How Gzip Just Beat AI
by Saketh Thota · 1/26/2024Whats the Big Deal?
A few weeks ago you might’ve heard talks about a paper that went viral on Hacker News and on Youtube via Tsoding. The paper describes an unusual approach to text classification that boasts surpassing the performance of deep neual networks (DNN).
I say unusual because instead of using a complex model with lots of parameters and training,
they just used gzip
.
Gzip: The Forgotten Hero
Though it might come as a shock, compressors like gzip
have long been used in other language
applicatons like plagiarism detection. The goal of compression algorithms is simply to find
redundancies in a file and removing them without losing any information. In doing so, they
are also able to capture regularity in the text being compressed. Since texts fom the same
category share more regularity than those that aren’t, we can use gzip
in some pretty clever ways.
If we take to represent compressed lengths where is the compressed length of the concatenation of and , is the information distance between and , or how many more bytes are needed to encode given . This means, , is more similar to than .
We’re now left with 2 obstacles: how do we calculate information distance, and how do we classify our text?
NCD: Close Enough
Information distance is defined as the length of the shortest binary program that converts to . It is a well-defined problem that is as follows:
Unfortunately, in this definition refers to Kolmogorov complexity, and its use here renders incomputable.
If we use a common compressor like we intend with gzip
, though, we can approximate
the information distance as follows to achieve whats known as the normalized compression distance (NCD):
Now that we have all the pieces, we can finally get to classifying our text.
k-NN: Bringing it All Together
The k-nearest neighbors algorithm (k-NN) is a simple algorithm that classifies a sample by assigning the label which is most frequent among its k-nearest neighbors. In our case, we can use NCD to determine the distance from our sample to each of the training samples, and then assign the label.
Implementation
After some finicking with zlib and rapidcsv, I was able to get a working implementation of the paper’s proposed algorithm in C++.
Results
Training on the AG News dataset I was able to achieve about 85% accuracy after testing 1/3 of the test samples. The original paper acheived a 92% accuracy.
Due to cpu limitations, I was not able to test on the whole set. In the future I would like to try C++‘s thread library as much of the computation time here can be parallelized.