light / dark mode

TurboQuant Insights

Recent paper by Google that claims model memory compression by 6x - finally got around to reading it.

idea #1

We want to quantize vectors - but some vectors are hit worse by scale based quantization than others.

For example, a vector with heavy outliers, something like [0.01, -0.02, -0.001, 0.05, 91.5], has its scale factor heavily biased to preserve the outlier. It completely zeros out the other values when quantized via the likes of INT4. The KV-cache is full of vectors like this.

With INT4 quantization, the scale factor would be about 13.05 with the vector as [0, 0, 0, 0, 7].

Btw something like this can be abated by different scale factors for groups of dimensions ^. However as the number of scale factors go up, the memory savings go down.

What this paper proposes (which is apparently not an OG idea) is to randomly rotate these vectors before quantizing. When rotated completely randomly, there is very little possibility of the resultant vector possessing outliers of this magnitude - resulting in significantly better quantization performance. Moreover, the accuracy loss decreases the number of dimensions go up (seems intuitive).

Randomly oriented vector for the same [0.01, -0.02, -0.001, 0.05, 91.5] might look something like [12.42, 34.36, 23.26, 21.12, 9.7]. A quantized version of this vector would have much lower error in attention computation.

idea #2

The technique above minimizes mean squared error between original and quantized vectors, but it sneaks biases between inner products (which attention uses between Q and K) which are not random, and thus can stack up deterministically.

We need QJL here as well (will write later)