light / dark mode

TurboQuant Insights

Recent paper by Google that claims model memory compression by 6x - finally got around to reading it. A lot of short and quick to digest blogs simplify it a little too much I think (especially glossing over pt 2).

  1. Rotating vectors randomly before quantizing is not a silver bullet and exploits a key pattern in KV caches, i.e., this works only for KV cache vectors.
  2. Quantization with or without rotation - is not good enough for dot products and can alter ranking of tokens because of rounding biases.

idea #1

We want to quantize vectors - but some vectors are hit worse by scale based quantization than others.

For example, a vector with heavy outliers, something like [0.01, -0.02, -0.001, 0.05, 91.5], has its scale factor heavily biased to preserve the outlier. It completely zeros out the other values when quantized via the likes of INT4. The KV-cache is full of vectors like this.

With INT4 quantization, the scale factor would be about 13.07 with the vector as [0, 0, 0, 0, 7].

Btw something like this can be abated by different scale factors for groups of dimensions ^. However as the number of scale factors go up, the memory savings go down.

What this paper proposes (which is apparently not an OG idea) is to randomly rotate these vectors before quantizing. When rotated completely randomly, there is very little possibility of the resultant vector possessing outliers of this magnitude - resulting in significantly better quantization performance. Moreover, the accuracy loss decreases as the number of dimensions go up (seems intuitive).

Randomly oriented vector for the same [0.01, -0.02, -0.001, 0.05, 91.5] might look something like [29.5, 44.2, -38.1, 31.6, 55.66], which preserves the same L2 norm. Under the same INT4 scheme, this would quantize to roughly [4, 6, -5, 4, 7] with a scale factor of about 7.95, which yields much lower error in attention computation.

This technique is referred to as PolarQuant.

idea #2

Quantization that leads to minimization of MSE between the original and the new vector does so via smarter rounding, causing values to latch onto nearest centroids.

For example, suppose the quantizer only allows centroids at 0 and 1, and we use a query vector like [1, 1]. Two keys [0.49, 0.49] and [0.51, 0.51] are almost the same in the original space, giving inner products of 0.98 and 1.02. But after MSE-optimal rounding they become [0, 0] and [1, 1], and now the inner products jump to 0 and 2. So even though the reconstruction error per vector is small, the attention score error is not really just noise. It is a result of centroid selection.

We need to debias this with the smallest memory footprint possible. In turboquant this is done via the creation of a bad but unbiased estimator that consumes just 1 bit per dimension. The paper calls it QJL or Quantized Johnson-Lindenstrauss.

There's some math involved for the exact correction value, and what we get is not even a good estimator, but it is unbiased (with an expected value exactly equal to the correction).

Conclusion

Turboquant manages to fuse quantization gains with a debiased correction estimator causing accuracy loss to be very minimal. The idea of rotation before quantization can also be found in papers like RaBitQ.