Master ChatGPT for Work Success
ChatGPT is revolutionizing how we work, but most people barely scratch the surface. Subscribe to Mindstream for free and unlock 5 essential resources including templates, workflows, and expert strategies for 2025. Whether you're writing emails, analyzing data, or streamlining tasks, this bundle shows you exactly how to save hours every week.
Random Rotation. One Bit Residual. Google's KV Cache Math.
A 1984 lemma, a Beta distribution, and 6x less memory.
|
6x
KV cache reduction
|
8x
attention speedup
|
0
accuracy loss
|
On LongBench with Llama-3.1-8B, Google's new KV cache quantizer scores 50.06 at 3.5 bits per coordinate. The FP16 baseline scores 50.06. Those are not the same sentence twice.
The method is TurboQuant, from Amir Zandieh and Vahab Mirrokni at Google Research with collaborators at NYU and KAIST. It compresses the KV cache six times smaller with zero accuracy loss, using a random rotation and a math trick from 1984. The paper posted to arXiv in April 2025 and is being presented at ICLR 2026 later this month.
The buried lede is not the 6x compression. It is that TurboQuant's distortion is now within roughly 2.7x of the Shannon information-theoretic lower bound. There is very little room left below this line. The compression race for KV cache, along this axis, is almost over.
Context Lengths Ate the GPU
Every time a transformer generates a token, it stores a key and a value vector for that token in every attention layer. Multiply by model depth, multiply by context length, and the KV cache becomes the single biggest memory consumer during inference.
At 128K tokens, a 70B model's KV cache alone consumes around 40 GB, nearly double the headroom available on two H100 SXM5s after loading the model weights. You can run out of memory for the cache before you run out of memory for the model.
The obvious fix is to quantize the KV cache the way we quantize model weights: knock FP16 down to 4 bits, store less. But existing quantizers break in a specific way.
Standard methods like per-channel min-max store normalization constants for every small block of data, which adds 1 to 2 extra bits per coordinate in overhead and eats most of your savings.
More importantly, MSE-optimal quantizers introduce a systematic bias in the dot products that compute attention scores. The model starts attending to the wrong tokens. Quality collapses at the bit widths you actually want.
Rotate, Then Sign-Correct
TurboQuant splits the compression budget into two stages. Neither stage needs training, calibration data, or a fine-tuning pass. Both are data-oblivious, which means the same quantizer works on any model out of the box.
|
Stage 1: PolarQuant
rotate, then quantize scalarly
1. random orthogonal rotation
↓
2. coordinates become Beta(d/2, d/2)
↓
3. optimal Lloyd-Max scalar quantizer, (b-1) bits
no normalization constants stored
|
→ |
Stage 2: QJL (1984 math)
correct the residual bias
1. take Stage 1 residual error
↓
2. Johnson-Lindenstrauss projection
↓
3. retain the sign bit, +1 or -1
1 bit total. unbiased dot products.
|
Stage 1: Rotate Into a Known Distribution
Multiply any input vector by a random orthogonal matrix, and in high dimensions, each coordinate of the result follows a concentrated Beta distribution. Any two distinct coordinates become nearly independent. This is not magic, it is concentration of measure.
The practical consequence is large. Because the post-rotation distribution is known in advance, you do not need to store per-block scale and zero-point constants.
You can apply the optimal Lloyd-Max scalar quantizer to each coordinate independently, using a codebook computed once at compile time. Traditional quantizers spend 1 to 2 bits per coordinate holding normalization constants. PolarQuant spends zero. That is Stage 1, and it uses (b-1) bits per coordinate.
Stage 2: One Bit Fixes the Bias
Scalar quantization minimizes MSE but introduces a small bias in dot products. For the attention mechanism, which computes query-times-key to produce attention scores, this bias accumulates across tokens. Unchecked, it flips which tokens the model attends to.
The Johnson-Lindenstrauss lemma, published in 1984, says that random projections from high dimensions to low dimensions preserve pairwise distances approximately. Zandieh and colleagues apply this with an extreme twist: project the residual error down through a random Gaussian matrix, then retain the sign bit of each projected value.
One bit per coordinate of overhead. The result is a provably unbiased estimator for the original dot product.
Total bit budget: (b-1) from Stage 1 plus 1 from Stage 2 equals b bits per coordinate. At b=3, the method achieves what the paper calls absolute quality neutrality on LongBench, Needle-In-A-Haystack, ZeroSCROLLS, RULER, and L-Eval. At b=4, it matches FP16 across every benchmark tested.
Two Users to Eleven on the Same H100
On LongBench with Llama-3.1-8B-Instruct, TurboQuant at 3.5 bits scores 50.06. The FP16 baseline scores 50.06. The next-best comparable method, KIVI at 3 bits, scores 48.50. At 4x compression, TurboQuant achieves 0.997 retrieval accuracy on Needle-In-A-Haystack tests across document lengths from 4K to 104K tokens.
On attention speed, the paper reports up to 8x speedup over a JAX FP16 baseline on NVIDIA H100 at 4-bit mode. Community reimplementations have independently validated the quality claims on Qwen2.5-3B-Instruct (99.5% cosine similarity at 3-bit), Gemma 3 4B on an RTX 4090 (character-identical output at 2-bit), and MoE architectures on 8x RTX 3090.
The economic picture: on a single H100 SXM5 at $5.80 per hour, a 70B model serving 32K context previously supported about 2 concurrent users because the KV cache ate all the remaining memory. With TurboQuant, the same setup supports 11 concurrent users. That drops cost-per-user from roughly $2,088 per month to $380 per month. At 128K context the ratio becomes even more dramatic, because the baseline requires additional GPUs simply to have headroom.
Key Insights
Random rotation is a zero-cost pre-processor that converts unknown distributions into known ones. This is the structural insight worth carrying forward. If you have a quantizer that performs optimally on a specific distribution, and you have data from an unknown distribution, a random orthogonal matrix gets you most of the way.
The cost is a matrix multiply at quantization time and a matrix multiply at dequantization time. That is cheap compared to what you gain: you can precompute the optimal codebook once, offline, and never touch it again.
Data-oblivious beats calibrated for practical deployment. NVIDIA's rival KVTC achieves 20x compression, higher than TurboQuant's 6x, but it requires a per-model PCA calibration step offline.
TurboQuant works on any model instantly with no dataset preparation. For teams shipping many models or serving heterogeneous workloads, the operational simplicity of skipping calibration beats a higher compression ratio.
The Shannon-limit framing is the most important part of the paper. TurboQuant's error at 3-bit is within roughly 2.7x of the information-theoretic lower bound for lossy compression of Gaussian vectors.
If you are building KV cache compression along this axis, that 2.7x gap is the entire frontier you have left to close. Future work will come from attacking the problem from different angles, like entropy coding the quantized output, exploiting cross-layer redundancy, or using model-specific structure that a data-oblivious method cannot see.
The memory chip selloff was wrong for Jevons paradox reasons. When TurboQuant published, memory chip stocks dropped 3 to 6 percent on the logic that AI systems would need less memory.
But cheaper memory per token historically causes more inference, not less. The total serving demand goes up. The demand curve for inference at long context was never close to saturated at the old prices.
Quick Hits
No official Google code, but the community already shipped. Google Research has not released an implementation as of April 2026. Independent engineers have built working versions in Triton/PyTorch, Rust (with vLLM integration), and MLX. At least one has validated character-identical output to FP16 at 2-bit precision on Gemma 3 4B.
Value vectors are more sensitive than keys. Community benchmarks have found that value quantization breaks first. Two-bit values drop cosine similarity to around 0.94, while 4-bit values hold at 0.997. The practical rule: 3-bit keys, 4-bit values, and you land almost exactly at the FP16 baseline.
The companion papers matter. QJL was published at AAAI 2025, PolarQuant is scheduled for AISTATS 2026, and TurboQuant is the umbrella at ICLR 2026. The underlying ideas have been public and peer-reviewed for over a year. The practitioner rediscovery cycle is sometimes slow.
Small models break the method. On models under 3B parameters, quantization noise from TurboQuant produces repetitive output, especially at 3-bit. If you serve small models, test carefully before deploying.
The Take
The most interesting thing about TurboQuant is how old the math is. The Johnson-Lindenstrauss lemma is from 1984. Lloyd-Max scalar quantization is from the 1960s. What changed is someone noticed that a random rotation bridges the two.
The KV cache problem was an unknown-distribution problem. Google solved it by multiplying by a random matrix and bolting a 1984 lemma onto the residual.
For this week: if you run long-context inference at 32K or beyond, the no-calibration property matters more than the raw compression ratio. You can test a community implementation against your existing workload in an afternoon. Expect to land at 3-bit keys and 4-bit values with no measurable quality drop on any model over 3B parameters.
The paid archive has a step-by-step TurboQuant integration walkthrough for vLLM, including Triton kernel patterns and a benchmark harness for comparing compressed versus FP16 attention scores on your own workload.
The Open Question
TurboQuant is near-optimal for lossy compression of Gaussian vectors. NVIDIA's KVTC reaches 20x compression using a different approach: PCA decorrelation and entropy coding borrowed from JPEG. Both are at ICLR 2026 later this month.
The question neither paper answers yet: can you compose them? A data-oblivious rotation stage, followed by a calibrated entropy coder, followed by the one-bit residual corrector. That might get you past 10x with quality neutrality. Nobody has published the experiment.
Google compressed the KV cache to near the information-theoretic limit using a random rotation and a 1984 lemma. The math was older than the hardware it is running on.
Next week: Meta's Muse Spark shipped with 16 tools wired into its chat harness. We walk through each one and what it reveals about how the big labs are building agent scaffolding in 2026.


