TurboQuant: Google’s New Compression Trick That Actually Works

Google Research just dropped three new compression algorithms at ICLR and AISTATS this year, and one of them — TurboQuant — is genuinely interesting. Not because it’s another “breakthrough” that’ll change everything overnight, but because it solves a stupid problem that’s been bugging everyone working with large language models and vector search.

The problem? Vector quantization is great for shrinking models, but it comes with this annoying memory overhead. Most methods need to store quantization constants in full precision for every small block of data, which adds 1-2 extra bits per number. That partially defeats the purpose of compression in the first place. It’s like buying a smaller apartment but then filling it with giant furniture.

TurboQuant fixes that. And it does it without sacrificing accuracy.

The two-step dance

TurboQuant works in two stages. First, it randomly rotates the data vectors — a clever geometric trick that simplifies the data’s structure, making it easier to apply standard quantization. This first stage uses most of the compression budget to capture the main signal.

Then comes the interesting part. TurboQuant takes the tiny error leftover from the first stage and applies something called the Quantized Johnson-Lindenstrauss (QJL) algorithm using just 1 bit. Think of it as a mathematical error-checker that eliminates bias in the attention score calculations.

This two-stage approach means you get high compression without the hidden errors that usually creep in when you try to squeeze too hard.

The 1-bit trick that costs nothing

QJL itself is worth talking about separately. It uses the Johnson-Lindenstrauss Transform to shrink high-dimensional data while preserving the essential distances between points. Each resulting vector number gets reduced to a single sign bit — +1 or -1. That’s it.

The magic is that this requires zero memory overhead. To maintain accuracy, QJL uses a special estimator that balances a high-precision query with the low-precision simplified data. This lets the model calculate attention scores accurately without the usual memory tax.

I’ve seen similar ideas tried before with binary quantization, but they always introduced significant accuracy loss. Google’s results here are higher than I expected.

PolarQuant: A different angle

The third algorithm, PolarQuant, takes a completely different approach. Instead of representing vectors using standard Cartesian coordinates (X, Y, Z), it converts them into polar coordinates — angles and magnitudes. This eliminates the need for those expensive quantization constants entirely.

The trade-off is that PolarQuant works best for specific types of data distributions. It’s not a universal solution, but for key-value cache compression, it performs remarkably well.

What this means in practice

All three techniques showed great promise for reducing key-value bottlenecks without sacrificing model performance. For anyone running large language models in production, this is directly relevant. The KV cache is one of the biggest memory hogs in transformer inference, and anything that shrinks it without hurting quality is worth paying attention to.

The paper claims zero accuracy loss with massive compression ratios. I’m always skeptical of “zero loss” claims — there’s usually some degradation at scale — but the theoretical foundations here are solid. These aren’t heuristic hacks; they’re mathematically grounded algorithms with rigorous proofs.

The bigger picture

What I find interesting is that Google is releasing this as open research rather than keeping it internal. The ICLR and AISTATS publications suggest they want community adoption. That’s good for everyone working on efficient AI inference.

The real test will be how these algorithms perform in production environments with real-world data distributions. Benchmarks are one thing; deployment at scale is another. But the theoretical advances here are genuinely new, and the problem they solve is real.

If you’re working on vector search or LLM inference, TurboQuant is worth a closer look. The code should be available after the conference presentations, and I’ll be testing it myself when it drops.