Gemma 4 QAT models shrink memory, speed decoding

Google has released new Gemma 4 model checkpoints trained with quantization-aware training (QAT), designed to cut on-device memory use while preserving quality. The update introduces five QAT-optimized sizes—Gemma 4 E2B, Gemma 4 E4B, Gemma 4 12B, Gemma 4 26B A
For anyone hoping to run advanced AI on a phone without turning the device into a heat source, the latest Gemma 4 update lands at a practical moment: smaller models that need less memory to run.
Google has released new Gemma 4 model checkpoints using quantization-aware training (QAT), now available for download. The goal is straightforward—quantization reduces the amount of memory required to run lightweight models—but Google says doing that work during training rather than afterward helps keep quality steadier. These Gemma 4 versions are designed to retain performance better than checkpoints refined with post-training quantization (PTQ). and they’re also intended to accelerate decode speed.
The company frames the shift around a core difference in approach. PTQ quantizes a model after training, but Google says it can weaken performance. With QAT. quantization is incorporated into the training process itself. producing checkpoints with better performance than models using PTQ. according to Google’s blog post.
To make the compressed models practical on everyday hardware, Google also points to a custom mobile-quantization schema. That method uses pre-calculated settings, 2-bit compression in certain parts of the model, and vocabulary list and short-term memory compression. The end result for users is a smaller model that consumes less system memory—exactly the kind of tradeoff that matters when compute is limited.
The new QAT-optimized Gemma 4 checkpoints come in five sizes: Gemma 4 E2B. Gemma 4 E4B. Gemma 4 12B. Gemma 4 26B A4B. and Gemma 4 31B. Google also says the smallest versions. including the text-only Gemma 4 E2B model. require less than a gigabyte of memory to run. making them a fit for phones. The company says it shared approximate memory requirements for loading the new Gemma 4 models with QAT in various sizes.
There’s another reason the release feels built for real-world use: it’s not just one download. Google says there are four different formats available for download—unquantized QAT checkpoints. GPT-Generated Unified Format (GGUF). mobile-optimized models. and Compressed Tensors. In Google’s words. the models preserve “similar quality to bfloat16 while dramatically reducing the memory requirements to load the model.”.
After users download the Gemma 4 QAT model weights, the checkpoints can be run on phones, laptops, or desktops. Google notes that the mobile and desktop models are available on Hugging Face as well as in LM Studio.
The timing also matters. Google’s launch of the laptop-grade Gemma 4 12B model earlier this week is now followed by these new QAT-based checkpoints—an immediate expansion of the line aimed at squeezing more performance out of the same, limited hardware.
Gemma 4 quantization-aware training QAT post-training quantization PTQ on-device AI model compression decode speed GGUF Hugging Face LM Studio
So they made it smaller but it still works right? Idk why my phone would be “heat source” anyway lol
This reads like they just made the AI cheaper to run on your phone. Like “less memory” but do they mean it’s worse? Also “decode speed” sounds like streaming, but I’m probably misunderstanding.
I don’t get the QAT vs PTQ thing. Isn’t quantization the same no matter when you do it? Like if it’s 2-bit compression then it’s gonna lose accuracy, period. But maybe I’m mixing it up with video codecs.
“Less than a gigabyte” for Gemma 4?? That feels too good to be true. My coworker said all AI on phones is basically smoke and mirrors. But if they’re doing short-term memory compression too, does that mean it forgets faster? I’m just guessing, I didn’t finish the whole thing.