FP-Quant QAT
Collection
High-quality QAT FP4 models to use with the fp_quant vLLM/Transformers integration on Blackwell NVIDIA GPUs. See https://arxiv.org/abs/2509.23202
•
11 items
•
Updated
This is the official QAT FP-Quant checkpoint of meta-llama/Llama-3.2-1B-Instruct, produced as described in the "Bridging the Gap Between Promise and Performance for Microscaling FP4 Quantization" paper.
This model can be run on Blackwell-generation NVIDIA GPUs via QuTLASS and FP-Quant in either transformers or vLLM.
The approximate recipe for training this model (up to local batch size and LR) is available here.
This checkpoint has the following performance relative to the original model and the RTN quantization:
| Model | MMLU | GSM8k | Hellaswag | Winogrande | Avg |
|---|---|---|---|---|---|
meta-llama/Llama-3.2-1B-Instruct |
46.2 | 46.3 | 59.8 | 61.6 | 53.5 |
| RTN | 30.9 | 19.4 | 51.6 | 57.2 | 39.8 |
| QAT (THIS) | 28.1 | 36.9 | 57.0 | 58.8 | 45.2 |