HQQ-INT8-INT4 google/gemma-3-4b-it model
- Developed by: pytorch
- License: apache-2.0
- Quantized from Model : google/gemma-3-4b-it
- Quantization Method : HQQ-INT8-INT4
- Terms of Use: Terms
Gemma3-4B is quantized by the PyTorch team using torchao with 8-bit embeddings and 8-bit dynamic activations with 4-bit weight linears (INT8-INT4). The model is suitable for mobile deployment with ExecuTorch.
We provide the quantized pte for direct use in ExecuTorch. (The provided pte file is exported with a max_seq_length/max_context_length of 1024; if you wish to change this, re-export the quantized model following the instructions in Exporting to ExecuTorch.)
Running in a mobile app
To run in a mobile app, download the quantized pte and tokenizer and follow the instructions here.
Quantization Recipe
First need to install the required packages:
pip install git+https://github.com/huggingface/transformers@main
pip install --pre torchao torch --index-url https://download.pytorch.org/whl/nightly/cu126
Untie weights
from transformers import (
AutoModelForCausalLM,
AutoProcessor,
AutoTokenizer,
TorchAoConfig,
)
model_id = "google/gemma-3-4b-it"
MODEL_NAME = model_id.split("/")[-1]
save_to_local_path = f"{MODEL_NAME}-untied-weights"
untied_model = AutoModelForCausalLM.from_pretrained(
model_id, torch_dtype="auto", device_map="cuda:0"
)
tokenizer = AutoTokenizer.from_pretrained(model_id)
processor = AutoProcessor.from_pretrained(model_id)
from transformers.modeling_utils import find_tied_parameters
if getattr(
untied_model.config.get_text_config(decoder=True), "tie_word_embeddings"
):
setattr(
untied_model.config.get_text_config(decoder=True),
"tie_word_embeddings",
False,
)
untied_model._tied_weights_keys = []
untied_model.lm_head.weight = torch.nn.Parameter(
untied_model.lm_head.weight.clone()
)
print("tied weights:", find_tied_parameters(untied_model))
# save locally
untied_model.save_pretrained(save_to_local_path)
tokenizer.save_pretrained(save_to_local_path)
processor.save_pretrained(save_to_local_path)
Quantization
We used following code to get the quantized model:
from torchao.quantization.quant_api import (
IntxWeightOnlyConfig,
Int8DynamicActivationIntxWeightConfig,
ModuleFqnToConfig,
quantize_,
)
from torchao.quantization.granularity import PerGroup, PerAxis
import torch
USER_ID = "YOUR_USER_ID"
# We start from the model with untied weights
model_to_quantize = save_to_local_path
int8_int4_config = Int8DynamicActivationIntxWeightConfig(
weight_dtype=torch.int4,
weight_granularity=PerGroup(32),
intx_choose_qparams_algorithm="hqq_scale_only",
)
int8_int8_config = Int8DynamicActivationIntxWeightConfig(
weight_dtype=torch.int8,
weight_granularity=PerAxis(0),
intx_choose_qparams_algorithm="hqq_scale_only",
)
int8_weight_only_config = IntxWeightOnlyConfig(
weight_dtype=torch.int8,
granularity=PerAxis(0),
intx_choose_qparams_algorithm="hqq_scale_only",
)
fqn_to_config = {}
fqn_to_config["_default"] = int8_int4_config
fqn_to_config["model.language_model.embed_tokens"] = int8_weight_only_config
fqn_to_config["model.vision_tower.vision_model.embeddings.position_embedding"] = int8_weight_only_config
for i in range(27):
fqn_to_config[f"model.vision_tower.vision_model.encoder.layers.{i}.mlp.fc2"] = int8_int8_config
quant_config = ModuleFqnToConfig(fqn_to_config)
quantization_config = TorchAoConfig(quant_type=quant_config, include_input_output_embeddings=True, modules_to_not_convert=[])
quantized_model = AutoModelForCausalLM.from_pretrained(model_to_quantize, device_map="auto", torch_dtype=torch.bfloat16, quantization_config=quantization_config)
tokenizer = AutoTokenizer.from_pretrained(model_to_quantize)
processor = AutoProcessor.from_pretrained(model_to_quantize)
# Push to hub
save_to = f"{USER_ID}/{MODEL_NAME}-HQQ-INT8-INT4"
quantized_model.push_to_hub(save_to, safe_serialization=False)
tokenizer.push_to_hub(save_to)
processor.push_to_hub(save_to)
# Manual testing
prompt = "Hey, are you conscious? Can you talk to me?"
messages = [
{
"role": "system",
"content": "",
},
{"role": "user", "content": prompt},
]
templated_prompt = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
)
print("Prompt:", prompt)
print("Templated prompt:", templated_prompt)
inputs = tokenizer(
templated_prompt,
return_tensors="pt",
).to("cuda")
generated_ids = quantized_model.generate(**inputs, max_new_tokens=128)
output_text = tokenizer.batch_decode(
generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print("Response:", output_text[0][len(templated_prompt):])
The response from the manual testing is:
That's a really fascinating question! And a very common one when people interact with AI like me.
The short answer is: I can *simulate* conversation and respond to you in a way that *feels* like talking, but I'm not conscious in the way a human is.
Model Quality
| Benchmark | ||
|---|---|---|
| gemma-3-4b-it | pytorch/gemma-3-4b-it-HQQ-INT8-INT4 | |
| Benchmark | ||
| mmlu | 57.68 | 55.65 |
| chartqa (multimodal) | 50.56 | 42.88 |
Reproduce Model Quality Results
We rely on lm-evaluation-harness to evaluate the quality of the quantized model.
Need to install lm-eval from source: https://github.com/EleutherAI/lm-evaluation-harness#install
baseline
lm_eval --model hf --model_args pretrained=google/gemma-3-4b-it --tasks mmlu --device cuda:0 --batch_size auto
int8 dynamic activation and int4 weight quantization using HQQ (HQQ-INT8-INT4)
lm_eval --model hf --model_args pretrained=pytorch/gemma-3-4b-it-HQQ-INT8-INT4 --tasks mmlu --device cuda:0 --batch_size auto
multi-modal eval
Need to install lmms-eval from source:
pip install git+https://github.com/EvolvingLMMs-Lab/lmms-eval.git
lmms-eval --model gemma3 --model_args "pretrained=google/gemma-3-4b-it,trust_remote_code=True,device_map=auto" --tasks chartqa --batch_size 1
Exporting to ExecuTorch
To export to ExecuTorch, we use optimum-executorch.
We first install ExecuTorch and optimum-executorch:
# Set up executorch
git clone https://github.com/pytorch/executorch.git
pushd executorch
git submodule update --init --recursive
python install_executorch.py
popd
# Install optimum-executorch
git clone https://github.com/huggingface/optimum-executorch.git
pushd optimum-executorch
python install_dev.py --skip_override_torch
popd
Now we can export our model to an ExecuTorch pte file and upload it to HuggingFace. The command below exports the model for the XNNPACK backend with a context length of 1024, but this can be adjusted.
optimum-cli export executorch --model "pytorch/gemma-3-4b-it-HQQ-INT8-INT4" --task "multimodal-text-to-text" --recipe "xnnpack" --use_custom_sdpa --use_custom_kv_cache --max_seq_len 1024 --output_dir ./
hf upload pytorch/gemma-3-4b-it-HQQ-INT8-INT4 model.pte
Paper: TorchAO: PyTorch-Native Training-to-Serving Model Optimization
The model's quantization is powered by TorchAO, a framework presented in the paper TorchAO: PyTorch-Native Training-to-Serving Model Optimization.
Abstract: We present TorchAO, a PyTorch-native model optimization framework leveraging quantization and sparsity to provide an end-to-end, training-to-serving workflow for AI models. TorchAO supports a variety of popular model optimization techniques, including FP8 quantized training, quantization-aware training (QAT), post-training quantization (PTQ), and 2:4 sparsity, and leverages a novel tensor subclass abstraction to represent a variety of widely-used, backend agnostic low precision data types, including INT4, INT8, FP8, MXFP4, MXFP6, and MXFP8. TorchAO integrates closely with the broader ecosystem at each step of the model optimization pipeline, from pre-training (TorchTitan) to fine-tuning (TorchTune, Axolotl) to serving (HuggingFace, vLLM, SGLang, ExecuTorch), connecting an otherwise fragmented space in a single, unified workflow. TorchAO has enabled recent launches of the quantized Llama 3.2 1B/3B and LlamaGuard3-8B models and is open-source at this https URL .
Resources
- Official TorchAO GitHub Repository: https://github.com/pytorch/ao
- TorchAO Documentation: https://docs.pytorch.org/ao/stable/index.html
Disclaimer
PyTorch has not performed safety evaluations or red teamed the quantized models. Performance characteristics, outputs, and behaviors may differ from the original models. Users are solely responsible for selecting appropriate use cases, evaluating and mitigating for accuracy, safety, and fairness, ensuring security, and complying with all applicable laws and regulations.
Nothing contained in this Model Card should be interpreted as or deemed a restriction or modification to the licenses the models are released under, including any limitations of liability or disclaimers of warranties provided therein.
- Downloads last month
- 39
