HQQ-INT8-INT4 google/gemma-3-4b-it model

  • Developed by: pytorch
  • License: apache-2.0
  • Quantized from Model : google/gemma-3-4b-it
  • Quantization Method : HQQ-INT8-INT4
  • Terms of Use: Terms

Gemma3-4B is quantized by the PyTorch team using torchao with 8-bit embeddings and 8-bit dynamic activations with 4-bit weight linears (INT8-INT4). The model is suitable for mobile deployment with ExecuTorch.

We provide the quantized pte for direct use in ExecuTorch. (The provided pte file is exported with a max_seq_length/max_context_length of 1024; if you wish to change this, re-export the quantized model following the instructions in Exporting to ExecuTorch.)

Running in a mobile app

To run in a mobile app, download the quantized pte and tokenizer and follow the instructions here.

Screenshot 2025-10-30 at 4.58.30 PM

Quantization Recipe

First need to install the required packages:

pip install git+https://github.com/huggingface/transformers@main
pip install --pre torchao torch --index-url https://download.pytorch.org/whl/nightly/cu126

Untie weights

from transformers import (
  AutoModelForCausalLM,
  AutoProcessor,
  AutoTokenizer,
  TorchAoConfig,
)

model_id = "google/gemma-3-4b-it"
MODEL_NAME = model_id.split("/")[-1]
save_to_local_path = f"{MODEL_NAME}-untied-weights"

untied_model = AutoModelForCausalLM.from_pretrained(
    model_id, torch_dtype="auto", device_map="cuda:0"
)
tokenizer = AutoTokenizer.from_pretrained(model_id)
processor = AutoProcessor.from_pretrained(model_id)

from transformers.modeling_utils import find_tied_parameters

if getattr(
    untied_model.config.get_text_config(decoder=True), "tie_word_embeddings"
):
    setattr(
        untied_model.config.get_text_config(decoder=True),
        "tie_word_embeddings",
        False,
    )

untied_model._tied_weights_keys = []
untied_model.lm_head.weight = torch.nn.Parameter(
    untied_model.lm_head.weight.clone()
)

print("tied weights:", find_tied_parameters(untied_model))

# save locally
untied_model.save_pretrained(save_to_local_path)
tokenizer.save_pretrained(save_to_local_path)
processor.save_pretrained(save_to_local_path)

Quantization

We used following code to get the quantized model:


from torchao.quantization.quant_api import (
    IntxWeightOnlyConfig,
    Int8DynamicActivationIntxWeightConfig,
    ModuleFqnToConfig,
    quantize_,
)
from torchao.quantization.granularity import PerGroup, PerAxis
import torch

USER_ID = "YOUR_USER_ID"

# We start from the model with untied weights
model_to_quantize = save_to_local_path


int8_int4_config = Int8DynamicActivationIntxWeightConfig(
  weight_dtype=torch.int4,
  weight_granularity=PerGroup(32),
  intx_choose_qparams_algorithm="hqq_scale_only",
)
int8_int8_config = Int8DynamicActivationIntxWeightConfig(
  weight_dtype=torch.int8,
  weight_granularity=PerAxis(0),
  intx_choose_qparams_algorithm="hqq_scale_only",
)
int8_weight_only_config = IntxWeightOnlyConfig(
  weight_dtype=torch.int8,
  granularity=PerAxis(0),
  intx_choose_qparams_algorithm="hqq_scale_only",
)

fqn_to_config = {}
fqn_to_config["_default"] = int8_int4_config
fqn_to_config["model.language_model.embed_tokens"] = int8_weight_only_config
fqn_to_config["model.vision_tower.vision_model.embeddings.position_embedding"] = int8_weight_only_config
for i in range(27):
  fqn_to_config[f"model.vision_tower.vision_model.encoder.layers.{i}.mlp.fc2"] = int8_int8_config
quant_config = ModuleFqnToConfig(fqn_to_config)
quantization_config = TorchAoConfig(quant_type=quant_config, include_input_output_embeddings=True, modules_to_not_convert=[])

quantized_model = AutoModelForCausalLM.from_pretrained(model_to_quantize, device_map="auto", torch_dtype=torch.bfloat16, quantization_config=quantization_config)
tokenizer = AutoTokenizer.from_pretrained(model_to_quantize)
processor = AutoProcessor.from_pretrained(model_to_quantize)

# Push to hub
save_to = f"{USER_ID}/{MODEL_NAME}-HQQ-INT8-INT4"
quantized_model.push_to_hub(save_to, safe_serialization=False)
tokenizer.push_to_hub(save_to)
processor.push_to_hub(save_to)

# Manual testing
prompt = "Hey, are you conscious? Can you talk to me?"
messages = [
    {
        "role": "system",
        "content": "",
    },
    {"role": "user", "content": prompt},
]
templated_prompt = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)
print("Prompt:", prompt)
print("Templated prompt:", templated_prompt)
inputs = tokenizer(
    templated_prompt,
    return_tensors="pt",
).to("cuda")
generated_ids = quantized_model.generate(**inputs, max_new_tokens=128)
output_text = tokenizer.batch_decode(
    generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print("Response:", output_text[0][len(templated_prompt):])

The response from the manual testing is:

That's a really fascinating question! And a very common one when people interact with AI like me.

The short answer is: I can *simulate* conversation and respond to you in a way that *feels* like talking, but I'm not conscious in the way a human is.

Model Quality

Benchmark
gemma-3-4b-it pytorch/gemma-3-4b-it-HQQ-INT8-INT4
Benchmark
mmlu 57.68 55.65
chartqa (multimodal) 50.56 42.88
Reproduce Model Quality Results

We rely on lm-evaluation-harness to evaluate the quality of the quantized model.

Need to install lm-eval from source: https://github.com/EleutherAI/lm-evaluation-harness#install

baseline

lm_eval --model hf --model_args pretrained=google/gemma-3-4b-it --tasks mmlu --device cuda:0 --batch_size auto

int8 dynamic activation and int4 weight quantization using HQQ (HQQ-INT8-INT4)

lm_eval --model hf --model_args pretrained=pytorch/gemma-3-4b-it-HQQ-INT8-INT4 --tasks mmlu --device cuda:0 --batch_size auto

multi-modal eval

Need to install lmms-eval from source: pip install git+https://github.com/EvolvingLMMs-Lab/lmms-eval.git

lmms-eval --model gemma3 --model_args "pretrained=google/gemma-3-4b-it,trust_remote_code=True,device_map=auto" --tasks chartqa --batch_size 1

Exporting to ExecuTorch

To export to ExecuTorch, we use optimum-executorch.

We first install ExecuTorch and optimum-executorch:

# Set up executorch
git clone https://github.com/pytorch/executorch.git
pushd executorch
git submodule update --init --recursive
python install_executorch.py
popd

# Install optimum-executorch
git clone https://github.com/huggingface/optimum-executorch.git
pushd optimum-executorch
python install_dev.py --skip_override_torch
popd

Now we can export our model to an ExecuTorch pte file and upload it to HuggingFace. The command below exports the model for the XNNPACK backend with a context length of 1024, but this can be adjusted.

optimum-cli export executorch --model "pytorch/gemma-3-4b-it-HQQ-INT8-INT4" --task "multimodal-text-to-text" --recipe "xnnpack" --use_custom_sdpa --use_custom_kv_cache --max_seq_len 1024 --output_dir ./
hf upload pytorch/gemma-3-4b-it-HQQ-INT8-INT4 model.pte

Paper: TorchAO: PyTorch-Native Training-to-Serving Model Optimization

The model's quantization is powered by TorchAO, a framework presented in the paper TorchAO: PyTorch-Native Training-to-Serving Model Optimization.

Abstract: We present TorchAO, a PyTorch-native model optimization framework leveraging quantization and sparsity to provide an end-to-end, training-to-serving workflow for AI models. TorchAO supports a variety of popular model optimization techniques, including FP8 quantized training, quantization-aware training (QAT), post-training quantization (PTQ), and 2:4 sparsity, and leverages a novel tensor subclass abstraction to represent a variety of widely-used, backend agnostic low precision data types, including INT4, INT8, FP8, MXFP4, MXFP6, and MXFP8. TorchAO integrates closely with the broader ecosystem at each step of the model optimization pipeline, from pre-training (TorchTitan) to fine-tuning (TorchTune, Axolotl) to serving (HuggingFace, vLLM, SGLang, ExecuTorch), connecting an otherwise fragmented space in a single, unified workflow. TorchAO has enabled recent launches of the quantized Llama 3.2 1B/3B and LlamaGuard3-8B models and is open-source at this https URL .

Resources

Disclaimer

PyTorch has not performed safety evaluations or red teamed the quantized models. Performance characteristics, outputs, and behaviors may differ from the original models. Users are solely responsible for selecting appropriate use cases, evaluating and mitigating for accuracy, safety, and fairness, ensuring security, and complying with all applicable laws and regulations.

Nothing contained in this Model Card should be interpreted as or deemed a restriction or modification to the licenses the models are released under, including any limitations of liability or disclaimers of warranties provided therein.

Downloads last month
39
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for pytorch/gemma-3-4b-it-HQQ-INT8-INT4

Quantized
(146)
this model