Qwen3-30B-A3B-Instruct-2507-FP8-dynamic

Model Overview

Model Architecture: Qwen3MoeForCausalLM
- Input: Text
- Output: Text
Model Optimizations:
- Activation quantization: FP8
- Weight quantization: FP8
Intended Use Cases:
- Function calling.
- Subject matter experts via fine-tuning.
- Multilingual instruction following.
- Translation.

Quantized version of Qwen/Qwen3-30B-A3B-Instruct-2507.

Model Optimizations

This model was obtained by quantizing the weights and activations of Qwen/Qwen3-30B-A3B-Instruct-2507 to FP8 data type. This optimization reduces the number of bits per parameter from 16 to 8, reducing the disk size and GPU memory requirements by approximately 50%. Only the weights and activations of the linear operators within transformers blocks of the language model are quantized.

It's running faster than Qwen/Qwen3-30B-A3B-Instruct-2507-FP8 in vLLM/sglang in 4090 or H100, with no diffrence on model ability on most benchmarks.

vllm docs of fp8 quantization

Deployment

Use with vLLM

You may need to tune moe for best performance with vllm or sglang, almost same as Qwen3-Next guid on vllm

vllm serve bash99/Qwen3-30B-A3B-Instruct-2507-FP8-Dynamic --tensor_parallel_size 2

Creation

This model was quantized using the llm-compressor library as shown below.

Creation details

from transformers import AutoProcessor, AutoModelForCausalLM, AutoTokenizer
from llmcompressor import oneshot
from llmcompressor.modifiers.quantization import QuantizationModifier
import sys

MODEL_ID = sys.argv[1]

# Load model.
model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID, device_map="auto", torch_dtype="auto"
)
processor = AutoProcessor.from_pretrained(MODEL_ID)

# Configure the quantization algorithm and scheme.
# In this case, we:
#   * quantize the weights to fp8 with per channel via ptq
#   * quantize the activations to fp8 with dynamic per token
recipe = QuantizationModifier(targets="Linear",scheme="FP8_DYNAMIC",
        ignore=["re:.*lm_head", "re:visual.*", 're:.*mlp.gate$', 're:.*mlp.shared_expert_gate$', 're:.*router$']
    )

# Apply quantization and save to disk in compressed-tensors format.
SAVE_DIR = MODEL_ID + "-FP8-Dynamic"

oneshot(model=model, recipe=recipe, output_dir=SAVE_DIR)
processor.save_pretrained(SAVE_DIR)

print(f"========== quantizeing to {SAVE_DIR}, done ==============")

python quantize_tofp8.py Qwen/Qwen3-30B-A3B-Instruct-2507-FP8

Downloads last month: 58

Safetensors

Model size

31B params

Tensor type

BF16

F8_E4M3

Model tree for bash99/Qwen3-30B-A3B-Instruct-2507-FP8-Dynamic

Base model

Qwen/Qwen3-30B-A3B-Instruct-2507

Quantized

(109)

this model