Qwen3-30B-A3B-Instruct-2507-FP8-dynamic

Model Overview

  • Model Architecture: Qwen3MoeForCausalLM
    • Input: Text
    • Output: Text
  • Model Optimizations:
    • Activation quantization: FP8
    • Weight quantization: FP8
  • Intended Use Cases:
    • Function calling.
    • Subject matter experts via fine-tuning.
    • Multilingual instruction following.
    • Translation.

Quantized version of Qwen/Qwen3-30B-A3B-Instruct-2507.

Model Optimizations

This model was obtained by quantizing the weights and activations of Qwen/Qwen3-30B-A3B-Instruct-2507 to FP8 data type. This optimization reduces the number of bits per parameter from 16 to 8, reducing the disk size and GPU memory requirements by approximately 50%. Only the weights and activations of the linear operators within transformers blocks of the language model are quantized.

It's running faster than Qwen/Qwen3-30B-A3B-Instruct-2507-FP8 in vLLM/sglang in 4090 or H100, with no diffrence on model ability on most benchmarks.

vllm docs of fp8 quantization

Deployment

Use with vLLM

You may need to tune moe for best performance with vllm or sglang, almost same as Qwen3-Next guid on vllm

vllm serve bash99/Qwen3-30B-A3B-Instruct-2507-FP8-Dynamic --tensor_parallel_size 2

Creation

This model was quantized using the llm-compressor library as shown below.

Creation details
from transformers import AutoProcessor, AutoModelForCausalLM, AutoTokenizer
from llmcompressor import oneshot
from llmcompressor.modifiers.quantization import QuantizationModifier
import sys

MODEL_ID = sys.argv[1]

# Load model.
model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID, device_map="auto", torch_dtype="auto"
)
processor = AutoProcessor.from_pretrained(MODEL_ID)

# Configure the quantization algorithm and scheme.
# In this case, we:
#   * quantize the weights to fp8 with per channel via ptq
#   * quantize the activations to fp8 with dynamic per token
recipe = QuantizationModifier(targets="Linear",scheme="FP8_DYNAMIC",
        ignore=["re:.*lm_head", "re:visual.*", 're:.*mlp.gate$', 're:.*mlp.shared_expert_gate$', 're:.*router$']
    )

# Apply quantization and save to disk in compressed-tensors format.
SAVE_DIR = MODEL_ID + "-FP8-Dynamic"

oneshot(model=model, recipe=recipe, output_dir=SAVE_DIR)
processor.save_pretrained(SAVE_DIR)

print(f"========== quantizeing to {SAVE_DIR}, done ==============")
python quantize_tofp8.py Qwen/Qwen3-30B-A3B-Instruct-2507-FP8
Downloads last month
58
Safetensors
Model size
31B params
Tensor type
BF16
·
F8_E4M3
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for bash99/Qwen3-30B-A3B-Instruct-2507-FP8-Dynamic

Quantized
(109)
this model