Qwen3-30B-A3B-Instruct-2507-FP8-dynamic
Model Overview
- Model Architecture: Qwen3MoeForCausalLM
- Input: Text
- Output: Text
- Model Optimizations:
- Activation quantization: FP8
- Weight quantization: FP8
- Intended Use Cases:
- Function calling.
- Subject matter experts via fine-tuning.
- Multilingual instruction following.
- Translation.
Quantized version of Qwen/Qwen3-30B-A3B-Instruct-2507.
Model Optimizations
This model was obtained by quantizing the weights and activations of Qwen/Qwen3-30B-A3B-Instruct-2507 to FP8 data type. This optimization reduces the number of bits per parameter from 16 to 8, reducing the disk size and GPU memory requirements by approximately 50%. Only the weights and activations of the linear operators within transformers blocks of the language model are quantized.
It's running faster than Qwen/Qwen3-30B-A3B-Instruct-2507-FP8 in vLLM/sglang in 4090 or H100, with no diffrence on model ability on most benchmarks.
Deployment
Use with vLLM
You may need to tune moe for best performance with vllm or sglang, almost same as Qwen3-Next guid on vllm
vllm serve bash99/Qwen3-30B-A3B-Instruct-2507-FP8-Dynamic --tensor_parallel_size 2
Creation
This model was quantized using the llm-compressor library as shown below.
Creation details
from transformers import AutoProcessor, AutoModelForCausalLM, AutoTokenizer
from llmcompressor import oneshot
from llmcompressor.modifiers.quantization import QuantizationModifier
import sys
MODEL_ID = sys.argv[1]
# Load model.
model = AutoModelForCausalLM.from_pretrained(
MODEL_ID, device_map="auto", torch_dtype="auto"
)
processor = AutoProcessor.from_pretrained(MODEL_ID)
# Configure the quantization algorithm and scheme.
# In this case, we:
# * quantize the weights to fp8 with per channel via ptq
# * quantize the activations to fp8 with dynamic per token
recipe = QuantizationModifier(targets="Linear",scheme="FP8_DYNAMIC",
ignore=["re:.*lm_head", "re:visual.*", 're:.*mlp.gate$', 're:.*mlp.shared_expert_gate$', 're:.*router$']
)
# Apply quantization and save to disk in compressed-tensors format.
SAVE_DIR = MODEL_ID + "-FP8-Dynamic"
oneshot(model=model, recipe=recipe, output_dir=SAVE_DIR)
processor.save_pretrained(SAVE_DIR)
print(f"========== quantizeing to {SAVE_DIR}, done ==============")
python quantize_tofp8.py Qwen/Qwen3-30B-A3B-Instruct-2507-FP8
- Downloads last month
- 58
Model tree for bash99/Qwen3-30B-A3B-Instruct-2507-FP8-Dynamic
Base model
Qwen/Qwen3-30B-A3B-Instruct-2507