pytorch
/

Qwen3-8B-QAT-INT4

Text Generation

text-generation-inference

Model card Files Files and versions

andrewor14 commited on 11 days ago

Commit

a90ae0b

·

verified ·

1 Parent(s): 250f9c1

Update README.md

Files changed (1) hide show

README.md +0 -1

README.md CHANGED Viewed

@@ -21,7 +21,6 @@ base_model:
 - **License:** apache-2.0
 - **Quantized from Model :** Qwen/Qwen3-8B
 - **Quantization Method :** QAT INT4
-- **Terms of Use**: [Terms][terms]
 [Qwen3-8B](https://huggingface.co/Qwen/Qwen3-8B) fine-tuned with [unsloth](https://github.com/unslothai/unsloth) using quantization-aware training (QAT) from [torchao](https://huggingface.co/docs/transformers/main/en/quantization/torchao), and quantized with int4 weight only quantization, by PyTorch team.
 Use it directly or serve using [vLLM](https://docs.vllm.ai/en/latest/) for 62% VRAM reduction (6.24 GB needed) and 1.45x speedup on H100 GPUs.

 - **License:** apache-2.0
 - **Quantized from Model :** Qwen/Qwen3-8B
 - **Quantization Method :** QAT INT4
 [Qwen3-8B](https://huggingface.co/Qwen/Qwen3-8B) fine-tuned with [unsloth](https://github.com/unslothai/unsloth) using quantization-aware training (QAT) from [torchao](https://huggingface.co/docs/transformers/main/en/quantization/torchao), and quantized with int4 weight only quantization, by PyTorch team.
 Use it directly or serve using [vLLM](https://docs.vllm.ai/en/latest/) for 62% VRAM reduction (6.24 GB needed) and 1.45x speedup on H100 GPUs.