groxaxo's picture
Update README.md
991298a verified
metadata
language:
  - en
license: apache-2.0
tags:
  - llm
  - quantized
  - 8-bit
  - w8a16
  - text-generation
  - gpt
  - valiantlabs
  - gptq
  - int8
  - fp8
library_name: llmcompressor
base_model:
  - ValiantLabs/gpt-oss-20b-ShiningValiant3

gpt-oss-20b-ShiningValiant3-W8A16

This is a W8A16 (8-bit weight, 16-bit activation) quantized version of the ValiantLabs/gpt-oss-20b-ShiningValiant3 model, processed using LLM Compressor.

Model Details

  • Original Model: ValiantLabs/gpt-oss-20b-ShiningValiant3
  • Quantization Method: W8A16 (8-bit weights, 16-bit activations)
  • Quantization Library: LLM Compressor
  • Precision: 8-bit weights with 16-bit activations
  • Compatible with: vLLM, transformers

Usage

The model can be loaded and used with the standard Hugging Face transformers library:

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("groxaxo/gpt-oss-20b-ShiningValiant3-W8A16")
tokenizer = AutoTokenizer.from_pretrained("groxaxo/gpt-oss-20b-ShiningValiant3-W8A16")

# Generate text
input_text = "Hello, how are you today?"
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Benefits of W8A16 Quantization

  • Reduced Memory Footprint: ~50% reduction in model size compared to FP16
  • Faster Inference: Improved inference speed while maintaining quality
  • Compatibility: Works with standard Hugging Face tooling
  • Quality Preservation: Maintains most of the original model's performance

Quantization Process

This model was quantized using LLM Compressor's W8A16 quantization scheme, which applies 8-bit quantization to the model weights while keeping activations in 16-bit precision. The quantization process used GPTQ (Generalized Post-Training Quantization) with calibration on a subset of data to maintain model quality.

Limitations

  • The model may exhibit slight degradation in performance compared to the full precision version
  • Not all hardware may support 8-bit operations efficiently