Llama-3-Nanda-10B-Chat-Quantized-8-Bits
Model Overview
This is an 8-bit quantized version of the MBZUAI/Llama-3-Nanda-10B-Chat model, optimized for efficient inference while maintaining high performance. The model has been quantized using BitsAndBytesConfig to reduce memory footprint from ~37GB to ~10.5GB, making it more accessible for deployment on consumer hardware.
Model Details
- Base Model: MBZUAI/Llama-3-Nanda-10B-Chat
- Model Type: Causal Language Model (LLaMA-3 Architecture)
- Parameters: 9.98 billion parameters
- Quantization: 8-bit (INT8) using BitsAndBytesConfig
- Memory Footprint: ~10.5GB (vs ~37GB for FP32)
- Architecture: Transformer decoder with 40 layers
- Context Length: Supports extended context windows
- Vocabulary Size: 153,856 tokens
Architecture Details
LlamaForCausalLM(
(model): LlamaModel(
(embed_tokens): Embedding(153856, 4096)
(layers): ModuleList(
(0-39): 40 x LlamaDecoderLayer(
(self_attn): LlamaAttention(
(q_proj): Linear(in_features=4096, out_features=4096, bias=False)
(k_proj): Linear(in_features=4096, out_features=1024, bias=False)
(v_proj): Linear(in_features=4096, out_features=1024, bias=False)
(o_proj): Linear(in_features=4096, out_features=4096, bias=False)
)
(mlp): LlamaMLP(
(gate_proj): Linear(in_features=4096, out_features=14336, bias=False)
(up_proj): Linear(in_features=4096, out_features=14336, bias=False)
(down_proj): Linear(in_features=14336, out_features=4096, bias=False)
(act_fn): SiLUActivation()
)
(input_layernorm): LlamaRMSNorm((4096,), eps=1e-05)
(post_attention_layernorm): LlamaRMSNorm((4096,), eps=1e-05)
)
)
(norm): LlamaRMSNorm((4096,), eps=1e-05)
(rotary_emb): LlamaRotaryEmbedding()
)
(lm_head): Linear(in_features=4096, out_features=153856, bias=False)
)
Key Architecture Features:
- Hidden Size: 4096
- Intermediate Size: 14336
- Number of Layers: 40
- Number of Attention Heads: 32
- Key/Value Heads: 8 (Grouped Query Attention)
- Activation Function: SiLU (Swish)
- Normalization: RMSNorm
- Position Encoding: Rotary Position Embedding (RoPE)
Capabilities
This model excels in:
- Multilingual Conversations: Supports Hindi, English, and other languages
- Question Answering: Provides detailed, informative responses
- Cultural Knowledge: Demonstrates understanding of regional contexts (UAE, India, etc.)
- Safety: Includes built-in safety mechanisms to decline inappropriate requests
- Code Generation: Can assist with programming tasks
- Creative Writing: Capable of generating creative content
Usage
Quick Start
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
# Load model with 8-bit quantization
model_path = "FilledVaccum/Llama-3-Nanda-10B-Chat-Quantized-8-Bits"
device = "cuda" if torch.cuda.is_available() else "cpu"
# Configure 8-bit quantization
quantization_config = BitsAndBytesConfig(load_in_8bit=True)
# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(
model_path,
quantization_config=quantization_config,
device_map="auto",
trust_remote_code=True
)
# Chat template
prompt_template = "<|begin_of_text|><|start_header_id|>system<|end_header_id|>You are a helpful AI assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>{Question}<|eot_id|><|start_header_id|>assistant<|end_header_id|>"
def get_response(question, max_length=500):
text = prompt_template.format(Question=question)
input_ids = tokenizer(text, return_tensors="pt").input_ids.to(device)
with torch.no_grad():
generate_ids = model.generate(
input_ids,
top_p=0.95,
temperature=0.2,
max_length=max_length,
min_length=30,
repetition_penalty=1.3,
do_sample=True,
pad_token_id=tokenizer.eos_token_id
)
response = tokenizer.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True
)[0]
return response.split("assistant")[-1].strip()
# Example usage
question = "What is artificial intelligence?"
response = get_response(question)
print(response)
Multilingual Example
# Hindi example
hindi_question = "मुझे यूएई के बारे में कुछ रोचक तथ्य बताएं?"
hindi_response = get_response(hindi_question)
print(hindi_response)
Generation Parameters
Recommended parameters for different use cases:
Creative Writing:
generate_ids = model.generate(
input_ids,
temperature=0.8,
top_p=0.9,
max_length=1000,
do_sample=True,
repetition_penalty=1.1
)
Factual Q&A:
generate_ids = model.generate(
input_ids,
temperature=0.2,
top_p=0.95,
max_length=500,
do_sample=True,
repetition_penalty=1.3
)
System Requirements
Minimum Requirements:
- GPU Memory: 12GB VRAM (for 8-bit quantization)
- System RAM: 16GB
- Storage: 15GB free space
- CUDA: 11.0 or higher
Recommended Requirements:
- GPU Memory: 16GB+ VRAM
- System RAM: 32GB+
- Storage: 25GB+ free space (for caching)
Multi-GPU Setup:
The model supports automatic device mapping across multiple GPUs:
model = AutoModelForCausalLM.from_pretrained(
model_path,
quantization_config=quantization_config,
device_map="auto" # Automatically distributes across available GPUs
)
Performance Benchmarks
| Metric | Value |
|---|---|
| Memory Footprint | 10.47 GB |
| Parameters | 9.98B |
| Quantization | 8-bit INT8 |
| Inference Speed | ~2-3x faster than FP16 |
| Quality Retention | ~95% of original model |
Training Details
This model is based on MBZUAI/Llama-3-Nanda-10B-Chat, which was fine-tuned for:
- Multilingual conversation capabilities
- Cultural awareness and regional knowledge
- Safety and alignment
- Instruction following
Limitations
- Quantization Effects: Some precision loss compared to the original FP32 model
- Context Window: Limited by the base model's context length
- Language Bias: May perform better in English and Hindi compared to other languages
- Knowledge Cutoff: Training data has a specific cutoff date
- Hardware Requirements: Still requires significant computational resources
Safety and Bias
The model includes safety mechanisms to:
- Decline inappropriate or harmful requests
- Avoid generating offensive content
- Provide helpful and constructive responses
However, users should be aware of potential biases and limitations inherent in large language models.
License
This model inherits the license from the base MBZUAI/Llama-3-Nanda-10B-Chat model. Please refer to the original model's license for usage terms.
Citation
If you use this model, please cite:
@misc{llama3-nanda-10b-quantized,
title={Llama-3-Nanda-10B-Chat-Quantized-8-Bits},
author={FilledVaccum},
year={2024},
publisher={Hugging Face},
url={https://huggingface.co/FilledVaccum/Llama-3-Nanda-10B-Chat-Quantized-8-Bits}
}
Also cite the original base model:
@misc{llama3-nanda-10b,
title={Llama-3-Nanda-10B-Chat},
author={MBZUAI},
year={2024},
publisher={Hugging Face},
url={https://huggingface.co/MBZUAI/Llama-3-Nanda-10B-Chat}
}
Acknowledgments
- MBZUAI for the original Llama-3-Nanda-10B-Chat model
- Meta AI for the LLaMA-3 architecture
- Hugging Face for the transformers library and quantization tools
- BitsAndBytes team for the quantization implementation
Contact
For questions or issues related to this quantized version, please open an issue in the model repository.
This model is provided as-is for research and educational purposes. Users are responsible for ensuring appropriate and ethical use.
- Downloads last month
- 31
Model tree for FilledVaccum/Llama-3-Nanda-10B-Chat-Quantized-8-Bits
Base model
MBZUAI/Llama-3-Nanda-10B-Chat