Llama-3-Nanda-10B-Chat-Quantized-8-Bits

Model Overview

This is an 8-bit quantized version of the MBZUAI/Llama-3-Nanda-10B-Chat model, optimized for efficient inference while maintaining high performance. The model has been quantized using BitsAndBytesConfig to reduce memory footprint from ~37GB to ~10.5GB, making it more accessible for deployment on consumer hardware.

Model Details

  • Base Model: MBZUAI/Llama-3-Nanda-10B-Chat
  • Model Type: Causal Language Model (LLaMA-3 Architecture)
  • Parameters: 9.98 billion parameters
  • Quantization: 8-bit (INT8) using BitsAndBytesConfig
  • Memory Footprint: ~10.5GB (vs ~37GB for FP32)
  • Architecture: Transformer decoder with 40 layers
  • Context Length: Supports extended context windows
  • Vocabulary Size: 153,856 tokens

Architecture Details

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(153856, 4096)
    (layers): ModuleList(
      (0-39): 40 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear(in_features=4096, out_features=1024, bias=False)
          (v_proj): Linear(in_features=4096, out_features=1024, bias=False)
          (o_proj): Linear(in_features=4096, out_features=4096, bias=False)
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear(in_features=4096, out_features=14336, bias=False)
          (up_proj): Linear(in_features=4096, out_features=14336, bias=False)
          (down_proj): Linear(in_features=14336, out_features=4096, bias=False)
          (act_fn): SiLUActivation()
        )
        (input_layernorm): LlamaRMSNorm((4096,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((4096,), eps=1e-05)
      )
    )
    (norm): LlamaRMSNorm((4096,), eps=1e-05)
    (rotary_emb): LlamaRotaryEmbedding()
  )
  (lm_head): Linear(in_features=4096, out_features=153856, bias=False)
)

Key Architecture Features:

  • Hidden Size: 4096
  • Intermediate Size: 14336
  • Number of Layers: 40
  • Number of Attention Heads: 32
  • Key/Value Heads: 8 (Grouped Query Attention)
  • Activation Function: SiLU (Swish)
  • Normalization: RMSNorm
  • Position Encoding: Rotary Position Embedding (RoPE)

Capabilities

This model excels in:

  • Multilingual Conversations: Supports Hindi, English, and other languages
  • Question Answering: Provides detailed, informative responses
  • Cultural Knowledge: Demonstrates understanding of regional contexts (UAE, India, etc.)
  • Safety: Includes built-in safety mechanisms to decline inappropriate requests
  • Code Generation: Can assist with programming tasks
  • Creative Writing: Capable of generating creative content

Usage

Quick Start

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

# Load model with 8-bit quantization
model_path = "FilledVaccum/Llama-3-Nanda-10B-Chat-Quantized-8-Bits"
device = "cuda" if torch.cuda.is_available() else "cpu"

# Configure 8-bit quantization
quantization_config = BitsAndBytesConfig(load_in_8bit=True)

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(
    model_path,
    quantization_config=quantization_config,
    device_map="auto",
    trust_remote_code=True
)

# Chat template
prompt_template = "<|begin_of_text|><|start_header_id|>system<|end_header_id|>You are a helpful AI assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>{Question}<|eot_id|><|start_header_id|>assistant<|end_header_id|>"

def get_response(question, max_length=500):
    text = prompt_template.format(Question=question)
    input_ids = tokenizer(text, return_tensors="pt").input_ids.to(device)
    
    with torch.no_grad():
        generate_ids = model.generate(
            input_ids,
            top_p=0.95,
            temperature=0.2,
            max_length=max_length,
            min_length=30,
            repetition_penalty=1.3,
            do_sample=True,
            pad_token_id=tokenizer.eos_token_id
        )
    
    response = tokenizer.batch_decode(
        generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True
    )[0]
    
    return response.split("assistant")[-1].strip()

# Example usage
question = "What is artificial intelligence?"
response = get_response(question)
print(response)

Multilingual Example

# Hindi example
hindi_question = "मुझे यूएई के बारे में कुछ रोचक तथ्य बताएं?"
hindi_response = get_response(hindi_question)
print(hindi_response)

Generation Parameters

Recommended parameters for different use cases:

Creative Writing:

generate_ids = model.generate(
    input_ids,
    temperature=0.8,
    top_p=0.9,
    max_length=1000,
    do_sample=True,
    repetition_penalty=1.1
)

Factual Q&A:

generate_ids = model.generate(
    input_ids,
    temperature=0.2,
    top_p=0.95,
    max_length=500,
    do_sample=True,
    repetition_penalty=1.3
)

System Requirements

Minimum Requirements:

  • GPU Memory: 12GB VRAM (for 8-bit quantization)
  • System RAM: 16GB
  • Storage: 15GB free space
  • CUDA: 11.0 or higher

Recommended Requirements:

  • GPU Memory: 16GB+ VRAM
  • System RAM: 32GB+
  • Storage: 25GB+ free space (for caching)

Multi-GPU Setup:

The model supports automatic device mapping across multiple GPUs:

model = AutoModelForCausalLM.from_pretrained(
    model_path,
    quantization_config=quantization_config,
    device_map="auto"  # Automatically distributes across available GPUs
)

Performance Benchmarks

Metric Value
Memory Footprint 10.47 GB
Parameters 9.98B
Quantization 8-bit INT8
Inference Speed ~2-3x faster than FP16
Quality Retention ~95% of original model

Training Details

This model is based on MBZUAI/Llama-3-Nanda-10B-Chat, which was fine-tuned for:

  • Multilingual conversation capabilities
  • Cultural awareness and regional knowledge
  • Safety and alignment
  • Instruction following

Limitations

  • Quantization Effects: Some precision loss compared to the original FP32 model
  • Context Window: Limited by the base model's context length
  • Language Bias: May perform better in English and Hindi compared to other languages
  • Knowledge Cutoff: Training data has a specific cutoff date
  • Hardware Requirements: Still requires significant computational resources

Safety and Bias

The model includes safety mechanisms to:

  • Decline inappropriate or harmful requests
  • Avoid generating offensive content
  • Provide helpful and constructive responses

However, users should be aware of potential biases and limitations inherent in large language models.

License

This model inherits the license from the base MBZUAI/Llama-3-Nanda-10B-Chat model. Please refer to the original model's license for usage terms.

Citation

If you use this model, please cite:

@misc{llama3-nanda-10b-quantized,
  title={Llama-3-Nanda-10B-Chat-Quantized-8-Bits},
  author={FilledVaccum},
  year={2024},
  publisher={Hugging Face},
  url={https://huggingface.co/FilledVaccum/Llama-3-Nanda-10B-Chat-Quantized-8-Bits}
}

Also cite the original base model:

@misc{llama3-nanda-10b,
  title={Llama-3-Nanda-10B-Chat},
  author={MBZUAI},
  year={2024},
  publisher={Hugging Face},
  url={https://huggingface.co/MBZUAI/Llama-3-Nanda-10B-Chat}
}

Acknowledgments

  • MBZUAI for the original Llama-3-Nanda-10B-Chat model
  • Meta AI for the LLaMA-3 architecture
  • Hugging Face for the transformers library and quantization tools
  • BitsAndBytes team for the quantization implementation

Contact

For questions or issues related to this quantized version, please open an issue in the model repository.


This model is provided as-is for research and educational purposes. Users are responsible for ensuring appropriate and ethical use.

Downloads last month
31
Safetensors
Model size
10B params
Tensor type
F16
·
F32
·
I8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for FilledVaccum/Llama-3-Nanda-10B-Chat-Quantized-8-Bits

Quantized
(1)
this model