Llama-3.1-8B-R1-Distill

This model is a fine-tuned version of meta-llama/Llama-3.1-8B-Instruct using supervised fine-tuning (SFT) on the open-r1/Mixture-of-Thoughts dataset. The model has been trained using the Open-R1 library to replicate the step-by-step reasoning capabilities of DeepSeek-R1 distilled models.

Model Description

  • Base Model: meta-llama/Llama-3.1-8B-Instruct
  • Model Type: Causal Language Model
  • Language(s): English
  • License: Llama 3.1 Community License
  • Finetuned from model: meta-llama/Llama-3.1-8B-Instruct

This model demonstrates strong performance across reasoning, mathematical problem-solving, scientific understanding, and code generation tasks. It has been specifically trained to think step-by-step using reasoning traces in a structured format with <think> and </think> tags.

Training Details

Training Data

The model was fine-tuned on the Mixture-of-Thoughts dataset, which contains 350k verified reasoning traces distilled from DeepSeek-R1. The dataset composition includes:

  • Mathematics: 93.7k reasoning traces for mathematical problems
  • Code: 83.1k reasoning traces for competitive programming problems (Python and C++)
  • Science: 173k reasoning traces for scientific problems

Training Procedure

  • Training Framework: Open-R1 library with TRL (Transformers Reinforcement Learning)
  • Training Type: Supervised Fine-Tuning (SFT)
  • Optimization: DeepSpeed ZeRO-2 with gradient checkpointing
  • Hardware: 8xNividia B200 Node
  • Precision: BFloat16
  • Learning Rate: 4.0e-5
  • Learning Rate Scheduler: Cosine with minimum learning rate
  • Training Epochs: 5
  • Training Tokens: 100B
  • Max Sequence Length: 32,768 tokens
  • Batch Size: 2 per device with gradient accumulation
  • Gradient Accumulation Steps: 8
  • Max Gradient Norm: 0.2
  • Warmup Ratio: 0.03

Training Configuration

# Key training parameters
model_name_or_path: meta-llama/Llama-3.1-8B-Instruct
dataset_name: open-r1/Mixture-of-Thoughts
dataset_config: all
learning_rate: 4.0e-05
num_train_epochs: 5
max_length: 32768
per_device_train_batch_size: 2
gradient_accumulation_steps: 8
bf16: true
gradient_checkpointing: true
use_liger_kernel: true

Performance

----Still in training----

Usage

Basic Usage

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_name = "your-username/Llama-3.1-8B-R1-Distill"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    attn_implementation="flash_attention_2"
)

# Example: Mathematical reasoning
prompt = """Solve this step by step: A rectangle has a length that is 3 times its width. If the perimeter is 32 units, what are the dimensions?"""

inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(
    **inputs,
    max_new_tokens=500,
    temperature=0.1,
    do_sample=True,
    pad_token_id=tokenizer.eos_token_id
)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

Structured Reasoning Format

This model is trained to use a structured reasoning format with <think> tags:

def format_reasoning_prompt(question, system_prompt=None):
    if system_prompt is None:
        system_prompt = "You are a helpful assistant that thinks step by step. Show your reasoning process within <think> tags before providing your final answer."
    
    return f"""<|start_header_id|>system<|end_header_id|>

{system_prompt}<|eot_id|><|start_header_id|>user<|end_header_id|>

{question}<|eot_id|><|start_header_id|>assistant<|end_header_id|>

<think>
"""

# Example for coding problems
coding_prompt = format_reasoning_prompt(
    "Write a Python function to find the longest palindromic substring in a given string.",
    "You are an expert programmer. Think through the problem step by step, consider different approaches, and then provide a clean implementation."
)

inputs = tokenizer(coding_prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=800, temperature=0.1)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

Model Architecture

  • Architecture: Llama 3.1
  • Parameters: ~8 billion
  • Context Length: 128K tokens (inherited from base model)
  • Training Context: 32K tokens
  • Vocabulary Size: 128,256
  • Attention: Grouped Query Attention (GQA)
  • Activation: SwiGLU
  • Positional Encoding: RoPE (Rotary Position Embedding)

Capabilities

Mathematics

  • Step-by-step problem solving
  • Advanced mathematical reasoning (algebra, calculus, geometry)
  • Competition-level problems (AIME, IMO-style)
  • Statistical analysis and probability

Science

  • Physics problem solving
  • Chemistry calculations
  • Biology conceptual understanding
  • Graduate-level scientific reasoning

Programming

  • Code generation in Python and C++
  • Competitive programming problems
  • Algorithm design and optimization
  • Code explanation and debugging

General Reasoning

  • Logical reasoning and inference
  • Multi-step problem decomposition
  • Analytical thinking
  • Abstract reasoning

Limitations

  • Language: Primarily trained on English content
  • Knowledge Cutoff: Limited to training data knowledge cutoff
  • Reasoning Errors: May occasionally make logical errors in complex multi-step problems
  • Code Execution: Cannot execute code; provides code generation only
  • Real-time Information: No access to current information or internet
  • Domain Specificity: Best performance on math, science, and coding; may struggle with other specialized domains

Ethical Considerations

  • Bias: May reflect biases present in training data
  • Misuse: Should not be used for generating harmful, illegal, or malicious content
  • Academic Integrity: Users should be transparent about AI assistance in academic contexts
  • Verification: Important mathematical, scientific, or coding results should be independently verified
  • Professional Use: Should not replace professional expertise in critical applications

Training Infrastructure

  • Library: Open-R1
  • Total Training Tokens: 100B Tokens
  • Framework: PyTorch with Transformers and TRL
  • Optimization: DeepSpeed ZeRO-2
  • Memory Optimization: Gradient checkpointing, Liger kernels
  • Monitoring: Weights & Biases integration
  • Hardware Used: 8xB200 GPUs

Citation

If you use this model in your research, please cite:

@misc{llama31-r1-distill,
  title={Llama-3.1-8B-R1-Distill: A Step-by-Step Reasoning Model},
  author={[Your Name]},
  year={2025},
  url={https://huggingface.co/your-username/Llama-3.1-8B-R1-Distill}
}

@misc{openr1,
  title={Open R1: A fully open reproduction of DeepSeek-R1},
  url={https://github.com/huggingface/open-r1},
  author={Hugging Face},
  month={January},
  year={2025}
}

@misc{mixture-of-thoughts,
  title={Mixture-of-Thoughts},
  author={Hugging Face Open R1 Team},
  year={2025},
  url={https://huggingface.co/datasets/open-r1/Mixture-of-Thoughts}
}

Acknowledgments

  • Base Model: Meta AI for Llama 3.1
  • Training Framework: Hugging Face for the Open-R1 library and TRL
  • Dataset: Open-R1 team for the Mixture-of-Thoughts dataset
  • Inspiration: DeepSeek AI for the original R1 reasoning approach

License

This model is released under the Llama 3.1 Community License. Please see the official license for terms and conditions.

Model Card Contact

For questions about this model card or the model itself, please open an issue in the model repository.

Downloads last month
24
Safetensors
Model size
8B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for justinj92/Llama-3.1-8B-R1-Distill

Finetuned
(1899)
this model
Quantizations
3 models

Dataset used to train justinj92/Llama-3.1-8B-R1-Distill