license: apache-2.0
language:
- en
- code
library_name: transformers
tags:
- smallcoder
- code-llm
- sft
- 303m
- trc
datasets:
- HuggingFaceFW/fineweb-edu
- nvidia/Nemotron-Pretraining-SFT-v1
- bigcode/starcoderdata
- nvidia/Nemotron-Pretraining-Code-v1
- HuggingFaceFW/finewiki
- open-web-math/open-web-math
- nvidia/Nemotron-CC-Math-v1
- nvidia/OpenCodeInstruct
- nvidia/OpenMathInstruct-2
SmallCoder (303M)
SmallCoder is a 303 Million parameter Large Language Model (LLM) trained from scratch, specializing in code generation and algorithmic reasoning.
This checkpoint is the result of a 6 Billion token Supervised Fine-Tuning (SFT) run, which fixed a critical End-of-Sequence (EOS) token bug present in previous versions.
This model demonstrates state-of-the-art (SOTA) coding performance for its size, outperforming models larger than 1B parameters and competing with models 23x its size.
Trained with support from Google's TPU Research Cloud (TRC) program.
π Key Performance (Benchmarks)
The goal of SmallCoder was to maximize coding performance in a compact (<500M) package. This model achieves SOTA scores that rival or exceed models in the 1B+ class.
| Model | Size | HumanEval (pass@1) | MBPP (pass@1) |
|---|---|---|---|
| SmallCoder (S4.1) | 303M | 27.4% | 31.0% |
| TinyLlama-1.1B | 1.1B | ~26.4% | ~27.6% |
| MPT-1B-Instruct | 1.0B | ~22.0% | ~25.0% |
| Zephyr-1.3B SFT | 1.3B | 31.0% | 34.0% |
| Mistral-7B Base | 7B | 30.5% | 47.5% |
SmallCoder (303M) nearly achieves parity with Mistral 7B on HumanEval while being 23x smaller.
π§ Model Architecture
This model uses a Llama-type architecture (MHA) with 303M parameters.
- Architecture: LlamaForCausalLM (MHA)
- Hidden Size: 768
- Layers: 24
- Attention Heads: 8
- KV Heads: 8 (Standard MHA)
- Vocab Size: 49152 (Tokenizer:
bigcode/starcoder) - Max Context: 1024 tokens
LlamaConfig(
vocab_size=49152,
hidden_size=768,
num_hidden_layers=24,
intermediate_size=3072,
num_attention_heads=8,
num_key_value_heads=8,
max_position_embeddings=1024,
...
)
π οΈ Training Plan (4 Stages)
This model is the result of a multi-stage training curriculum totaling 29.8 Billion tokens.
Stage 1: Linguistic Base (Completed)
- Tokens: 6.3B
- Dataset:
FineWeb-Edu - Objective: Learn natural language.
- Loss: 10.87 β 2.58
Stage 2: Code Specialization (Completed)
- Tokens: 7.5B
- Dataset:
Nemotron Synthetic Code Q/A CoT(60%) /StarCoderData(40%) - Objective: Learn code syntax and reasoning.
- Loss: 5.00 β 1.25
Stage 3: Math & Knowledge (Completed)
- Tokens: 10B
- Dataset:
Nemotron CC-Math-4plus(40%) /FineWiki-EN(35%) /Nemotron CC-Math-4(15%) /OpenWebMath(10%) - Objective: Learn mathematical reasoning.
- Loss: 2.77 β 1.55
- Result: A solid base model (Wikitext PPL: 35.4).
Stage 4.1: SFT (EOS-Fixed) (Completed)
- Tokens: 6B
- Starting Checkpoint:
stage-3/ - Dataset:
Nemotron-SFT-Code(45%),OpenCodeInstruct(30%),OpenMathInstruct-2(15%),Nemotron-SFT-General(10%) - Objective: Align on code instructions and fix the EOS generation bug.
- Loss: 1.73 β ~0.70 (low point)
π Detailed Benchmarks (Stage 4.1)
The SFT (Code) scores are excellent. The generalist scores (Math, Reasoning) are low, indicating the SFT has heavily specialized the model (a "code specialist").
| Task | Benchmark | n-shot | Metric | Score |
|---|---|---|---|---|
| Code | HumanEval | 0 | pass@1 | 27.4% |
| Code | MBPP | 3 | pass@1 | 31.0% |
| Math | GSM8k | 0 | exact_match | 4.55% |
| General | Wikitext | 0 | word_perplexity | 167.6 |
| Reasoning | ARC Easy | 0 | acc_norm | 34.6% |
| Reasoning | ARC Challenge | 0 | acc_norm | 22.8% |
| Commonsense | HellaSwag | 0 | acc_norm | 28.3% |
humaneval/mbpp scores are based on manual analysis (max_gen_toks=512), as official lm-eval benchmarks fail to evaluate this model due to SFT formatting and truncation issues.
β οΈ Known Limitations
- Code Specialist: Heavily optimized for code (27.4% HEval) at the expense of other skills. Performance on math (
gsm8k4.55%) and general knowledge (PPL 167) is low. This is a code specialist model, not a generalist. - Limited Context: This model was trained exclusively on a sequence length of 1024 tokens. It cannot handle longer prompts.
β‘ How to Use
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
model_id = "Beebey/smallcoder-303m"
device = "cuda" # or "cpu"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16
).to(device)
# Note the 'User:' and 'Assistant:' formatting
prompt = "User: Write a Python function to compute the Fibonacci sequence.\nAssistant:"
inputs = tokenizer(prompt, return_tensors="pt").to(device)
# Generation
# The model was trained to use tokenizer.eos_token_id
# It should stop automatically.
outputs = model.generate(
**inputs,
max_new_tokens=512,
pad_token_id=tokenizer.eos_token_id,
eos_token_id=tokenizer.eos_token_id
)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)
Acknowledgements
Trained with the Google TRC
This model was trained with support from Google's TPU Research Cloud (TRC) program. We thank Google for providing access to the TPU v4 infrastructure that made this training run possible.