smallcoder-303m / README.md
Beebey's picture
Update README.md
297e718 verified
metadata
license: apache-2.0
language:
  - en
  - code
library_name: transformers
pipeline_tag: text-generation
tags:
  - smallcoder
  - code-llm
  - code-generation
  - sft
  - pretraining
  - tpu
  - 303m
  - trc
datasets:
  - HuggingFaceFW/fineweb-edu
  - nvidia/Nemotron-Pretraining-SFT-v1
  - bigcode/starcoderdata
  - nvidia/Nemotron-Pretraining-Code-v1
  - HuggingFaceFW/finewiki
  - open-web-math/open-web-math
  - nvidia/Nemotron-CC-Math-v1
  - nvidia/OpenCodeInstruct
  - nvidia/OpenMathInstruct-2

🧠 SmallCoder (303M)

SmallCoder is a 303M parameter LLaMA-style language model trained from scratch for code generation and algorithmic reasoning.

This checkpoint represents a 6B-token Supervised Fine-Tuning (SFT) run that fixed a critical End-of-Sequence (EOS) token bug from earlier versions.

Despite its compact size, SmallCoder achieves state-of-the-art (SOTA) coding performance for <500M models, rivaling 1B–7B parameter LLMs.

Trained with support from Google’s TPU Research Cloud (TRC) program.


πŸš€ Key Results

Model Size HumanEval (pass@1) MBPP (pass@1)
SmallCoder (Stage 4.1) 303M 27.4 % 31.0 %
TinyLlama-1.1B 1.1B ~26.4 % ~27.6 %
MPT-1B-Instruct 1.0B ~22.0 % ~25.0 %
Zephyr-1.3B-SFT 1.3B 31.0 % 34.0 %
Mistral-7B-Base 7B 30.5 % 47.5 %

βš–οΈ SmallCoder nearly matches Mistral 7B on HumanEval while being 23Γ— smaller.


🧬 Model Architecture

A LLaMA-type causal decoder with standard Multi-Head Attention (MHA).

LlamaConfig(
  vocab_size=49152,               # StarCoder tokenizer
  hidden_size=768,
  num_hidden_layers=24,
  num_attention_heads=8,
  num_key_value_heads=8,
  intermediate_size=3072,
  max_position_embeddings=1024,
)
Parameter Value
Total parameters β‰ˆ 303 M
Context length 1 024 tokens
Tokenizer bigcode/starcoder
Architecture type LLaMA (MHA, non-GQA)
Precision bfloat16
Optimizer AdamW XLA
Hardware TPU v4-32 (TRC)

πŸ“š Training Curriculum (4 Stages, 29.8B tokens)

Stage Tokens (B) Dataset Objective Loss ↓
1. Linguistic Base 6.3 FineWeb-Edu General English grounding 10.87 β†’ 2.58
2. Code Specialization 7.5 60 % Nemotron Synthetic Code / 40 % StarCoderData Code syntax & reasoning 5.00 β†’ 1.25
3. Math & Knowledge 10.0 Nemotron CC-Math / FineWiki / OpenWebMath Mathematical reasoning 2.77 β†’ 1.55
4.1 SFT (EOS Fixed) 6.0 Nemotron SFT / OpenCodeInstruct / OpenMathInstruct-2 Instruction-tuned code alignment 1.73 β†’ ~0.70

🧩 Total β‰ˆ 29.8 B tokens of curated curriculum learning.


πŸ“Š Detailed Benchmarks (Stage 4.1 SFT)

Domain Benchmark Metric Score
Code HumanEval (0-shot) pass@1 27.4 %
Code MBPP (3-shot) pass@1 31.0 %
Math GSM8k (0-shot) exact match 4.55 %
Knowledge Wikitext-2 perplexity ↓ 167.6
Reasoning ARC (Easy/Challenge) acc norm 34.6 / 22.8 %
Commonsense HellaSwag acc norm 28.3 %

humaneval/mbpp were computed with manual evaluation (max_new_tokens=512, temp=0.2) due to SFT format truncation issues in lm-eval.


⚠️ Known Limitations

  1. Code-Specialized Model Tuned for Python and algorithmic reasoning. Poor performance on general text, math, and commonsense tasks.

  2. Short Context Trained on 1 024-token sequences only. Performance degrades on longer inputs.

  3. Tokenizer Bias Uses bigcode/starcoder BPE vocabulary β€” optimized for code, not prose.


πŸ’» Usage Example

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "Beebey/smallcoder-303m"
device = "cuda" if torch.cuda.is_available() else "cpu"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16).to(device)

prompt = """User: Write a Python function to compute Fibonacci numbers.
Assistant:"""
inputs = tokenizer(prompt, return_tensors="pt").to(device)

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=512,
        eos_token_id=tokenizer.eos_token_id,
        pad_token_id=tokenizer.eos_token_id,
    )

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

πŸ’‘ Trained using the β€œUser:” / β€œAssistant:” dialogue format.


🧾 Citation

If you use SmallCoder (303M) in your research, please cite:

@misc{smallcoder303m,
  title  = {SmallCoder: A 303M-parameter Code LLM trained from scratch},
  author = {Da Silva, Ilan},
  year   = {2025},
  url    = {https://huggingface.co/Beebey/smallcoder-303m},
  note   = {Trained with Google TPU Research Cloud (TRC) support}
}

πŸ™ Acknowledgements

This model was trained with support from the Google TPU Research Cloud (TRC) program. Special thanks to the open datasets that enabled this work: FineWeb, StarCoderData, Nemotron, and OpenWebMath.


🧩 Summary

Category Description
Type Code LLM (LLaMA-style)
Parameters 303 M
Training tokens ~29.8 B
Specialty Code generation & reasoning
Context window 1 024 tokens
Tokenizer bigcode/starcoder
License Apache 2.0
Hardware TPU v4 (TRC Program)

πŸ”¬ SmallCoder (303M) demonstrates that a carefully designed <500M model can achieve near-SOTA coding performance, matching 1B-class models on HumanEval β€” proving that efficient, compact, open models still matter.