smallcoder-303m / README.md

Beebey

Update README.md

a1acc74 verified about 1 month ago

preview code

raw

history blame

5.77 kB

metadata

license: apache-2.0
language:
  - en
  - code
library_name: transformers
tags:
  - smallcoder
  - code-llm
  - sft
  - 303m
  - trc
datasets:
  - HuggingFaceFW/fineweb-edu
  - nvidia/Nemotron-Pretraining-SFT-v1
  - bigcode/starcoderdata
  - nvidia/Nemotron-Pretraining-Code-v1
  - HuggingFaceFW/finewiki
  - open-web-math/open-web-math
  - nvidia/Nemotron-CC-Math-v1
  - nvidia/OpenCodeInstruct
  - nvidia/OpenMathInstruct-2

SmallCoder (303M)

SmallCoder is a 303 Million parameter Large Language Model (LLM) trained from scratch, specializing in code generation and algorithmic reasoning.

This checkpoint is the result of a 6 Billion token Supervised Fine-Tuning (SFT) run, which fixed a critical End-of-Sequence (EOS) token bug present in previous versions.

This model demonstrates state-of-the-art (SOTA) coding performance for its size, outperforming models larger than 1B parameters and competing with models 23x its size.

Trained with support from Google's TPU Research Cloud (TRC) program.

🚀 Key Performance (Benchmarks)

The goal of SmallCoder was to maximize coding performance in a compact (<500M) package. This model achieves SOTA scores that rival or exceed models in the 1B+ class.

Model	Size	HumanEval (pass@1)	MBPP (pass@1)
SmallCoder (S4.1)	303M	27.4%	31.0%
TinyLlama-1.1B	1.1B	~26.4%	~27.6%
MPT-1B-Instruct	1.0B	~22.0%	~25.0%
Zephyr-1.3B SFT	1.3B	31.0%	34.0%
Mistral-7B Base	7B	30.5%	47.5%

SmallCoder (303M) nearly achieves parity with Mistral 7B on HumanEval while being 23x smaller.

🧠 Model Architecture

This model uses a Llama-type architecture (MHA) with 303M parameters.

Architecture: LlamaForCausalLM (MHA)
Hidden Size: 768
Layers: 24
Attention Heads: 8
KV Heads: 8 (Standard MHA)
Vocab Size: 49152 (Tokenizer: bigcode/starcoder)
Max Context: 1024 tokens

LlamaConfig(
  vocab_size=49152,
  hidden_size=768,
  num_hidden_layers=24,
  intermediate_size=3072,
  num_attention_heads=8,
  num_key_value_heads=8,
  max_position_embeddings=1024,
  ...
)

🛠️ Training Plan (4 Stages)

This model is the result of a multi-stage training curriculum totaling 29.8 Billion tokens.

Stage 1: Linguistic Base (Completed)

Tokens: 6.3B
Dataset: FineWeb-Edu
Objective: Learn natural language.
Loss: 10.87 → 2.58

Stage 2: Code Specialization (Completed)

Tokens: 7.5B
Dataset: Nemotron Synthetic Code Q/A CoT (60%) / StarCoderData (40%)
Objective: Learn code syntax and reasoning.
Loss: 5.00 → 1.25

Stage 3: Math & Knowledge (Completed)

Tokens: 10B
Dataset: Nemotron CC-Math-4plus (40%) / FineWiki-EN (35%) / Nemotron CC-Math-4 (15%) / OpenWebMath (10%)
Objective: Learn mathematical reasoning.
Loss: 2.77 → 1.55
Result: A solid base model (Wikitext PPL: 35.4).

Stage 4.1: SFT (EOS-Fixed) (Completed)

Tokens: 6B
Starting Checkpoint: stage-3/
Dataset: Nemotron-SFT-Code (45%), OpenCodeInstruct (30%), OpenMathInstruct-2 (15%), Nemotron-SFT-General (10%)
Objective: Align on code instructions and fix the EOS generation bug.
Loss: 1.73 → ~0.70 (low point)

📊 Detailed Benchmarks (Stage 4.1)

The SFT (Code) scores are excellent. The generalist scores (Math, Reasoning) are low, indicating the SFT has heavily specialized the model (a "code specialist").

Task	Benchmark	n-shot	Metric	Score
Code	HumanEval	0	pass@1	27.4%
Code	MBPP	3	pass@1	31.0%
Math	GSM8k	0	exact_match	4.55%
General	Wikitext	0	word_perplexity	167.6
Reasoning	ARC Easy	0	acc_norm	34.6%
Reasoning	ARC Challenge	0	acc_norm	22.8%
Commonsense	HellaSwag	0	acc_norm	28.3%

humaneval/mbpp scores are based on manual analysis (max_gen_toks=512), as official lm-eval benchmarks fail to evaluate this model due to SFT formatting and truncation issues.

⚠️ Known Limitations

Code Specialist: Heavily optimized for code (27.4% HEval) at the expense of other skills. Performance on math (gsm8k 4.55%) and general knowledge (PPL 167) is low. This is a code specialist model, not a generalist.
Limited Context: This model was trained exclusively on a sequence length of 1024 tokens. It cannot handle longer prompts.

⚡ How to Use

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "Beebey/smallcoder-303m"
device = "cuda" # or "cpu"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16
).to(device)

# Note the 'User:' and 'Assistant:' formatting
prompt = "User: Write a Python function to compute the Fibonacci sequence.\nAssistant:"
inputs = tokenizer(prompt, return_tensors="pt").to(device)

# Generation
# The model was trained to use tokenizer.eos_token_id
# It should stop automatically.
outputs = model.generate(
    **inputs,
    max_new_tokens=512,
    pad_token_id=tokenizer.eos_token_id,
    eos_token_id=tokenizer.eos_token_id
)

response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

Acknowledgements

Trained with the Google TRC

This model was trained with support from Google's TPU Research Cloud (TRC) program. We thank Google for providing access to the TPU v4 infrastructure that made this training run possible.