Beebey commited on
Commit
297e718
Β·
verified Β·
1 Parent(s): a1acc74

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +116 -90
README.md CHANGED
@@ -4,10 +4,14 @@ language:
4
  - en
5
  - code
6
  library_name: transformers
 
7
  tags:
8
  - smallcoder
9
  - code-llm
 
10
  - sft
 
 
11
  - 303m
12
  - trc
13
  datasets:
@@ -22,149 +26,171 @@ datasets:
22
  - nvidia/OpenMathInstruct-2
23
  ---
24
 
25
- # SmallCoder (303M)
26
 
27
- SmallCoder is a **303 Million parameter** Large Language Model (LLM) trained from scratch, specializing in code generation and algorithmic reasoning.
28
 
29
- This checkpoint is the result of a 6 Billion token Supervised Fine-Tuning (SFT) run, which **fixed a critical End-of-Sequence (EOS) token bug** present in previous versions.
30
 
31
- This model demonstrates state-of-the-art (SOTA) coding performance for its size, outperforming models larger than 1B parameters and competing with models 23x its size.
32
 
33
- **Trained with support from Google's TPU Research Cloud (TRC) program.**
34
 
35
- ## πŸš€ Key Performance (Benchmarks)
36
 
37
- The goal of SmallCoder was to maximize coding performance in a compact (<500M) package. This model achieves SOTA scores that rival or exceed models in the 1B+ class.
38
 
39
  | Model | Size | HumanEval (pass@1) | MBPP (pass@1) |
40
- | :--- | :---: | :---: | :---: |
41
- | **SmallCoder (S4.1)** | **303M** | **27.4%** | **31.0%** |
42
- | TinyLlama-1.1B | 1.1B | ~26.4% | ~27.6% |
43
- | MPT-1B-Instruct | 1.0B | ~22.0% | ~25.0% |
44
- | Zephyr-1.3B SFT | 1.3B | 31.0% | 34.0% |
45
- | Mistral-7B Base | 7B | 30.5% | 47.5% |
46
 
47
- SmallCoder (303M) nearly achieves **parity with Mistral 7B** on HumanEval while being **23x smaller**.
48
 
49
- ## 🧠 Model Architecture
50
 
51
- This model uses a Llama-type architecture (MHA) with 303M parameters.
52
 
53
- * **Architecture**: LlamaForCausalLM (MHA)
54
- * **Hidden Size**: 768
55
- * **Layers**: 24
56
- * **Attention Heads**: 8
57
- * **KV Heads**: 8 (Standard MHA)
58
- * **Vocab Size**: 49152 (Tokenizer: `bigcode/starcoder`)
59
- * **Max Context**: 1024 tokens
60
 
61
  ```python
62
  LlamaConfig(
63
- vocab_size=49152,
64
  hidden_size=768,
65
  num_hidden_layers=24,
66
- intermediate_size=3072,
67
  num_attention_heads=8,
68
  num_key_value_heads=8,
 
69
  max_position_embeddings=1024,
70
- ...
71
  )
72
  ````
73
 
74
- ## πŸ› οΈ Training Plan (4 Stages)
 
 
 
 
 
 
 
 
75
 
76
- This model is the result of a multi-stage training curriculum totaling **29.8 Billion tokens**.
77
-
78
- ### Stage 1: Linguistic Base (Completed)
79
-
80
- * **Tokens**: 6.3B
81
- * **Dataset**: `FineWeb-Edu`
82
- * **Objective**: Learn natural language.
83
- * **Loss**: 10.87 β†’ **2.58**
84
 
85
- ### Stage 2: Code Specialization (Completed)
86
 
87
- * **Tokens**: 7.5B
88
- * **Dataset**: `Nemotron Synthetic Code Q/A CoT` (60%) / `StarCoderData` (40%)
89
- * **Objective**: Learn code syntax and reasoning.
90
- * **Loss**: 5.00 β†’ **1.25**
 
 
91
 
92
- ### Stage 3: Math & Knowledge (Completed)
93
 
94
- * **Tokens**: 10B
95
- * **Dataset**: `Nemotron CC-Math-4plus` (40%) / `FineWiki-EN` (35%) / `Nemotron CC-Math-4` (15%) / `OpenWebMath` (10%)
96
- * **Objective**: Learn mathematical reasoning.
97
- * **Loss**: 2.77 β†’ **1.55**
98
- * **Result**: A solid base model (Wikitext PPL: 35.4).
99
 
100
- ### Stage 4.1: SFT (EOS-Fixed) (Completed)
101
 
102
- * **Tokens**: 6B
103
- * **Starting Checkpoint**: `stage-3/`
104
- * **Dataset**: `Nemotron-SFT-Code` (45%), `OpenCodeInstruct` (30%), `OpenMathInstruct-2` (15%), `Nemotron-SFT-General` (10%)
105
- * **Objective**: Align on code instructions and fix the EOS generation bug.
106
- * **Loss**: 1.73 β†’ **\~0.70** (low point)
 
 
 
107
 
108
- -----
109
 
110
- ## πŸ“Š Detailed Benchmarks (Stage 4.1)
111
 
112
- The SFT (Code) scores are excellent. The generalist scores (Math, Reasoning) are low, indicating the SFT has heavily specialized the model (a "code specialist").
113
 
114
- | Task | Benchmark | n-shot | Metric | Score |
115
- | :--- | :--- | :---: | :--- | :---: |
116
- | **Code** | **HumanEval** | 0 | **pass@1** | **27.4%** |
117
- | **Code** | **MBPP** | 3 | **pass@1** | **31.0%** |
118
- | **Math** | **GSM8k** | 0 | exact\_match | **4.55%** |
119
- | **General** | **Wikitext** | 0 | word\_perplexity | 167.6 |
120
- | **Reasoning** | **ARC Easy** | 0 | acc\_norm | 34.6% |
121
- | **Reasoning** | **ARC Challenge** | 0 | acc\_norm | 22.8% |
122
- | **Commonsense** | **HellaSwag** | 0 | acc\_norm | 28.3% |
123
 
124
- *`humaneval`/`mbpp` scores are based on manual analysis (`max_gen_toks=512`), as official `lm-eval` benchmarks fail to evaluate this model due to SFT formatting and truncation issues.*
 
125
 
126
- ## ⚠️ Known Limitations
 
127
 
128
- 1. **Code Specialist:** Heavily optimized for code (27.4% HEval) at the expense of other skills. Performance on math (`gsm8k` 4.55%) and general knowledge (PPL 167) is low. **This is a code specialist model, not a generalist.**
129
- 2. **Limited Context:** This model was trained exclusively on a sequence length of **1024 tokens**. It cannot handle longer prompts.
130
 
131
- ## ⚑ How to Use
132
 
133
  ```python
134
  import torch
135
  from transformers import AutoTokenizer, AutoModelForCausalLM
136
 
137
  model_id = "Beebey/smallcoder-303m"
138
- device = "cuda" # or "cpu"
139
 
140
  tokenizer = AutoTokenizer.from_pretrained(model_id)
141
- model = AutoModelForCausalLM.from_pretrained(
142
- model_id,
143
- torch_dtype=torch.bfloat16
144
- ).to(device)
145
 
146
- # Note the 'User:' and 'Assistant:' formatting
147
- prompt = "User: Write a Python function to compute the Fibonacci sequence.\nAssistant:"
148
  inputs = tokenizer(prompt, return_tensors="pt").to(device)
149
 
150
- # Generation
151
- # The model was trained to use tokenizer.eos_token_id
152
- # It should stop automatically.
153
- outputs = model.generate(
154
- **inputs,
155
- max_new_tokens=512,
156
- pad_token_id=tokenizer.eos_token_id,
157
- eos_token_id=tokenizer.eos_token_id
158
- )
159
 
160
- response = tokenizer.decode(outputs[0], skip_special_tokens=True)
161
- print(response)
162
  ```
163
 
164
- ## Acknowledgements
 
 
165
 
166
- ### Trained with the Google TRC
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
167
 
168
- This model was trained with support from Google's **TPU Research Cloud (TRC)** program. We thank Google for providing access to the TPU v4 infrastructure that made this training run possible.
169
 
170
  ```
 
4
  - en
5
  - code
6
  library_name: transformers
7
+ pipeline_tag: text-generation
8
  tags:
9
  - smallcoder
10
  - code-llm
11
+ - code-generation
12
  - sft
13
+ - pretraining
14
+ - tpu
15
  - 303m
16
  - trc
17
  datasets:
 
26
  - nvidia/OpenMathInstruct-2
27
  ---
28
 
29
+ # 🧠 SmallCoder (303M)
30
 
31
+ **SmallCoder** is a **303M parameter** LLaMA-style language model trained **from scratch** for **code generation** and **algorithmic reasoning**.
32
 
33
+ This checkpoint represents a **6B-token Supervised Fine-Tuning (SFT)** run that fixed a critical **End-of-Sequence (EOS) token bug** from earlier versions.
34
 
35
+ Despite its compact size, SmallCoder achieves **state-of-the-art (SOTA) coding performance for <500M models**, rivaling 1B–7B parameter LLMs.
36
 
37
+ > Trained with support from **Google’s TPU Research Cloud (TRC)** program.
38
 
39
+ ---
40
 
41
+ ## πŸš€ Key Results
42
 
43
  | Model | Size | HumanEval (pass@1) | MBPP (pass@1) |
44
+ |:------|:----:|:------------------:|:--------------:|
45
+ | **SmallCoder (Stage 4.1)** | **303M** | **27.4 %** | **31.0 %** |
46
+ | TinyLlama-1.1B | 1.1B | ~26.4 % | ~27.6 % |
47
+ | MPT-1B-Instruct | 1.0B | ~22.0 % | ~25.0 % |
48
+ | Zephyr-1.3B-SFT | 1.3B | 31.0 % | 34.0 % |
49
+ | Mistral-7B-Base | 7B | 30.5 % | 47.5 % |
50
 
51
+ > βš–οΈ **SmallCoder nearly matches Mistral 7B on HumanEval while being 23Γ— smaller.**
52
 
53
+ ---
54
 
55
+ ## 🧬 Model Architecture
56
 
57
+ A **LLaMA-type causal decoder** with standard Multi-Head Attention (MHA).
 
 
 
 
 
 
58
 
59
  ```python
60
  LlamaConfig(
61
+ vocab_size=49152, # StarCoder tokenizer
62
  hidden_size=768,
63
  num_hidden_layers=24,
 
64
  num_attention_heads=8,
65
  num_key_value_heads=8,
66
+ intermediate_size=3072,
67
  max_position_embeddings=1024,
 
68
  )
69
  ````
70
 
71
+ | Parameter | Value |
72
+ | ----------------- | ------------------------------ |
73
+ | Total parameters | β‰ˆ 303 M |
74
+ | Context length | 1 024 tokens |
75
+ | Tokenizer | `bigcode/starcoder` |
76
+ | Architecture type | LLaMA (MHA, non-GQA) |
77
+ | Precision | bfloat16 |
78
+ | Optimizer | AdamW XLA |
79
+ | Hardware | TPU v4-32 (TRC) |
80
 
81
+ ---
 
 
 
 
 
 
 
82
 
83
+ ## πŸ“š Training Curriculum (4 Stages, 29.8B tokens)
84
 
85
+ | Stage | Tokens (B) | Dataset | Objective | Loss ↓ |
86
+ | :------------------------- | :--------: | :--------------------------------------------------- | :------------------------------- | :----------: |
87
+ | **1. Linguistic Base** | 6.3 | FineWeb-Edu | General English grounding | 10.87 β†’ 2.58 |
88
+ | **2. Code Specialization** | 7.5 | 60 % Nemotron Synthetic Code / 40 % StarCoderData | Code syntax & reasoning | 5.00 β†’ 1.25 |
89
+ | **3. Math & Knowledge** | 10.0 | Nemotron CC-Math / FineWiki / OpenWebMath | Mathematical reasoning | 2.77 β†’ 1.55 |
90
+ | **4.1 SFT (EOS Fixed)** | 6.0 | Nemotron SFT / OpenCodeInstruct / OpenMathInstruct-2 | Instruction-tuned code alignment | 1.73 β†’ ~0.70 |
91
 
92
+ > 🧩 Total β‰ˆ 29.8 B tokens of curated curriculum learning.
93
 
94
+ ---
 
 
 
 
95
 
96
+ ## πŸ“Š Detailed Benchmarks (Stage 4.1 SFT)
97
 
98
+ | Domain | Benchmark | Metric | Score |
99
+ | :-------------- | :------------------- | :----------- | :-----------: |
100
+ | **Code** | HumanEval (0-shot) | pass@1 | **27.4 %** |
101
+ | **Code** | MBPP (3-shot) | pass@1 | **31.0 %** |
102
+ | **Math** | GSM8k (0-shot) | exact match | **4.55 %** |
103
+ | **Knowledge** | Wikitext-2 | perplexity ↓ | **167.6** |
104
+ | **Reasoning** | ARC (Easy/Challenge) | acc norm | 34.6 / 22.8 % |
105
+ | **Commonsense** | HellaSwag | acc norm | 28.3 % |
106
 
107
+ > `humaneval`/`mbpp` were computed with manual evaluation (`max_new_tokens=512`, `temp=0.2`) due to SFT format truncation issues in `lm-eval`.
108
 
109
+ ---
110
 
111
+ ## ⚠️ Known Limitations
112
 
113
+ 1. **Code-Specialized Model**
114
+ Tuned for Python and algorithmic reasoning. Poor performance on general text, math, and commonsense tasks.
 
 
 
 
 
 
 
115
 
116
+ 2. **Short Context**
117
+ Trained on **1 024-token** sequences only. Performance degrades on longer inputs.
118
 
119
+ 3. **Tokenizer Bias**
120
+ Uses `bigcode/starcoder` BPE vocabulary β€” optimized for code, not prose.
121
 
122
+ ---
 
123
 
124
+ ## πŸ’» Usage Example
125
 
126
  ```python
127
  import torch
128
  from transformers import AutoTokenizer, AutoModelForCausalLM
129
 
130
  model_id = "Beebey/smallcoder-303m"
131
+ device = "cuda" if torch.cuda.is_available() else "cpu"
132
 
133
  tokenizer = AutoTokenizer.from_pretrained(model_id)
134
+ model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16).to(device)
 
 
 
135
 
136
+ prompt = """User: Write a Python function to compute Fibonacci numbers.
137
+ Assistant:"""
138
  inputs = tokenizer(prompt, return_tensors="pt").to(device)
139
 
140
+ with torch.no_grad():
141
+ outputs = model.generate(
142
+ **inputs,
143
+ max_new_tokens=512,
144
+ eos_token_id=tokenizer.eos_token_id,
145
+ pad_token_id=tokenizer.eos_token_id,
146
+ )
 
 
147
 
148
+ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
 
149
  ```
150
 
151
+ πŸ’‘ *Trained using the β€œUser:” / β€œAssistant:” dialogue format.*
152
+
153
+ ---
154
 
155
+ ## 🧾 Citation
156
+
157
+ If you use **SmallCoder (303M)** in your research, please cite:
158
+
159
+ ```
160
+ @misc{smallcoder303m,
161
+ title = {SmallCoder: A 303M-parameter Code LLM trained from scratch},
162
+ author = {Da Silva, Ilan},
163
+ year = {2025},
164
+ url = {https://huggingface.co/Beebey/smallcoder-303m},
165
+ note = {Trained with Google TPU Research Cloud (TRC) support}
166
+ }
167
+ ```
168
+
169
+ ---
170
+
171
+ ## πŸ™ Acknowledgements
172
+
173
+ This model was trained with support from the **Google TPU Research Cloud (TRC)** program.
174
+ Special thanks to the open datasets that enabled this work:
175
+ FineWeb, StarCoderData, Nemotron, and OpenWebMath.
176
+
177
+ ---
178
+
179
+ ## 🧩 Summary
180
+
181
+ | Category | Description |
182
+ | ------------------- | --------------------------- |
183
+ | **Type** | Code LLM (LLaMA-style) |
184
+ | **Parameters** | 303 M |
185
+ | **Training tokens** | ~29.8 B |
186
+ | **Specialty** | Code generation & reasoning |
187
+ | **Context window** | 1 024 tokens |
188
+ | **Tokenizer** | `bigcode/starcoder` |
189
+ | **License** | Apache 2.0 |
190
+ | **Hardware** | TPU v4 (TRC Program) |
191
+
192
+ ---
193
 
194
+ > πŸ”¬ **SmallCoder (303M)** demonstrates that a carefully designed <500M model can achieve near-SOTA coding performance, matching 1B-class models on HumanEval β€” proving that *efficient, compact, open models* still matter.
195
 
196
  ```