Update README.md

Browse files

Files changed (1) hide show

README.md +116 -90

README.md CHANGED Viewed

@@ -4,10 +4,14 @@ language:
 - en
 - code
 library_name: transformers
 tags:
 - smallcoder
 - code-llm
 - sft
 - 303m
 - trc
 datasets:
@@ -22,149 +26,171 @@ datasets:
 - nvidia/OpenMathInstruct-2
 ---
-# SmallCoder (303M)
-SmallCoder is a **303 Million parameter** Large Language Model (LLM) trained from scratch, specializing in code generation and algorithmic reasoning.
-This checkpoint is the result of a 6 Billion token Supervised Fine-Tuning (SFT) run, which **fixed a critical End-of-Sequence (EOS) token bug** present in previous versions.
-This model demonstrates state-of-the-art (SOTA) coding performance for its size, outperforming models larger than 1B parameters and competing with models 23x its size.
-**Trained with support from Google's TPU Research Cloud (TRC) program.**
-## 🚀 Key Performance (Benchmarks)
-The goal of SmallCoder was to maximize coding performance in a compact (<500M) package. This model achieves SOTA scores that rival or exceed models in the 1B+ class.
 | Model | Size | HumanEval (pass@1) | MBPP (pass@1) |
-| :--- | :---: | :---: | :---: |
-| **SmallCoder (S4.1)** | **303M** | **27.4%** | **31.0%** |
-| TinyLlama-1.1B | 1.1B | ~26.4% | ~27.6% |
-| MPT-1B-Instruct | 1.0B | ~22.0% | ~25.0% |
-| Zephyr-1.3B SFT | 1.3B | 31.0% | 34.0% |
-| Mistral-7B Base | 7B | 30.5% | 47.5% |
-SmallCoder (303M) nearly achieves **parity with Mistral 7B** on HumanEval while being **23x smaller**.
-## 🧠 Model Architecture
-This model uses a Llama-type architecture (MHA) with 303M parameters.
-* **Architecture**: LlamaForCausalLM (MHA)
-* **Hidden Size**: 768
-* **Layers**: 24
-* **Attention Heads**: 8
-* **KV Heads**: 8 (Standard MHA)
-* **Vocab Size**: 49152 (Tokenizer: `bigcode/starcoder`)
-* **Max Context**: 1024 tokens
 ```python
 LlamaConfig(
-  vocab_size=49152,
   hidden_size=768,
   num_hidden_layers=24,
-  intermediate_size=3072,
   num_attention_heads=8,
   num_key_value_heads=8,
   max_position_embeddings=1024,
-  ...
 )
 ````
-## 🛠️ Training Plan (4 Stages)
-This model is the result of a multi-stage training curriculum totaling **29.8 Billion tokens**.
-### Stage 1: Linguistic Base (Completed)
-  * **Tokens**: 6.3B
-  * **Dataset**: `FineWeb-Edu`
-  * **Objective**: Learn natural language.
-  * **Loss**: 10.87 → **2.58**
-### Stage 2: Code Specialization (Completed)
-  * **Tokens**: 7.5B
-  * **Dataset**: `Nemotron Synthetic Code Q/A CoT` (60%) / `StarCoderData` (40%)
-  * **Objective**: Learn code syntax and reasoning.
-  * **Loss**: 5.00 → **1.25**
-### Stage 3: Math & Knowledge (Completed)
-  * **Tokens**: 10B
-  * **Dataset**: `Nemotron CC-Math-4plus` (40%) / `FineWiki-EN` (35%) / `Nemotron CC-Math-4` (15%) / `OpenWebMath` (10%)
-  * **Objective**: Learn mathematical reasoning.
-  * **Loss**: 2.77 → **1.55**
-  * **Result**: A solid base model (Wikitext PPL: 35.4).
-### Stage 4.1: SFT (EOS-Fixed) (Completed)
-  * **Tokens**: 6B
-  * **Starting Checkpoint**: `stage-3/`
-  * **Dataset**: `Nemotron-SFT-Code` (45%), `OpenCodeInstruct` (30%), `OpenMathInstruct-2` (15%), `Nemotron-SFT-General` (10%)
-  * **Objective**: Align on code instructions and fix the EOS generation bug.
-  * **Loss**: 1.73 → **\~0.70** (low point)
------
-## 📊 Detailed Benchmarks (Stage 4.1)
-The SFT (Code) scores are excellent. The generalist scores (Math, Reasoning) are low, indicating the SFT has heavily specialized the model (a "code specialist").
-| Task | Benchmark | n-shot | Metric | Score |
-| :--- | :--- | :---: | :--- | :---: |
-| **Code** | **HumanEval** | 0 | **pass@1** | **27.4%** |
-| **Code** | **MBPP** | 3 | **pass@1** | **31.0%** |
-| **Math** | **GSM8k** | 0 | exact\_match | **4.55%** |
-| **General** | **Wikitext** | 0 | word\_perplexity | 167.6 |
-| **Reasoning** | **ARC Easy** | 0 | acc\_norm | 34.6% |
-| **Reasoning** | **ARC Challenge** | 0 | acc\_norm | 22.8% |
-| **Commonsense** | **HellaSwag** | 0 | acc\_norm | 28.3% |
-*`humaneval`/`mbpp` scores are based on manual analysis (`max_gen_toks=512`), as official `lm-eval` benchmarks fail to evaluate this model due to SFT formatting and truncation issues.*
-## ⚠️ Known Limitations
-1.  **Code Specialist:** Heavily optimized for code (27.4% HEval) at the expense of other skills. Performance on math (`gsm8k` 4.55%) and general knowledge (PPL 167) is low. **This is a code specialist model, not a generalist.**
-2.  **Limited Context:** This model was trained exclusively on a sequence length of **1024 tokens**. It cannot handle longer prompts.
-## ⚡ How to Use
 ```python
 import torch
 from transformers import AutoTokenizer, AutoModelForCausalLM
 model_id = "Beebey/smallcoder-303m"
-device = "cuda" # or "cpu"
 tokenizer = AutoTokenizer.from_pretrained(model_id)
-model = AutoModelForCausalLM.from_pretrained(
-    model_id,
-    torch_dtype=torch.bfloat16
-).to(device)
-# Note the 'User:' and 'Assistant:' formatting
-prompt = "User: Write a Python function to compute the Fibonacci sequence.\nAssistant:"
 inputs = tokenizer(prompt, return_tensors="pt").to(device)
-# Generation
-# The model was trained to use tokenizer.eos_token_id
-# It should stop automatically.
-outputs = model.generate(
-    **inputs,
-    max_new_tokens=512,
-    pad_token_id=tokenizer.eos_token_id,
-    eos_token_id=tokenizer.eos_token_id
-)
-response = tokenizer.decode(outputs[0], skip_special_tokens=True)
-print(response)
 ```
-## Acknowledgements
-### Trained with the Google TRC
-This model was trained with support from Google's **TPU Research Cloud (TRC)** program. We thank Google for providing access to the TPU v4 infrastructure that made this training run possible.
 ```

 - en
 - code
 library_name: transformers
+pipeline_tag: text-generation
 tags:
 - smallcoder
 - code-llm
+- code-generation
 - sft
+- pretraining
+- tpu
 - 303m
 - trc
 datasets:
 - nvidia/OpenMathInstruct-2
 ---
+# 🧠 SmallCoder (303M)
+**SmallCoder** is a **303M parameter** LLaMA-style language model trained **from scratch** for **code generation** and **algorithmic reasoning**.
+This checkpoint represents a **6B-token Supervised Fine-Tuning (SFT)** run that fixed a critical **End-of-Sequence (EOS) token bug** from earlier versions.
+Despite its compact size, SmallCoder achieves **state-of-the-art (SOTA) coding performance for <500M models**, rivaling 1B–7B parameter LLMs.
+> Trained with support from **Google’s TPU Research Cloud (TRC)** program.
+---
+## 🚀 Key Results
 | Model | Size | HumanEval (pass@1) | MBPP (pass@1) |
+|:------|:----:|:------------------:|:--------------:|
+| **SmallCoder (Stage 4.1)** | **303M** | **27.4 %** | **31.0 %** |
+| TinyLlama-1.1B | 1.1B | ~26.4 % | ~27.6 % |
+| MPT-1B-Instruct | 1.0B | ~22.0 % | ~25.0 % |
+| Zephyr-1.3B-SFT | 1.3B | 31.0 % | 34.0 % |
+| Mistral-7B-Base | 7B | 30.5 % | 47.5 % |
+> ⚖️ **SmallCoder nearly matches Mistral 7B on HumanEval while being 23× smaller.**
+---
+## 🧬 Model Architecture
+A **LLaMA-type causal decoder** with standard Multi-Head Attention (MHA).
 ```python
 LlamaConfig(
+  vocab_size=49152,               # StarCoder tokenizer
   hidden_size=768,
   num_hidden_layers=24,
   num_attention_heads=8,
   num_key_value_heads=8,
+  intermediate_size=3072,
   max_position_embeddings=1024,
 )
 ````
+| Parameter         | Value                          |
+| ----------------- | ------------------------------ |
+| Total parameters  | ≈ 303 M                        |
+| Context length    | 1 024 tokens                   |
+| Tokenizer         | `bigcode/starcoder`            |
+| Architecture type | LLaMA (MHA, non-GQA)           |
+| Precision         | bfloat16                       |
+| Optimizer         | AdamW XLA                      |
+| Hardware          | TPU v4-32 (TRC)                 |
+---
+## 📚 Training Curriculum (4 Stages, 29.8B tokens)
+| Stage                      | Tokens (B) | Dataset                                              | Objective                        |    Loss ↓    |
+| :------------------------- | :--------: | :--------------------------------------------------- | :------------------------------- | :----------: |
+| **1. Linguistic Base**     |     6.3    | FineWeb-Edu                                          | General English grounding        | 10.87 → 2.58 |
+| **2. Code Specialization** |     7.5    | 60 % Nemotron Synthetic Code / 40 % StarCoderData    | Code syntax & reasoning          |  5.00 → 1.25 |
+| **3. Math & Knowledge**    |    10.0    | Nemotron CC-Math / FineWiki / OpenWebMath            | Mathematical reasoning           |  2.77 → 1.55 |
+| **4.1 SFT (EOS Fixed)**    |     6.0    | Nemotron SFT / OpenCodeInstruct / OpenMathInstruct-2 | Instruction-tuned code alignment | 1.73 → ~0.70 |
+> 🧩 Total ≈ 29.8 B tokens of curated curriculum learning.
+---
+## 📊 Detailed Benchmarks (Stage 4.1 SFT)
+| Domain          | Benchmark            | Metric       |     Score     |
+| :-------------- | :------------------- | :----------- | :-----------: |
+| **Code**        | HumanEval (0-shot)   | pass@1       |   **27.4 %**  |
+| **Code**        | MBPP (3-shot)        | pass@1       |   **31.0 %**  |
+| **Math**        | GSM8k (0-shot)       | exact match  |   **4.55 %**  |
+| **Knowledge**   | Wikitext-2           | perplexity ↓ |   **167.6**   |
+| **Reasoning**   | ARC (Easy/Challenge) | acc norm     | 34.6 / 22.8 % |
+| **Commonsense** | HellaSwag            | acc norm     |     28.3 %    |
+> `humaneval`/`mbpp` were computed with manual evaluation (`max_new_tokens=512`, `temp=0.2`) due to SFT format truncation issues in `lm-eval`.
+---
+## ⚠️ Known Limitations
+1. **Code-Specialized Model**
+   Tuned for Python and algorithmic reasoning. Poor performance on general text, math, and commonsense tasks.
+2. **Short Context**
+   Trained on **1 024-token** sequences only. Performance degrades on longer inputs.
+3. **Tokenizer Bias**
+   Uses `bigcode/starcoder` BPE vocabulary — optimized for code, not prose.
+---
+## 💻 Usage Example
 ```python
 import torch
 from transformers import AutoTokenizer, AutoModelForCausalLM
 model_id = "Beebey/smallcoder-303m"
+device = "cuda" if torch.cuda.is_available() else "cpu"
 tokenizer = AutoTokenizer.from_pretrained(model_id)
+model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16).to(device)
+prompt = """User: Write a Python function to compute Fibonacci numbers.
+Assistant:"""
 inputs = tokenizer(prompt, return_tensors="pt").to(device)
+with torch.no_grad():
+    outputs = model.generate(
+        **inputs,
+        max_new_tokens=512,
+        eos_token_id=tokenizer.eos_token_id,
+        pad_token_id=tokenizer.eos_token_id,
+    )
+print(tokenizer.decode(outputs[0], skip_special_tokens=True))
 ```
+💡 *Trained using the “User:” / “Assistant:” dialogue format.*
+---
+## 🧾 Citation
+If you use **SmallCoder (303M)** in your research, please cite:
+```
+@misc{smallcoder303m,
+  title  = {SmallCoder: A 303M-parameter Code LLM trained from scratch},
+  author = {Da Silva, Ilan},
+  year   = {2025},
+  url    = {https://huggingface.co/Beebey/smallcoder-303m},
+  note   = {Trained with Google TPU Research Cloud (TRC) support}
+}
+```
+---
+## 🙏 Acknowledgements
+This model was trained with support from the **Google TPU Research Cloud (TRC)** program.
+Special thanks to the open datasets that enabled this work:
+FineWeb, StarCoderData, Nemotron, and OpenWebMath.
+---
+## 🧩 Summary
+| Category            | Description                 |
+| ------------------- | --------------------------- |
+| **Type**            | Code LLM (LLaMA-style)      |
+| **Parameters**      | 303 M                       |
+| **Training tokens** | ~29.8 B                     |
+| **Specialty**       | Code generation & reasoning |
+| **Context window**  | 1 024 tokens                |
+| **Tokenizer**       | `bigcode/starcoder`         |
+| **License**         | Apache 2.0                  |
+| **Hardware**        | TPU v4 (TRC Program)        |
+---
+> 🔬 **SmallCoder (303M)** demonstrates that a carefully designed <500M model can achieve near-SOTA coding performance, matching 1B-class models on HumanEval — proving that *efficient, compact, open models* still matter.
 ```