Spaces:
				
			
			
	
			
			
					
		Running
		
	
	
	
			
			
	
	
	
	
		
		Full Training detail on 8 A100s (exclude the optional RL)
So I ran the full training as-is (without RL), going through all the stages below:
- create tokenizer
- do Pre-train (21K steps; 7 hrs)
- do Mid-train (700 steps; few min)
- do SFT (650 steps; few min)
I've also started really studying the code repository very closely, and can post a follow-up here, but I've captured the progress in X: https://x.com/richhsu556572/status/1980022537011744893?s=51
I'll follow up with more detailed study notes for individual pieces, but just want to report that thankfully A100s also work and 80GB is more than sufficient to prevent running into OOO issues (as expected, instead of 4 hrs, it takes roughly double the time at 8 hrs)!
Getting y'all the final report that you really care about.
nanochat training report
Generated: 2025-10-19 19:54:27
Environment
Git Information
- Branch: master
- Commit: d6d86cb (dirty)
- Message: update readme with a link to the CPU|MPS branch
Hardware
- Platform: Linux
- CPUs: 240 cores (240 logical)
- Memory: 1771.7 GB
- GPUs: 8x NVIDIA A100-SXM4-80GB
- GPU Memory: 634.0 GB total
- CUDA Version: 12.8
- Hourly Rate: $14.32/hour
Software
- Python: 3.10.12
- PyTorch: 2.8.0+cu128
Bloat
- Characters: 357,831
- Lines: 8,718
- Files: 44
- Tokens (approx): 89,457
- Dependencies (uv.lock lines): 2,004
Run started: 2025-10-19 19:54:32
Tokenizer training
timestamp: 2025-10-19 19:56:03
- max_chars: 2,000,000,000
- doc_cap: 10,000
- vocab_size: 65,536
- train_time: 71.4154
- num_special_tokens: 9
- token_bytes_min: 1
- token_bytes_max: 32
- token_bytes_mean: 6.9197
- token_bytes_std: 2.8748
Tokenizer evaluation
timestamp: 2025-10-19 19:56:15
Comparison with GPT-2
| Text Type | Bytes | GPT-2 Tokens | GPT-2 Ratio | Ours Tokens | Ours Ratio | Relative Diff % | 
|---|---|---|---|---|---|---|
| news | 1819 | 404 | 4.50 | 375 | 4.85 | +7.2% | 
| korean | 893 | 745 | 1.20 | 712 | 1.25 | +4.4% | 
| code | 1259 | 576 | 2.19 | 492 | 2.56 | +14.6% | 
| math | 1834 | 936 | 1.96 | 966 | 1.90 | -3.2% | 
| science | 1112 | 260 | 4.28 | 228 | 4.88 | +12.3% | 
| fwe-train | 4208518 | 900364 | 4.67 | 856883 | 4.91 | +4.8% | 
| fwe-val | 4908443 | 1059062 | 4.63 | 1010352 | 4.86 | +4.6% | 
Comparison with GPT-4
| Text Type | Bytes | GPT-4 Tokens | GPT-4 Ratio | Ours Tokens | Ours Ratio | Relative Diff % | 
|---|---|---|---|---|---|---|
| news | 1819 | 387 | 4.70 | 375 | 4.85 | +3.1% | 
| korean | 893 | 364 | 2.45 | 712 | 1.25 | -95.6% | 
| code | 1259 | 309 | 4.07 | 492 | 2.56 | -59.2% | 
| math | 1834 | 832 | 2.20 | 966 | 1.90 | -16.1% | 
| science | 1112 | 249 | 4.47 | 228 | 4.88 | +8.4% | 
| fwe-train | 4208518 | 874799 | 4.81 | 856883 | 4.91 | +2.0% | 
| fwe-val | 4908443 | 1029691 | 4.77 | 1010352 | 4.86 | +1.9% | 
Base model training
timestamp: 2025-10-20 03:02:00
- run: speedrun
- depth: 20
- max_seq_len: 2048
- num_iterations: -1
- target_flops: -1.0000
- target_param_data_ratio: 20
- device_batch_size: 32
- total_batch_size: 524,288
- embedding_lr: 0.2000
- unembedding_lr: 0.0040
- weight_decay: 0.0000
- matrix_lr: 0.0200
- grad_clip: 1.0000
- eval_every: 250
- eval_tokens: 10,485,760
- core_metric_every: 2000
- core_metric_max_per_task: 500
- sample_every: 2000
- model_tag:
- Number of parameters: 560,988,160
- Number of FLOPs per token: 3.491758e+09
- Calculated number of iterations: 21,400
- Number of training tokens: 11,219,763,200
- Tokens : Params ratio: 20.0000
- DDP world size: 8
- warmup_ratio: 0.0000
- warmdown_ratio: 0.2000
- final_lr_frac: 0.0000
- Minimum validation bpb: 0.8143
- Final validation bpb: 0.8143
- CORE metric estimate: 0.2133
- MFU %: 21.02%
- Total training flops: 3.917670e+19
- Total training time: 394.42m
- Peak memory usage: 75374.27MiB
Base model loss
timestamp: 2025-10-20 03:03:28
- train bpb: 0.8171
- val bpb: 0.8144
- sample 0: <|bos|>The capital of France is Paris. The capital of France is Paris. The capital of France is Paris.
- sample 1: <|bos|>The chemical symbol of gold is Au. The chemical symbol of gold is Au. The chemical symbol of gold is
- sample 2: <|bos|>If yesterday was Friday, then tomorrow will be Friday, and so on. This is a very common way of thinking about the
- sample 3: <|bos|>The opposite of hot is cold. The opposite of cold is hot. The opposite of hot is cold.
- sample 4: <|bos|>The planets of the solar system are: Mercury, Venus, Earth, Mars, Jupiter, Saturn, Uranus, Neptune,
- sample 5: <|bos|>My favorite color is red. I love it. I love it. I love it. I love
- sample 6: <|bos|>If 5x + 3 = 13, then x is a multiple of 5. If 5x + 3 =
Base model evaluation
timestamp: 2025-10-20 03:10:53
- Model: base_model (step 21400)
- CORE metric: 0.2084
- hellaswag_zeroshot: 0.2626
- jeopardy: 0.1068
- bigbench_qa_wikidata: 0.5118
- arc_easy: 0.5325
- arc_challenge: 0.1274
- copa: 0.4000
- commonsense_qa: 0.0274
- piqa: 0.3645
- openbook_qa: 0.1200
- lambada_openai: 0.3813
- hellaswag: 0.2631
- winograd: 0.2234
- winogrande: 0.0545
- bigbench_dyck_languages: 0.1270
- agi_eval_lsat_ar: 0.0489
- bigbench_cs_algorithms: 0.3545
- bigbench_operators: 0.1429
- bigbench_repeat_copy_logic: 0.0312
- squad: 0.2391
- coqa: 0.2176
- boolq: -0.1267
- bigbench_language_identification: 0.1740
Midtraining
timestamp: 2025-10-20 03:29:50
- run: speedrun
- dtype: bfloat16
- max_seq_len: 2048
- device_batch_size: 32
- unembedding_lr: 0.0040
- embedding_lr: 0.2000
- matrix_lr: 0.0200
- init_lr_frac: 1.0000
- weight_decay: 0.0000
- eval_every: 150
- eval_tokens: 10,485,760
- total_batch_size: 524,288
- dry_run: 0
- Number of iterations: 765
- DDP world size: 8
- Minimum validation bpb: 0.3963
Chat evaluation mid
timestamp: 2025-10-20 03:48:39
- source: mid
- task_name: None
- dtype: bfloat16
- temperature: 0.0000
- max_new_tokens: 512
- num_samples: 1
- top_k: 50
- batch_size: 8
- model_tag: None
- step: None
- max_problems: None
- ARC-Easy: 0.3119
- ARC-Challenge: 0.2927
- MMLU: 0.2975
- GSM8K: 0.0402
- HumanEval: 0.0976
- ChatCORE metric: 0.0681
Chat SFT
timestamp: 2025-10-20 03:53:11
- run: speedrun
- source: mid
- dtype: bfloat16
- device_batch_size: 4
- num_epochs: 1
- max_iterations: -1
- target_examples_per_step: 32
- unembedding_lr: 0.0040
- embedding_lr: 0.2000
- matrix_lr: 0.0200
- weight_decay: 0.0000
- init_lr_frac: 0.0200
- eval_every: 100
- eval_steps: 100
- eval_metrics_every: 200
- Training rows: 20,843
- Number of iterations: 651
- Training loss: 1.1234
- Validation loss: 1.0146
Chat evaluation sft
timestamp: 2025-10-20 04:09:28
- source: sft
- task_name: None
- dtype: bfloat16
- temperature: 0.0000
- max_new_tokens: 512
- num_samples: 1
- top_k: 50
- batch_size: 8
- model_tag: None
- step: None
- max_problems: None
- ARC-Easy: 0.3338
- ARC-Challenge: 0.3046
- MMLU: 0.2955
- GSM8K: 0.0599
- HumanEval: 0.1220
- ChatCORE metric: 0.0854
Summary
- Characters: 357,831
- Lines: 8,718
- Files: 44
- Tokens (approx): 89,457
- Dependencies (uv.lock lines): 2,004
| Metric | BASE | MID | SFT | RL | 
|---|---|---|---|---|
| CORE | 0.2084 | - | - | - | 
| ARC-Challenge | - | 0.2927 | 0.3046 | - | 
| ARC-Easy | - | 0.3119 | 0.3338 | - | 
| GSM8K | - | 0.0402 | 0.0599 | - | 
| HumanEval | - | 0.0976 | 0.1220 | - | 
| MMLU | - | 0.2975 | 0.2955 | - | 
| ChatCORE | - | 0.0681 | 0.0854 | - | 
Total wall clock time: 8h14m
