Full Training detail on 8 A100s (exclude the optional RL)

#9
by richardprobe - opened
nanochat students org

So I ran the full training as-is (without RL), going through all the stages below:

  1. create tokenizer
  2. do Pre-train (21K steps; 7 hrs)
  3. do Mid-train (700 steps; few min)
  4. do SFT (650 steps; few min)

I've also started really studying the code repository very closely, and can post a follow-up here, but I've captured the progress in X: https://x.com/richhsu556572/status/1980022537011744893?s=51

I'll follow up with more detailed study notes for individual pieces, but just want to report that thankfully A100s also work and 80GB is more than sufficient to prevent running into OOO issues (as expected, instead of 4 hrs, it takes roughly double the time at 8 hrs)!

Getting y'all the final report that you really care about.

nanochat training report

Generated: 2025-10-19 19:54:27

Environment

Git Information

  • Branch: master
  • Commit: d6d86cb (dirty)
  • Message: update readme with a link to the CPU|MPS branch

Hardware

  • Platform: Linux
  • CPUs: 240 cores (240 logical)
  • Memory: 1771.7 GB
  • GPUs: 8x NVIDIA A100-SXM4-80GB
  • GPU Memory: 634.0 GB total
  • CUDA Version: 12.8
  • Hourly Rate: $14.32/hour

Software

  • Python: 3.10.12
  • PyTorch: 2.8.0+cu128

Bloat

  • Characters: 357,831
  • Lines: 8,718
  • Files: 44
  • Tokens (approx): 89,457
  • Dependencies (uv.lock lines): 2,004

Run started: 2025-10-19 19:54:32


Tokenizer training

timestamp: 2025-10-19 19:56:03

  • max_chars: 2,000,000,000
  • doc_cap: 10,000
  • vocab_size: 65,536
  • train_time: 71.4154
  • num_special_tokens: 9
  • token_bytes_min: 1
  • token_bytes_max: 32
  • token_bytes_mean: 6.9197
  • token_bytes_std: 2.8748

Tokenizer evaluation

timestamp: 2025-10-19 19:56:15

Comparison with GPT-2

Text Type Bytes GPT-2 Tokens GPT-2 Ratio Ours Tokens Ours Ratio Relative Diff %
news 1819 404 4.50 375 4.85 +7.2%
korean 893 745 1.20 712 1.25 +4.4%
code 1259 576 2.19 492 2.56 +14.6%
math 1834 936 1.96 966 1.90 -3.2%
science 1112 260 4.28 228 4.88 +12.3%
fwe-train 4208518 900364 4.67 856883 4.91 +4.8%
fwe-val 4908443 1059062 4.63 1010352 4.86 +4.6%

Comparison with GPT-4

Text Type Bytes GPT-4 Tokens GPT-4 Ratio Ours Tokens Ours Ratio Relative Diff %
news 1819 387 4.70 375 4.85 +3.1%
korean 893 364 2.45 712 1.25 -95.6%
code 1259 309 4.07 492 2.56 -59.2%
math 1834 832 2.20 966 1.90 -16.1%
science 1112 249 4.47 228 4.88 +8.4%
fwe-train 4208518 874799 4.81 856883 4.91 +2.0%
fwe-val 4908443 1029691 4.77 1010352 4.86 +1.9%

Base model training

timestamp: 2025-10-20 03:02:00

  • run: speedrun
  • depth: 20
  • max_seq_len: 2048
  • num_iterations: -1
  • target_flops: -1.0000
  • target_param_data_ratio: 20
  • device_batch_size: 32
  • total_batch_size: 524,288
  • embedding_lr: 0.2000
  • unembedding_lr: 0.0040
  • weight_decay: 0.0000
  • matrix_lr: 0.0200
  • grad_clip: 1.0000
  • eval_every: 250
  • eval_tokens: 10,485,760
  • core_metric_every: 2000
  • core_metric_max_per_task: 500
  • sample_every: 2000
  • model_tag:
  • Number of parameters: 560,988,160
  • Number of FLOPs per token: 3.491758e+09
  • Calculated number of iterations: 21,400
  • Number of training tokens: 11,219,763,200
  • Tokens : Params ratio: 20.0000
  • DDP world size: 8
  • warmup_ratio: 0.0000
  • warmdown_ratio: 0.2000
  • final_lr_frac: 0.0000
  • Minimum validation bpb: 0.8143
  • Final validation bpb: 0.8143
  • CORE metric estimate: 0.2133
  • MFU %: 21.02%
  • Total training flops: 3.917670e+19
  • Total training time: 394.42m
  • Peak memory usage: 75374.27MiB

Base model loss

timestamp: 2025-10-20 03:03:28

  • train bpb: 0.8171
  • val bpb: 0.8144
  • sample 0: <|bos|>The capital of France is Paris. The capital of France is Paris. The capital of France is Paris.
  • sample 1: <|bos|>The chemical symbol of gold is Au. The chemical symbol of gold is Au. The chemical symbol of gold is
  • sample 2: <|bos|>If yesterday was Friday, then tomorrow will be Friday, and so on. This is a very common way of thinking about the
  • sample 3: <|bos|>The opposite of hot is cold. The opposite of cold is hot. The opposite of hot is cold.
  • sample 4: <|bos|>The planets of the solar system are: Mercury, Venus, Earth, Mars, Jupiter, Saturn, Uranus, Neptune,
  • sample 5: <|bos|>My favorite color is red. I love it. I love it. I love it. I love
  • sample 6: <|bos|>If 5x + 3 = 13, then x is a multiple of 5. If 5x + 3 =

Base model evaluation

timestamp: 2025-10-20 03:10:53

  • Model: base_model (step 21400)
  • CORE metric: 0.2084
  • hellaswag_zeroshot: 0.2626
  • jeopardy: 0.1068
  • bigbench_qa_wikidata: 0.5118
  • arc_easy: 0.5325
  • arc_challenge: 0.1274
  • copa: 0.4000
  • commonsense_qa: 0.0274
  • piqa: 0.3645
  • openbook_qa: 0.1200
  • lambada_openai: 0.3813
  • hellaswag: 0.2631
  • winograd: 0.2234
  • winogrande: 0.0545
  • bigbench_dyck_languages: 0.1270
  • agi_eval_lsat_ar: 0.0489
  • bigbench_cs_algorithms: 0.3545
  • bigbench_operators: 0.1429
  • bigbench_repeat_copy_logic: 0.0312
  • squad: 0.2391
  • coqa: 0.2176
  • boolq: -0.1267
  • bigbench_language_identification: 0.1740

Midtraining

timestamp: 2025-10-20 03:29:50

  • run: speedrun
  • dtype: bfloat16
  • max_seq_len: 2048
  • device_batch_size: 32
  • unembedding_lr: 0.0040
  • embedding_lr: 0.2000
  • matrix_lr: 0.0200
  • init_lr_frac: 1.0000
  • weight_decay: 0.0000
  • eval_every: 150
  • eval_tokens: 10,485,760
  • total_batch_size: 524,288
  • dry_run: 0
  • Number of iterations: 765
  • DDP world size: 8
  • Minimum validation bpb: 0.3963

Chat evaluation mid

timestamp: 2025-10-20 03:48:39

  • source: mid
  • task_name: None
  • dtype: bfloat16
  • temperature: 0.0000
  • max_new_tokens: 512
  • num_samples: 1
  • top_k: 50
  • batch_size: 8
  • model_tag: None
  • step: None
  • max_problems: None
  • ARC-Easy: 0.3119
  • ARC-Challenge: 0.2927
  • MMLU: 0.2975
  • GSM8K: 0.0402
  • HumanEval: 0.0976
  • ChatCORE metric: 0.0681

Chat SFT

timestamp: 2025-10-20 03:53:11

  • run: speedrun
  • source: mid
  • dtype: bfloat16
  • device_batch_size: 4
  • num_epochs: 1
  • max_iterations: -1
  • target_examples_per_step: 32
  • unembedding_lr: 0.0040
  • embedding_lr: 0.2000
  • matrix_lr: 0.0200
  • weight_decay: 0.0000
  • init_lr_frac: 0.0200
  • eval_every: 100
  • eval_steps: 100
  • eval_metrics_every: 200
  • Training rows: 20,843
  • Number of iterations: 651
  • Training loss: 1.1234
  • Validation loss: 1.0146

Chat evaluation sft

timestamp: 2025-10-20 04:09:28

  • source: sft
  • task_name: None
  • dtype: bfloat16
  • temperature: 0.0000
  • max_new_tokens: 512
  • num_samples: 1
  • top_k: 50
  • batch_size: 8
  • model_tag: None
  • step: None
  • max_problems: None
  • ARC-Easy: 0.3338
  • ARC-Challenge: 0.3046
  • MMLU: 0.2955
  • GSM8K: 0.0599
  • HumanEval: 0.1220
  • ChatCORE metric: 0.0854

Summary

  • Characters: 357,831
  • Lines: 8,718
  • Files: 44
  • Tokens (approx): 89,457
  • Dependencies (uv.lock lines): 2,004
Metric BASE MID SFT RL
CORE 0.2084 - - -
ARC-Challenge - 0.2927 0.3046 -
ARC-Easy - 0.3119 0.3338 -
GSM8K - 0.0402 0.0599 -
HumanEval - 0.0976 0.1220 -
MMLU - 0.2975 0.2955 -
ChatCORE - 0.0681 0.0854 -

Total wall clock time: 8h14m

Sign up or log in to comment