Spaces:

nanochat-students
/

README

Running

App Files Files Community

Full Training detail on 8 A100s (exclude the optional RL)

by richardprobe - opened 11 days ago

Discussion

richardprobe

nanochat students org 11 days ago

So I ran the full training as-is (without RL), going through all the stages below:

create tokenizer
do Pre-train (21K steps; 7 hrs)
do Mid-train (700 steps; few min)
do SFT (650 steps; few min)

I've also started really studying the code repository very closely, and can post a follow-up here, but I've captured the progress in X: https://x.com/richhsu556572/status/1980022537011744893?s=51

I'll follow up with more detailed study notes for individual pieces, but just want to report that thankfully A100s also work and 80GB is more than sufficient to prevent running into OOO issues (as expected, instead of 4 hrs, it takes roughly double the time at 8 hrs)!

Getting y'all the final report that you really care about.

nanochat training report

Generated: 2025-10-19 19:54:27

Environment

Git Information

Branch: master
Commit: d6d86cb (dirty)
Message: update readme with a link to the CPU|MPS branch

Hardware

Platform: Linux
CPUs: 240 cores (240 logical)
Memory: 1771.7 GB
GPUs: 8x NVIDIA A100-SXM4-80GB
GPU Memory: 634.0 GB total
CUDA Version: 12.8
Hourly Rate: $14.32/hour

Software

Python: 3.10.12
PyTorch: 2.8.0+cu128

Bloat

Characters: 357,831
Lines: 8,718
Files: 44
Tokens (approx): 89,457
Dependencies (uv.lock lines): 2,004

Run started: 2025-10-19 19:54:32

Tokenizer training

timestamp: 2025-10-19 19:56:03

max_chars: 2,000,000,000
doc_cap: 10,000
vocab_size: 65,536
train_time: 71.4154
num_special_tokens: 9
token_bytes_min: 1
token_bytes_max: 32
token_bytes_mean: 6.9197
token_bytes_std: 2.8748

Tokenizer evaluation

timestamp: 2025-10-19 19:56:15

Comparison with GPT-2

Text Type	Bytes	GPT-2 Tokens	GPT-2 Ratio	Ours Tokens	Ours Ratio	Relative Diff %
news	1819	404	4.50	375	4.85	+7.2%
korean	893	745	1.20	712	1.25	+4.4%
code	1259	576	2.19	492	2.56	+14.6%
math	1834	936	1.96	966	1.90	-3.2%
science	1112	260	4.28	228	4.88	+12.3%
fwe-train	4208518	900364	4.67	856883	4.91	+4.8%
fwe-val	4908443	1059062	4.63	1010352	4.86	+4.6%

Comparison with GPT-4

Text Type	Bytes	GPT-4 Tokens	GPT-4 Ratio	Ours Tokens	Ours Ratio	Relative Diff %
news	1819	387	4.70	375	4.85	+3.1%
korean	893	364	2.45	712	1.25	-95.6%
code	1259	309	4.07	492	2.56	-59.2%
math	1834	832	2.20	966	1.90	-16.1%
science	1112	249	4.47	228	4.88	+8.4%
fwe-train	4208518	874799	4.81	856883	4.91	+2.0%
fwe-val	4908443	1029691	4.77	1010352	4.86	+1.9%

Base model training

timestamp: 2025-10-20 03:02:00

run: speedrun
depth: 20
max_seq_len: 2048
num_iterations: -1
target_flops: -1.0000
target_param_data_ratio: 20
device_batch_size: 32
total_batch_size: 524,288
embedding_lr: 0.2000
unembedding_lr: 0.0040
weight_decay: 0.0000
matrix_lr: 0.0200
grad_clip: 1.0000
eval_every: 250
eval_tokens: 10,485,760
core_metric_every: 2000
core_metric_max_per_task: 500
sample_every: 2000
model_tag:
Number of parameters: 560,988,160
Number of FLOPs per token: 3.491758e+09
Calculated number of iterations: 21,400
Number of training tokens: 11,219,763,200
Tokens : Params ratio: 20.0000
DDP world size: 8
warmup_ratio: 0.0000
warmdown_ratio: 0.2000
final_lr_frac: 0.0000
Minimum validation bpb: 0.8143
Final validation bpb: 0.8143
CORE metric estimate: 0.2133
MFU %: 21.02%
Total training flops: 3.917670e+19
Total training time: 394.42m
Peak memory usage: 75374.27MiB

Base model loss

timestamp: 2025-10-20 03:03:28

train bpb: 0.8171
val bpb: 0.8144
sample 0: <|bos|>The capital of France is Paris. The capital of France is Paris. The capital of France is Paris.
sample 1: <|bos|>The chemical symbol of gold is Au. The chemical symbol of gold is Au. The chemical symbol of gold is
sample 2: <|bos|>If yesterday was Friday, then tomorrow will be Friday, and so on. This is a very common way of thinking about the
sample 3: <|bos|>The opposite of hot is cold. The opposite of cold is hot. The opposite of hot is cold.
sample 4: <|bos|>The planets of the solar system are: Mercury, Venus, Earth, Mars, Jupiter, Saturn, Uranus, Neptune,
sample 5: <|bos|>My favorite color is red. I love it. I love it. I love it. I love
sample 6: <|bos|>If 5x + 3 = 13, then x is a multiple of 5. If 5x + 3 =

Base model evaluation

timestamp: 2025-10-20 03:10:53

Model: base_model (step 21400)
CORE metric: 0.2084
hellaswag_zeroshot: 0.2626
jeopardy: 0.1068
bigbench_qa_wikidata: 0.5118
arc_easy: 0.5325
arc_challenge: 0.1274
copa: 0.4000
commonsense_qa: 0.0274
piqa: 0.3645
openbook_qa: 0.1200
lambada_openai: 0.3813
hellaswag: 0.2631
winograd: 0.2234
winogrande: 0.0545
bigbench_dyck_languages: 0.1270
agi_eval_lsat_ar: 0.0489
bigbench_cs_algorithms: 0.3545
bigbench_operators: 0.1429
bigbench_repeat_copy_logic: 0.0312
squad: 0.2391
coqa: 0.2176
boolq: -0.1267
bigbench_language_identification: 0.1740

Midtraining

timestamp: 2025-10-20 03:29:50

run: speedrun
dtype: bfloat16
max_seq_len: 2048
device_batch_size: 32
unembedding_lr: 0.0040
embedding_lr: 0.2000
matrix_lr: 0.0200
init_lr_frac: 1.0000
weight_decay: 0.0000
eval_every: 150
eval_tokens: 10,485,760
total_batch_size: 524,288
dry_run: 0
Number of iterations: 765
DDP world size: 8
Minimum validation bpb: 0.3963

Chat evaluation mid

timestamp: 2025-10-20 03:48:39

source: mid
task_name: None
dtype: bfloat16
temperature: 0.0000
max_new_tokens: 512
num_samples: 1
top_k: 50
batch_size: 8
model_tag: None
step: None
max_problems: None
ARC-Easy: 0.3119
ARC-Challenge: 0.2927
MMLU: 0.2975
GSM8K: 0.0402
HumanEval: 0.0976
ChatCORE metric: 0.0681

Chat SFT

timestamp: 2025-10-20 03:53:11

run: speedrun
source: mid
dtype: bfloat16
device_batch_size: 4
num_epochs: 1
max_iterations: -1
target_examples_per_step: 32
unembedding_lr: 0.0040
embedding_lr: 0.2000
matrix_lr: 0.0200
weight_decay: 0.0000
init_lr_frac: 0.0200
eval_every: 100
eval_steps: 100
eval_metrics_every: 200
Training rows: 20,843
Number of iterations: 651
Training loss: 1.1234
Validation loss: 1.0146

Chat evaluation sft

timestamp: 2025-10-20 04:09:28

source: sft
task_name: None
dtype: bfloat16
temperature: 0.0000
max_new_tokens: 512
num_samples: 1
top_k: 50
batch_size: 8
model_tag: None
step: None
max_problems: None
ARC-Easy: 0.3338
ARC-Challenge: 0.3046
MMLU: 0.2955
GSM8K: 0.0599
HumanEval: 0.1220
ChatCORE metric: 0.0854

Summary

Characters: 357,831
Lines: 8,718
Files: 44
Tokens (approx): 89,457
Dependencies (uv.lock lines): 2,004

Metric	BASE	MID	SFT	RL
CORE	0.2084	-	-	-
ARC-Challenge	-	0.2927	0.3046	-
ARC-Easy	-	0.3119	0.3338	-
GSM8K	-	0.0402	0.0599	-
HumanEval	-	0.0976	0.1220	-
MMLU	-	0.2975	0.2955	-
ChatCORE	-	0.0681	0.0854	-

Total wall clock time: 8h14m

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment