| 2025-08-30 01:43:03 - pico-train - INFO - Step 32000 -- ๐ Evaluation Results | |
| 2025-08-30 01:43:03 - pico-train - INFO - โโโ paloma: 2.977755235898109e+26 | |
| 2025-08-30 01:43:05 - pico-train - INFO - ================================================== | |
| 2025-08-30 01:43:05 - pico-train - INFO - โจ Training Configuration | |
| 2025-08-30 01:43:05 - pico-train - INFO - ================================================== | |
| 2025-08-30 01:43:05 - pico-train - INFO - โญโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ | |
| 2025-08-30 01:43:05 - pico-train - INFO - โ checkpointing: โ | |
| 2025-08-30 01:43:05 - pico-train - INFO - โ checkpoints_dir: checkpoints โ | |
| 2025-08-30 01:43:05 - pico-train - INFO - โ evaluation: โ | |
| 2025-08-30 01:43:05 - pico-train - INFO - โ eval_results_dir: eval_results โ | |
| 2025-08-30 01:43:05 - pico-train - INFO - โ fabric_checkpoint_dir: fabric_state โ | |
| 2025-08-30 01:43:05 - pico-train - INFO - โ fabric_checkpoint_filename: checkpoint.pt โ | |
| 2025-08-30 01:43:05 - pico-train - INFO - โ hf_checkpoint: โ | |
| 2025-08-30 01:43:05 - pico-train - INFO - โ collection_slug: null โ | |
| 2025-08-30 01:43:05 - pico-train - INFO - โ repo_id: ThomasTheMaker/pico-decoder-tiny โ | |
| 2025-08-30 01:43:05 - pico-train - INFO - โ learning_dynamics: โ | |
| 2025-08-30 01:43:05 - pico-train - INFO - โ batch_size: 1 โ | |
| 2025-08-30 01:43:05 - pico-train - INFO - โ eval_data: null โ | |
| 2025-08-30 01:43:05 - pico-train - INFO - โ layer_suffixes: โ | |
| 2025-08-30 01:43:05 - pico-train - INFO - โ - attention.v_proj โ | |
| 2025-08-30 01:43:05 - pico-train - INFO - โ - attention.o_proj โ | |
| 2025-08-30 01:43:05 - pico-train - INFO - โ - swiglu.w_2 โ | |
| 2025-08-30 01:43:05 - pico-train - INFO - โ sequence_idx: -1 โ | |
| 2025-08-30 01:43:05 - pico-train - INFO - โ learning_dynamics_dir: learning_dynamics โ | |
| 2025-08-30 01:43:05 - pico-train - INFO - โ logs_dir: logs โ | |
| 2025-08-30 01:43:05 - pico-train - INFO - โ run_name: pico-decoder-tiny-dolma5M-v1 โ | |
| 2025-08-30 01:43:05 - pico-train - INFO - โ runs_dir: runs โ | |
| 2025-08-30 01:43:05 - pico-train - INFO - โ save_every_n_steps: 500 โ | |
| 2025-08-30 01:43:05 - pico-train - INFO - โ save_to_hf: true โ | |
| 2025-08-30 01:43:05 - pico-train - INFO - โ training: โ | |
| 2025-08-30 01:43:05 - pico-train - INFO - โ auto_resume: true โ | |
| 2025-08-30 01:43:05 - pico-train - INFO - โ data: โ | |
| 2025-08-30 01:43:05 - pico-train - INFO - โ dataloader: โ | |
| 2025-08-30 01:43:05 - pico-train - INFO - โ batch_size: 4 โ | |
| 2025-08-30 01:43:05 - pico-train - INFO - โ dataset: โ | |
| 2025-08-30 01:43:05 - pico-train - INFO - โ name: ThomasTheMaker/pretokenized-dolma-5M โ | |
| 2025-08-30 01:43:05 - pico-train - INFO - โ tokenizer: โ | |
| 2025-08-30 01:43:05 - pico-train - INFO - โ name: allenai/OLMo-7B-0724-hf โ | |
| 2025-08-30 01:43:05 - pico-train - INFO - โ vocab_size: 50304 โ | |
| 2025-08-30 01:43:05 - pico-train - INFO - โ evaluation: โ | |
| 2025-08-30 01:43:05 - pico-train - INFO - โ metrics: โ | |
| 2025-08-30 01:43:05 - pico-train - INFO - โ - paloma โ | |
| 2025-08-30 01:43:05 - pico-train - INFO - โ paloma: โ | |
| 2025-08-30 01:43:05 - pico-train - INFO - โ batch_size: 1 โ | |
| 2025-08-30 01:43:05 - pico-train - INFO - โ dataset_name: pico-lm/pretokenized-paloma-tinsy โ | |
| 2025-08-30 01:43:05 - pico-train - INFO - โ dataset_split: val โ | |
| 2025-08-30 01:43:05 - pico-train - INFO - โ max_length: 2048 โ | |
| 2025-08-30 01:43:05 - pico-train - INFO - โ model: โ | |
| 2025-08-30 01:43:05 - pico-train - INFO - โ activation_hidden_dim: 384 โ | |
| 2025-08-30 01:43:05 - pico-train - INFO - โ attention_n_heads: 12 โ | |
| 2025-08-30 01:43:05 - pico-train - INFO - โ attention_n_kv_heads: 4 โ | |
| 2025-08-30 01:43:05 - pico-train - INFO - โ batch_size: 1024 โ | |
| 2025-08-30 01:43:05 - pico-train - INFO - โ d_model: 96 โ | |
| 2025-08-30 01:43:05 - pico-train - INFO - โ max_seq_len: 2048 โ | |
| 2025-08-30 01:43:05 - pico-train - INFO - โ model_type: pico_decoder โ | |
| 2025-08-30 01:43:05 - pico-train - INFO - โ n_layers: 12 โ | |
| 2025-08-30 01:43:05 - pico-train - INFO - โ norm_eps: 1.0e-06 โ | |
| 2025-08-30 01:43:05 - pico-train - INFO - โ position_emb_theta: 10000.0 โ | |
| 2025-08-30 01:43:05 - pico-train - INFO - โ vocab_size: 50304 โ | |
| 2025-08-30 01:43:05 - pico-train - INFO - โ monitoring: โ | |
| 2025-08-30 01:43:05 - pico-train - INFO - โ logging: โ | |
| 2025-08-30 01:43:05 - pico-train - INFO - โ log_every_n_steps: 25 โ | |
| 2025-08-30 01:43:05 - pico-train - INFO - โ log_level: INFO โ | |
| 2025-08-30 01:43:05 - pico-train - INFO - โ save_to_wandb: false โ | |
| 2025-08-30 01:43:05 - pico-train - INFO - โ wandb: โ | |
| 2025-08-30 01:43:05 - pico-train - INFO - โ entity: boymyc โ | |
| 2025-08-30 01:43:05 - pico-train - INFO - โ project: pico-decoder-tiny โ | |
| 2025-08-30 01:43:05 - pico-train - INFO - โ training: โ | |
| 2025-08-30 01:43:05 - pico-train - INFO - โ fabric: โ | |
| 2025-08-30 01:43:05 - pico-train - INFO - โ accelerator: cuda โ | |
| 2025-08-30 01:43:05 - pico-train - INFO - โ num_devices: 1 โ | |
| 2025-08-30 01:43:05 - pico-train - INFO - โ num_nodes: 1 โ | |
| 2025-08-30 01:43:05 - pico-train - INFO - โ precision: bf16-mixed โ | |
| 2025-08-30 01:43:05 - pico-train - INFO - โ max_steps: 20000 โ | |
| 2025-08-30 01:43:05 - pico-train - INFO - โ optimization: โ | |
| 2025-08-30 01:43:05 - pico-train - INFO - โ gradient_accumulation_steps: 4 โ | |
| 2025-08-30 01:43:05 - pico-train - INFO - โ lr: 5.0e-05 โ | |
| 2025-08-30 01:43:05 - pico-train - INFO - โ lr_scheduler: cosine โ | |
| 2025-08-30 01:43:05 - pico-train - INFO - โ lr_warmup_steps: 8000 โ | |
| 2025-08-30 01:43:05 - pico-train - INFO - โ optimizer: adamw โ | |
| 2025-08-30 01:43:05 - pico-train - INFO - โ โ | |
| 2025-08-30 01:43:05 - pico-train - INFO - โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ | |
| 2025-08-30 01:43:05 - pico-train - INFO - ================================================== | |
| 2025-08-30 01:43:05 - pico-train - INFO - โญ Runtime Summary: | |
| 2025-08-30 01:43:05 - pico-train - INFO - ================================================== | |
| 2025-08-30 01:43:05 - pico-train - INFO - Starting from step: 32000 | |
| 2025-08-30 01:43:05 - pico-train - INFO - Model Setup: | |
| 2025-08-30 01:43:05 - pico-train - INFO - โโ Total Parameters: 11,282,784 | |
| 2025-08-30 01:43:05 - pico-train - INFO - โโ Trainable Parameters: 11,282,784 | |
| 2025-08-30 01:43:05 - pico-train - INFO - Distributed Setup: | |
| 2025-08-30 01:43:05 - pico-train - INFO - โโ Number of Devices: 1 | |
| 2025-08-30 01:43:05 - pico-train - INFO - โโ Device Type: NVIDIA GeForce RTX 5090 | |
| 2025-08-30 01:43:05 - pico-train - INFO - โโ Available Memory: 33.68 GB | |
| 2025-08-30 01:43:05 - pico-train - INFO - Software Setup: | |
| 2025-08-30 01:43:05 - pico-train - INFO - โโ Python Version: 3.10.12 | |
| 2025-08-30 01:43:05 - pico-train - INFO - โโ PyTorch Version: 2.8.0+cu128 | |
| 2025-08-30 01:43:05 - pico-train - INFO - โโ CUDA Version: 12.8 | |
| 2025-08-30 01:43:05 - pico-train - INFO - โโ Operating System: Linux 6.8.0-63-generic | |
| 2025-08-30 01:43:05 - pico-train - INFO - Batch Size Configuration: | |
| 2025-08-30 01:43:05 - pico-train - INFO - โโ Global Batch Size: 4 | |
| 2025-08-30 01:43:05 - pico-train - INFO - โโ Per Device Batch Size: 1 | |
| 2025-08-30 01:43:05 - pico-train - INFO - โโ Gradient Accumulation Steps: 4 | |
| 2025-08-30 01:43:05 - pico-train - INFO - ================================================== | |
| 2025-08-30 01:43:06 - pico-train - INFO - Step 32000 -- ๐ Training Metrics | |
| 2025-08-30 01:43:06 - pico-train - INFO - โโโ Loss: 6.3376 | |
| 2025-08-30 01:43:06 - pico-train - INFO - โโโ Learning Rate: 7.32e-06 | |
| 2025-08-30 01:43:06 - pico-train - INFO - โโโ Inf/NaN count: 0 | |
| 2025-08-30 01:43:06 - pico-train - INFO - Step 32000 -- ๐ Saving Learning Dynamics | |
| 2025-08-30 01:43:20 - pico-train - INFO - Step 32025 -- ๐ Training Metrics | |
| 2025-08-30 01:43:20 - pico-train - INFO - โโโ Loss: 6.1999 | |
| 2025-08-30 01:43:20 - pico-train - INFO - โโโ Learning Rate: 7.28e-06 | |
| 2025-08-30 01:43:20 - pico-train - INFO - โโโ Inf/NaN count: 0 | |
| 2025-08-30 01:43:33 - pico-train - INFO - Step 32050 -- ๐ Training Metrics | |
| 2025-08-30 01:43:33 - pico-train - INFO - โโโ Loss: 6.1488 | |
| 2025-08-30 01:43:33 - pico-train - INFO - โโโ Learning Rate: 7.24e-06 | |
| 2025-08-30 01:43:33 - pico-train - INFO - โโโ Inf/NaN count: 0 | |
| 2025-08-30 01:43:45 - pico-train - INFO - Step 32075 -- ๐ Training Metrics | |
| 2025-08-30 01:43:45 - pico-train - INFO - โโโ Loss: 6.0460 | |
| 2025-08-30 01:43:45 - pico-train - INFO - โโโ Learning Rate: 7.19e-06 | |
| 2025-08-30 01:43:45 - pico-train - INFO - โโโ Inf/NaN count: 0 | |
| 2025-08-30 01:43:58 - pico-train - INFO - Step 32100 -- ๐ Training Metrics | |
| 2025-08-30 01:43:58 - pico-train - INFO - โโโ Loss: 6.1627 | |
| 2025-08-30 01:43:58 - pico-train - INFO - โโโ Learning Rate: 7.15e-06 | |
| 2025-08-30 01:43:58 - pico-train - INFO - โโโ Inf/NaN count: 0 | |
| 2025-08-30 01:44:11 - pico-train - INFO - Step 32125 -- ๐ Training Metrics | |
| 2025-08-30 01:44:11 - pico-train - INFO - โโโ Loss: 6.2085 | |
| 2025-08-30 01:44:11 - pico-train - INFO - โโโ Learning Rate: 7.11e-06 | |
| 2025-08-30 01:44:11 - pico-train - INFO - โโโ Inf/NaN count: 0 | |
| 2025-08-30 01:44:23 - pico-train - INFO - Step 32150 -- ๐ Training Metrics | |
| 2025-08-30 01:44:23 - pico-train - INFO - โโโ Loss: 6.1659 | |
| 2025-08-30 01:44:23 - pico-train - INFO - โโโ Learning Rate: 7.06e-06 | |
| 2025-08-30 01:44:23 - pico-train - INFO - โโโ Inf/NaN count: 0 | |
| 2025-08-30 01:44:36 - pico-train - INFO - Step 32175 -- ๐ Training Metrics | |
| 2025-08-30 01:44:36 - pico-train - INFO - โโโ Loss: 6.1719 | |
| 2025-08-30 01:44:36 - pico-train - INFO - โโโ Learning Rate: 7.02e-06 | |
| 2025-08-30 01:44:36 - pico-train - INFO - โโโ Inf/NaN count: 0 | |
| 2025-08-30 01:44:48 - pico-train - INFO - Step 32200 -- ๐ Training Metrics | |
| 2025-08-30 01:44:48 - pico-train - INFO - โโโ Loss: 6.2081 | |
| 2025-08-30 01:44:48 - pico-train - INFO - โโโ Learning Rate: 6.98e-06 | |
| 2025-08-30 01:44:48 - pico-train - INFO - โโโ Inf/NaN count: 0 | |
| 2025-08-30 01:45:01 - pico-train - INFO - Step 32225 -- ๐ Training Metrics | |
| 2025-08-30 01:45:01 - pico-train - INFO - โโโ Loss: 6.1955 | |
| 2025-08-30 01:45:01 - pico-train - INFO - โโโ Learning Rate: 6.94e-06 | |
| 2025-08-30 01:45:01 - pico-train - INFO - โโโ Inf/NaN count: 0 | |
| 2025-08-30 01:45:14 - pico-train - INFO - Step 32250 -- ๐ Training Metrics | |
| 2025-08-30 01:45:14 - pico-train - INFO - โโโ Loss: 6.1139 | |
| 2025-08-30 01:45:14 - pico-train - INFO - โโโ Learning Rate: 6.89e-06 | |
| 2025-08-30 01:45:14 - pico-train - INFO - โโโ Inf/NaN count: 0 | |
| 2025-08-30 01:45:26 - pico-train - INFO - Step 32275 -- ๐ Training Metrics | |
| 2025-08-30 01:45:26 - pico-train - INFO - โโโ Loss: 6.1075 | |
| 2025-08-30 01:45:26 - pico-train - INFO - โโโ Learning Rate: 6.85e-06 | |
| 2025-08-30 01:45:26 - pico-train - INFO - โโโ Inf/NaN count: 0 | |
| 2025-08-30 01:45:39 - pico-train - INFO - Step 32300 -- ๐ Training Metrics | |
| 2025-08-30 01:45:39 - pico-train - INFO - โโโ Loss: 6.0814 | |
| 2025-08-30 01:45:39 - pico-train - INFO - โโโ Learning Rate: 6.81e-06 | |
| 2025-08-30 01:45:39 - pico-train - INFO - โโโ Inf/NaN count: 0 | |
| 2025-08-30 01:45:51 - pico-train - INFO - Step 32325 -- ๐ Training Metrics | |
| 2025-08-30 01:45:51 - pico-train - INFO - โโโ Loss: 6.0880 | |
| 2025-08-30 01:45:51 - pico-train - INFO - โโโ Learning Rate: 6.77e-06 | |
| 2025-08-30 01:45:51 - pico-train - INFO - โโโ Inf/NaN count: 0 | |
| 2025-08-30 01:46:04 - pico-train - INFO - Step 32350 -- ๐ Training Metrics | |
| 2025-08-30 01:46:04 - pico-train - INFO - โโโ Loss: 6.1997 | |
| 2025-08-30 01:46:04 - pico-train - INFO - โโโ Learning Rate: 6.73e-06 | |
| 2025-08-30 01:46:04 - pico-train - INFO - โโโ Inf/NaN count: 0 | |
| 2025-08-30 01:46:16 - pico-train - INFO - Step 32375 -- ๐ Training Metrics | |
| 2025-08-30 01:46:16 - pico-train - INFO - โโโ Loss: 6.1376 | |
| 2025-08-30 01:46:16 - pico-train - INFO - โโโ Learning Rate: 6.68e-06 | |
| 2025-08-30 01:46:16 - pico-train - INFO - โโโ Inf/NaN count: 0 | |
| 2025-08-30 01:46:29 - pico-train - INFO - Step 32400 -- ๐ Training Metrics | |
| 2025-08-30 01:46:29 - pico-train - INFO - โโโ Loss: 6.1077 | |
| 2025-08-30 01:46:29 - pico-train - INFO - โโโ Learning Rate: 6.64e-06 | |
| 2025-08-30 01:46:29 - pico-train - INFO - โโโ Inf/NaN count: 0 | |
| 2025-08-30 01:46:42 - pico-train - INFO - Step 32425 -- ๐ Training Metrics | |
| 2025-08-30 01:46:42 - pico-train - INFO - โโโ Loss: 6.2641 | |
| 2025-08-30 01:46:42 - pico-train - INFO - โโโ Learning Rate: 6.60e-06 | |
| 2025-08-30 01:46:42 - pico-train - INFO - โโโ Inf/NaN count: 0 | |
| 2025-08-30 01:46:54 - pico-train - INFO - Step 32450 -- ๐ Training Metrics | |
| 2025-08-30 01:46:54 - pico-train - INFO - โโโ Loss: 6.1020 | |
| 2025-08-30 01:46:54 - pico-train - INFO - โโโ Learning Rate: 6.56e-06 | |
| 2025-08-30 01:46:54 - pico-train - INFO - โโโ Inf/NaN count: 0 | |
| 2025-08-30 01:47:07 - pico-train - INFO - Step 32475 -- ๐ Training Metrics | |
| 2025-08-30 01:47:07 - pico-train - INFO - โโโ Loss: 6.2170 | |
| 2025-08-30 01:47:07 - pico-train - INFO - โโโ Learning Rate: 6.52e-06 | |
| 2025-08-30 01:47:07 - pico-train - INFO - โโโ Inf/NaN count: 0 | |
| 2025-08-30 01:47:19 - pico-train - INFO - Step 32500 -- ๐พ Saving Checkpoint | |