broken bc of updates to transformers library, let me reimplement and train
GLORT2 (GLORT2 Low Rank Transformer Transformer) is a transformer model where every single linear layer is another smaller transformer model. I combined qkv into one operation which means one transformer instead of 3 to save on parameters, I played w using a transformer on the embeddings but it wasnt .. great, it's 768 dim 10 layers w/ 384 dim 1 layer as the replacements for linear layers (besides embed and lm head)
also sorry I just realized theres some residual from where I copied the model code from in my own projects that includes some "expanded lm head size" stuff just ignore that if you're looking at the config and code this isn't a serious project so I don't care too much that it's there
| model | 512-token strided perplexity on a pile test set | tokens |
|---|---|---|
| cerebras 111m | 21.550655364990234 | 2.2b |
| cerebras 256m | 15.203496932983398 | 5.1b |
| cerebras 590m | 12.098200798034668 | 11.something b |
| deduped pythia 70m (95.6M) | 22.393400192260742 | 300b |
| deduped pythia 160m (213M) | 13.933751106262207 | 300b |
| deduped pythia 410m (506M) | 9.61842155456543 | 300b |
| llama w same settings as cerebras 111m (119m) | 13.882301330566406 | 2.2b |
| llama plus w same settings as cerebras 111m and llama 70b embeddings (369m) | 13.565109252929688 | 2.2b |
| GLORT2 (205m) | 13.051741600036621 | 2.2b |
| Tasks | Version | Filter | n-shot | Metric | Value | Stderr | |
|---|---|---|---|---|---|---|---|
| arc_challenge | 1 | none | 25 | acc | 0.1706 | ± | 0.0110 |
| none | 25 | acc_norm | 0.2099 | ± | 0.0119 | ||
| truthfulqa_mc2 | 2 | none | 0 | acc | 0.4599 | ± | 0.0154 |
| winogrande | 1 | none | 5 | acc | 0.5083 | ± | 0.0141 |
| hellaswag | 1 | none | 10 | acc | 0.2728 | ± | 0.0044 |
| none | 10 | acc_norm | 0.2815 | ± | 0.0045 | ||
| gsm8k | 2 | get-answer | 5 | exact_match | 0 | ± | 0 |
mmlu
mean is 0.26394385964912276 i think
| Tasks | Version | Filter | n-shot | Metric | Value | Stderr | |
|---|---|---|---|---|---|---|---|
| world_religions | 0 | none | 5 | acc | 0.1988 | ± | 0.0306 |
| virology | 0 | none | 5 | acc | 0.1928 | ± | 0.0307 |
| us_foreign_policy | 0 | none | 5 | acc | 0.2600 | ± | 0.0441 |
| sociology | 0 | none | 5 | acc | 0.2438 | ± | 0.0304 |
| security_studies | 0 | none | 5 | acc | 0.4000 | ± | 0.0314 |
| public_relations | 0 | none | 5 | acc | 0.2273 | ± | 0.0401 |
| professional_psychology | 0 | none | 5 | acc | 0.2484 | ± | 0.0175 |
| professional_medicine | 0 | none | 5 | acc | 0.4485 | ± | 0.0302 |
| professional_law | 0 | none | 5 | acc | 0.2445 | ± | 0.0110 |
| professional_accounting | 0 | none | 5 | acc | 0.2482 | ± | 0.0258 |
| prehistory | 0 | none | 5 | acc | 0.2562 | ± | 0.0243 |
| philosophy | 0 | none | 5 | acc | 0.2186 | ± | 0.0235 |
| nutrition | 0 | none | 5 | acc | 0.2941 | ± | 0.0261 |
| moral_scenarios | 0 | none | 5 | acc | 0.2503 | ± | 0.0145 |
| moral_disputes | 0 | none | 5 | acc | 0.1965 | ± | 0.0214 |
| miscellaneous | 0 | none | 5 | acc | 0.2554 | ± | 0.0156 |
| medical_genetics | 0 | none | 5 | acc | 0.3000 | ± | 0.0461 |
| marketing | 0 | none | 5 | acc | 0.1966 | ± | 0.0260 |
| management | 0 | none | 5 | acc | 0.1942 | ± | 0.0392 |
| machine_learning | 0 | none | 5 | acc | 0.2321 | ± | 0.0401 |
| logical_fallacies | 0 | none | 5 | acc | 0.2331 | ± | 0.0332 |
| jurisprudence | 0 | none | 5 | acc | 0.2407 | ± | 0.0413 |
| international_law | 0 | none | 5 | acc | 0.3719 | ± | 0.0441 |
| human_sexuality | 0 | none | 5 | acc | 0.2137 | ± | 0.0360 |
| human_aging | 0 | none | 5 | acc | 0.2646 | ± | 0.0296 |
| high_school_world_history | 0 | none | 5 | acc | 0.2489 | ± | 0.0281 |
| high_school_us_history | 0 | none | 5 | acc | 0.2304 | ± | 0.0296 |
| high_school_statistics | 0 | none | 5 | acc | 0.4722 | ± | 0.0340 |
| high_school_psychology | 0 | none | 5 | acc | 0.3083 | ± | 0.0198 |
| high_school_physics | 0 | none | 5 | acc | 0.3046 | ± | 0.0376 |
| high_school_microeconomics | 0 | none | 5 | acc | 0.3361 | ± | 0.0307 |
| high_school_mathematics | 0 | none | 5 | acc | 0.2630 | ± | 0.0268 |
| high_school_macroeconomics | 0 | none | 5 | acc | 0.3231 | ± | 0.0237 |
| high_school_government_and_politics | 0 | none | 5 | acc | 0.3523 | ± | 0.0345 |
| high_school_geography | 0 | none | 5 | acc | 0.3384 | ± | 0.0337 |
| high_school_european_history | 0 | none | 5 | acc | 0.2909 | ± | 0.0355 |
| high_school_computer_science | 0 | none | 5 | acc | 0.2600 | ± | 0.0441 |
| high_school_chemistry | 0 | none | 5 | acc | 0.2709 | ± | 0.0313 |
| high_school_biology | 0 | none | 5 | acc | 0.3161 | ± | 0.0265 |
| global_facts | 0 | none | 5 | acc | 0.1800 | ± | 0.0386 |
| formal_logic | 0 | none | 5 | acc | 0.1667 | ± | 0.0333 |
| elementary_mathematics | 0 | none | 5 | acc | 0.2540 | ± | 0.0224 |
| electrical_engineering | 0 | none | 5 | acc | 0.3103 | ± | 0.0386 |
| econometrics | 0 | none | 5 | acc | 0.2895 | ± | 0.0427 |
| conceptual_physics | 0 | none | 5 | acc | 0.2553 | ± | 0.0285 |
| computer_security | 0 | none | 5 | acc | 0.1900 | ± | 0.0394 |
| college_physics | 0 | none | 5 | acc | 0.3431 | ± | 0.0472 |
| college_medicine | 0 | none | 5 | acc | 0.2312 | ± | 0.0321 |
| college_mathematics | 0 | none | 5 | acc | 0.1800 | ± | 0.0386 |
| college_computer_science | 0 | none | 5 | acc | 0.3000 | ± | 0.0461 |
| college_chemistry | 0 | none | 5 | acc | 0.2900 | ± | 0.0456 |
| college_biology | 0 | none | 5 | acc | 0.2083 | ± | 0.0340 |
| clinical_knowledge | 0 | none | 5 | acc | 0.2038 | ± | 0.0248 |
| business_ethics | 0 | none | 5 | acc | 0.2100 | ± | 0.0409 |
| astronomy | 0 | none | 5 | acc | 0.1908 | ± | 0.0320 |
| anatomy | 0 | none | 5 | acc | 0.2963 | ± | 0.0394 |
| abstract_algebra | 0 | none | 5 | acc | 0.2000 | ± | 0.0402 |
- Downloads last month
- 3