How did you estimate model parameter count before training and use correct hyperparams?
As title says π€.
It's pretty simple if you look at config.json in this repo, as using the information inside of it you can actually calculate the parameter count yourself.
Starting with the embedding layers, both the token and position embeddings are sized at 192 dimensions, applied to a vocabulary of 32,000 tokens. The calculation for each is straightforward: 32,000 * 192 = 6,144,000 parameters. Together, they total about 12.29 million parameters for both the token and position embeddings as they are the same size in this model.
The attention mechanism in this model uses 2 heads, each with dimensions split into 96 for queries, keys, and values in a 192-dimensional embedding space. The parameters for the attention mechanism are calculated by 2 heads * 3 matrices per head (Q,K,V) * 192 embedding dimensions * 96 dimensions per head = 110,592 parameters for the matrices. The output projection also has it's own parameters, which are calculated as 192 positional embeddings * 192 which is 36,864 parameters, giving a total of 147,456 parameters for the entire multi-headed attention mechanism.
For the feed-forward network, which uses a width of 1024 for the intermediate layer, each linear transformation contributes significantly. The first transformation upscales from 192 to 1024, calculated by 192 * 1,024 = 196,608 and the second downscales back to 192, using the same number of parameters, 1,024 * 192 = 196,608 Together, these add up to 393,216 for the layer.
Layer normalization, despite being a smaller contributor to the total, adds 2 * 192 = 384 parameters, due to scaling and centering factors. When all these components are summed, the total parameter count comes out to approximately 12.83 million parameters (A bit above the 10M parameters, but it's within my margin of "error").
Important parts from config.json for reference:
{
. . .
"hidden_size": 192,
"intermediate_size": 1024,
"max_position_embeddings": 1024,
"num_attention_heads": 2,
"num_hidden_layers": 1,
"num_key_value_heads": 1,
. . .
"vocab_size": 32000
}
Hope this answers your question, you can apply this to all of the other OpenLAiNN Models. π
It's pretty simple if you look at config.json in this repo, as using the information inside of it you can actually calculate the parameter count yourself.
Starting with the embedding layers, both the token and position embeddings are sized at 192 dimensions, applied to a vocabulary of 32,000 tokens. The calculation for each is straightforward: 32,000 * 192 = 6,144,000 parameters. Together, they total about 12.29 million parameters for both the token and position embeddings as they are the same size in this model.
The attention mechanism in this model uses 2 heads, each with dimensions split into 96 for queries, keys, and values in a 192-dimensional embedding space. The parameters for the attention mechanism are calculated by 2 heads * 3 matrices per head (Q,K,V) * 192 embedding dimensions * 96 dimensions per head = 110,592 parameters for the matrices. The output projection also has it's own parameters, which are calculated as 192 positional embeddings * 192 which is 36,864 parameters, giving a total of 147,456 parameters for the entire multi-headed attention mechanism.
For the feed-forward network, which uses a width of 1024 for the intermediate layer, each linear transformation contributes significantly. The first transformation upscales from 192 to 1024, calculated by 192 * 1,024 = 196,608 and the second downscales back to 192, using the same number of parameters, 1,024 * 192 = 196,608 Together, these add up to 393,216 for the layer.
Layer normalization, despite being a smaller contributor to the total, adds 2 * 192 = 384 parameters, due to scaling and centering factors. When all these components are summed, the total parameter count comes out to approximately 12.83 million parameters (A bit above the 10M parameters, but it's within my margin of "error").Important parts from config.json for reference:
{
. . .
"hidden_size": 192,
"intermediate_size": 1024,
"max_position_embeddings": 1024,
"num_attention_heads": 2,
"num_hidden_layers": 1,
"num_key_value_heads": 1,
. . .
"vocab_size": 32000
}Hope this answers your question, you can apply this to all of the other OpenLAiNN Models. π
Thank you so much! I pretrained a couple of models myself, but never got the param count right.
Are you thinking about open sourcing the training code? Thank you again.
I'm debating it, As I'm working on multiple projects right now, The Training code will probably be open sourced once I clean it up and make it not awful to look at and read. Though this will probably be around OpenLAiNN-2, as I am working on a second version with a different arch that might be interesting. No promises on that one though. :)
As of right now, I'm also working on creating some instruct tuned versions of these models.
Are you still considering open sourcing the training code? Would be great if you did!
Hey, I'm actually working on a second revision of the model right now, Unsure of when I plan to open source it but It's taken a while as I'm building the dataset and everything from scratch, but it's mostly complete and I will likely start training of this bigger and better version in the coming weeks. The weights will be open sourced and I while I can't make any promises, I plan to also open source the training code and the pretraining and finetuning datasets. So it would be completely possible to replicate the models given you had a could of spare GPUs lying around. π
So TL;DR, Very likely, but it might take a month or two as I'm a one man show doing this as a hobby.
Alright . Also, one more question, what challenges did you experience while training Planck-OpenLAiNN? Specifically Planck-OpenLAiNN-10M and the 25M variant. Just need some advice for my model. Would be great if you gave some!
Hey, I'm actually working on a second revision of the model right now, Unsure of when I plan to open source it but It's taken a while as I'm building the dataset and everything from scratch, but it's mostly complete and I will likely start training of this bigger and better version in the coming weeks. The weights will be open sourced and I while I can't make any promises, I plan to also open source the training code and the pretraining and finetuning datasets. So it would be completely possible to replicate the models given you had a could of spare GPUs lying around. π
So TL;DR, Very likely, but it might take a month or two as I'm a one man show doing this as a hobby.
Any updates bro? I hope you're doing well.
I have been busy with work, but I have gotten the first prototype models training successfully a few days ago πΌ, I've also been busy setting up my A100 server. I just want to take my time with it.
Like I said, this is a side project. But it will be done before the end of the year, things should go quickly.