Spaces:
Running
Running
| # training llama tokenizer | |
| How does Meta train their sentencepiece tokenizer? You can print the config as follows: | |
| ```python | |
| import sentencepiece.sentencepiece_model_pb2 | |
| mp = sentencepiece.sentencepiece_model_pb2.ModelProto() | |
| mp.ParseFromString(open("tokenizer.model", "rb").read()) | |
| print(mp.trainer_spec) | |
| print(mp.normalizer_spec) | |
| ``` | |
| this gives: | |
| ``` | |
| trainer_spec { | |
| input: "/large_experiments/theorem/datasets/MERGED/all.test1.merged" | |
| model_prefix: "spm_model_32k_200M_charcov099995_allowWSO__v2" | |
| model_type: BPE | |
| vocab_size: 32000 | |
| self_test_sample_size: 0 | |
| input_format: "text" | |
| character_coverage: 0.9999499917030334 | |
| input_sentence_size: 200000000 | |
| seed_sentencepiece_size: 1000000 | |
| shrinking_factor: 0.75 | |
| num_threads: 80 | |
| num_sub_iterations: 2 | |
| max_sentence_length: 4192 | |
| shuffle_input_sentence: true | |
| max_sentencepiece_length: 16 | |
| split_by_unicode_script: true | |
| split_by_whitespace: true | |
| split_by_number: true | |
| treat_whitespace_as_suffix: false | |
| split_digits: true | |
| allow_whitespace_only_pieces: true | |
| vocabulary_output_piece_score: true | |
| hard_vocab_limit: true | |
| use_all_vocab: false | |
| byte_fallback: true | |
| required_chars: "" | |
| unk_id: 0 | |
| bos_id: 1 | |
| eos_id: 2 | |
| pad_id: -1 | |
| unk_surface: " \342\201\207 " | |
| unk_piece: "<unk>" | |
| bos_piece: "<s>" | |
| eos_piece: "</s>" | |
| pad_piece: "<pad>" | |
| train_extremely_large_corpus: false | |
| enable_differential_privacy: false | |
| differential_privacy_noise_level: 0.0 | |
| differential_privacy_clipping_threshold: 0 | |
| } | |
| normalizer_spec { | |
| name: "identity" | |
| precompiled_charsmap: "" | |
| add_dummy_prefix: true | |
| remove_extra_whitespaces: false | |
| normalization_rule_tsv: "" | |
| } | |
| ``` | |
| We can use the sentencepiece spm_train to train the same models, but optionally smaller. Here are their [options docs](https://github.com/google/sentencepiece/blob/master/doc/options.md) we can refer to. It's not much but it helps. | |
| We'll depart on one setting, I recommend changing `character_coverage` -> 1.0. We also want to make sure to note the following important settings that come up in the paper and are not necessarily the default sentencepiece settings: | |
| ``` | |
| --split-digits = true | |
| --allow_whitespace_only_pieces = true | |
| --byte_fallback = true | |
| --normalization_rule_name = identity | |
| ``` | |
| With this in mind we can train a sentencepiece vocab in what I believe is probably the same to how Meta trained theirs as: | |
| ``` | |
| spm_train --input="$input" \ | |
| --model_prefix="$model_prefix" \ | |
| --model_type=bpe \ | |
| --vocab_size="$vocab_size" \ | |
| --self_test_sample_size=0 \ | |
| --input_format="text" \ | |
| --character_coverage=1.0 \ | |
| --num_threads="$(nproc)" \ | |
| --split_digits=true \ | |
| --allow_whitespace_only_pieces=true \ | |
| --byte_fallback=true \ | |
| --unk_surface=" \342\201\207 " \ | |
| --normalization_rule_name=identity \ | |
| ``` | |
| Where $input is the input file, $model_prefix is the output path prefix, vocab_size is the desired vocab, and we're by default taking over the CPU resources of the machine. | |
| Lastly note that sentencepiece is weird and expects "sentences" delimited by newlines as the input. You can't just put in a massive block of text. And they have a hyperparameter that constols the maximum size of a "sentence". Fwiw I really dislike this design choice around a weird concept of a "sentence". It should just be block of text with no assumptions. But here we are. | |
| Look into the file `tinystories.py` where we train the vocab in the same way, but using Python bindings instead. | |