pszemraj
/

bytebpe-tokenizer-32k-mlm-uncased

Model card Files Files and versions

bpe tokenizer w byte-fallback: 32k vocab, uncased

uncased BPE tokenizer for encoders/MLM objective with byte-pair fallback:

Trained on pints-ai/Expository-Prose-V1; this tokenizer is primarily for English and code.
this tokenizer is uncased: "HELLO WORLD" is the same as "hello world"
model_max_length is set to 1e9 to not cause hidden issues. Set tokenizer.model_max_length to your model's max position embeddings when training.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train pszemraj/bytebpe-tokenizer-32k-mlm-uncased