--- library_name: transformers license: apache-2.0 datasets: - pints-ai/Expository-Prose-V1 language: - en --- # bpe tokenizer w byte-fallback: 32k vocab, uncased uncased BPE tokenizer for encoders/MLM objective with byte-pair fallback: - Trained on `pints-ai/Expository-Prose-V1`; this tokenizer is primarily for English and code. - this tokenizer is **uncased**: "HELLO WORLD" is **the same** as "hello world" - `model_max_length` is set to 1e9 to not cause hidden issues. **Set `tokenizer.model_max_length` to your model's max position embeddings** when training.