import streamlit as st import json # Set page configuration st.set_page_config( page_title="tokeniser-py Demonstration", page_icon="🔣", layout="wide", ) # Custom CSS for better UI st.markdown(""" """, unsafe_allow_html=True) # Header with logo and title st.markdown("""
|
Library GitHub (tokeniser-py-lite)|
HF Dataset (unchunked)|
GitHub Dataset (chunked)|
GitHub Imp Files|
PyPI Package (Main Lib)|
PyPI Package (Lite Lib)Learn about language model tokenization
tokeniser-py's custom tokenizer processes text using tokens, which are common sequences of characters found in a set of text. The model learns to understand the statistical relationships between these tokens, and excel at producing the next token in a sequence of tokens. You can use the tool below to understand how a piece of text might be tokenized by a language model, and the total count of tokens in that piece of text.
We've conducted a thorough analysis of token efficiency of our tokeniser against different tokenizers:
", unsafe_allow_html=True) # Analysis points with enhanced styling st.markdown("""Our tokenizer demonstrates different characteristics:
", unsafe_allow_html=True) # Efficiency points with enhanced styling st.markdown("""We tested a 28-page blog post across different tokenizers:
**2.52 characters is the average (adjusted frequency)-weighted token size i.e. we weigh the token size by their true occurences, obtained after adjusting their observed occurences by their super-tokens' occurences.
A super-token of a token say 'e' is any token which contains 'e' (like 'ear', 'ears', 'years', etc.). While weighing the token length we find that a smaller tokens have an undue higher weightage due their occurences in super-tokens being added up as well.
To adjust this we hierarchially subtract the occurence of a token from its super tokens to get a True frequency.
Un-adjusted frequency weighting gives an average size of ~2.2 characters per token, and a raw (un-weighted) average results in ~4.6-4.7 chars per token.
Our tokenization strategy separates non-underscore special characters from alphanumeric tokens.
We define alphanumeric tokens as any word that doesn't contain special characters (except underscores).
For OpenAI's tokens, we considered any token containing at least one alphanumeric character (excluding underscores) as an alphanumeric token.
This difference is due to the different special characters handling methodology followed in both tokeniser.
The tokeniser's better word representation performance is not only due to technique differences but also because GPT-4 has fewer available tokens (100k vs our 131k) and needs to reserve tokens for multimodal content, further reducing English-specific tokens.
Additionally, GPT-4's approach of combining special characters with alphanumerical content potentially reduces the availability of relevant alphanumerical tokens. Despite these constraints, GPT-4's tokeniser performs relatively well, though ours provides a valuable research preview into an alternate algorithm.
Our approach prioritizes semantic representation over token count minimization:
", unsafe_allow_html=True) # Philosophy points with enhanced styling st.markdown("""Need a programmatic interface for tokenizing text? Check out our tokeniser-py package for Python.
0.5B (Validation-only data) and 1B (Validation + Test data)from tokeniser import Tokeniser""", unsafe_allow_html=True) st.markdown(""" Use `t.one_hot_tokens(token_ids)` for NumPy-based one-hot encoding, or `op='torch'` for PyTorch. ### 📁 Vocab Files - `ordered_tokenizer_1b_val_test_data.json` — Ordered tokens (1B data) - `unordered_tokenizer_1b_val_test_data.json` — Unordered tokens (1B) - `count_tokenizer_1b_val_test_data.json` — Token counts (1B) - Similar structure for 0.5B val-only version """)
t = Tokeniser()
tokens, count = t.tokenise("Your input text here.")
token_ids = t.token_ids(tokens)