", unsafe_allow_html=True) # Section 1: Tokenization Efficiency st.markdown("---") st.markdown("

Tokenization Efficiency

", unsafe_allow_html=True) # Quote block st.markdown("""

A helpful rule of thumb is that one token generally corresponds to ~4 characters of text for common English text. This translates to roughly ¾ of a word (so 100 tokens ~= 75 words).

— OpenAI

""", unsafe_allow_html=True) # Section 2: Our Analysis st.markdown("

Our Analysis

", unsafe_allow_html=True) st.markdown("

We've conducted a thorough analysis of token efficiency of our tokeniser against different tokenizers:

", unsafe_allow_html=True) # Analysis points with enhanced styling st.markdown("""

•

The GPT-2 tokenizer corresponds to approximately 3.9 characters per token

•

English text corpus typically has average word lengths ranging from 4.7 to 5.1 characters, which was observed to be 4.73-4.79 in our dataset

•

Thus for our dataset, traditional tokenizers convert to roughly ⁴⁄₅ of a word (100 tokens ≈ 80 words)

""", unsafe_allow_html=True) # Section 3: tokeniser-py Efficiency st.markdown("

tokeniser-py efficiency

", unsafe_allow_html=True) st.markdown("

Our tokenizer demonstrates different characteristics:

", unsafe_allow_html=True) # Efficiency points with enhanced styling st.markdown("""

•

Average token size of ~2.52 characters** across all token types

•

For alphanumeric tokens only: ~3.97 characters per token

•

This translates to approximately ⁹⁄₁₀ of a word (100 tokens ≈ 90 words)

•

Unlike other tokenizers, we handle spaces (' ') as separate tokens rather than concatenating them with other characters, which affects our total token count

""", unsafe_allow_html=True) # Section 4: Real-world Comparison with completely redesigned styling st.markdown("""

Real-world Comparison

We tested a 28-page blog post across different tokenizers:

GPT-4o/GPT-4: ~10.4k tokens

GPT-3: ~12.1k tokens

tokeniser-py: ~18.8k tokens (including ~8.4k space tokens and ~2.6k other special-char based tokens)

tokeniser-py (alphanumeric only): ~7.8k tokens

GPT-4/GPT-4o (alphanumeric): ~8k tokens

Token corpus size: 131k (tokeniser-py) vs. 100k (GPT-4 multimodal)

""", unsafe_allow_html=True) # Note box with enhanced styling st.markdown("""

Note:

• **2.52 characters is the average (adjusted frequency)-weighted token size i.e. we weigh the token size by their true occurences, obtained after adjusting their observed occurences by their super-tokens' occurences.
• A super-token of a token say 'e' is any token which contains 'e' (like 'ear', 'ears', 'years', etc.). While weighing the token length we find that a smaller tokens have an undue higher weightage due their occurences in super-tokens being added up as well. To adjust this we hierarchially subtract the occurence of a token from its super tokens to get a True frequency.
• Un-adjusted frequency weighting gives an average size of ~2.2 characters per token, and a raw (un-weighted) average results in ~4.6-4.7 chars per token.
• Our tokenization strategy separates non-underscore special characters from alphanumeric tokens.
• We define alphanumeric tokens as any word that doesn't contain special characters (except underscores).
• For OpenAI's tokens, we considered any token containing at least one alphanumeric character (excluding underscores) as an alphanumeric token.
• This difference is due to the different special characters handling methodology followed in both tokeniser.
• The tokeniser's better word representation performance is not only due to technique differences but also because GPT-4 has fewer available tokens (100k vs our 131k) and needs to reserve tokens for multimodal content, further reducing English-specific tokens.
• Additionally, GPT-4's approach of combining special characters with alphanumerical content potentially reduces the availability of relevant alphanumerical tokens. Despite these constraints, GPT-4's tokeniser performs relatively well, though ours provides a valuable research preview into an alternate algorithm.

""", unsafe_allow_html=True) # Section 5: Design Philosophy with enhanced styling st.markdown("

Design Philosophy

", unsafe_allow_html=True) st.markdown("

Our approach prioritizes semantic representation over token count minimization:

", unsafe_allow_html=True) # Philosophy points with enhanced styling st.markdown("""

•

We consciously separate special characters from alphanumeric tokens

•

This provides more available alphanumeric tokens in the vocabulary

•

While this may increase total token count, it improves semantic representation

•

Our design philosophy favors representation quality over token count minimization

•

For example, space (' ') is broken as a separate token in our system compared to being concatenated in standard methods like OpenAI's

•

This approach results in better word representations despite potentially larger token counts

•

While choosing a combination-based tokenizer may reduce token count, our focus on representation offers semantic advantages

•

Combining special tokens with alphanumeric ones adds less semantic value than using pure alphanumeric tokens

""", unsafe_allow_html=True) # Footer link st.markdown("""

Need a programmatic interface for tokenizing text? Check out our tokeniser-py package for Python.

tokeniser-py 🔣

Tokenization Efficiency

Our Analysis

tokeniser-py efficiency

Real-world Comparison

Design Philosophy

About tokeniser-py