Spaces:

Tasmay-Tib
/

Tokeniser-py

Sleeping

App Files Files Community

Dark-O-Ether commited on Mar 25

Commit

94b2188

1 Parent(s): de55334

Added clarifications for analysis

Browse files

Files changed (1) hide show

app.py +37 -1

app.py CHANGED Viewed

@@ -586,6 +586,11 @@ st.markdown("""
     <div class="bullet-point-icon">•</div>
     <div>This translates to approximately <span style="background-color:rgba(0,186,124,0.4); padding:2px 4px; border-radius:3px;">⁹⁄₁₀ of a word</span> (100 tokens ≈ 90 words)</div>
 </div>
 """, unsafe_allow_html=True)
 # Section 4: Real-world Comparison with completely redesigned styling
@@ -629,6 +634,13 @@ st.markdown("""
             <span style="font-size:1.1em; margin-left:8px;">~8k tokens</span>
         </div>
     </div>
 </div>
 """, unsafe_allow_html=True)
@@ -650,7 +662,11 @@ st.markdown("""
     <span class="bullet-point-icon" style="background-color:rgba(255,204,0,0.15); color:#ffcc00;">•</span>
     <span>For OpenAI's tokens, we considered any token containing at least one alphanumeric character (excluding underscores) as an alphanumeric token.</span><br>
     <span class="bullet-point-icon" style="background-color:rgba(255,204,0,0.15); color:#ffcc00;">•</span>
-    <span>This difference is due to the different special characters handling methodology followed in both tokeniser.</span></p>
 </div>
 """, unsafe_allow_html=True)
@@ -679,6 +695,26 @@ st.markdown("""
     <div class="bullet-point-icon">•</div>
     <div>Our design philosophy favors representation quality over token count minimization</div>
 </div>
 """, unsafe_allow_html=True)
 # Footer link

     <div class="bullet-point-icon">•</div>
     <div>This translates to approximately <span style="background-color:rgba(0,186,124,0.4); padding:2px 4px; border-radius:3px;">⁹⁄₁₀ of a word</span> (100 tokens ≈ 90 words)</div>
 </div>
+<div class="bullet-point">
+    <div class="bullet-point-icon">•</div>
+    <div>Unlike other tokenizers, we handle spaces (' ') as separate tokens rather than concatenating them with other characters, which affects our total token count</div>
+</div>
 """, unsafe_allow_html=True)
 # Section 4: Real-world Comparison with completely redesigned styling
             <span style="font-size:1.1em; margin-left:8px;">~8k tokens</span>
         </div>
     </div>
+    <div class="comparison-item">
+        <div class="comparison-icon">6</div>
+        <div class="comparison-text">
+            <span style="background-color:rgba(0,186,124,0.15); padding:2px 4px; border-radius:3px; font-weight:500;">Token corpus size:</span>
+            <span style="font-size:1.1em; margin-left:8px;">131k (tokeniser-py) vs. 100k (GPT-4 multimodal)</span>
+        </div>
+    </div>
 </div>
 """, unsafe_allow_html=True)
     <span class="bullet-point-icon" style="background-color:rgba(255,204,0,0.15); color:#ffcc00;">•</span>
     <span>For OpenAI's tokens, we considered any token containing at least one alphanumeric character (excluding underscores) as an alphanumeric token.</span><br>
     <span class="bullet-point-icon" style="background-color:rgba(255,204,0,0.15); color:#ffcc00;">•</span>
+    <span>This difference is due to the different special characters handling methodology followed in both tokeniser.</span><br>
+    <span class="bullet-point-icon" style="background-color:rgba(255,204,0,0.15); color:#ffcc00;">•</span>
+    <span>The tokeniser's better word representation performance is not only due to technique differences but also because GPT-4 has fewer available tokens <span style="background-color:rgba(255,204,0,0.15); padding:2px 4px; border-radius:3px;">(100k vs our 131k)</span> and needs to reserve tokens for multimodal content, further reducing English-specific tokens.</span><br>
+    <span class="bullet-point-icon" style="background-color:rgba(255,204,0,0.15); color:#ffcc00;">•</span>
+    <span>Additionally, GPT-4's approach of combining special characters with alphanumerical content potentially reduces the availability of relevant alphanumerical tokens. Despite these constraints, GPT-4's tokeniser performs relatively well, though ours provides a valuable research preview into an alternate algorithm.</span></p>
 </div>
 """, unsafe_allow_html=True)
     <div class="bullet-point-icon">•</div>
     <div>Our design philosophy favors representation quality over token count minimization</div>
 </div>
+<div class="bullet-point">
+    <div class="bullet-point-icon">•</div>
+    <div>For example, space (' ') is broken as a separate token in our system compared to being concatenated in standard methods like OpenAI's</div>
+</div>
+<div class="bullet-point">
+    <div class="bullet-point-icon">•</div>
+    <div>This approach results in better word representations despite potentially larger token counts</div>
+</div>
+<div class="bullet-point">
+    <div class="bullet-point-icon">•</div>
+    <div>While choosing a combination-based tokenizer may reduce token count, our focus on representation offers semantic advantages</div>
+</div>
+<div class="bullet-point">
+    <div class="bullet-point-icon">•</div>
+    <div>Combining special tokens with alphanumeric ones adds less semantic value than using pure alphanumeric tokens</div>
+</div>
 """, unsafe_allow_html=True)
 # Footer link