Dark-O-Ether commited on
Commit
94b2188
Β·
1 Parent(s): de55334

Added clarifications for analysis

Browse files
Files changed (1) hide show
  1. app.py +37 -1
app.py CHANGED
@@ -586,6 +586,11 @@ st.markdown("""
586
  <div class="bullet-point-icon">β€’</div>
587
  <div>This translates to approximately <span style="background-color:rgba(0,186,124,0.4); padding:2px 4px; border-radius:3px;">⁹⁄₁₀ of a word</span> (100 tokens β‰ˆ 90 words)</div>
588
  </div>
 
 
 
 
 
589
  """, unsafe_allow_html=True)
590
 
591
  # Section 4: Real-world Comparison with completely redesigned styling
@@ -629,6 +634,13 @@ st.markdown("""
629
  <span style="font-size:1.1em; margin-left:8px;">~8k tokens</span>
630
  </div>
631
  </div>
 
 
 
 
 
 
 
632
  </div>
633
  """, unsafe_allow_html=True)
634
 
@@ -650,7 +662,11 @@ st.markdown("""
650
  <span class="bullet-point-icon" style="background-color:rgba(255,204,0,0.15); color:#ffcc00;">β€’</span>
651
  <span>For OpenAI's tokens, we considered any token containing at least one alphanumeric character (excluding underscores) as an alphanumeric token.</span><br>
652
  <span class="bullet-point-icon" style="background-color:rgba(255,204,0,0.15); color:#ffcc00;">β€’</span>
653
- <span>This difference is due to the different special characters handling methodology followed in both tokeniser.</span></p>
 
 
 
 
654
  </div>
655
  """, unsafe_allow_html=True)
656
 
@@ -679,6 +695,26 @@ st.markdown("""
679
  <div class="bullet-point-icon">β€’</div>
680
  <div>Our design philosophy favors representation quality over token count minimization</div>
681
  </div>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
682
  """, unsafe_allow_html=True)
683
 
684
  # Footer link
 
586
  <div class="bullet-point-icon">β€’</div>
587
  <div>This translates to approximately <span style="background-color:rgba(0,186,124,0.4); padding:2px 4px; border-radius:3px;">⁹⁄₁₀ of a word</span> (100 tokens β‰ˆ 90 words)</div>
588
  </div>
589
+
590
+ <div class="bullet-point">
591
+ <div class="bullet-point-icon">β€’</div>
592
+ <div>Unlike other tokenizers, we handle spaces (' ') as separate tokens rather than concatenating them with other characters, which affects our total token count</div>
593
+ </div>
594
  """, unsafe_allow_html=True)
595
 
596
  # Section 4: Real-world Comparison with completely redesigned styling
 
634
  <span style="font-size:1.1em; margin-left:8px;">~8k tokens</span>
635
  </div>
636
  </div>
637
+ <div class="comparison-item">
638
+ <div class="comparison-icon">6</div>
639
+ <div class="comparison-text">
640
+ <span style="background-color:rgba(0,186,124,0.15); padding:2px 4px; border-radius:3px; font-weight:500;">Token corpus size:</span>
641
+ <span style="font-size:1.1em; margin-left:8px;">131k (tokeniser-py) vs. 100k (GPT-4 multimodal)</span>
642
+ </div>
643
+ </div>
644
  </div>
645
  """, unsafe_allow_html=True)
646
 
 
662
  <span class="bullet-point-icon" style="background-color:rgba(255,204,0,0.15); color:#ffcc00;">β€’</span>
663
  <span>For OpenAI's tokens, we considered any token containing at least one alphanumeric character (excluding underscores) as an alphanumeric token.</span><br>
664
  <span class="bullet-point-icon" style="background-color:rgba(255,204,0,0.15); color:#ffcc00;">β€’</span>
665
+ <span>This difference is due to the different special characters handling methodology followed in both tokeniser.</span><br>
666
+ <span class="bullet-point-icon" style="background-color:rgba(255,204,0,0.15); color:#ffcc00;">β€’</span>
667
+ <span>The tokeniser's better word representation performance is not only due to technique differences but also because GPT-4 has fewer available tokens <span style="background-color:rgba(255,204,0,0.15); padding:2px 4px; border-radius:3px;">(100k vs our 131k)</span> and needs to reserve tokens for multimodal content, further reducing English-specific tokens.</span><br>
668
+ <span class="bullet-point-icon" style="background-color:rgba(255,204,0,0.15); color:#ffcc00;">β€’</span>
669
+ <span>Additionally, GPT-4's approach of combining special characters with alphanumerical content potentially reduces the availability of relevant alphanumerical tokens. Despite these constraints, GPT-4's tokeniser performs relatively well, though ours provides a valuable research preview into an alternate algorithm.</span></p>
670
  </div>
671
  """, unsafe_allow_html=True)
672
 
 
695
  <div class="bullet-point-icon">β€’</div>
696
  <div>Our design philosophy favors representation quality over token count minimization</div>
697
  </div>
698
+
699
+ <div class="bullet-point">
700
+ <div class="bullet-point-icon">β€’</div>
701
+ <div>For example, space (' ') is broken as a separate token in our system compared to being concatenated in standard methods like OpenAI's</div>
702
+ </div>
703
+
704
+ <div class="bullet-point">
705
+ <div class="bullet-point-icon">β€’</div>
706
+ <div>This approach results in better word representations despite potentially larger token counts</div>
707
+ </div>
708
+
709
+ <div class="bullet-point">
710
+ <div class="bullet-point-icon">β€’</div>
711
+ <div>While choosing a combination-based tokenizer may reduce token count, our focus on representation offers semantic advantages</div>
712
+ </div>
713
+
714
+ <div class="bullet-point">
715
+ <div class="bullet-point-icon">β€’</div>
716
+ <div>Combining special tokens with alphanumeric ones adds less semantic value than using pure alphanumeric tokens</div>
717
+ </div>
718
  """, unsafe_allow_html=True)
719
 
720
  # Footer link