Spaces:
Sleeping
Sleeping
Commit
Β·
94b2188
1
Parent(s):
de55334
Added clarifications for analysis
Browse files
app.py
CHANGED
|
@@ -586,6 +586,11 @@ st.markdown("""
|
|
| 586 |
<div class="bullet-point-icon">β’</div>
|
| 587 |
<div>This translates to approximately <span style="background-color:rgba(0,186,124,0.4); padding:2px 4px; border-radius:3px;">βΉβββ of a word</span> (100 tokens β 90 words)</div>
|
| 588 |
</div>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 589 |
""", unsafe_allow_html=True)
|
| 590 |
|
| 591 |
# Section 4: Real-world Comparison with completely redesigned styling
|
|
@@ -629,6 +634,13 @@ st.markdown("""
|
|
| 629 |
<span style="font-size:1.1em; margin-left:8px;">~8k tokens</span>
|
| 630 |
</div>
|
| 631 |
</div>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 632 |
</div>
|
| 633 |
""", unsafe_allow_html=True)
|
| 634 |
|
|
@@ -650,7 +662,11 @@ st.markdown("""
|
|
| 650 |
<span class="bullet-point-icon" style="background-color:rgba(255,204,0,0.15); color:#ffcc00;">β’</span>
|
| 651 |
<span>For OpenAI's tokens, we considered any token containing at least one alphanumeric character (excluding underscores) as an alphanumeric token.</span><br>
|
| 652 |
<span class="bullet-point-icon" style="background-color:rgba(255,204,0,0.15); color:#ffcc00;">β’</span>
|
| 653 |
-
<span>This difference is due to the different special characters handling methodology followed in both tokeniser.</span
|
|
|
|
|
|
|
|
|
|
|
|
|
| 654 |
</div>
|
| 655 |
""", unsafe_allow_html=True)
|
| 656 |
|
|
@@ -679,6 +695,26 @@ st.markdown("""
|
|
| 679 |
<div class="bullet-point-icon">β’</div>
|
| 680 |
<div>Our design philosophy favors representation quality over token count minimization</div>
|
| 681 |
</div>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 682 |
""", unsafe_allow_html=True)
|
| 683 |
|
| 684 |
# Footer link
|
|
|
|
| 586 |
<div class="bullet-point-icon">β’</div>
|
| 587 |
<div>This translates to approximately <span style="background-color:rgba(0,186,124,0.4); padding:2px 4px; border-radius:3px;">βΉβββ of a word</span> (100 tokens β 90 words)</div>
|
| 588 |
</div>
|
| 589 |
+
|
| 590 |
+
<div class="bullet-point">
|
| 591 |
+
<div class="bullet-point-icon">β’</div>
|
| 592 |
+
<div>Unlike other tokenizers, we handle spaces (' ') as separate tokens rather than concatenating them with other characters, which affects our total token count</div>
|
| 593 |
+
</div>
|
| 594 |
""", unsafe_allow_html=True)
|
| 595 |
|
| 596 |
# Section 4: Real-world Comparison with completely redesigned styling
|
|
|
|
| 634 |
<span style="font-size:1.1em; margin-left:8px;">~8k tokens</span>
|
| 635 |
</div>
|
| 636 |
</div>
|
| 637 |
+
<div class="comparison-item">
|
| 638 |
+
<div class="comparison-icon">6</div>
|
| 639 |
+
<div class="comparison-text">
|
| 640 |
+
<span style="background-color:rgba(0,186,124,0.15); padding:2px 4px; border-radius:3px; font-weight:500;">Token corpus size:</span>
|
| 641 |
+
<span style="font-size:1.1em; margin-left:8px;">131k (tokeniser-py) vs. 100k (GPT-4 multimodal)</span>
|
| 642 |
+
</div>
|
| 643 |
+
</div>
|
| 644 |
</div>
|
| 645 |
""", unsafe_allow_html=True)
|
| 646 |
|
|
|
|
| 662 |
<span class="bullet-point-icon" style="background-color:rgba(255,204,0,0.15); color:#ffcc00;">β’</span>
|
| 663 |
<span>For OpenAI's tokens, we considered any token containing at least one alphanumeric character (excluding underscores) as an alphanumeric token.</span><br>
|
| 664 |
<span class="bullet-point-icon" style="background-color:rgba(255,204,0,0.15); color:#ffcc00;">β’</span>
|
| 665 |
+
<span>This difference is due to the different special characters handling methodology followed in both tokeniser.</span><br>
|
| 666 |
+
<span class="bullet-point-icon" style="background-color:rgba(255,204,0,0.15); color:#ffcc00;">β’</span>
|
| 667 |
+
<span>The tokeniser's better word representation performance is not only due to technique differences but also because GPT-4 has fewer available tokens <span style="background-color:rgba(255,204,0,0.15); padding:2px 4px; border-radius:3px;">(100k vs our 131k)</span> and needs to reserve tokens for multimodal content, further reducing English-specific tokens.</span><br>
|
| 668 |
+
<span class="bullet-point-icon" style="background-color:rgba(255,204,0,0.15); color:#ffcc00;">β’</span>
|
| 669 |
+
<span>Additionally, GPT-4's approach of combining special characters with alphanumerical content potentially reduces the availability of relevant alphanumerical tokens. Despite these constraints, GPT-4's tokeniser performs relatively well, though ours provides a valuable research preview into an alternate algorithm.</span></p>
|
| 670 |
</div>
|
| 671 |
""", unsafe_allow_html=True)
|
| 672 |
|
|
|
|
| 695 |
<div class="bullet-point-icon">β’</div>
|
| 696 |
<div>Our design philosophy favors representation quality over token count minimization</div>
|
| 697 |
</div>
|
| 698 |
+
|
| 699 |
+
<div class="bullet-point">
|
| 700 |
+
<div class="bullet-point-icon">β’</div>
|
| 701 |
+
<div>For example, space (' ') is broken as a separate token in our system compared to being concatenated in standard methods like OpenAI's</div>
|
| 702 |
+
</div>
|
| 703 |
+
|
| 704 |
+
<div class="bullet-point">
|
| 705 |
+
<div class="bullet-point-icon">β’</div>
|
| 706 |
+
<div>This approach results in better word representations despite potentially larger token counts</div>
|
| 707 |
+
</div>
|
| 708 |
+
|
| 709 |
+
<div class="bullet-point">
|
| 710 |
+
<div class="bullet-point-icon">β’</div>
|
| 711 |
+
<div>While choosing a combination-based tokenizer may reduce token count, our focus on representation offers semantic advantages</div>
|
| 712 |
+
</div>
|
| 713 |
+
|
| 714 |
+
<div class="bullet-point">
|
| 715 |
+
<div class="bullet-point-icon">β’</div>
|
| 716 |
+
<div>Combining special tokens with alphanumeric ones adds less semantic value than using pure alphanumeric tokens</div>
|
| 717 |
+
</div>
|
| 718 |
""", unsafe_allow_html=True)
|
| 719 |
|
| 720 |
# Footer link
|