Spaces:
Sleeping
Sleeping
Commit
Β·
a97920a
1
Parent(s):
94b2188
Updated README.md and links in the page
Browse files
README.md
CHANGED
|
@@ -13,6 +13,8 @@ short_description: Demonstrating the custom tokeniser library (tokeniser-py)
|
|
| 13 |
|
| 14 |
# tokeniser-py π£ - Interactive Tokenization Visualizer
|
| 15 |
|
|
|
|
|
|
|
| 16 |
This Hugging Face Space demonstrates **tokeniser-py**, a custom tokenizer built from scratch for language model preprocessing. Unlike traditional tokenizers like BPE (Byte Pair Encoding), tokeniser-py uses a unique algorithm developed independently and trained on over 1 billion tokens from the SlimPajama dataset.
|
| 17 |
|
| 18 |
## π Features of this Demo
|
|
|
|
| 13 |
|
| 14 |
# tokeniser-py π£ - Interactive Tokenization Visualizer
|
| 15 |
|
| 16 |
+
**Imp Links: [PyPI Main Library (tokeniser-py)](https://pypi.org/project/tokeniser-py/) | [PyPI Lite Library (tokeniser-py-lite)](https://pypi.org/project/tokeniser-py-lite/) | [Main Library GitHub (tokeniser-py)](https://github.com/Tasmay-Tibrewal/tokeniser-py) | [Lite Library GitHub (tokeniser-py-lite)](https://github.com/Tasmay-Tibrewal/tokeniser-py-lite) | [Complete repo (unchunked) - HF](https://huggingface.co/datasets/Tasmay-Tib/Tokeniser) | [Complete repo (chunked) - GitHub](https://github.com/Tasmay-Tibrewal/Tokeniser) | [Imp Files Github](https://github.com/Tasmay-Tibrewal/Tokeniser-imp)**
|
| 17 |
+
|
| 18 |
This Hugging Face Space demonstrates **tokeniser-py**, a custom tokenizer built from scratch for language model preprocessing. Unlike traditional tokenizers like BPE (Byte Pair Encoding), tokeniser-py uses a unique algorithm developed independently and trained on over 1 billion tokens from the SlimPajama dataset.
|
| 19 |
|
| 20 |
## π Features of this Demo
|
app.py
CHANGED
|
@@ -288,15 +288,19 @@ st.markdown("""
|
|
| 288 |
<div class="header-container">
|
| 289 |
<div>
|
| 290 |
<h1>tokeniser-py π£</h1>
|
| 291 |
-
<a href = "https://github.com/Tasmay-Tibrewal/tokeniser-py" class="link-top-a" style="display: inline;"><span style="background-color:rgba(100,146,154,0.17); padding:2px 4px; border-radius:3px;">Library GitHub</span></a>
|
| 292 |
<p class="link-top" style="display: inline;"> | </p>
|
| 293 |
-
<a href = "https://
|
|
|
|
|
|
|
| 294 |
<p class="link-top" style="display: inline;"> | </p>
|
| 295 |
<a href = "https://github.com/Tasmay-Tibrewal/Tokeniser" class="link-top-a"style="display: inline;"><span style="background-color:rgba(100,146,154,0.17); padding:2px 4px; border-radius:3px;">GitHub Dataset (chunked)</span></a>
|
| 296 |
<p class="link-top" style="display: inline;"> | </p>
|
| 297 |
<a href = "https://github.com/Tasmay-Tibrewal/Tokeniser-imp" class="link-top-a"style="display: inline;"><span style="background-color:rgba(100,146,154,0.17); padding:2px 4px; border-radius:3px;">GitHub Imp Files</span></a>
|
| 298 |
<p class="link-top" style="display: inline;"> | </p>
|
| 299 |
-
<a href = "https://pypi.org/project/tokeniser-py/" class="link-top-a"style="display: inline;"><span style="background-color:rgba(100,146,154,0.17); padding:2px 4px; border-radius:3px;">PyPI Package</span></a>
|
|
|
|
|
|
|
| 300 |
<p></p>
|
| 301 |
<p style="font-size: 20px;"><strong>Learn about language model tokenization</strong></p>
|
| 302 |
<p style="font-size: 17px; margin-bottom: 5px;">
|
|
|
|
| 288 |
<div class="header-container">
|
| 289 |
<div>
|
| 290 |
<h1>tokeniser-py π£</h1>
|
| 291 |
+
<a href = "https://github.com/Tasmay-Tibrewal/tokeniser-py" class="link-top-a" style="display: inline;"><span style="background-color:rgba(100,146,154,0.17); padding:2px 4px; border-radius:3px;">Library GitHub (tokeniser-py)</span></a>
|
| 292 |
<p class="link-top" style="display: inline;"> | </p>
|
| 293 |
+
<a href = "https://github.com/Tasmay-Tibrewal/tokeniser-py-lite" class="link-top-a" style="display: inline;"><span style="background-color:rgba(100,146,154,0.17); padding:2px 4px; border-radius:3px;">Library GitHub (tokeniser-py-lite)</span></a>
|
| 294 |
+
<p class="link-top" style="display: inline;"> | </p>
|
| 295 |
+
<a href = "https://huggingface.co/datasets/Tasmay-Tib/Tokeniser" class="link-top-a"style="display: inline;"><span style="background-color:rgba(100,146,154,0.17); padding:2px 4px; border-radius:3px;">HF Dataset (unchunked)</span></a>
|
| 296 |
<p class="link-top" style="display: inline;"> | </p>
|
| 297 |
<a href = "https://github.com/Tasmay-Tibrewal/Tokeniser" class="link-top-a"style="display: inline;"><span style="background-color:rgba(100,146,154,0.17); padding:2px 4px; border-radius:3px;">GitHub Dataset (chunked)</span></a>
|
| 298 |
<p class="link-top" style="display: inline;"> | </p>
|
| 299 |
<a href = "https://github.com/Tasmay-Tibrewal/Tokeniser-imp" class="link-top-a"style="display: inline;"><span style="background-color:rgba(100,146,154,0.17); padding:2px 4px; border-radius:3px;">GitHub Imp Files</span></a>
|
| 300 |
<p class="link-top" style="display: inline;"> | </p>
|
| 301 |
+
<a href = "https://pypi.org/project/tokeniser-py/" class="link-top-a"style="display: inline;"><span style="background-color:rgba(100,146,154,0.17); padding:2px 4px; border-radius:3px;">PyPI Package (Main Lib)</span></a>
|
| 302 |
+
<p class="link-top" style="display: inline;"> | </p>
|
| 303 |
+
<a href = "https://pypi.org/project/tokeniser-py-lite/" class="link-top-a"style="display: inline;"><span style="background-color:rgba(100,146,154,0.17); padding:2px 4px; border-radius:3px;">PyPI Package (Lite Lib)</span></a>
|
| 304 |
<p></p>
|
| 305 |
<p style="font-size: 20px;"><strong>Learn about language model tokenization</strong></p>
|
| 306 |
<p style="font-size: 17px; margin-bottom: 5px;">
|