Dark-O-Ether commited on
Commit
d24df75
Β·
1 Parent(s): 8764b41

Fixed some requirements

Browse files
Files changed (3) hide show
  1. README.md +76 -1
  2. app.py +0 -5
  3. requirements.txt +0 -1
README.md CHANGED
@@ -11,4 +11,79 @@ license: mit
11
  short_description: Demonstrating the custom tokeniser library (tokeniser-py)
12
  ---
13
 
14
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
11
  short_description: Demonstrating the custom tokeniser library (tokeniser-py)
12
  ---
13
 
14
+ # tokeniser-py πŸ”£ - Interactive Tokenization Visualizer
15
+
16
+ This Hugging Face Space demonstrates **tokeniser-py**, a custom tokenizer built from scratch for language model preprocessing. Unlike traditional tokenizers like BPE (Byte Pair Encoding), tokeniser-py uses a unique algorithm developed independently and trained on over 1 billion tokens from the SlimPajama dataset.
17
+
18
+ ## πŸš€ Features of this Demo
19
+
20
+ - **Interactive Tokenization**: Enter any text and see how it's broken down into tokens
21
+ - **Visual Token Representation**: Each token is displayed in a different color with token IDs on hover
22
+ - **Multiple Model Options**: Choose between different tokenizer configurations (1b/0.5b models, ordered/unordered tokens)
23
+ - **Real-time Statistics**: See token count, character count, and characters per token ratio
24
+ - **Educational Content**: Learn about tokenization efficiency and how tokeniser-py compares to other tokenizers
25
+
26
+ ## πŸ“Š About tokeniser-py
27
+
28
+ tokeniser-py offers:
29
+
30
+ - A vocabulary of **131,072 tokens**
31
+ - Two vocabulary versions:
32
+ - `0.5B`: Trained on validation-only data
33
+ - `1B`: Trained on validation + test data (default)
34
+ - Tokens can be ordered or unordered (by frequency)
35
+ - Efficient token segmentation for out-of-vocabulary words using dynamic programming
36
+ - One-hot encoding support (NumPy or PyTorch)
37
+ - Customizable tokenization parameters
38
+
39
+ ## πŸ” Tokenization Efficiency
40
+
41
+ When comparing tokeniser-py to standard tokenizers like GPT-2/GPT-4:
42
+
43
+ - Typical tokenizers: ~3.9 characters per token (~80 words per 100 tokens)
44
+ - tokeniser-py: ~2.52 characters per token overall, ~3.97 for alphanumeric tokens (~90 words per 100 tokens)
45
+ - tokeniser-py separates special characters from alphanumeric tokens, prioritizing semantic representation
46
+
47
+ ## πŸ’» Usage in Python
48
+
49
+ ```python
50
+ from tokeniser import Tokeniser
51
+
52
+ # Initialize tokenizer (defaults to 1b unordered model)
53
+ t = Tokeniser()
54
+
55
+ # For other models:
56
+ # t = Tokeniser(ln="0.5b", token_ordered=True)
57
+
58
+ # Tokenize text
59
+ tokens, count = t.tokenise("Your input text here.")
60
+
61
+ # Get token IDs
62
+ token_ids = t.token_ids(tokens)
63
+
64
+ # Convert to one-hot encoding (NumPy)
65
+ one_hot = t.one_hot_tokens(token_ids)
66
+
67
+ # For PyTorch:
68
+ # one_hot_torch = t.one_hot_tokens(token_ids, op='torch')
69
+ ```
70
+
71
+ ## πŸ”— Resources
72
+
73
+ - [GitHub Repository](https://github.com/Tasmay-Tibrewal/tokeniser-py)
74
+ - [PyPI Package](https://pypi.org/project/tokeniser-py/)
75
+ - [Hugging Face Dataset](https://huggingface.co/datasets/Tasmay-Tib/Tokeniser)
76
+ - [GitHub Dataset (chunked)](https://github.com/Tasmay-Tibrewal/Tokeniser)
77
+ - [GitHub Implementation Files](https://github.com/Tasmay-Tibrewal/Tokeniser-imp)
78
+
79
+ ## 🧠 Design Philosophy
80
+
81
+ tokeniser-py prioritizes semantic representation over token count minimization. By separating special characters from alphanumeric tokens, it provides more available alphanumeric tokens in the vocabulary, which improves semantic representation at the cost of slightly higher token counts.
82
+
83
+ ## πŸ”§ Installation
84
+
85
+ ```bash
86
+ pip install tokeniser-py
87
+ ```
88
+
89
+ Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
app.py CHANGED
@@ -1,10 +1,5 @@
1
  import streamlit as st
2
- import streamlit.components.v1 as components
3
- import pandas as pd
4
- import random
5
  import json
6
- from streamlit_javascript import st_javascript
7
- import time
8
 
9
  # Set page configuration
10
  st.set_page_config(
 
1
  import streamlit as st
 
 
 
2
  import json
 
 
3
 
4
  # Set page configuration
5
  st.set_page_config(
requirements.txt CHANGED
@@ -1,3 +1,2 @@
1
  streamlit>=1.27.0
2
- pandas>=1.5.0
3
  tokeniser-py
 
1
  streamlit>=1.27.0
 
2
  tokeniser-py