Update README.md
Browse files
README.md
CHANGED
|
@@ -23,3 +23,15 @@ This is a simple word-level tokenizer created using the [Tokenizers](https://git
|
|
| 23 |
- Normalization: [NFC](https://en.wikipedia.org/wiki/Unicode_equivalence#Normal_forms) (Normalization Form Canonical Composition), Strip, Lowercase
|
| 24 |
- Pre-tokenization: Whitespace
|
| 25 |
- Code: [wikitext-wordlevel.py](wikitext-wordlevel.py)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 23 |
- Normalization: [NFC](https://en.wikipedia.org/wiki/Unicode_equivalence#Normal_forms) (Normalization Form Canonical Composition), Strip, Lowercase
|
| 24 |
- Pre-tokenization: Whitespace
|
| 25 |
- Code: [wikitext-wordlevel.py](wikitext-wordlevel.py)
|
| 26 |
+
|
| 27 |
+
The tokenizer can be used as simple as follows.
|
| 28 |
+
|
| 29 |
+
```python
|
| 30 |
+
tokenizer = Tokenizer.from_pretrained('dustalov/wikitext-wordlevel')
|
| 31 |
+
|
| 32 |
+
tokenizer.encode("I'll see you soon").ids # => [68, 14, 2746, 577, 184, 595]
|
| 33 |
+
|
| 34 |
+
tokenizer.encode("I'll see you soon").tokens # => ['i', "'", 'll', 'see', 'you', 'soon']
|
| 35 |
+
|
| 36 |
+
tokenizer.decode([68, 14, 2746, 577, 184, 595]) # => "i ' ll see you soon"
|
| 37 |
+
```
|