dustalov
/

wikitext-wordlevel

Model card Files Files and versions

dustalov commited on Jan 22, 2024

Commit

b0f9f16

·

verified ·

1 Parent(s): ef99e22

Update README.md

Files changed (1) hide show

README.md +12 -0

README.md CHANGED Viewed

@@ -23,3 +23,15 @@ This is a simple word-level tokenizer created using the [Tokenizers](https://git
 - Normalization: [NFC](https://en.wikipedia.org/wiki/Unicode_equivalence#Normal_forms) (Normalization Form Canonical Composition), Strip, Lowercase
 - Pre-tokenization: Whitespace
 - Code: [wikitext-wordlevel.py](wikitext-wordlevel.py)

 - Normalization: [NFC](https://en.wikipedia.org/wiki/Unicode_equivalence#Normal_forms) (Normalization Form Canonical Composition), Strip, Lowercase
 - Pre-tokenization: Whitespace
 - Code: [wikitext-wordlevel.py](wikitext-wordlevel.py)
+The tokenizer can be used as simple as follows.
+```python
+tokenizer = Tokenizer.from_pretrained('dustalov/wikitext-wordlevel')
+tokenizer.encode("I'll see you soon").ids  # => [68, 14, 2746, 577, 184, 595]
+tokenizer.encode("I'll see you soon").tokens  # => ['i', "'", 'll', 'see', 'you', 'soon']
+tokenizer.decode([68, 14, 2746, 577, 184, 595])  # => "i ' ll see you soon"
+```