fix tokenizer by removing pretokenizer
#17
by
						
stephantulkens
	
							
						- opened
							
					
This PR removes the redundant pretokenizer in the tokenizer json file. This tokenizer splits on whitespace, but because of the normalization step before, all whitespaces are replaced by a Metaspace token. This can lead some toolkits to believe splitting is happening when actually this is not the case.
I've done the folllowing test:
from typing import cast, Iterator
from datasets import load_dataset, Dataset
from tokenizers import Tokenizer
from tqdm import tqdm
def batch_iterator(dataset: Dataset, batch_size=1000, total: int = 10_000) -> Iterator[str]:
    i = 0
    for batch in dataset.iter(batch_size):
        for line in batch["text"]:  # type: ignore[no-any-return]
            yield line
            i += 1
        if i >= total:
            break
if __name__ == "__main__":
    tok = Tokenizer.from_file("tokenizer.json")
    tok2 = Tokenizer.from_pretrained("google/embeddinggemma-300m")
    subsets = ("20231101.zh", "20231101.en", "20231101.ja", "20231101.dv")
    for subset in subsets:
        dataset = cast(Dataset, load_dataset(
            "wikimedia/wikipedia", subset, split="train", streaming=True
        ))
        print("Processing subset:", subset)
        for line in tqdm(batch_iterator(dataset)):
            x = tok.encode(line)
            y = tok2.encode(line)
            assert x.tokens == y.tokens
            assert x.ids == y.ids
These tests confirmed that across a variety of languages, the old and new tokenizer are the same, as expected.
stephantulkens
	
				
		changed pull request status to
		open
			

