mjbommar
/

binary-tokenizer-005

mjbommar commited on Sep 15

Commit

3f7388f

verified ·

1 Parent(s): 90028a3

Clarify special tokens purpose for file boundaries

Files changed (1) hide show

README.md CHANGED Viewed

@@ -91,7 +91,8 @@ print(f"Tokens: {encoded.ids}")
 # Output: [0, 45689, 205, 22648, 1]
 # Where: 0='<|start|>', 45689='\x7fEL', 205='F', 22648='\x01\x01\x01\x00', 1='<|end|>'
-# The tokenizer adds special tokens <|start|> (id=0) and <|end|> (id=1)
 # Content tokens are: [45689, 205, 22648]
 # Note: Decoding adds spaces between tokens (BPE tokenizer behavior)

 # Output: [0, 45689, 205, 22648, 1]
 # Where: 0='<|start|>', 45689='\x7fEL', 205='F', 22648='\x01\x01\x01\x00', 1='<|end|>'
+# Special tokens: <|start|> (id=0) and <|end|> (id=1) mark file boundaries
+# This helps models identify file headers and distinguish between files
 # Content tokens are: [45689, 205, 22648]
 # Note: Decoding adds spaces between tokens (BPE tokenizer behavior)