mjbommar commited on
Commit
3f7388f
·
verified ·
1 Parent(s): 90028a3

Clarify special tokens purpose for file boundaries

Browse files
Files changed (1) hide show
  1. README.md +2 -1
README.md CHANGED
@@ -91,7 +91,8 @@ print(f"Tokens: {encoded.ids}")
91
  # Output: [0, 45689, 205, 22648, 1]
92
  # Where: 0='<|start|>', 45689='\x7fEL', 205='F', 22648='\x01\x01\x01\x00', 1='<|end|>'
93
 
94
- # The tokenizer adds special tokens <|start|> (id=0) and <|end|> (id=1)
 
95
  # Content tokens are: [45689, 205, 22648]
96
 
97
  # Note: Decoding adds spaces between tokens (BPE tokenizer behavior)
 
91
  # Output: [0, 45689, 205, 22648, 1]
92
  # Where: 0='<|start|>', 45689='\x7fEL', 205='F', 22648='\x01\x01\x01\x00', 1='<|end|>'
93
 
94
+ # Special tokens: <|start|> (id=0) and <|end|> (id=1) mark file boundaries
95
+ # This helps models identify file headers and distinguish between files
96
  # Content tokens are: [45689, 205, 22648]
97
 
98
  # Note: Decoding adds spaces between tokens (BPE tokenizer behavior)