Clarify special tokens purpose for file boundaries
Browse files
README.md
CHANGED
|
@@ -91,7 +91,8 @@ print(f"Tokens: {encoded.ids}")
|
|
| 91 |
# Output: [0, 45689, 205, 22648, 1]
|
| 92 |
# Where: 0='<|start|>', 45689='\x7fEL', 205='F', 22648='\x01\x01\x01\x00', 1='<|end|>'
|
| 93 |
|
| 94 |
-
#
|
|
|
|
| 95 |
# Content tokens are: [45689, 205, 22648]
|
| 96 |
|
| 97 |
# Note: Decoding adds spaces between tokens (BPE tokenizer behavior)
|
|
|
|
| 91 |
# Output: [0, 45689, 205, 22648, 1]
|
| 92 |
# Where: 0='<|start|>', 45689='\x7fEL', 205='F', 22648='\x01\x01\x01\x00', 1='<|end|>'
|
| 93 |
|
| 94 |
+
# Special tokens: <|start|> (id=0) and <|end|> (id=1) mark file boundaries
|
| 95 |
+
# This helps models identify file headers and distinguish between files
|
| 96 |
# Content tokens are: [45689, 205, 22648]
|
| 97 |
|
| 98 |
# Note: Decoding adds spaces between tokens (BPE tokenizer behavior)
|