Books from the Survivor Library (mostly ~1920s & earlier) OCR'd with recent VLMs
BEEspoke Data
community
AI & ML interests
'an LLM is only as good as the dataset it was trained on' - Sun Tzu
Recent Activity
smol_llama 220M fine-tunes we did
-
BEE-spoke-data/smol_llama-220M-openhermes
Text Generation ⢠0.2B ⢠Updated ⢠590 ⢠5 -
BEE-spoke-data/smol_llama-220M-open_instruct
Text Generation ⢠0.2B ⢠Updated ⢠10 ⢠2 -
BEE-spoke-data/beecoder-220M-python
Text Generation ⢠0.2B ⢠Updated ⢠6 ⢠3 -
BEE-spoke-data/zephyr-220m-sft-full
Text Generation ⢠0.2B ⢠Updated ⢠703 ⢠1
models fine-tuned to be knowledgeable about apiary practice
-
BEE-spoke-data/TinyLlama-3T-1.1bee
Text Generation ⢠1B ⢠Updated ⢠8 ⢠2 -
BEE-spoke-data/TinyLlama-1.1bee
Text Generation ⢠1B ⢠Updated ⢠2 ⢠1 -
BEE-spoke-data/Meta-Llama-3-8Bee
Text Generation ⢠8B ⢠Updated ⢠4 -
BEE-spoke-data/phi-1bee5
Text Generation ⢠1B ⢠Updated ⢠2 ⢠1
trained and adapted tokenizers - various
š§"raw" pretrained smol_llama checkpoints - WIP š§
-
BEE-spoke-data/smol_llama-101M-GQA
Text Generation ⢠0.1B ⢠Updated ⢠1.93k ⢠30 -
BEE-spoke-data/smol_llama-81M-tied
Text Generation ⢠81.3M ⢠Updated ⢠632 ⢠9 -
BEE-spoke-data/smol_llama-220M-GQA
Text Generation ⢠0.2B ⢠Updated ⢠1.83k ⢠13 -
BEE-spoke-data/verysmol_llama-v11-KIx2
Text Generation ⢠58.1M ⢠Updated ⢠580 ⢠4
Pretrained encoder (fill-mask) models we made
text classification models for book genres
-
BEE-spoke-data/albert-xxlarge-v2-description2genre
Text Classification ⢠0.2B ⢠Updated ⢠4 ⢠2 -
BEE-spoke-data/mobilebert-uncased-title2genre
Text Classification ⢠24.6M ⢠Updated ⢠4 ⢠1 -
BEE-spoke-data/roberta-large-title2genre
Text Classification ⢠0.4B ⢠Updated ⢠1 ⢠1 -
BEE-spoke-data/roberta-base-description2genre
Text Classification ⢠0.1B ⢠Updated ⢠4
concept datasets extracted from fineweb
Books from the Survivor Library (mostly ~1920s & earlier) OCR'd with recent VLMs
š§"raw" pretrained smol_llama checkpoints - WIP š§
-
BEE-spoke-data/smol_llama-101M-GQA
Text Generation ⢠0.1B ⢠Updated ⢠1.93k ⢠30 -
BEE-spoke-data/smol_llama-81M-tied
Text Generation ⢠81.3M ⢠Updated ⢠632 ⢠9 -
BEE-spoke-data/smol_llama-220M-GQA
Text Generation ⢠0.2B ⢠Updated ⢠1.83k ⢠13 -
BEE-spoke-data/verysmol_llama-v11-KIx2
Text Generation ⢠58.1M ⢠Updated ⢠580 ⢠4
smol_llama 220M fine-tunes we did
-
BEE-spoke-data/smol_llama-220M-openhermes
Text Generation ⢠0.2B ⢠Updated ⢠590 ⢠5 -
BEE-spoke-data/smol_llama-220M-open_instruct
Text Generation ⢠0.2B ⢠Updated ⢠10 ⢠2 -
BEE-spoke-data/beecoder-220M-python
Text Generation ⢠0.2B ⢠Updated ⢠6 ⢠3 -
BEE-spoke-data/zephyr-220m-sft-full
Text Generation ⢠0.2B ⢠Updated ⢠703 ⢠1
Pretrained encoder (fill-mask) models we made
models fine-tuned to be knowledgeable about apiary practice
-
BEE-spoke-data/TinyLlama-3T-1.1bee
Text Generation ⢠1B ⢠Updated ⢠8 ⢠2 -
BEE-spoke-data/TinyLlama-1.1bee
Text Generation ⢠1B ⢠Updated ⢠2 ⢠1 -
BEE-spoke-data/Meta-Llama-3-8Bee
Text Generation ⢠8B ⢠Updated ⢠4 -
BEE-spoke-data/phi-1bee5
Text Generation ⢠1B ⢠Updated ⢠2 ⢠1
text classification models for book genres
-
BEE-spoke-data/albert-xxlarge-v2-description2genre
Text Classification ⢠0.2B ⢠Updated ⢠4 ⢠2 -
BEE-spoke-data/mobilebert-uncased-title2genre
Text Classification ⢠24.6M ⢠Updated ⢠4 ⢠1 -
BEE-spoke-data/roberta-large-title2genre
Text Classification ⢠0.4B ⢠Updated ⢠1 ⢠1 -
BEE-spoke-data/roberta-base-description2genre
Text Classification ⢠0.1B ⢠Updated ⢠4
trained and adapted tokenizers - various
concept datasets extracted from fineweb