Description:
This is a Byte Pair Encoding (BPE) tokenizer trained specifically for Turkish text. The tokenizer was trained on a curated subset (~30 MB from each dataset) of multiple Turkish datasets, covering news, academic texts, legal Q&A, medical articles, books, and user reviews. The goal is to provide a high-quality subword tokenizer suitable for training or fine-tuning Turkish language models.
Vocab_size: 32768
Training datasets (~30 MB from each):
- omarkamali/wikipedia-monthly
- alibayram/hukuk_soru_cevap
- umutertugrul/turkish-hospital-medical-articles
- umutertugrul/turkish-medical-articles
- alibayram/tr-books
- selimfirat/bilkent-turkish-writings-dataset
- umutertugrul/turkish-academic-theses-dataset
- alibayram/onedio_haberler
- habanoz/news-tr-1.8M
- alibayram/hepsiburada_yorumlar
- alibayram/kitapyurdu_yorumlar
- alibayram/beyazperde_yorumlar
total : ~360 MB
Usage:
```python
from transformers import AutoTokenizer
fast_tokenizer = AutoTokenizer.from_pretrained("AhmetSemih/merged_dataset-32k-bpe-tokenizer", use_fast=True)
fast_tokenizer.encode("Bugün hava çok güzel.")
```
İntended_use:
- Training and fine-tuning Turkish language models
- Tokenization of Turkish text for NLP tasks (classification, summarization, question answering)
- Research and educational purposes
	Inference Providers
	NEW
	
	
	This model isn't deployed by any Inference Provider.
	🙋
			
		Ask for provider support