Description:

This is a Byte Pair Encoding (BPE) tokenizer trained specifically for Turkish text. The tokenizer was trained on a curated subset (~30 MB from each dataset) of multiple Turkish datasets, covering news, academic texts, legal Q&A, medical articles, books, and user reviews. The goal is to provide a high-quality subword tokenizer suitable for training or fine-tuning Turkish language models.

Vocab_size: 32768

Training datasets (~30 MB from each):

  • omarkamali/wikipedia-monthly
  • alibayram/hukuk_soru_cevap
  • umutertugrul/turkish-hospital-medical-articles
  • umutertugrul/turkish-medical-articles
  • alibayram/tr-books
  • selimfirat/bilkent-turkish-writings-dataset
  • umutertugrul/turkish-academic-theses-dataset
  • alibayram/onedio_haberler
  • habanoz/news-tr-1.8M
  • alibayram/hepsiburada_yorumlar
  • alibayram/kitapyurdu_yorumlar
  • alibayram/beyazperde_yorumlar

total : ~360 MB

Usage:

```python

from transformers import AutoTokenizer

fast_tokenizer = AutoTokenizer.from_pretrained("AhmetSemih/merged_dataset-32k-bpe-tokenizer", use_fast=True)
fast_tokenizer.encode("Bugün hava çok güzel.")


```

İntended_use:

  • Training and fine-tuning Turkish language models
  • Tokenization of Turkish text for NLP tasks (classification, summarization, question answering)
  • Research and educational purposes

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Datasets used to train AhmetSemih/merged_dataset-32k-bpe-tokenizer