SubRoBERTa: Reddit Subreddit Classification Model
This model is a fine-tuned RoBERTa-base model for classifying text into 10 different subreddits. It was trained on a dataset of posts from various subreddits to predict which subreddit a given text belongs to.
Model Description
- Model type: RoBERTa-base fine-tuned for sequence classification
 - Language: English
 - License: MIT
 - Finetuned from model: roberta-base
 
Intended Uses & Limitations
This model is intended to be used for:
- Classifying text into one of the following subreddits:
- r/aitah
 - r/buildapc
 - r/dating_advice
 - r/legaladvice
 - r/minecraft
 - r/nostupidquestions
 - r/pcmasterrace
 - r/relationship_advice
 - r/techsupport
 - r/teenagers
 
 
Limitations
- The model was trained on English text only
 - Performance may vary for texts that are significantly different from the training data
 - The model may not perform well on texts that don't clearly belong to any of the target subreddits
 
Usage
Here's how to use the model:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
import torch.nn.functional as F
# Load model and tokenizer
model_name = "marcoallanda/SubRoBERTa"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
# Example text
text = "My computer won't turn on, what should I do?"
# Tokenize input
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
# Run inference
with torch.no_grad():
    outputs = model(**inputs)
    logits = outputs.logits
    probs = F.softmax(logits, dim=-1)
    pred_id = torch.argmax(probs, dim=-1).item()
    pred_label = model.config.id2label[pred_id]
print(f"Predicted subreddit: {pred_label}")
Training and Evaluation Data
The model was trained on a dataset of posts from the 10 target subreddits. The data was split into training and evaluation sets with an 80-20 split.
Training Procedure
- Training regime: Fine-tuning
 - Learning rate: 2e-5
 - Number of epochs: 10
 - Batch size: 128
 - Optimizer: AdamW
 - Mixed precision: FP16
 
Training Results
The model was evaluated using accuracy and F1-macro scores. The best model was selected based on the F1-macro score.
Citation
If you use this model in your research, please cite:
@misc{SubRoBERTa,
  author = {Marco Allanda},
  title = {SubRoBERTa: Reddit Subreddit Classification Model},
  year = {2025},
  publisher = {Hugging Face},
  journal = {Hugging Face Hub},
  howpublished = {\url{https://huggingface.co/marcoallanda/SubRoBERTa}}
}
- Downloads last month
 - -
 
Model tree for marcoallanda/SubRoBERTa
Base model
FacebookAI/roberta-base