purrgpt-community
/

PurrBERT-v1.1

+---
+license: apache-2.0
+base_model:
+- distilbert/distilbert-base-uncased
+tags:
+- Safety
+- Content Moderation
+- Hate Speech Detection
+- Toxicity Detection
+language:
+- en
+library_name: transformers
+datasets:
+- Paul/hatecheck
+- dvruette/toxic-completions
+- nvidia/Aegis-AI-Content-Safety-Dataset-2.0
+pipeline_tag: text-classification
+---
+# 🐾 PurrBERT-v1.1
+**PurrBERT-v1.1** is a lightweight content-safety classifier built on top of [DistilBERT](https://huggingface.co/distilbert/distilbert-base-uncased).
+It’s designed to flag harmful or unsafe user prompts before they reach an AI assistant.
+This model is trained on a combination of:
+- [HateCheck](https://huggingface.co/datasets/Paul/hatecheck)
+- [Toxic Completions](https://huggingface.co/datasets/dvruette/toxic-completions)
+- [Aegis AI Content Safety Dataset 2.0](https://huggingface.co/datasets/nvidia/Aegis-AI-Content-Safety-Dataset-2.0)
+---
+## 📝 Model Description
+- **Architecture**: DistilBERT with a classification head (2 labels: `SAFE` vs. `FLAGGED`)
+- **Purpose**: Detect hate speech, toxic content, and unsafe prompts in English text.
+- **Input**: A single string (prompt text).
+- **Output**: A binary prediction:
+  - `0` → SAFE
+  - `1` → FLAGGED
+---
+## 🧠 Training Details
+- **Base model**: `distilbert-base-uncased`
+- **Epochs**: 2 (initial run)
+- **Optimizer**: AdamW
+- **Batch size**: 16
+- **Learning rate**: 2e-5
+- **Weight decay**: 0.01
+Loss dropped steadily during training, and metrics were evaluated on a held-out test set.
+---
+## 📊 Evaluation Results
+On an Aegis test slice:
+| Metric     | Score  | v1     | v2     |
+|------------|--------|--------|--------|
+| Accuracy   |        | 0.8050 | 0.8200 |
+| Precision  |        | 0.7731 | 0.8091 |
+| Recall     |        | 0.8846 | 0.8558 |
+| F1 Score   |        | 0.8251 | 0.8318 |
+Latency per prompt on GPU: **~0.0230 sec**
+---
+## 🚀 Usage
+```python
+from transformers import DistilBertTokenizerFast, DistilBertForSequenceClassification
+import torch
+# Load trained model and tokenizer
+model = DistilBertForSequenceClassification.from_pretrained("purrgpt-community/purrbert-v1.1")
+tokenizer = DistilBertTokenizerFast.from_pretrained("purrgpt-community/purrbert-v1.1")
+model.eval()
+def classify_prompt(prompt):
+    inputs = tokenizer(prompt, return_tensors="pt", truncation=True, padding=True)
+    with torch.no_grad():
+        outputs = model(**inputs)
+        pred = torch.argmax(outputs.logits, dim=-1).item()
+    return "SAFE" if pred == 0 else "FLAGGED"
+print(classify_prompt("You are worthless and nobody likes you!"))
+# → FLAGGED
+````
+---
+## ⚠️ Limitations & Bias
+* The model is trained primarily on English datasets.
+* It may produce false positives on edgy but non-harmful speech, or false negatives on subtle harms.
+* It reflects biases present in its training datasets.
+---
+## 🐾 Intended Use
+PurrBERT is intended for **moderating prompts** before they’re passed to AI models or for content-safety tasks.
+It is **not** a replacement for professional moderation in high-risk settings.