Text Classification
Transformers
Safetensors
English
distilbert
Safety
Content Moderation
Hate Speech Detection
Toxicity Detection
FlameF0X commited on
Commit
920361c
·
verified ·
1 Parent(s): a572ebe

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +106 -0
README.md ADDED
@@ -0,0 +1,106 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ base_model:
4
+ - distilbert/distilbert-base-uncased
5
+ tags:
6
+ - Safety
7
+ - Content Moderation
8
+ - Hate Speech Detection
9
+ - Toxicity Detection
10
+ language:
11
+ - en
12
+ library_name: transformers
13
+ datasets:
14
+ - Paul/hatecheck
15
+ - dvruette/toxic-completions
16
+ - nvidia/Aegis-AI-Content-Safety-Dataset-2.0
17
+ pipeline_tag: text-classification
18
+ ---
19
+
20
+ # 🐾 PurrBERT-v1.1
21
+
22
+ **PurrBERT-v1.1** is a lightweight content-safety classifier built on top of [DistilBERT](https://huggingface.co/distilbert/distilbert-base-uncased).
23
+ It’s designed to flag harmful or unsafe user prompts before they reach an AI assistant.
24
+
25
+ This model is trained on a combination of:
26
+ - [HateCheck](https://huggingface.co/datasets/Paul/hatecheck)
27
+ - [Toxic Completions](https://huggingface.co/datasets/dvruette/toxic-completions)
28
+ - [Aegis AI Content Safety Dataset 2.0](https://huggingface.co/datasets/nvidia/Aegis-AI-Content-Safety-Dataset-2.0)
29
+
30
+ ---
31
+
32
+ ## 📝 Model Description
33
+
34
+ - **Architecture**: DistilBERT with a classification head (2 labels: `SAFE` vs. `FLAGGED`)
35
+ - **Purpose**: Detect hate speech, toxic content, and unsafe prompts in English text.
36
+ - **Input**: A single string (prompt text).
37
+ - **Output**: A binary prediction:
38
+ - `0` → SAFE
39
+ - `1` → FLAGGED
40
+
41
+ ---
42
+
43
+ ## 🧠 Training Details
44
+
45
+ - **Base model**: `distilbert-base-uncased`
46
+ - **Epochs**: 2 (initial run)
47
+ - **Optimizer**: AdamW
48
+ - **Batch size**: 16
49
+ - **Learning rate**: 2e-5
50
+ - **Weight decay**: 0.01
51
+
52
+ Loss dropped steadily during training, and metrics were evaluated on a held-out test set.
53
+
54
+ ---
55
+
56
+ ## 📊 Evaluation Results
57
+
58
+ On an Aegis test slice:
59
+
60
+ | Metric | Score | v1 | v2 |
61
+ |------------|--------|--------|--------|
62
+ | Accuracy | | 0.8050 | 0.8200 |
63
+ | Precision | | 0.7731 | 0.8091 |
64
+ | Recall | | 0.8846 | 0.8558 |
65
+ | F1 Score | | 0.8251 | 0.8318 |
66
+
67
+ Latency per prompt on GPU: **~0.0230 sec**
68
+
69
+ ---
70
+
71
+ ## 🚀 Usage
72
+
73
+ ```python
74
+ from transformers import DistilBertTokenizerFast, DistilBertForSequenceClassification
75
+ import torch
76
+
77
+ # Load trained model and tokenizer
78
+ model = DistilBertForSequenceClassification.from_pretrained("purrgpt-community/purrbert-v1.1")
79
+ tokenizer = DistilBertTokenizerFast.from_pretrained("purrgpt-community/purrbert-v1.1")
80
+ model.eval()
81
+
82
+ def classify_prompt(prompt):
83
+ inputs = tokenizer(prompt, return_tensors="pt", truncation=True, padding=True)
84
+ with torch.no_grad():
85
+ outputs = model(**inputs)
86
+ pred = torch.argmax(outputs.logits, dim=-1).item()
87
+ return "SAFE" if pred == 0 else "FLAGGED"
88
+
89
+ print(classify_prompt("You are worthless and nobody likes you!"))
90
+ # → FLAGGED
91
+ ````
92
+
93
+ ---
94
+
95
+ ## ⚠️ Limitations & Bias
96
+
97
+ * The model is trained primarily on English datasets.
98
+ * It may produce false positives on edgy but non-harmful speech, or false negatives on subtle harms.
99
+ * It reflects biases present in its training datasets.
100
+
101
+ ---
102
+
103
+ ## 🐾 Intended Use
104
+
105
+ PurrBERT is intended for **moderating prompts** before they’re passed to AI models or for content-safety tasks.
106
+ It is **not** a replacement for professional moderation in high-risk settings.