PurrBERT
Collection
Our BERT base prompt guardian.
β’
2 items
β’
Updated
PurrBERT-v1.1 is a lightweight content-safety classifier built on top of DistilBERT.
Itβs designed to flag harmful or unsafe user prompts before they reach an AI assistant.
This model is trained on a combination of:
SAFE vs. FLAGGED) 0 β SAFE 1 β FLAGGEDdistilbert-base-uncasedLoss dropped steadily during training, and metrics were evaluated on a held-out test set.
On an Aegis test slice:
| Metric | Score | v1 | v1.1 |
|---|---|---|---|
| Accuracy | 0.8050 | 0.8200 | |
| Precision | 0.7731 | 0.8091 | |
| Recall | 0.8846 | 0.8558 | |
| F1 Score | 0.8251 | 0.8318 |
Latency per prompt on GPU: ~0.0230 sec
from transformers import DistilBertTokenizerFast, DistilBertForSequenceClassification
import torch
# Load trained model and tokenizer
model = DistilBertForSequenceClassification.from_pretrained("purrgpt-community/purrbert-v1.1")
tokenizer = DistilBertTokenizerFast.from_pretrained("purrgpt-community/purrbert-v1.1")
model.eval()
def classify_prompt(prompt):
inputs = tokenizer(prompt, return_tensors="pt", truncation=True, padding=True)
with torch.no_grad():
outputs = model(**inputs)
pred = torch.argmax(outputs.logits, dim=-1).item()
return "SAFE" if pred == 0 else "FLAGGED"
print(classify_prompt("You are worthless and nobody likes you!"))
# β FLAGGED
PurrBERT is intended for moderating prompts before theyβre passed to AI models or for content-safety tasks. It is not a replacement for professional moderation in high-risk settings.
Base model
distilbert/distilbert-base-uncased