π§© ModernBERT-base Fine-tuned for Harmful Prompt Classification
A binary classifier fine-tuned on the WildGuardMix dataset to detect harmful or unsafe prompts.
Built on answerdotai/ModernBERT-base with flash attention for efficient inference.
π§ Model Overview
- Task: Harmful prompt detection (binary classification)
 - Labels:  
1β Harmful / Unsafe0β Safe / Non-harmful
 
π Performance (Test Set)
| Metric | Score | 
|---|---|
| Accuracy | 95.9% | 
| F1 Score | 96.21% | 
| Precision | 96.39% | 
| Recall | 96.21% | 
βοΈ Training Details
- Dataset: 
allenai/wildguardmix(wildguardtrainsubset) - Split:  
- 80/20 train/test
 - 90/10 train/validation (from training set)
 
 - Stratified on: prompt harm label, adversarial flag, and subcategory
 - Optimizer: AdamW (8-bit)
 - Learning Rate: 
1e-4(cosine schedule, 10% warmup) - Batch Size: 96
 - Max Sequence Length: 256 tokens
 - Epochs: 3
 
π― Intended Use
This model is designed for binary classification of text prompts as:
- Harmful (1) β unsafe or toxic content
 - Unharmful (0) β safe or benign content
 
β οΈ Disclaimer:
This model should not be deployed in production systems without additional evaluation and alignment with domain-specific safety and ethical guidelines.
- Downloads last month
 - 20
 
Model tree for Jazhyc/modernbert-wildguardmix-classifier
Base model
answerdotai/ModernBERT-base