DeepPass2-XLM-RoBERTa Fine-tuned for Secret Detection
Model Description
DeepPass2 is a fine-tuned version of xlm-roberta-base specifically designed for detecting passwords and secrets in documents through token classification. Unlike traditional regex-based approaches, this model understands context to identify both structured tokens (API keys, JWTs) and free-form passwords.
Developed by: Neeraj Gupta (SpecterOps)
Model type: Token Classification (Sequence Labeling)
Base model: xlm-roberta-base
Language(s): English
License: [Same as base model]
Fine-tuned with: LoRA (Low-Rank Adaptation) through Unsloth
Blog post: What's Your Secret?: Secret Scanning by DeepPass2
Model Architecture
Base Model
- Architecture: XLM-RoBERTa-base (Cross-lingual RoBERTa)
- Parameters: ~278M (base model)
- Max sequence length: 512 tokens
- Hidden size: 768
- Number of layers: 12
- Number of attention heads: 12
LoRA Configuration
LoraConfig(
task_type=TaskType.TOKEN_CLS,
r=64, # Rank
lora_alpha=128, # Scaling parameter
lora_dropout=0.05, # Dropout probability
bias="none",
target_modules=["query", "key", "value", "dense"]
)
Intended Use
This model is the BERT based model used in the DeepPass2 blog.
Primary Use Case
- Secret Detection: Identify passwords, API keys, tokens, and other sensitive credentials in documents
- Security Auditing: Scan documents for potential credential leaks
- Data Loss Prevention: Pre-screen documents before sharing or publishing
Input
- Text documents of any length (automatically chunked into 300-400 token segments) for DeepPass2 complete tool
- Text string of 512 tokens for particular instance of input to the Model
Output
- Token-level binary classification:
0: Non-credential token1: Credential/password token
Training Data
Dataset Composition
- Total examples: 23,000 (20,800 training, 2,200 testing)
- Document types: Synthetic Emails, technical documents, logs, configuration files
- Password sources:
- Real breached passwords from CrackStation's "real human" dump
- Synthetic passwords generated by LLMs
- Structured tokens (API keys, JWTs, etc.)
Data Generation Process
- Base Documents: 2,000 long documents (2000+ tokens each) generated using LLMs
- 50% containing passwords, 50% without
- Chunking: Documents split into 300-400 token chunks with random boundaries
- Password Injection: Real passwords inserted using skeleton sentences:
"Your account has been created with username: {user} and password: {pass}" - Class Balance: <0.3% of tokens are passwords (maintaining real-world distribution)
Training Procedure
Hardware
- Trained on MacBook Pro (64GB RAM) with MPS acceleration
- Can be trained on systems with 8-16GB RAM
Hyperparameters
- Epochs: 4
- Batch size: 8 (per device)
- Weight decay: 0.01
- Optimizer: AdamW (default in Trainer)
- Learning rate: Default (5e-5)
- Max sequence length: 512 tokens
- Random seed: 2
Training Process
# Preprocessing
- Tokenization with offset mapping
- Label generation based on credential spans
- Padding to max_length with truncation
# Fine-tuning
- LoRA adapters applied to attention layers
- Binary cross-entropy loss
- Token-level classification head
Performance Metrics
Chunk-Level Metrics
| Metric | Score |
|---|---|
| Strict Accuracy | 86.67% |
| Overlap Accuracy | 97.72% |
Password-Level Metrics
| Metric | Count/Rate |
|---|---|
| True Positives | 1,201 |
| True Negatives | 1,112 |
| False Positives | 49 (3.9%) |
| False Negatives | 138 |
| Overlap True Positives | 456 |
| Recall | 89.7% |
Definitions
- Strict Accuracy: All passwords in chunk detected with 100% accuracy
- Overlap Accuracy: At least one password detected with >30% overlap with ground truth
Limitations and Biases
Known Limitations
- Context window: Limited to 512 tokens per chunk
- Training data: Primarily trained on LLM-generated documents which may not fully represent real-world documents
- Password types: Better at detecting structured/complex passwords than simple dictionary words
- Tokenization boundaries: SentencePiece tokenization can fragment passwords, affecting boundary detection
Potential Biases
- May over-detect in technical documentation due to training distribution
- Tends to flag alphanumeric strings more readily than common words used as passwords
Ethical Considerations
Responsible Use
- Privacy: This model should only be used on documents you have permission to scan
- Security: Detected credentials should be handled securely and not logged or stored insecurely
- False Positives: Always verify detected credentials before taking action
Misuse Potential
- Should not be used to scan documents without authorization
- Not intended for credential harvesting or malicious purposes
Usage
Installation
pip install transformers torch
Quick Start
from transformers import AutoModelForTokenClassification, AutoTokenizer
import torch
# Load model and tokenizer
model_name = "path/to/deeppass2-xlm-roberta"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)
# Classify tokens
def detect_passwords(text):
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
with torch.no_grad():
outputs = model(**inputs)
predictions = torch.argmax(outputs.logits, dim=-1)
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
# Extract password tokens
password_tokens = [
token for token, label in zip(tokens, predictions[0])
if label == 1
]
return password_tokens
Integration with DeepPass2
For production use, integrate with the full DeepPass2 pipeline:
- NoseyParker regex filtering
- BERT token classification (this model)
- LLM validation for false positive reduction
See the DeepPass2 repository for complete implementation.
Citation
@software{gupta2025deeppass2,
author = {Gupta, Neeraj},
title = {DeepPass2: Fine-tuned XLM-RoBERTa for Secret Detection},
year = {2025},
organization = {SpecterOps},
url = {https://huggingface.co/deeppass2-bert},
note = {Blog: \url{https://specterops.io/blog/2025/07/31/whats-your-secret-secret-scanning-by-deeppass2/}}
}
Additional Information
Model Versions
- v6.0-BERT: Current production version with LoRA adapters
- merged-model: LoRA weights merged with base model for easier deployment
Related Links
Contact
For questions or issues, please open an issue on the GitHub repository
- Downloads last month
- 40
Model tree for gneeraj/deeppass2-bert
Base model
FacebookAI/xlm-roberta-base