DeepPass2-XLM-RoBERTa Fine-tuned for Secret Detection

Model Description

DeepPass2 is a fine-tuned version of xlm-roberta-base specifically designed for detecting passwords and secrets in documents through token classification. Unlike traditional regex-based approaches, this model understands context to identify both structured tokens (API keys, JWTs) and free-form passwords.

Developed by: Neeraj Gupta (SpecterOps)
Model type: Token Classification (Sequence Labeling)
Base model: xlm-roberta-base
Language(s): English
License: [Same as base model]
Fine-tuned with: LoRA (Low-Rank Adaptation) through Unsloth Blog post: What's Your Secret?: Secret Scanning by DeepPass2

Model Architecture

Base Model

Architecture: XLM-RoBERTa-base (Cross-lingual RoBERTa)
Parameters: ~278M (base model)
Max sequence length: 512 tokens
Hidden size: 768
Number of layers: 12
Number of attention heads: 12

LoRA Configuration

LoraConfig(
    task_type=TaskType.TOKEN_CLS,
    r=64,                    # Rank
    lora_alpha=128,          # Scaling parameter
    lora_dropout=0.05,       # Dropout probability
    bias="none",
    target_modules=["query", "key", "value", "dense"]
)

Intended Use

This model is the BERT based model used in the DeepPass2 blog.

Primary Use Case

Secret Detection: Identify passwords, API keys, tokens, and other sensitive credentials in documents
Security Auditing: Scan documents for potential credential leaks
Data Loss Prevention: Pre-screen documents before sharing or publishing

Input

Text documents of any length (automatically chunked into 300-400 token segments) for DeepPass2 complete tool
Text string of 512 tokens for particular instance of input to the Model

Output

Token-level binary classification:
- 0: Non-credential token
- 1: Credential/password token

Training Data

Dataset Composition

Total examples: 23,000 (20,800 training, 2,200 testing)
Document types: Synthetic Emails, technical documents, logs, configuration files
Password sources:
- Real breached passwords from CrackStation's "real human" dump
- Synthetic passwords generated by LLMs
- Structured tokens (API keys, JWTs, etc.)

Data Generation Process

Base Documents: 2,000 long documents (2000+ tokens each) generated using LLMs
- 50% containing passwords, 50% without
Chunking: Documents split into 300-400 token chunks with random boundaries

Password Injection: Real passwords inserted using skeleton sentences:

"Your account has been created with username: {user} and password: {pass}"

Class Balance: <0.3% of tokens are passwords (maintaining real-world distribution)

Training Procedure

Hardware

Trained on MacBook Pro (64GB RAM) with MPS acceleration
Can be trained on systems with 8-16GB RAM

Hyperparameters

Epochs: 4
Batch size: 8 (per device)
Weight decay: 0.01
Optimizer: AdamW (default in Trainer)
Learning rate: Default (5e-5)
Max sequence length: 512 tokens
Random seed: 2

Training Process

# Preprocessing
- Tokenization with offset mapping
- Label generation based on credential spans
- Padding to max_length with truncation

# Fine-tuning
- LoRA adapters applied to attention layers
- Binary cross-entropy loss
- Token-level classification head

Performance Metrics

Chunk-Level Metrics

Metric	Score
Strict Accuracy	86.67%
Overlap Accuracy	97.72%

Password-Level Metrics

Metric	Count/Rate
True Positives	1,201
True Negatives	1,112
False Positives	49 (3.9%)
False Negatives	138
Overlap True Positives	456
Recall	89.7%

Definitions

Strict Accuracy: All passwords in chunk detected with 100% accuracy
Overlap Accuracy: At least one password detected with >30% overlap with ground truth

Limitations and Biases

Known Limitations

Context window: Limited to 512 tokens per chunk
Training data: Primarily trained on LLM-generated documents which may not fully represent real-world documents
Password types: Better at detecting structured/complex passwords than simple dictionary words
Tokenization boundaries: SentencePiece tokenization can fragment passwords, affecting boundary detection

Potential Biases

May over-detect in technical documentation due to training distribution
Tends to flag alphanumeric strings more readily than common words used as passwords

Ethical Considerations

Responsible Use

Privacy: This model should only be used on documents you have permission to scan
Security: Detected credentials should be handled securely and not logged or stored insecurely
False Positives: Always verify detected credentials before taking action

Misuse Potential

Should not be used to scan documents without authorization
Not intended for credential harvesting or malicious purposes

Usage

Installation

pip install transformers torch

Quick Start

from transformers import AutoModelForTokenClassification, AutoTokenizer
import torch

# Load model and tokenizer
model_name = "path/to/deeppass2-xlm-roberta"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

# Classify tokens
def detect_passwords(text):
    inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
    
    with torch.no_grad():
        outputs = model(**inputs)
    
    predictions = torch.argmax(outputs.logits, dim=-1)
    tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
    
    # Extract password tokens
    password_tokens = [
        token for token, label in zip(tokens, predictions[0])
        if label == 1
    ]
    
    return password_tokens

Integration with DeepPass2

For production use, integrate with the full DeepPass2 pipeline:

NoseyParker regex filtering
BERT token classification (this model)
LLM validation for false positive reduction

See the DeepPass2 repository for complete implementation.

Citation

@software{gupta2025deeppass2,
  author = {Gupta, Neeraj},
  title = {DeepPass2: Fine-tuned XLM-RoBERTa for Secret Detection},
  year = {2025},
  organization = {SpecterOps},
  url = {https://huggingface.co/deeppass2-bert},
  note = {Blog: \url{https://specterops.io/blog/2025/07/31/whats-your-secret-secret-scanning-by-deeppass2/}}
}

Additional Information

Model Versions

v6.0-BERT: Current production version with LoRA adapters
merged-model: LoRA weights merged with base model for easier deployment

Contact

For questions or issues, please open an issue on the GitHub repository

Downloads last month: 40

Safetensors

Model size

0.3B params

Tensor type

F32

Model tree for gneeraj/deeppass2-bert

Base model

FacebookAI/xlm-roberta-base

Finetuned

(3507)

this model

gneeraj
/

deeppass2-bert