DeepPass2-XLM-RoBERTa Fine-tuned for Secret Detection

Model Description

DeepPass2 is a fine-tuned version of xlm-roberta-base specifically designed for detecting passwords and secrets in documents through token classification. Unlike traditional regex-based approaches, this model understands context to identify both structured tokens (API keys, JWTs) and free-form passwords.

Developed by: Neeraj Gupta (SpecterOps)
Model type: Token Classification (Sequence Labeling)
Base model: xlm-roberta-base
Language(s): English
License: [Same as base model]
Fine-tuned with: LoRA (Low-Rank Adaptation) through Unsloth Blog post: What's Your Secret?: Secret Scanning by DeepPass2

Model Architecture

Base Model

  • Architecture: XLM-RoBERTa-base (Cross-lingual RoBERTa)
  • Parameters: ~278M (base model)
  • Max sequence length: 512 tokens
  • Hidden size: 768
  • Number of layers: 12
  • Number of attention heads: 12

LoRA Configuration

LoraConfig(
    task_type=TaskType.TOKEN_CLS,
    r=64,                    # Rank
    lora_alpha=128,          # Scaling parameter
    lora_dropout=0.05,       # Dropout probability
    bias="none",
    target_modules=["query", "key", "value", "dense"]
)

Intended Use

This model is the BERT based model used in the DeepPass2 blog.

Primary Use Case

  • Secret Detection: Identify passwords, API keys, tokens, and other sensitive credentials in documents
  • Security Auditing: Scan documents for potential credential leaks
  • Data Loss Prevention: Pre-screen documents before sharing or publishing

Input

  • Text documents of any length (automatically chunked into 300-400 token segments) for DeepPass2 complete tool
  • Text string of 512 tokens for particular instance of input to the Model

Output

  • Token-level binary classification:
    • 0: Non-credential token
    • 1: Credential/password token

Training Data

Dataset Composition

  • Total examples: 23,000 (20,800 training, 2,200 testing)
  • Document types: Synthetic Emails, technical documents, logs, configuration files
  • Password sources:
    • Real breached passwords from CrackStation's "real human" dump
    • Synthetic passwords generated by LLMs
    • Structured tokens (API keys, JWTs, etc.)

Data Generation Process

  1. Base Documents: 2,000 long documents (2000+ tokens each) generated using LLMs
    • 50% containing passwords, 50% without
  2. Chunking: Documents split into 300-400 token chunks with random boundaries
  3. Password Injection: Real passwords inserted using skeleton sentences:
    "Your account has been created with username: {user} and password: {pass}"
    
  4. Class Balance: <0.3% of tokens are passwords (maintaining real-world distribution)

Training Procedure

Hardware

  • Trained on MacBook Pro (64GB RAM) with MPS acceleration
  • Can be trained on systems with 8-16GB RAM

Hyperparameters

  • Epochs: 4
  • Batch size: 8 (per device)
  • Weight decay: 0.01
  • Optimizer: AdamW (default in Trainer)
  • Learning rate: Default (5e-5)
  • Max sequence length: 512 tokens
  • Random seed: 2

Training Process

# Preprocessing
- Tokenization with offset mapping
- Label generation based on credential spans
- Padding to max_length with truncation

# Fine-tuning
- LoRA adapters applied to attention layers
- Binary cross-entropy loss
- Token-level classification head

Performance Metrics

Chunk-Level Metrics

Metric Score
Strict Accuracy 86.67%
Overlap Accuracy 97.72%

Password-Level Metrics

Metric Count/Rate
True Positives 1,201
True Negatives 1,112
False Positives 49 (3.9%)
False Negatives 138
Overlap True Positives 456
Recall 89.7%

Definitions

  • Strict Accuracy: All passwords in chunk detected with 100% accuracy
  • Overlap Accuracy: At least one password detected with >30% overlap with ground truth

Limitations and Biases

Known Limitations

  1. Context window: Limited to 512 tokens per chunk
  2. Training data: Primarily trained on LLM-generated documents which may not fully represent real-world documents
  3. Password types: Better at detecting structured/complex passwords than simple dictionary words
  4. Tokenization boundaries: SentencePiece tokenization can fragment passwords, affecting boundary detection

Potential Biases

  • May over-detect in technical documentation due to training distribution
  • Tends to flag alphanumeric strings more readily than common words used as passwords

Ethical Considerations

Responsible Use

  • Privacy: This model should only be used on documents you have permission to scan
  • Security: Detected credentials should be handled securely and not logged or stored insecurely
  • False Positives: Always verify detected credentials before taking action

Misuse Potential

  • Should not be used to scan documents without authorization
  • Not intended for credential harvesting or malicious purposes

Usage

Installation

pip install transformers torch

Quick Start

from transformers import AutoModelForTokenClassification, AutoTokenizer
import torch

# Load model and tokenizer
model_name = "path/to/deeppass2-xlm-roberta"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

# Classify tokens
def detect_passwords(text):
    inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
    
    with torch.no_grad():
        outputs = model(**inputs)
    
    predictions = torch.argmax(outputs.logits, dim=-1)
    tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
    
    # Extract password tokens
    password_tokens = [
        token for token, label in zip(tokens, predictions[0])
        if label == 1
    ]
    
    return password_tokens

Integration with DeepPass2

For production use, integrate with the full DeepPass2 pipeline:

  1. NoseyParker regex filtering
  2. BERT token classification (this model)
  3. LLM validation for false positive reduction

See the DeepPass2 repository for complete implementation.

Citation

@software{gupta2025deeppass2,
  author = {Gupta, Neeraj},
  title = {DeepPass2: Fine-tuned XLM-RoBERTa for Secret Detection},
  year = {2025},
  organization = {SpecterOps},
  url = {https://huggingface.co/deeppass2-bert},
  note = {Blog: \url{https://specterops.io/blog/2025/07/31/whats-your-secret-secret-scanning-by-deeppass2/}}
}

Additional Information

Model Versions

  • v6.0-BERT: Current production version with LoRA adapters
  • merged-model: LoRA weights merged with base model for easier deployment

Related Links

Contact

For questions or issues, please open an issue on the GitHub repository

Downloads last month
40
Safetensors
Model size
0.3B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for gneeraj/deeppass2-bert

Finetuned
(3507)
this model