Spaces:

davanstrien
/

deepseek-ocr

Runtime error

App Files Files Community

deepseek-ocr / README.md

davanstrien HF Staff

Restart with minimal changes to official DeepSeek code

0a5527f 19 days ago

preview code

raw

history blame contribute delete

7.77 kB

metadata

title: DeepSeek-OCR
emoji: 📄
colorFrom: blue
colorTo: purple
sdk: docker
pinned: false
tags:
  - ocr
  - vision-language-model
  - document-processing
  - vllm
  - deepseek
license: mit

DeepSeek-OCR with vLLM

High-performance document OCR using DeepSeek-OCR with vLLM for efficient batch processing.

🚀 Quick Start with HuggingFace Jobs

Process any image dataset without needing your own GPU:

# Basic usage (Gundam mode - adaptive resolution)
hf jobs run --flavor l4x1 \
    --secrets HF_TOKEN \
    hf.co/spaces/davanstrien/deepseek-ocr \
    python process_dataset.py \
    input-dataset \
    output-dataset

# Quick test with 10 samples
hf jobs run --flavor l4x1 \
    --secrets HF_TOKEN \
    hf.co/spaces/davanstrien/deepseek-ocr \
    python process_dataset.py \
    your-input-dataset \
    your-output-dataset \
    --max-samples 10

That's it! The script will:

✅ Process images from your dataset
✅ Add OCR results as a new markdown column
✅ Push results to a new dataset with automatic documentation
📊 View results at: https://huggingface.co/datasets/[your-output-dataset]

📋 Features

Model Capabilities

📐 LaTeX equations - Mathematical formulas preserved in LaTeX format
📊 Tables - Extracted and formatted as HTML/markdown
📝 Document structure - Headers, lists, and formatting maintained
🖼️ Image grounding - Spatial layout and bounding box information
🔍 Complex layouts - Multi-column and hierarchical structures
🌍 Multilingual - Supports multiple languages

Performance

⚡ vLLM AsyncEngine - Optimized for throughput (~2500 tokens/s on A100)
🎯 Multiple resolution modes - Choose speed vs quality
🔥 Large context - Up to 8K tokens
💪 Batch optimized - Efficient async processing

🎛️ Resolution Modes

Mode	Resolution	Vision Tokens	Best For
`tiny`	512×512	64	Fast testing, simple documents
`small`	640×640	100	Balanced speed/quality
`base`	1024×1024	256	High quality documents
`large`	1280×1280	400	Maximum quality, detailed docs
`gundam`	Dynamic	Adaptive	Large documents, best overall

💻 Usage Examples

Basic Processing

# Default (Gundam mode)
hf jobs run --flavor l4x1 --secrets HF_TOKEN \
    hf.co/spaces/davanstrien/deepseek-ocr \
    python process_dataset.py \
    my-images-dataset \
    ocr-results

Fast Processing for Testing

hf jobs run --flavor l4x1 --secrets HF_TOKEN \
    hf.co/spaces/davanstrien/deepseek-ocr \
    python process_dataset.py \
    large-dataset \
    test-output \
    --max-samples 100

Random Sampling

hf jobs run --flavor l4x1 --secrets HF_TOKEN \
    hf.co/spaces/davanstrien/deepseek-ocr \
    python process_dataset.py \
    ordered-dataset \
    random-sample \
    --max-samples 50 \
    --shuffle \
    --seed 42

Custom Image Column

hf jobs run --flavor a10g-large --secrets HF_TOKEN \
    hf.co/spaces/davanstrien/deepseek-ocr \
    python process_dataset.py \
    davanstrien/ufo-ColPali \
    ufo-ocr \
    --image-column image

Private Output Dataset

hf jobs run --flavor l4x1 --secrets HF_TOKEN \
    hf.co/spaces/davanstrien/deepseek-ocr \
    python process_dataset.py \
    private-input \
    private-output \
    --private

📝 Command-Line Options

Required Arguments

Argument	Description
`input_dataset`	Input dataset ID from Hugging Face Hub
`output_dataset`	Output dataset ID for Hugging Face Hub

Optional Arguments

Option	Default	Description
`--image-column`	`image`	Column containing images
`--model`	`deepseek-ai/DeepSeek-OCR`	Model to use
`--resolution-mode`	`gundam`	Resolution preset (tiny/small/base/large/gundam)
`--max-model-len`	`8192`	Maximum model context length
`--max-tokens`	`8192`	Maximum tokens to generate
`--gpu-memory-utilization`	`0.75`	GPU memory usage (0.0-1.0)
`--prompt`	`<image>\n<\|grounding\|>Convert...`	Custom prompt
`--hf-token`	-	Hugging Face API token (or use env var)
`--split`	`train`	Dataset split to process
`--max-samples`	None	Limit samples (for testing)
`--private`	False	Make output dataset private
`--shuffle`	False	Shuffle dataset before processing
`--seed`	`42`	Random seed for shuffling

📊 Output Format

The script adds two new columns to your dataset:

markdown - The OCR text in markdown format
inference_info - JSON metadata about the processing

Inference Info Structure

[
  {
    "column_name": "markdown",
    "model_id": "deepseek-ai/DeepSeek-OCR",
    "processing_date": "2025-10-21T12:00:00",
    "resolution_mode": "gundam",
    "base_size": 1024,
    "image_size": 640,
    "crop_mode": true,
    "prompt": "<image>\n<|grounding|>Convert the document to markdown.",
    "max_tokens": 8192,
    "gpu_memory_utilization": 0.75,
    "max_model_len": 8192,
    "script": "main.py",
    "script_version": "1.0.0",
    "space_url": "https://huggingface.co/spaces/davanstrien/deepseek-ocr",
    "implementation": "vllm-async (optimized)"
  }
]

🔧 Technical Details

Architecture

Model: DeepSeek-OCR (3B parameters, based on Qwen2.5-VL)
Inference Engine: vLLM 0.8.5 with AsyncEngine
Image Preprocessing: Custom dynamic tiling based on aspect ratio
Vision Encoders: Custom CLIP + SAM encoders
Context Length: Up to 8K tokens
Optimization: Flash Attention 2.7.3, async batch processing

Hardware Requirements

Minimum: L4 GPU (24GB VRAM) - --flavor l4x1
Recommended: L40S/A10G (48GB VRAM) - --flavor l40sx1 or --flavor a10g-large
Maximum Performance: A100 (40GB+ VRAM) - --flavor a100-large

Speed Benchmarks

GPU	Resolution	Speed	Notes
L4	Tiny	~5-8 img/s	Good for testing
L4	Gundam	~2-3 img/s	Balanced
A100	Gundam	~8-12 img/s	Production speed
A100	Large	~5-7 img/s	Maximum quality

📚 Example Workflows

1. Process Historical Documents

hf jobs run --flavor l40sx1 --secrets HF_TOKEN \
    hf.co/spaces/davanstrien/deepseek-ocr \
    python main.py \
    historical-scans \
    historical-text \
    --resolution-mode large \
    --shuffle

2. Extract Tables from Reports

hf jobs run --flavor a10g-large --secrets HF_TOKEN \
    hf.co/spaces/davanstrien/deepseek-ocr \
    python main.py \
    financial-reports \
    extracted-tables \
    --resolution-mode gundam \
    --prompt "<image>\n<|grounding|>Convert the document to markdown."

3. Multi-language Documents

hf jobs run --flavor l4x1 --secrets HF_TOKEN \
    hf.co/spaces/davanstrien/deepseek-ocr \
    python main.py \
    multilingual-docs \
    ocr-output \
    --resolution-mode base

🔗 Related Resources

Model: deepseek-ai/DeepSeek-OCR
vLLM: vllm-project/vllm
HF Jobs: Documentation

📄 License

MIT License - See model card for details

🙏 Acknowledgments

DeepSeek AI for the OCR model
vLLM team for the inference engine
Hugging Face for Jobs infrastructure

Built with ❤️ using vLLM and DeepSeek-OCR