deepseek-ocr / README.md
davanstrien's picture
davanstrien HF Staff
Restart with minimal changes to official DeepSeek code
0a5527f
metadata
title: DeepSeek-OCR
emoji: πŸ“„
colorFrom: blue
colorTo: purple
sdk: docker
pinned: false
tags:
  - ocr
  - vision-language-model
  - document-processing
  - vllm
  - deepseek
license: mit

DeepSeek-OCR with vLLM

High-performance document OCR using DeepSeek-OCR with vLLM for efficient batch processing.

πŸš€ Quick Start with HuggingFace Jobs

Process any image dataset without needing your own GPU:

# Basic usage (Gundam mode - adaptive resolution)
hf jobs run --flavor l4x1 \
    --secrets HF_TOKEN \
    hf.co/spaces/davanstrien/deepseek-ocr \
    python process_dataset.py \
    input-dataset \
    output-dataset

# Quick test with 10 samples
hf jobs run --flavor l4x1 \
    --secrets HF_TOKEN \
    hf.co/spaces/davanstrien/deepseek-ocr \
    python process_dataset.py \
    your-input-dataset \
    your-output-dataset \
    --max-samples 10

That's it! The script will:

  • βœ… Process images from your dataset
  • βœ… Add OCR results as a new markdown column
  • βœ… Push results to a new dataset with automatic documentation
  • πŸ“Š View results at: https://huggingface.co/datasets/[your-output-dataset]

πŸ“‹ Features

Model Capabilities

  • πŸ“ LaTeX equations - Mathematical formulas preserved in LaTeX format
  • πŸ“Š Tables - Extracted and formatted as HTML/markdown
  • πŸ“ Document structure - Headers, lists, and formatting maintained
  • πŸ–ΌοΈ Image grounding - Spatial layout and bounding box information
  • πŸ” Complex layouts - Multi-column and hierarchical structures
  • 🌍 Multilingual - Supports multiple languages

Performance

  • ⚑ vLLM AsyncEngine - Optimized for throughput (~2500 tokens/s on A100)
  • 🎯 Multiple resolution modes - Choose speed vs quality
  • πŸ”₯ Large context - Up to 8K tokens
  • πŸ’ͺ Batch optimized - Efficient async processing

πŸŽ›οΈ Resolution Modes

Mode Resolution Vision Tokens Best For
tiny 512Γ—512 64 Fast testing, simple documents
small 640Γ—640 100 Balanced speed/quality
base 1024Γ—1024 256 High quality documents
large 1280Γ—1280 400 Maximum quality, detailed docs
gundam Dynamic Adaptive Large documents, best overall

πŸ’» Usage Examples

Basic Processing

# Default (Gundam mode)
hf jobs run --flavor l4x1 --secrets HF_TOKEN \
    hf.co/spaces/davanstrien/deepseek-ocr \
    python process_dataset.py \
    my-images-dataset \
    ocr-results

Fast Processing for Testing

hf jobs run --flavor l4x1 --secrets HF_TOKEN \
    hf.co/spaces/davanstrien/deepseek-ocr \
    python process_dataset.py \
    large-dataset \
    test-output \
    --max-samples 100

Random Sampling

hf jobs run --flavor l4x1 --secrets HF_TOKEN \
    hf.co/spaces/davanstrien/deepseek-ocr \
    python process_dataset.py \
    ordered-dataset \
    random-sample \
    --max-samples 50 \
    --shuffle \
    --seed 42

Custom Image Column

hf jobs run --flavor a10g-large --secrets HF_TOKEN \
    hf.co/spaces/davanstrien/deepseek-ocr \
    python process_dataset.py \
    davanstrien/ufo-ColPali \
    ufo-ocr \
    --image-column image

Private Output Dataset

hf jobs run --flavor l4x1 --secrets HF_TOKEN \
    hf.co/spaces/davanstrien/deepseek-ocr \
    python process_dataset.py \
    private-input \
    private-output \
    --private

πŸ“ Command-Line Options

Required Arguments

Argument Description
input_dataset Input dataset ID from Hugging Face Hub
output_dataset Output dataset ID for Hugging Face Hub

Optional Arguments

Option Default Description
--image-column image Column containing images
--model deepseek-ai/DeepSeek-OCR Model to use
--resolution-mode gundam Resolution preset (tiny/small/base/large/gundam)
--max-model-len 8192 Maximum model context length
--max-tokens 8192 Maximum tokens to generate
--gpu-memory-utilization 0.75 GPU memory usage (0.0-1.0)
--prompt <image>\n<|grounding|>Convert... Custom prompt
--hf-token - Hugging Face API token (or use env var)
--split train Dataset split to process
--max-samples None Limit samples (for testing)
--private False Make output dataset private
--shuffle False Shuffle dataset before processing
--seed 42 Random seed for shuffling

πŸ“Š Output Format

The script adds two new columns to your dataset:

  1. markdown - The OCR text in markdown format
  2. inference_info - JSON metadata about the processing

Inference Info Structure

[
  {
    "column_name": "markdown",
    "model_id": "deepseek-ai/DeepSeek-OCR",
    "processing_date": "2025-10-21T12:00:00",
    "resolution_mode": "gundam",
    "base_size": 1024,
    "image_size": 640,
    "crop_mode": true,
    "prompt": "<image>\n<|grounding|>Convert the document to markdown.",
    "max_tokens": 8192,
    "gpu_memory_utilization": 0.75,
    "max_model_len": 8192,
    "script": "main.py",
    "script_version": "1.0.0",
    "space_url": "https://huggingface.co/spaces/davanstrien/deepseek-ocr",
    "implementation": "vllm-async (optimized)"
  }
]

πŸ”§ Technical Details

Architecture

  • Model: DeepSeek-OCR (3B parameters, based on Qwen2.5-VL)
  • Inference Engine: vLLM 0.8.5 with AsyncEngine
  • Image Preprocessing: Custom dynamic tiling based on aspect ratio
  • Vision Encoders: Custom CLIP + SAM encoders
  • Context Length: Up to 8K tokens
  • Optimization: Flash Attention 2.7.3, async batch processing

Hardware Requirements

  • Minimum: L4 GPU (24GB VRAM) - --flavor l4x1
  • Recommended: L40S/A10G (48GB VRAM) - --flavor l40sx1 or --flavor a10g-large
  • Maximum Performance: A100 (40GB+ VRAM) - --flavor a100-large

Speed Benchmarks

GPU Resolution Speed Notes
L4 Tiny ~5-8 img/s Good for testing
L4 Gundam ~2-3 img/s Balanced
A100 Gundam ~8-12 img/s Production speed
A100 Large ~5-7 img/s Maximum quality

πŸ“š Example Workflows

1. Process Historical Documents

hf jobs run --flavor l40sx1 --secrets HF_TOKEN \
    hf.co/spaces/davanstrien/deepseek-ocr \
    python main.py \
    historical-scans \
    historical-text \
    --resolution-mode large \
    --shuffle

2. Extract Tables from Reports

hf jobs run --flavor a10g-large --secrets HF_TOKEN \
    hf.co/spaces/davanstrien/deepseek-ocr \
    python main.py \
    financial-reports \
    extracted-tables \
    --resolution-mode gundam \
    --prompt "<image>\n<|grounding|>Convert the document to markdown."

3. Multi-language Documents

hf jobs run --flavor l4x1 --secrets HF_TOKEN \
    hf.co/spaces/davanstrien/deepseek-ocr \
    python main.py \
    multilingual-docs \
    ocr-output \
    --resolution-mode base

πŸ”— Related Resources

πŸ“„ License

MIT License - See model card for details

πŸ™ Acknowledgments

  • DeepSeek AI for the OCR model
  • vLLM team for the inference engine
  • Hugging Face for Jobs infrastructure

Built with ❀️ using vLLM and DeepSeek-OCR