Spaces:
Runtime error
Runtime error
metadata
title: DeepSeek-OCR
emoji: π
colorFrom: blue
colorTo: purple
sdk: docker
pinned: false
tags:
- ocr
- vision-language-model
- document-processing
- vllm
- deepseek
license: mit
DeepSeek-OCR with vLLM
High-performance document OCR using DeepSeek-OCR with vLLM for efficient batch processing.
π Quick Start with HuggingFace Jobs
Process any image dataset without needing your own GPU:
# Basic usage (Gundam mode - adaptive resolution)
hf jobs run --flavor l4x1 \
--secrets HF_TOKEN \
hf.co/spaces/davanstrien/deepseek-ocr \
python process_dataset.py \
input-dataset \
output-dataset
# Quick test with 10 samples
hf jobs run --flavor l4x1 \
--secrets HF_TOKEN \
hf.co/spaces/davanstrien/deepseek-ocr \
python process_dataset.py \
your-input-dataset \
your-output-dataset \
--max-samples 10
That's it! The script will:
- β Process images from your dataset
- β
Add OCR results as a new
markdowncolumn - β Push results to a new dataset with automatic documentation
- π View results at:
https://huggingface.co/datasets/[your-output-dataset]
π Features
Model Capabilities
- π LaTeX equations - Mathematical formulas preserved in LaTeX format
- π Tables - Extracted and formatted as HTML/markdown
- π Document structure - Headers, lists, and formatting maintained
- πΌοΈ Image grounding - Spatial layout and bounding box information
- π Complex layouts - Multi-column and hierarchical structures
- π Multilingual - Supports multiple languages
Performance
- β‘ vLLM AsyncEngine - Optimized for throughput (~2500 tokens/s on A100)
- π― Multiple resolution modes - Choose speed vs quality
- π₯ Large context - Up to 8K tokens
- πͺ Batch optimized - Efficient async processing
ποΈ Resolution Modes
| Mode | Resolution | Vision Tokens | Best For |
|---|---|---|---|
tiny |
512Γ512 | 64 | Fast testing, simple documents |
small |
640Γ640 | 100 | Balanced speed/quality |
base |
1024Γ1024 | 256 | High quality documents |
large |
1280Γ1280 | 400 | Maximum quality, detailed docs |
gundam |
Dynamic | Adaptive | Large documents, best overall |
π» Usage Examples
Basic Processing
# Default (Gundam mode)
hf jobs run --flavor l4x1 --secrets HF_TOKEN \
hf.co/spaces/davanstrien/deepseek-ocr \
python process_dataset.py \
my-images-dataset \
ocr-results
Fast Processing for Testing
hf jobs run --flavor l4x1 --secrets HF_TOKEN \
hf.co/spaces/davanstrien/deepseek-ocr \
python process_dataset.py \
large-dataset \
test-output \
--max-samples 100
Random Sampling
hf jobs run --flavor l4x1 --secrets HF_TOKEN \
hf.co/spaces/davanstrien/deepseek-ocr \
python process_dataset.py \
ordered-dataset \
random-sample \
--max-samples 50 \
--shuffle \
--seed 42
Custom Image Column
hf jobs run --flavor a10g-large --secrets HF_TOKEN \
hf.co/spaces/davanstrien/deepseek-ocr \
python process_dataset.py \
davanstrien/ufo-ColPali \
ufo-ocr \
--image-column image
Private Output Dataset
hf jobs run --flavor l4x1 --secrets HF_TOKEN \
hf.co/spaces/davanstrien/deepseek-ocr \
python process_dataset.py \
private-input \
private-output \
--private
π Command-Line Options
Required Arguments
| Argument | Description |
|---|---|
input_dataset |
Input dataset ID from Hugging Face Hub |
output_dataset |
Output dataset ID for Hugging Face Hub |
Optional Arguments
| Option | Default | Description |
|---|---|---|
--image-column |
image |
Column containing images |
--model |
deepseek-ai/DeepSeek-OCR |
Model to use |
--resolution-mode |
gundam |
Resolution preset (tiny/small/base/large/gundam) |
--max-model-len |
8192 |
Maximum model context length |
--max-tokens |
8192 |
Maximum tokens to generate |
--gpu-memory-utilization |
0.75 |
GPU memory usage (0.0-1.0) |
--prompt |
<image>\n<|grounding|>Convert... |
Custom prompt |
--hf-token |
- | Hugging Face API token (or use env var) |
--split |
train |
Dataset split to process |
--max-samples |
None | Limit samples (for testing) |
--private |
False | Make output dataset private |
--shuffle |
False | Shuffle dataset before processing |
--seed |
42 |
Random seed for shuffling |
π Output Format
The script adds two new columns to your dataset:
markdown- The OCR text in markdown formatinference_info- JSON metadata about the processing
Inference Info Structure
[
{
"column_name": "markdown",
"model_id": "deepseek-ai/DeepSeek-OCR",
"processing_date": "2025-10-21T12:00:00",
"resolution_mode": "gundam",
"base_size": 1024,
"image_size": 640,
"crop_mode": true,
"prompt": "<image>\n<|grounding|>Convert the document to markdown.",
"max_tokens": 8192,
"gpu_memory_utilization": 0.75,
"max_model_len": 8192,
"script": "main.py",
"script_version": "1.0.0",
"space_url": "https://huggingface.co/spaces/davanstrien/deepseek-ocr",
"implementation": "vllm-async (optimized)"
}
]
π§ Technical Details
Architecture
- Model: DeepSeek-OCR (3B parameters, based on Qwen2.5-VL)
- Inference Engine: vLLM 0.8.5 with AsyncEngine
- Image Preprocessing: Custom dynamic tiling based on aspect ratio
- Vision Encoders: Custom CLIP + SAM encoders
- Context Length: Up to 8K tokens
- Optimization: Flash Attention 2.7.3, async batch processing
Hardware Requirements
- Minimum: L4 GPU (24GB VRAM) -
--flavor l4x1 - Recommended: L40S/A10G (48GB VRAM) -
--flavor l40sx1or--flavor a10g-large - Maximum Performance: A100 (40GB+ VRAM) -
--flavor a100-large
Speed Benchmarks
| GPU | Resolution | Speed | Notes |
|---|---|---|---|
| L4 | Tiny | ~5-8 img/s | Good for testing |
| L4 | Gundam | ~2-3 img/s | Balanced |
| A100 | Gundam | ~8-12 img/s | Production speed |
| A100 | Large | ~5-7 img/s | Maximum quality |
π Example Workflows
1. Process Historical Documents
hf jobs run --flavor l40sx1 --secrets HF_TOKEN \
hf.co/spaces/davanstrien/deepseek-ocr \
python main.py \
historical-scans \
historical-text \
--resolution-mode large \
--shuffle
2. Extract Tables from Reports
hf jobs run --flavor a10g-large --secrets HF_TOKEN \
hf.co/spaces/davanstrien/deepseek-ocr \
python main.py \
financial-reports \
extracted-tables \
--resolution-mode gundam \
--prompt "<image>\n<|grounding|>Convert the document to markdown."
3. Multi-language Documents
hf jobs run --flavor l4x1 --secrets HF_TOKEN \
hf.co/spaces/davanstrien/deepseek-ocr \
python main.py \
multilingual-docs \
ocr-output \
--resolution-mode base
π Related Resources
- Model: deepseek-ai/DeepSeek-OCR
- vLLM: vllm-project/vllm
- HF Jobs: Documentation
π License
MIT License - See model card for details
π Acknowledgments
- DeepSeek AI for the OCR model
- vLLM team for the inference engine
- Hugging Face for Jobs infrastructure
Built with β€οΈ using vLLM and DeepSeek-OCR