deepseek-ocr / README.md
davanstrien's picture
davanstrien HF Staff
Restart with minimal changes to official DeepSeek code
0a5527f
---
title: DeepSeek-OCR
emoji: πŸ“„
colorFrom: blue
colorTo: purple
sdk: docker
pinned: false
tags:
- ocr
- vision-language-model
- document-processing
- vllm
- deepseek
license: mit
---
# DeepSeek-OCR with vLLM
High-performance document OCR using [DeepSeek-OCR](https://huggingface.co/deepseek-ai/DeepSeek-OCR) with vLLM for efficient batch processing.
## πŸš€ Quick Start with HuggingFace Jobs
Process any image dataset without needing your own GPU:
```bash
# Basic usage (Gundam mode - adaptive resolution)
hf jobs run --flavor l4x1 \
--secrets HF_TOKEN \
hf.co/spaces/davanstrien/deepseek-ocr \
python process_dataset.py \
input-dataset \
output-dataset
# Quick test with 10 samples
hf jobs run --flavor l4x1 \
--secrets HF_TOKEN \
hf.co/spaces/davanstrien/deepseek-ocr \
python process_dataset.py \
your-input-dataset \
your-output-dataset \
--max-samples 10
```
That's it! The script will:
- βœ… Process images from your dataset
- βœ… Add OCR results as a new `markdown` column
- βœ… Push results to a new dataset with automatic documentation
- πŸ“Š View results at: `https://huggingface.co/datasets/[your-output-dataset]`
## πŸ“‹ Features
### Model Capabilities
- πŸ“ **LaTeX equations** - Mathematical formulas preserved in LaTeX format
- πŸ“Š **Tables** - Extracted and formatted as HTML/markdown
- πŸ“ **Document structure** - Headers, lists, and formatting maintained
- πŸ–ΌοΈ **Image grounding** - Spatial layout and bounding box information
- πŸ” **Complex layouts** - Multi-column and hierarchical structures
- 🌍 **Multilingual** - Supports multiple languages
### Performance
- ⚑ **vLLM AsyncEngine** - Optimized for throughput (~2500 tokens/s on A100)
- 🎯 **Multiple resolution modes** - Choose speed vs quality
- πŸ”₯ **Large context** - Up to 8K tokens
- πŸ’ͺ **Batch optimized** - Efficient async processing
## πŸŽ›οΈ Resolution Modes
| Mode | Resolution | Vision Tokens | Best For |
|------|-----------|---------------|----------|
| `tiny` | 512Γ—512 | 64 | Fast testing, simple documents |
| `small` | 640Γ—640 | 100 | Balanced speed/quality |
| `base` | 1024Γ—1024 | 256 | High quality documents |
| `large` | 1280Γ—1280 | 400 | Maximum quality, detailed docs |
| `gundam` | Dynamic | Adaptive | Large documents, best overall |
## πŸ’» Usage Examples
### Basic Processing
```bash
# Default (Gundam mode)
hf jobs run --flavor l4x1 --secrets HF_TOKEN \
hf.co/spaces/davanstrien/deepseek-ocr \
python process_dataset.py \
my-images-dataset \
ocr-results
```
### Fast Processing for Testing
```bash
hf jobs run --flavor l4x1 --secrets HF_TOKEN \
hf.co/spaces/davanstrien/deepseek-ocr \
python process_dataset.py \
large-dataset \
test-output \
--max-samples 100
```
### Random Sampling
```bash
hf jobs run --flavor l4x1 --secrets HF_TOKEN \
hf.co/spaces/davanstrien/deepseek-ocr \
python process_dataset.py \
ordered-dataset \
random-sample \
--max-samples 50 \
--shuffle \
--seed 42
```
### Custom Image Column
```bash
hf jobs run --flavor a10g-large --secrets HF_TOKEN \
hf.co/spaces/davanstrien/deepseek-ocr \
python process_dataset.py \
davanstrien/ufo-ColPali \
ufo-ocr \
--image-column image
```
### Private Output Dataset
```bash
hf jobs run --flavor l4x1 --secrets HF_TOKEN \
hf.co/spaces/davanstrien/deepseek-ocr \
python process_dataset.py \
private-input \
private-output \
--private
```
## πŸ“ Command-Line Options
### Required Arguments
| Argument | Description |
|----------|-------------|
| `input_dataset` | Input dataset ID from Hugging Face Hub |
| `output_dataset` | Output dataset ID for Hugging Face Hub |
### Optional Arguments
| Option | Default | Description |
|--------|---------|-------------|
| `--image-column` | `image` | Column containing images |
| `--model` | `deepseek-ai/DeepSeek-OCR` | Model to use |
| `--resolution-mode` | `gundam` | Resolution preset (tiny/small/base/large/gundam) |
| `--max-model-len` | `8192` | Maximum model context length |
| `--max-tokens` | `8192` | Maximum tokens to generate |
| `--gpu-memory-utilization` | `0.75` | GPU memory usage (0.0-1.0) |
| `--prompt` | `<image>\n<\|grounding\|>Convert...` | Custom prompt |
| `--hf-token` | - | Hugging Face API token (or use env var) |
| `--split` | `train` | Dataset split to process |
| `--max-samples` | None | Limit samples (for testing) |
| `--private` | False | Make output dataset private |
| `--shuffle` | False | Shuffle dataset before processing |
| `--seed` | `42` | Random seed for shuffling |
## πŸ“Š Output Format
The script adds two new columns to your dataset:
1. **`markdown`** - The OCR text in markdown format
2. **`inference_info`** - JSON metadata about the processing
### Inference Info Structure
```json
[
{
"column_name": "markdown",
"model_id": "deepseek-ai/DeepSeek-OCR",
"processing_date": "2025-10-21T12:00:00",
"resolution_mode": "gundam",
"base_size": 1024,
"image_size": 640,
"crop_mode": true,
"prompt": "<image>\n<|grounding|>Convert the document to markdown.",
"max_tokens": 8192,
"gpu_memory_utilization": 0.75,
"max_model_len": 8192,
"script": "main.py",
"script_version": "1.0.0",
"space_url": "https://huggingface.co/spaces/davanstrien/deepseek-ocr",
"implementation": "vllm-async (optimized)"
}
]
```
## πŸ”§ Technical Details
### Architecture
- **Model**: DeepSeek-OCR (3B parameters, based on Qwen2.5-VL)
- **Inference Engine**: vLLM 0.8.5 with AsyncEngine
- **Image Preprocessing**: Custom dynamic tiling based on aspect ratio
- **Vision Encoders**: Custom CLIP + SAM encoders
- **Context Length**: Up to 8K tokens
- **Optimization**: Flash Attention 2.7.3, async batch processing
### Hardware Requirements
- **Minimum**: L4 GPU (24GB VRAM) - `--flavor l4x1`
- **Recommended**: L40S/A10G (48GB VRAM) - `--flavor l40sx1` or `--flavor a10g-large`
- **Maximum Performance**: A100 (40GB+ VRAM) - `--flavor a100-large`
### Speed Benchmarks
| GPU | Resolution | Speed | Notes |
|-----|-----------|-------|-------|
| L4 | Tiny | ~5-8 img/s | Good for testing |
| L4 | Gundam | ~2-3 img/s | Balanced |
| A100 | Gundam | ~8-12 img/s | Production speed |
| A100 | Large | ~5-7 img/s | Maximum quality |
## πŸ“š Example Workflows
### 1. Process Historical Documents
```bash
hf jobs run --flavor l40sx1 --secrets HF_TOKEN \
hf.co/spaces/davanstrien/deepseek-ocr \
python main.py \
historical-scans \
historical-text \
--resolution-mode large \
--shuffle
```
### 2. Extract Tables from Reports
```bash
hf jobs run --flavor a10g-large --secrets HF_TOKEN \
hf.co/spaces/davanstrien/deepseek-ocr \
python main.py \
financial-reports \
extracted-tables \
--resolution-mode gundam \
--prompt "<image>\n<|grounding|>Convert the document to markdown."
```
### 3. Multi-language Documents
```bash
hf jobs run --flavor l4x1 --secrets HF_TOKEN \
hf.co/spaces/davanstrien/deepseek-ocr \
python main.py \
multilingual-docs \
ocr-output \
--resolution-mode base
```
## πŸ”— Related Resources
- **Model**: [deepseek-ai/DeepSeek-OCR](https://huggingface.co/deepseek-ai/DeepSeek-OCR)
- **vLLM**: [vllm-project/vllm](https://github.com/vllm-project/vllm)
- **HF Jobs**: [Documentation](https://huggingface.co/docs/huggingface_hub/en/guides/jobs)
## πŸ“„ License
MIT License - See model card for details
## πŸ™ Acknowledgments
- DeepSeek AI for the OCR model
- vLLM team for the inference engine
- Hugging Face for Jobs infrastructure
---
Built with ❀️ using [vLLM](https://github.com/vllm-project/vllm) and [DeepSeek-OCR](https://huggingface.co/deepseek-ai/DeepSeek-OCR)