File size: 7,773 Bytes
f53096b
1411932
 
 
 
f53096b
 
1411932
 
 
 
 
 
 
f53096b
 
1411932
 
 
 
 
 
 
 
 
 
 
 
 
0a5527f
1411932
 
 
 
 
 
 
0a5527f
1411932
 
0a5527f
1411932
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0a5527f
1411932
 
 
 
 
 
 
 
 
0a5527f
1411932
 
 
 
 
 
 
 
 
 
0a5527f
1411932
 
 
 
 
 
 
 
 
 
 
 
0a5527f
1411932
 
0a5527f
1411932
 
 
 
 
 
 
0a5527f
1411932
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
---
title: DeepSeek-OCR
emoji: πŸ“„
colorFrom: blue
colorTo: purple
sdk: docker
pinned: false
tags:
  - ocr
  - vision-language-model
  - document-processing
  - vllm
  - deepseek
license: mit
---

# DeepSeek-OCR with vLLM

High-performance document OCR using [DeepSeek-OCR](https://huggingface.co/deepseek-ai/DeepSeek-OCR) with vLLM for efficient batch processing.

## πŸš€ Quick Start with HuggingFace Jobs

Process any image dataset without needing your own GPU:

```bash
# Basic usage (Gundam mode - adaptive resolution)
hf jobs run --flavor l4x1 \
    --secrets HF_TOKEN \
    hf.co/spaces/davanstrien/deepseek-ocr \
    python process_dataset.py \
    input-dataset \
    output-dataset

# Quick test with 10 samples
hf jobs run --flavor l4x1 \
    --secrets HF_TOKEN \
    hf.co/spaces/davanstrien/deepseek-ocr \
    python process_dataset.py \
    your-input-dataset \
    your-output-dataset \
    --max-samples 10
```

That's it! The script will:
- βœ… Process images from your dataset
- βœ… Add OCR results as a new `markdown` column
- βœ… Push results to a new dataset with automatic documentation
- πŸ“Š View results at: `https://huggingface.co/datasets/[your-output-dataset]`

## πŸ“‹ Features

### Model Capabilities

- πŸ“ **LaTeX equations** - Mathematical formulas preserved in LaTeX format
- πŸ“Š **Tables** - Extracted and formatted as HTML/markdown
- πŸ“ **Document structure** - Headers, lists, and formatting maintained
- πŸ–ΌοΈ **Image grounding** - Spatial layout and bounding box information
- πŸ” **Complex layouts** - Multi-column and hierarchical structures
- 🌍 **Multilingual** - Supports multiple languages

### Performance

- ⚑ **vLLM AsyncEngine** - Optimized for throughput (~2500 tokens/s on A100)
- 🎯 **Multiple resolution modes** - Choose speed vs quality
- πŸ”₯ **Large context** - Up to 8K tokens
- πŸ’ͺ **Batch optimized** - Efficient async processing

## πŸŽ›οΈ Resolution Modes

| Mode | Resolution | Vision Tokens | Best For |
|------|-----------|---------------|----------|
| `tiny` | 512Γ—512 | 64 | Fast testing, simple documents |
| `small` | 640Γ—640 | 100 | Balanced speed/quality |
| `base` | 1024Γ—1024 | 256 | High quality documents |
| `large` | 1280Γ—1280 | 400 | Maximum quality, detailed docs |
| `gundam` | Dynamic | Adaptive | Large documents, best overall |

## πŸ’» Usage Examples

### Basic Processing

```bash
# Default (Gundam mode)
hf jobs run --flavor l4x1 --secrets HF_TOKEN \
    hf.co/spaces/davanstrien/deepseek-ocr \
    python process_dataset.py \
    my-images-dataset \
    ocr-results
```

### Fast Processing for Testing

```bash
hf jobs run --flavor l4x1 --secrets HF_TOKEN \
    hf.co/spaces/davanstrien/deepseek-ocr \
    python process_dataset.py \
    large-dataset \
    test-output \
    --max-samples 100
```

### Random Sampling

```bash
hf jobs run --flavor l4x1 --secrets HF_TOKEN \
    hf.co/spaces/davanstrien/deepseek-ocr \
    python process_dataset.py \
    ordered-dataset \
    random-sample \
    --max-samples 50 \
    --shuffle \
    --seed 42
```

### Custom Image Column

```bash
hf jobs run --flavor a10g-large --secrets HF_TOKEN \
    hf.co/spaces/davanstrien/deepseek-ocr \
    python process_dataset.py \
    davanstrien/ufo-ColPali \
    ufo-ocr \
    --image-column image
```

### Private Output Dataset

```bash
hf jobs run --flavor l4x1 --secrets HF_TOKEN \
    hf.co/spaces/davanstrien/deepseek-ocr \
    python process_dataset.py \
    private-input \
    private-output \
    --private
```

## πŸ“ Command-Line Options

### Required Arguments

| Argument | Description |
|----------|-------------|
| `input_dataset` | Input dataset ID from Hugging Face Hub |
| `output_dataset` | Output dataset ID for Hugging Face Hub |

### Optional Arguments

| Option | Default | Description |
|--------|---------|-------------|
| `--image-column` | `image` | Column containing images |
| `--model` | `deepseek-ai/DeepSeek-OCR` | Model to use |
| `--resolution-mode` | `gundam` | Resolution preset (tiny/small/base/large/gundam) |
| `--max-model-len` | `8192` | Maximum model context length |
| `--max-tokens` | `8192` | Maximum tokens to generate |
| `--gpu-memory-utilization` | `0.75` | GPU memory usage (0.0-1.0) |
| `--prompt` | `<image>\n<\|grounding\|>Convert...` | Custom prompt |
| `--hf-token` | - | Hugging Face API token (or use env var) |
| `--split` | `train` | Dataset split to process |
| `--max-samples` | None | Limit samples (for testing) |
| `--private` | False | Make output dataset private |
| `--shuffle` | False | Shuffle dataset before processing |
| `--seed` | `42` | Random seed for shuffling |

## πŸ“Š Output Format

The script adds two new columns to your dataset:

1. **`markdown`** - The OCR text in markdown format
2. **`inference_info`** - JSON metadata about the processing

### Inference Info Structure

```json
[
  {
    "column_name": "markdown",
    "model_id": "deepseek-ai/DeepSeek-OCR",
    "processing_date": "2025-10-21T12:00:00",
    "resolution_mode": "gundam",
    "base_size": 1024,
    "image_size": 640,
    "crop_mode": true,
    "prompt": "<image>\n<|grounding|>Convert the document to markdown.",
    "max_tokens": 8192,
    "gpu_memory_utilization": 0.75,
    "max_model_len": 8192,
    "script": "main.py",
    "script_version": "1.0.0",
    "space_url": "https://huggingface.co/spaces/davanstrien/deepseek-ocr",
    "implementation": "vllm-async (optimized)"
  }
]
```

## πŸ”§ Technical Details

### Architecture

- **Model**: DeepSeek-OCR (3B parameters, based on Qwen2.5-VL)
- **Inference Engine**: vLLM 0.8.5 with AsyncEngine
- **Image Preprocessing**: Custom dynamic tiling based on aspect ratio
- **Vision Encoders**: Custom CLIP + SAM encoders
- **Context Length**: Up to 8K tokens
- **Optimization**: Flash Attention 2.7.3, async batch processing

### Hardware Requirements

- **Minimum**: L4 GPU (24GB VRAM) - `--flavor l4x1`
- **Recommended**: L40S/A10G (48GB VRAM) - `--flavor l40sx1` or `--flavor a10g-large`
- **Maximum Performance**: A100 (40GB+ VRAM) - `--flavor a100-large`

### Speed Benchmarks

| GPU | Resolution | Speed | Notes |
|-----|-----------|-------|-------|
| L4 | Tiny | ~5-8 img/s | Good for testing |
| L4 | Gundam | ~2-3 img/s | Balanced |
| A100 | Gundam | ~8-12 img/s | Production speed |
| A100 | Large | ~5-7 img/s | Maximum quality |

## πŸ“š Example Workflows

### 1. Process Historical Documents

```bash
hf jobs run --flavor l40sx1 --secrets HF_TOKEN \
    hf.co/spaces/davanstrien/deepseek-ocr \
    python main.py \
    historical-scans \
    historical-text \
    --resolution-mode large \
    --shuffle
```

### 2. Extract Tables from Reports

```bash
hf jobs run --flavor a10g-large --secrets HF_TOKEN \
    hf.co/spaces/davanstrien/deepseek-ocr \
    python main.py \
    financial-reports \
    extracted-tables \
    --resolution-mode gundam \
    --prompt "<image>\n<|grounding|>Convert the document to markdown."
```

### 3. Multi-language Documents

```bash
hf jobs run --flavor l4x1 --secrets HF_TOKEN \
    hf.co/spaces/davanstrien/deepseek-ocr \
    python main.py \
    multilingual-docs \
    ocr-output \
    --resolution-mode base
```

## πŸ”— Related Resources

- **Model**: [deepseek-ai/DeepSeek-OCR](https://huggingface.co/deepseek-ai/DeepSeek-OCR)
- **vLLM**: [vllm-project/vllm](https://github.com/vllm-project/vllm)
- **HF Jobs**: [Documentation](https://huggingface.co/docs/huggingface_hub/en/guides/jobs)

## πŸ“„ License

MIT License - See model card for details

## πŸ™ Acknowledgments

- DeepSeek AI for the OCR model
- vLLM team for the inference engine
- Hugging Face for Jobs infrastructure

---

Built with ❀️ using [vLLM](https://github.com/vllm-project/vllm) and [DeepSeek-OCR](https://huggingface.co/deepseek-ai/DeepSeek-OCR)