Spaces:

davanstrien
/

deepseek-ocr

Runtime error

App Files Files Community

deepseek-ocr / README.md

davanstrien HF Staff

Restart with minimal changes to official DeepSeek code

0a5527f 22 days ago

preview code

raw

history blame contribute delete

7.77 kB

	---
	title: DeepSeek-OCR
	emoji: 📄
	colorFrom: blue
	colorTo: purple
	sdk: docker
	pinned: false
	tags:
	- ocr
	- vision-language-model
	- document-processing
	- vllm
	- deepseek
	license: mit
	---

	# DeepSeek-OCR with vLLM

	High-performance document OCR using [DeepSeek-OCR](https://huggingface.co/deepseek-ai/DeepSeek-OCR) with vLLM for efficient batch processing.

	## 🚀 Quick Start with HuggingFace Jobs

	Process any image dataset without needing your own GPU:

	```bash
	# Basic usage (Gundam mode - adaptive resolution)
	hf jobs run --flavor l4x1 \
	--secrets HF_TOKEN \
	hf.co/spaces/davanstrien/deepseek-ocr \
	python process_dataset.py \
	input-dataset \
	output-dataset

	# Quick test with 10 samples
	hf jobs run --flavor l4x1 \
	--secrets HF_TOKEN \
	hf.co/spaces/davanstrien/deepseek-ocr \
	python process_dataset.py \
	your-input-dataset \
	your-output-dataset \
	--max-samples 10
	```

	That's it! The script will:
	- ✅ Process images from your dataset
	- ✅ Add OCR results as a new `markdown` column
	- ✅ Push results to a new dataset with automatic documentation
	- 📊 View results at: `https://huggingface.co/datasets/[your-output-dataset]`

	## 📋 Features

	### Model Capabilities

	- 📐 LaTeX equations - Mathematical formulas preserved in LaTeX format
	- 📊 Tables - Extracted and formatted as HTML/markdown
	- 📝 Document structure - Headers, lists, and formatting maintained
	- 🖼️ Image grounding - Spatial layout and bounding box information
	- 🔍 Complex layouts - Multi-column and hierarchical structures
	- 🌍 Multilingual - Supports multiple languages

	### Performance

	- ⚡ vLLM AsyncEngine - Optimized for throughput (~2500 tokens/s on A100)
	- 🎯 Multiple resolution modes - Choose speed vs quality
	- 🔥 Large context - Up to 8K tokens
	- 💪 Batch optimized - Efficient async processing

	## 🎛️ Resolution Modes

	\| Mode \| Resolution \| Vision Tokens \| Best For \|
	\|------\|-----------\|---------------\|----------\|
	\| `tiny` \| 512×512 \| 64 \| Fast testing, simple documents \|
	\| `small` \| 640×640 \| 100 \| Balanced speed/quality \|
	\| `base` \| 1024×1024 \| 256 \| High quality documents \|
	\| `large` \| 1280×1280 \| 400 \| Maximum quality, detailed docs \|
	\| `gundam` \| Dynamic \| Adaptive \| Large documents, best overall \|

	## 💻 Usage Examples

	### Basic Processing

	```bash
	# Default (Gundam mode)
	hf jobs run --flavor l4x1 --secrets HF_TOKEN \
	hf.co/spaces/davanstrien/deepseek-ocr \
	python process_dataset.py \
	my-images-dataset \
	ocr-results
	```

	### Fast Processing for Testing

	```bash
	hf jobs run --flavor l4x1 --secrets HF_TOKEN \
	hf.co/spaces/davanstrien/deepseek-ocr \
	python process_dataset.py \
	large-dataset \
	test-output \
	--max-samples 100
	```

	### Random Sampling

	```bash
	hf jobs run --flavor l4x1 --secrets HF_TOKEN \
	hf.co/spaces/davanstrien/deepseek-ocr \
	python process_dataset.py \
	ordered-dataset \
	random-sample \
	--max-samples 50 \
	--shuffle \
	--seed 42
	```

	### Custom Image Column

	```bash
	hf jobs run --flavor a10g-large --secrets HF_TOKEN \
	hf.co/spaces/davanstrien/deepseek-ocr \
	python process_dataset.py \
	davanstrien/ufo-ColPali \
	ufo-ocr \
	--image-column image
	```

	### Private Output Dataset

	```bash
	hf jobs run --flavor l4x1 --secrets HF_TOKEN \
	hf.co/spaces/davanstrien/deepseek-ocr \
	python process_dataset.py \
	private-input \
	private-output \
	--private
	```

	## 📝 Command-Line Options

	### Required Arguments

	\| Argument \| Description \|
	\|----------\|-------------\|
	\| `input_dataset` \| Input dataset ID from Hugging Face Hub \|
	\| `output_dataset` \| Output dataset ID for Hugging Face Hub \|

	### Optional Arguments

	\| Option \| Default \| Description \|
	\|--------\|---------\|-------------\|
	\| `--image-column` \| `image` \| Column containing images \|
	\| `--model` \| `deepseek-ai/DeepSeek-OCR` \| Model to use \|
	\| `--resolution-mode` \| `gundam` \| Resolution preset (tiny/small/base/large/gundam) \|
	\| `--max-model-len` \| `8192` \| Maximum model context length \|
	\| `--max-tokens` \| `8192` \| Maximum tokens to generate \|
	\| `--gpu-memory-utilization` \| `0.75` \| GPU memory usage (0.0-1.0) \|
	\| `--prompt` \| `<image>\n<\\|grounding\\|>Convert...` \| Custom prompt \|
	\| `--hf-token` \| - \| Hugging Face API token (or use env var) \|
	\| `--split` \| `train` \| Dataset split to process \|
	\| `--max-samples` \| None \| Limit samples (for testing) \|
	\| `--private` \| False \| Make output dataset private \|
	\| `--shuffle` \| False \| Shuffle dataset before processing \|
	\| `--seed` \| `42` \| Random seed for shuffling \|

	## 📊 Output Format

	The script adds two new columns to your dataset:

	1. `markdown` - The OCR text in markdown format
	2. `inference_info` - JSON metadata about the processing

	### Inference Info Structure

	```json
	[
	{
	"column_name": "markdown",
	"model_id": "deepseek-ai/DeepSeek-OCR",
	"processing_date": "2025-10-21T12:00:00",
	"resolution_mode": "gundam",
	"base_size": 1024,
	"image_size": 640,
	"crop_mode": true,
	"prompt": "<image>\n<\|grounding\|>Convert the document to markdown.",
	"max_tokens": 8192,
	"gpu_memory_utilization": 0.75,
	"max_model_len": 8192,
	"script": "main.py",
	"script_version": "1.0.0",
	"space_url": "https://huggingface.co/spaces/davanstrien/deepseek-ocr",
	"implementation": "vllm-async (optimized)"
	}
	]
	```

	## 🔧 Technical Details

	### Architecture

	- Model: DeepSeek-OCR (3B parameters, based on Qwen2.5-VL)
	- Inference Engine: vLLM 0.8.5 with AsyncEngine
	- Image Preprocessing: Custom dynamic tiling based on aspect ratio
	- Vision Encoders: Custom CLIP + SAM encoders
	- Context Length: Up to 8K tokens
	- Optimization: Flash Attention 2.7.3, async batch processing

	### Hardware Requirements

	- Minimum: L4 GPU (24GB VRAM) - `--flavor l4x1`
	- Recommended: L40S/A10G (48GB VRAM) - `--flavor l40sx1` or `--flavor a10g-large`
	- Maximum Performance: A100 (40GB+ VRAM) - `--flavor a100-large`

	### Speed Benchmarks

	\| GPU \| Resolution \| Speed \| Notes \|
	\|-----\|-----------\|-------\|-------\|
	\| L4 \| Tiny \| ~5-8 img/s \| Good for testing \|
	\| L4 \| Gundam \| ~2-3 img/s \| Balanced \|
	\| A100 \| Gundam \| ~8-12 img/s \| Production speed \|
	\| A100 \| Large \| ~5-7 img/s \| Maximum quality \|

	## 📚 Example Workflows

	### 1. Process Historical Documents

	```bash
	hf jobs run --flavor l40sx1 --secrets HF_TOKEN \
	hf.co/spaces/davanstrien/deepseek-ocr \
	python main.py \
	historical-scans \
	historical-text \
	--resolution-mode large \
	--shuffle
	```

	### 2. Extract Tables from Reports

	```bash
	hf jobs run --flavor a10g-large --secrets HF_TOKEN \
	hf.co/spaces/davanstrien/deepseek-ocr \
	python main.py \
	financial-reports \
	extracted-tables \
	--resolution-mode gundam \
	--prompt "<image>\n<\|grounding\|>Convert the document to markdown."
	```

	### 3. Multi-language Documents

	```bash
	hf jobs run --flavor l4x1 --secrets HF_TOKEN \
	hf.co/spaces/davanstrien/deepseek-ocr \
	python main.py \
	multilingual-docs \
	ocr-output \
	--resolution-mode base
	```

	## 🔗 Related Resources

	- Model: [deepseek-ai/DeepSeek-OCR](https://huggingface.co/deepseek-ai/DeepSeek-OCR)
	- vLLM: [vllm-project/vllm](https://github.com/vllm-project/vllm)
	- HF Jobs: [Documentation](https://huggingface.co/docs/huggingface_hub/en/guides/jobs)

	## 📄 License

	MIT License - See model card for details

	## 🙏 Acknowledgments

	- DeepSeek AI for the OCR model
	- vLLM team for the inference engine
	- Hugging Face for Jobs infrastructure

	---

	Built with ❤️ using [vLLM](https://github.com/vllm-project/vllm) and [DeepSeek-OCR](https://huggingface.co/deepseek-ai/DeepSeek-OCR)