Spaces:
Runtime error
Runtime error
| title: DeepSeek-OCR | |
| emoji: π | |
| colorFrom: blue | |
| colorTo: purple | |
| sdk: docker | |
| pinned: false | |
| tags: | |
| - ocr | |
| - vision-language-model | |
| - document-processing | |
| - vllm | |
| - deepseek | |
| license: mit | |
| # DeepSeek-OCR with vLLM | |
| High-performance document OCR using [DeepSeek-OCR](https://huggingface.co/deepseek-ai/DeepSeek-OCR) with vLLM for efficient batch processing. | |
| ## π Quick Start with HuggingFace Jobs | |
| Process any image dataset without needing your own GPU: | |
| ```bash | |
| # Basic usage (Gundam mode - adaptive resolution) | |
| hf jobs run --flavor l4x1 \ | |
| --secrets HF_TOKEN \ | |
| hf.co/spaces/davanstrien/deepseek-ocr \ | |
| python process_dataset.py \ | |
| input-dataset \ | |
| output-dataset | |
| # Quick test with 10 samples | |
| hf jobs run --flavor l4x1 \ | |
| --secrets HF_TOKEN \ | |
| hf.co/spaces/davanstrien/deepseek-ocr \ | |
| python process_dataset.py \ | |
| your-input-dataset \ | |
| your-output-dataset \ | |
| --max-samples 10 | |
| ``` | |
| That's it! The script will: | |
| - β Process images from your dataset | |
| - β Add OCR results as a new `markdown` column | |
| - β Push results to a new dataset with automatic documentation | |
| - π View results at: `https://huggingface.co/datasets/[your-output-dataset]` | |
| ## π Features | |
| ### Model Capabilities | |
| - π **LaTeX equations** - Mathematical formulas preserved in LaTeX format | |
| - π **Tables** - Extracted and formatted as HTML/markdown | |
| - π **Document structure** - Headers, lists, and formatting maintained | |
| - πΌοΈ **Image grounding** - Spatial layout and bounding box information | |
| - π **Complex layouts** - Multi-column and hierarchical structures | |
| - π **Multilingual** - Supports multiple languages | |
| ### Performance | |
| - β‘ **vLLM AsyncEngine** - Optimized for throughput (~2500 tokens/s on A100) | |
| - π― **Multiple resolution modes** - Choose speed vs quality | |
| - π₯ **Large context** - Up to 8K tokens | |
| - πͺ **Batch optimized** - Efficient async processing | |
| ## ποΈ Resolution Modes | |
| | Mode | Resolution | Vision Tokens | Best For | | |
| |------|-----------|---------------|----------| | |
| | `tiny` | 512Γ512 | 64 | Fast testing, simple documents | | |
| | `small` | 640Γ640 | 100 | Balanced speed/quality | | |
| | `base` | 1024Γ1024 | 256 | High quality documents | | |
| | `large` | 1280Γ1280 | 400 | Maximum quality, detailed docs | | |
| | `gundam` | Dynamic | Adaptive | Large documents, best overall | | |
| ## π» Usage Examples | |
| ### Basic Processing | |
| ```bash | |
| # Default (Gundam mode) | |
| hf jobs run --flavor l4x1 --secrets HF_TOKEN \ | |
| hf.co/spaces/davanstrien/deepseek-ocr \ | |
| python process_dataset.py \ | |
| my-images-dataset \ | |
| ocr-results | |
| ``` | |
| ### Fast Processing for Testing | |
| ```bash | |
| hf jobs run --flavor l4x1 --secrets HF_TOKEN \ | |
| hf.co/spaces/davanstrien/deepseek-ocr \ | |
| python process_dataset.py \ | |
| large-dataset \ | |
| test-output \ | |
| --max-samples 100 | |
| ``` | |
| ### Random Sampling | |
| ```bash | |
| hf jobs run --flavor l4x1 --secrets HF_TOKEN \ | |
| hf.co/spaces/davanstrien/deepseek-ocr \ | |
| python process_dataset.py \ | |
| ordered-dataset \ | |
| random-sample \ | |
| --max-samples 50 \ | |
| --shuffle \ | |
| --seed 42 | |
| ``` | |
| ### Custom Image Column | |
| ```bash | |
| hf jobs run --flavor a10g-large --secrets HF_TOKEN \ | |
| hf.co/spaces/davanstrien/deepseek-ocr \ | |
| python process_dataset.py \ | |
| davanstrien/ufo-ColPali \ | |
| ufo-ocr \ | |
| --image-column image | |
| ``` | |
| ### Private Output Dataset | |
| ```bash | |
| hf jobs run --flavor l4x1 --secrets HF_TOKEN \ | |
| hf.co/spaces/davanstrien/deepseek-ocr \ | |
| python process_dataset.py \ | |
| private-input \ | |
| private-output \ | |
| --private | |
| ``` | |
| ## π Command-Line Options | |
| ### Required Arguments | |
| | Argument | Description | | |
| |----------|-------------| | |
| | `input_dataset` | Input dataset ID from Hugging Face Hub | | |
| | `output_dataset` | Output dataset ID for Hugging Face Hub | | |
| ### Optional Arguments | |
| | Option | Default | Description | | |
| |--------|---------|-------------| | |
| | `--image-column` | `image` | Column containing images | | |
| | `--model` | `deepseek-ai/DeepSeek-OCR` | Model to use | | |
| | `--resolution-mode` | `gundam` | Resolution preset (tiny/small/base/large/gundam) | | |
| | `--max-model-len` | `8192` | Maximum model context length | | |
| | `--max-tokens` | `8192` | Maximum tokens to generate | | |
| | `--gpu-memory-utilization` | `0.75` | GPU memory usage (0.0-1.0) | | |
| | `--prompt` | `<image>\n<\|grounding\|>Convert...` | Custom prompt | | |
| | `--hf-token` | - | Hugging Face API token (or use env var) | | |
| | `--split` | `train` | Dataset split to process | | |
| | `--max-samples` | None | Limit samples (for testing) | | |
| | `--private` | False | Make output dataset private | | |
| | `--shuffle` | False | Shuffle dataset before processing | | |
| | `--seed` | `42` | Random seed for shuffling | | |
| ## π Output Format | |
| The script adds two new columns to your dataset: | |
| 1. **`markdown`** - The OCR text in markdown format | |
| 2. **`inference_info`** - JSON metadata about the processing | |
| ### Inference Info Structure | |
| ```json | |
| [ | |
| { | |
| "column_name": "markdown", | |
| "model_id": "deepseek-ai/DeepSeek-OCR", | |
| "processing_date": "2025-10-21T12:00:00", | |
| "resolution_mode": "gundam", | |
| "base_size": 1024, | |
| "image_size": 640, | |
| "crop_mode": true, | |
| "prompt": "<image>\n<|grounding|>Convert the document to markdown.", | |
| "max_tokens": 8192, | |
| "gpu_memory_utilization": 0.75, | |
| "max_model_len": 8192, | |
| "script": "main.py", | |
| "script_version": "1.0.0", | |
| "space_url": "https://huggingface.co/spaces/davanstrien/deepseek-ocr", | |
| "implementation": "vllm-async (optimized)" | |
| } | |
| ] | |
| ``` | |
| ## π§ Technical Details | |
| ### Architecture | |
| - **Model**: DeepSeek-OCR (3B parameters, based on Qwen2.5-VL) | |
| - **Inference Engine**: vLLM 0.8.5 with AsyncEngine | |
| - **Image Preprocessing**: Custom dynamic tiling based on aspect ratio | |
| - **Vision Encoders**: Custom CLIP + SAM encoders | |
| - **Context Length**: Up to 8K tokens | |
| - **Optimization**: Flash Attention 2.7.3, async batch processing | |
| ### Hardware Requirements | |
| - **Minimum**: L4 GPU (24GB VRAM) - `--flavor l4x1` | |
| - **Recommended**: L40S/A10G (48GB VRAM) - `--flavor l40sx1` or `--flavor a10g-large` | |
| - **Maximum Performance**: A100 (40GB+ VRAM) - `--flavor a100-large` | |
| ### Speed Benchmarks | |
| | GPU | Resolution | Speed | Notes | | |
| |-----|-----------|-------|-------| | |
| | L4 | Tiny | ~5-8 img/s | Good for testing | | |
| | L4 | Gundam | ~2-3 img/s | Balanced | | |
| | A100 | Gundam | ~8-12 img/s | Production speed | | |
| | A100 | Large | ~5-7 img/s | Maximum quality | | |
| ## π Example Workflows | |
| ### 1. Process Historical Documents | |
| ```bash | |
| hf jobs run --flavor l40sx1 --secrets HF_TOKEN \ | |
| hf.co/spaces/davanstrien/deepseek-ocr \ | |
| python main.py \ | |
| historical-scans \ | |
| historical-text \ | |
| --resolution-mode large \ | |
| --shuffle | |
| ``` | |
| ### 2. Extract Tables from Reports | |
| ```bash | |
| hf jobs run --flavor a10g-large --secrets HF_TOKEN \ | |
| hf.co/spaces/davanstrien/deepseek-ocr \ | |
| python main.py \ | |
| financial-reports \ | |
| extracted-tables \ | |
| --resolution-mode gundam \ | |
| --prompt "<image>\n<|grounding|>Convert the document to markdown." | |
| ``` | |
| ### 3. Multi-language Documents | |
| ```bash | |
| hf jobs run --flavor l4x1 --secrets HF_TOKEN \ | |
| hf.co/spaces/davanstrien/deepseek-ocr \ | |
| python main.py \ | |
| multilingual-docs \ | |
| ocr-output \ | |
| --resolution-mode base | |
| ``` | |
| ## π Related Resources | |
| - **Model**: [deepseek-ai/DeepSeek-OCR](https://huggingface.co/deepseek-ai/DeepSeek-OCR) | |
| - **vLLM**: [vllm-project/vllm](https://github.com/vllm-project/vllm) | |
| - **HF Jobs**: [Documentation](https://huggingface.co/docs/huggingface_hub/en/guides/jobs) | |
| ## π License | |
| MIT License - See model card for details | |
| ## π Acknowledgments | |
| - DeepSeek AI for the OCR model | |
| - vLLM team for the inference engine | |
| - Hugging Face for Jobs infrastructure | |
| --- | |
| Built with β€οΈ using [vLLM](https://github.com/vllm-project/vllm) and [DeepSeek-OCR](https://huggingface.co/deepseek-ai/DeepSeek-OCR) | |