Post
313
DeepSeek-OCR is a new open-source, vision-language OCR model from DeepSeek-AI (the same lab behind the DeepSeek-V and DeepSeek-R series).
Itโs built to read complex, real-world documents โ screenshots, PDFs, forms, tables, and handwritten or noisy text โ and output clean, structured Markdown.
---
โ๏ธ Core capabilities
Multimodal (Vision + Language):
Uses a hybrid vision encoder + causal text decoder to โseeโ layouts and generate text like a language model rather than just classifying characters.
Markdown output:
Instead of raw text, it structures output with Markdown syntax โ headings, bullet lists, tables, and inline formatting โ which makes the results ideal for direct use in notebooks or LLM pipelines.
PDF-aware:
Includes a built-in PDF runner that automatically slices pages into tiles, processes each region, and re-assembles multi-page outputs.
Adaptive tiling (โcrop_modeโ):
Automatically splits large pages into overlapping tiles for better recognition of dense, small fonts โ the โGundam modeโ mentioned in their docs.
Vision backbone:
Based on DeepSeek-V2โs VL-encoder (โ3 B parameters) trained on massive document + scene-text corpora.
Handles resolutions up to 1280 ร 1280 px and dynamically scales lower.
Language head:
Uses the same causal decoder family as DeepSeek-V2, fine-tuned for text reconstruction, so it can reason about table alignment, code blocks, and list structures.
Open and MIT-licensed:
Weights and inference code are fully open under the MIT license, allowing integration into other projects or retraining for domain-specific OCR.
---
๐ Whatโs new about its approach
Traditional OCR (e.g., Tesseract, PaddleOCR) โ detects and classifies glyphs.
DeepSeek-OCR โ interprets the entire document as a multimodal
Google colab notebook running deepseek OCR
https://colab.research.google.com/drive/1Fjzv3UYNoOt28HpM0RMUc8kG34EFgvuu?usp=sharing
The model url
deepseek-ai/DeepSeek-OCR