11 41 84

Javed Alam PRO

Javedalam

https://www.linkedin.com/in/jalam1001/

AI & ML interests

Ai interest as user, tester, and developer of open source ai llm based applications.

Recent Activity

reacted to merve's post with 👍 about 12 hours ago

https://huggingface.co/deepseek-ai/DeepSeek-OCR is out! 🔥 my take ⤵️ > pretty insane it can parse and re-render charts in HTML > it uses CLIP and SAM features concatenated, so better grounding > very efficient per vision tokens/performance ratio > covers 100 languages

liked a model about 12 hours ago

onnx-community/trlm-135m-ONNX

updated a collection about 13 hours ago

general pupose llm

View all activity

Organizations

Posts 1

Post

313

DeepSeek-OCR is a new open-source, vision-language OCR model from DeepSeek-AI (the same lab behind the DeepSeek-V and DeepSeek-R series).
It’s built to read complex, real-world documents — screenshots, PDFs, forms, tables, and handwritten or noisy text — and output clean, structured Markdown.

---

⚙️ Core capabilities

Multimodal (Vision + Language):
Uses a hybrid vision encoder + causal text decoder to “see” layouts and generate text like a language model rather than just classifying characters.

Markdown output:
Instead of raw text, it structures output with Markdown syntax — headings, bullet lists, tables, and inline formatting — which makes the results ideal for direct use in notebooks or LLM pipelines.

PDF-aware:
Includes a built-in PDF runner that automatically slices pages into tiles, processes each region, and re-assembles multi-page outputs.

Adaptive tiling (“crop_mode”):
Automatically splits large pages into overlapping tiles for better recognition of dense, small fonts — the “Gundam mode” mentioned in their docs.

Vision backbone:
Based on DeepSeek-V2’s VL-encoder (≈3 B parameters) trained on massive document + scene-text corpora.
Handles resolutions up to 1280 × 1280 px and dynamically scales lower.

Language head:
Uses the same causal decoder family as DeepSeek-V2, fine-tuned for text reconstruction, so it can reason about table alignment, code blocks, and list structures.

Open and MIT-licensed:
Weights and inference code are fully open under the MIT license, allowing integration into other projects or retraining for domain-specific OCR.

---

🆕 What’s new about its approach

Traditional OCR (e.g., Tesseract, PaddleOCR) → detects and classifies glyphs.
DeepSeek-OCR → interprets the entire document as a multimodal

Google colab notebook running deepseek OCR

https://colab.research.google.com/drive/1Fjzv3UYNoOt28HpM0RMUc8kG34EFgvuu?usp=sharing

The model url

deepseek-ai/DeepSeek-OCR