deepseek-ai/DeepSeek-OCR is out! 🔥 my take ⤵️ > pretty insane it can parse and re-render charts in HTML > it uses CLIP and SAM features concatenated, so better grounding > very efficient per vision tokens/performance ratio > covers 100 languages
DeepSeek-OCR is a new open-source, vision-language OCR model from DeepSeek-AI (the same lab behind the DeepSeek-V and DeepSeek-R series). It’s built to read complex, real-world documents — screenshots, PDFs, forms, tables, and handwritten or noisy text — and output clean, structured Markdown.
---
⚙️ Core capabilities
Multimodal (Vision + Language): Uses a hybrid vision encoder + causal text decoder to “see” layouts and generate text like a language model rather than just classifying characters.
Markdown output: Instead of raw text, it structures output with Markdown syntax — headings, bullet lists, tables, and inline formatting — which makes the results ideal for direct use in notebooks or LLM pipelines.
PDF-aware: Includes a built-in PDF runner that automatically slices pages into tiles, processes each region, and re-assembles multi-page outputs.
Adaptive tiling (“crop_mode”): Automatically splits large pages into overlapping tiles for better recognition of dense, small fonts — the “Gundam mode” mentioned in their docs.
Vision backbone: Based on DeepSeek-V2’s VL-encoder (≈3 B parameters) trained on massive document + scene-text corpora. Handles resolutions up to 1280 × 1280 px and dynamically scales lower.
Language head: Uses the same causal decoder family as DeepSeek-V2, fine-tuned for text reconstruction, so it can reason about table alignment, code blocks, and list structures.
Open and MIT-licensed: Weights and inference code are fully open under the MIT license, allowing integration into other projects or retraining for domain-specific OCR.
---
🆕 What’s new about its approach
Traditional OCR (e.g., Tesseract, PaddleOCR) → detects and classifies glyphs. DeepSeek-OCR → interprets the entire document as a multimodal
Okay this is insane... WebGPU-accelerated semantic video tracking, powered by DINOv3 and Transformers.js! 🤯 Demo (+ source code): webml-community/DINOv3-video-tracking
This will revolutionize AI-powered video editors... which can now run 100% locally in your browser, no server inference required (costs $0)! 😍
How does it work? 🤔 1️⃣ Generate and cache image features for each frame 2️⃣ Create a list of embeddings for selected patch(es) 3️⃣ Compute cosine similarity between each patch and the selected patch(es) 4️⃣ Highlight those whose score is above some threshold
... et voilà! 🥳
You can also make selections across frames to improve temporal consistency! This is super useful if the object changes its appearance slightly throughout the video.
Liquid just released two 450M and 1.6B param VLMs!
They're super fast and leverage SigLIP2 NaFlex encoders to handle native resolutions without distortion. It's ideal for on-device deployment in constrained environments like phones.
It's available today on Hugging Face, with an inference and a fine-tuning Colab notebooks.