11 43 84

Javed Alam PRO

Javedalam

https://www.linkedin.com/in/jalam1001/

AI & ML interests

Ai interest as user, tester, and developer of open source ai llm based applications.

Recent Activity

upvoted a collection 7 minutes ago

Granite 4.0 Nano Language Models

upvoted an article 8 minutes ago

Granite 4.0 Nano: Just how small can you go?

reacted to merve's post with 👍 about 15 hours ago

https://huggingface.co/deepseek-ai/DeepSeek-OCR is out! 🔥 my take ⤵️ > pretty insane it can parse and re-render charts in HTML > it uses CLIP and SAM features concatenated, so better grounding > very efficient per vision tokens/performance ratio > covers 100 languages

View all activity

Organizations

reacted to merve's post with 👍 about 15 hours ago

Post

4524

deepseek-ai/DeepSeek-OCR is out! 🔥 my take ⤵️
> pretty insane it can parse and re-render charts in HTML
> it uses CLIP and SAM features concatenated, so better grounding
> very efficient per vision tokens/performance ratio
> covers 100 languages

2 replies

posted an update 7 days ago

Post

313

DeepSeek-OCR is a new open-source, vision-language OCR model from DeepSeek-AI (the same lab behind the DeepSeek-V and DeepSeek-R series).
It’s built to read complex, real-world documents — screenshots, PDFs, forms, tables, and handwritten or noisy text — and output clean, structured Markdown.

---

⚙️ Core capabilities

Multimodal (Vision + Language):
Uses a hybrid vision encoder + causal text decoder to “see” layouts and generate text like a language model rather than just classifying characters.

Markdown output:
Instead of raw text, it structures output with Markdown syntax — headings, bullet lists, tables, and inline formatting — which makes the results ideal for direct use in notebooks or LLM pipelines.

PDF-aware:
Includes a built-in PDF runner that automatically slices pages into tiles, processes each region, and re-assembles multi-page outputs.

Adaptive tiling (“crop_mode”):
Automatically splits large pages into overlapping tiles for better recognition of dense, small fonts — the “Gundam mode” mentioned in their docs.

Vision backbone:
Based on DeepSeek-V2’s VL-encoder (≈3 B parameters) trained on massive document + scene-text corpora.
Handles resolutions up to 1280 × 1280 px and dynamically scales lower.

Language head:
Uses the same causal decoder family as DeepSeek-V2, fine-tuned for text reconstruction, so it can reason about table alignment, code blocks, and list structures.

Open and MIT-licensed:
Weights and inference code are fully open under the MIT license, allowing integration into other projects or retraining for domain-specific OCR.

---

🆕 What’s new about its approach

Traditional OCR (e.g., Tesseract, PaddleOCR) → detects and classifies glyphs.
DeepSeek-OCR → interprets the entire document as a multimodal

Google colab notebook running deepseek OCR

https://colab.research.google.com/drive/1Fjzv3UYNoOt28HpM0RMUc8kG34EFgvuu?usp=sharing

The model url

deepseek-ai/DeepSeek-OCR

reacted to Xenova's post with 👍 about 2 months ago

Post

8185

Okay this is insane... WebGPU-accelerated semantic video tracking, powered by DINOv3 and Transformers.js! 🤯
Demo (+ source code): webml-community/DINOv3-video-tracking

This will revolutionize AI-powered video editors... which can now run 100% locally in your browser, no server inference required (costs $0)! 😍

How does it work? 🤔
1️⃣ Generate and cache image features for each frame
2️⃣ Create a list of embeddings for selected patch(es)
3️⃣ Compute cosine similarity between each patch and the selected patch(es)
4️⃣ Highlight those whose score is above some threshold

... et voilà! 🥳

You can also make selections across frames to improve temporal consistency! This is super useful if the object changes its appearance slightly throughout the video.

Excited to see what the community builds with it!

1 reply

reacted to mlabonne's post with 🔥 2 months ago

Post

6733

Liquid just released two 450M and 1.6B param VLMs!

They're super fast and leverage SigLIP2 NaFlex encoders to handle native resolutions without distortion. It's ideal for on-device deployment in constrained environments like phones.

It's available today on Hugging Face, with an inference and a fine-tuning Colab notebooks.

LiquidAI/LFM2-VL-450M
LiquidAI/LFM2-VL-1.6B

reacted to fdaudens's post with ❤️ over 1 year ago

Post

3447

Updated the Journalists on 🤗 community page:
- new text-to-speech tools collection https://huggingface.co/collections/JournalistsonHF/text-to-speech-6675c4dccdaa11e86928a15b
- additional leaderboards in the eval collection: TTS-AGI/TTS-Arena and dylanebert/3d-arena
- new tools in the Text-Analysis collection: gokaygokay/Florence-2, pdf2dataset/pdf2dataset, cvachet/pdf-chatbot
- Xenova/realtime-whisper-webgpu in the Transcription collection
- radames/flash-sd3-taesd3 in the Image Tools collection
- Last but not least, okaris/omni-zero in the fun collection for zero-shot stylized portrait creation

Is there any tool you would like to see added?

Find all the curated tools here: https://huggingface.co/collections/JournalistsonHF/

Javed Alam PRO

AI & ML interests

Recent Activity

Organizations

Javedalam's activity