> Quality โ 3โ4B dense, yet faster than Qwen3-1.7B > MoE designed to run on phones/laptops (llama.cpp / vLLM) > Pre-trained on 12T tokens โ strong math/code/IF
Okay this is insane... WebGPU-accelerated semantic video tracking, powered by DINOv3 and Transformers.js! ๐คฏ Demo (+ source code): webml-community/DINOv3-video-tracking
This will revolutionize AI-powered video editors... which can now run 100% locally in your browser, no server inference required (costs $0)! ๐
How does it work? ๐ค 1๏ธโฃ Generate and cache image features for each frame 2๏ธโฃ Create a list of embeddings for selected patch(es) 3๏ธโฃ Compute cosine similarity between each patch and the selected patch(es) 4๏ธโฃ Highlight those whose score is above some threshold
... et voilร ! ๐ฅณ
You can also make selections across frames to improve temporal consistency! This is super useful if the object changes its appearance slightly throughout the video.
Liquid just released two 450M and 1.6B param VLMs!
They're super fast and leverage SigLIP2 NaFlex encoders to handle native resolutions without distortion. It's ideal for on-device deployment in constrained environments like phones.
It's available today on Hugging Face, with an inference and a fine-tuning Colab notebooks.