MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid Vision Tokenizer Paper • 2509.16197 • Published Sep 19 • 54
MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing Paper • 2509.22186 • Published Sep 26 • 127
MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid Vision Tokenizer Paper • 2509.16197 • Published Sep 19 • 54
ProBench: Judging Multimodal Foundation Models on Open-ended Multi-domain Expert Tasks Paper • 2503.06885 • Published Mar 10 • 4
ProBench: Judging Multimodal Foundation Models on Open-ended Multi-domain Expert Tasks Paper • 2503.06885 • Published Mar 10 • 4 • 3
OminiControl: Minimal and Universal Control for Diffusion Transformer Paper • 2411.15098 • Published Nov 22, 2024 • 61
VideoAutoArena: An Automated Arena for Evaluating Large Multimodal Models in Video Analysis through User Simulation Paper • 2411.13281 • Published Nov 20, 2024 • 21
leon-se/Aria-sequential_mlp-FP8-dynamic Image-Text-to-Text • 25B • Updated Dec 29, 2024 • 2 • 6
Allegro: Open the Black Box of Commercial-Level Video Generation Model Paper • 2410.15458 • Published Oct 20, 2024 • 40
Aria: An Open Multimodal Native Mixture-of-Experts Model Paper • 2410.05993 • Published Oct 8, 2024 • 111
Word-level Deep Sign Language Recognition from Video: A New Large-scale Dataset and Methods Comparison Paper • 1910.11006 • Published Oct 24, 2019
X-InstructBLIP: A Framework for aligning X-Modal instruction-aware representations to LLMs and Emergent Cross-modal Reasoning Paper • 2311.18799 • Published Nov 30, 2023 • 1
Moonshot: Towards Controllable Video Generation and Editing with Multimodal Conditions Paper • 2401.01827 • Published Jan 3, 2024 • 18