Scope: Selective Cross-modal Orchestration of Visual Perception Experts Paper • 2510.12974 • Published 14 days ago
Unifying Autoregressive and Diffusion-Based Sequence Generation Paper • 2504.06416 • Published Apr 8 • 3
VeritasFi: An Adaptable, Multi-tiered RAG Framework for Multi-modal Financial Question Answering Paper • 2510.10828 • Published 16 days ago • 1
Improving GUI Grounding with Explicit Position-to-Coordinate Mapping Paper • 2510.03230 • Published 25 days ago • 3
BigDocs: An Open and Permissively-Licensed Dataset for Training Multimodal Models on Document and Code Tasks Paper • 2412.04626 • Published Dec 5, 2024 • 14
AlignVLM: Bridging Vision and Language Latent Spaces for Multimodal Understanding Paper • 2502.01341 • Published Feb 3 • 39
InsightBench: Evaluating Business Analytics Agents Through Multi-Step Insight Generation Paper • 2407.06423 • Published Jul 8, 2024
UI-Vision: A Desktop-centric GUI Benchmark for Visual Perception and Interaction Paper • 2503.15661 • Published Mar 19 • 2
StarFlow: Generating Structured Workflow Outputs From Sketch Images Paper • 2503.21889 • Published Mar 27 • 2
Augmenting LLM Reasoning with Dynamic Notes Writing for Complex QA Paper • 2505.16293 • Published May 22 • 2
Rendering-Aware Reinforcement Learning for Vector Graphics Generation Paper • 2505.20793 • Published May 27 • 12
AU-Harness: An Open-Source Toolkit for Holistic Evaluation of Audio LLMs Paper • 2509.08031 • Published Sep 9 • 21
WebMMU: A Benchmark for Multimodal Multilingual Website Understanding and Code Generation Paper • 2508.16763 • Published Aug 22 • 2