MMTok: Multimodal Coverage Maximization for Efficient Inference of VLMs Paper • 2508.18264 • Published Aug 25 • 25
LiveMCP-101: Stress Testing and Diagnosing MCP-enabled Agents on Challenging Queries Paper • 2508.15760 • Published Aug 21 • 46
CLASH: Evaluating Language Models on Judging High-Stakes Dilemmas from Multiple Perspectives Paper • 2504.10823 • Published Apr 15 • 15
ExpertLongBench: Benchmarking Language Models on Expert-Level Long-Form Generation Tasks with Structured Checklists Paper • 2506.01241 • Published Jun 2 • 9
VeriFact: Enhancing Long-Form Factuality Evaluation with Refined Fact Extraction and Reference Facts Paper • 2505.09701 • Published May 14 • 2
MLRC-Bench: Can Language Agents Solve Machine Learning Research Challenges? Paper • 2504.09702 • Published Apr 13 • 18