When Judgment Becomes Noise: How Design Failures in LLM Judge Benchmarks Silently Undermine Validity Paper • 2509.20293 • Published Sep 24 • 7
Thinking While Listening: Simple Test Time Scaling For Audio Classification Paper • 2509.19676 • Published Sep 24 • 4
MOSS-ChatV: Reinforcement Learning with Process Reasoning Reward for Video Temporal Reasoning Paper • 2509.21113 • Published Sep 25 • 5
Behind RoPE: How Does Causal Mask Encode Positional Information? Paper • 2509.21042 • Published Sep 25 • 8
ScaleDiff: Scaling Difficult Problems for Advanced Mathematical Reasoning Paper • 2509.21070 • Published Sep 25 • 9
Residual Off-Policy RL for Finetuning Behavior Cloning Policies Paper • 2509.19301 • Published Sep 23 • 18
VCRL: Variance-based Curriculum Reinforcement Learning for Large Language Models Paper • 2509.19803 • Published Sep 24 • 117