Yi Cui's picture

Yi Cui

onekq

·

https://onekq.ai

AI & ML interests

Benchmark, Code Generation Model

Recent Activity

replied to their post 3 days ago

Context rot is such a catchy phrase, but the problem has been identified 2+ years ago, called attention decay. https://huggingface.co/papers/2307.03172 I spotted the same problem in coding tasks, and documented in my book (https://www.amazon.com/dp/9999331130). Why did this problem become hot again? This is because many of us thought the problem has been solved by long context models, which is not true. Here we were misled by benchmarks. Most long-context benchmarks build around the QA scenario, i.e. "finding needle in haystack". But in agentic scenarios, the model needs to find EVERYTHING in the haystack, and just can't afford enough attention for this challenge.

replied to their post 3 days ago

Context rot is such a catchy phrase, but the problem has been identified 2+ years ago, called attention decay. https://huggingface.co/papers/2307.03172 I spotted the same problem in coding tasks, and documented in my book (https://www.amazon.com/dp/9999331130). Why did this problem become hot again? This is because many of us thought the problem has been solved by long context models, which is not true. Here we were misled by benchmarks. Most long-context benchmarks build around the QA scenario, i.e. "finding needle in haystack". But in agentic scenarios, the model needs to find EVERYTHING in the haystack, and just can't afford enough attention for this challenge.

posted an update 3 days ago

Context rot is such a catchy phrase, but the problem has been identified 2+ years ago, called attention decay. https://huggingface.co/papers/2307.03172 I spotted the same problem in coding tasks, and documented in my book (https://www.amazon.com/dp/9999331130). Why did this problem become hot again? This is because many of us thought the problem has been solved by long context models, which is not true. Here we were misled by benchmarks. Most long-context benchmarks build around the QA scenario, i.e. "finding needle in haystack". But in agentic scenarios, the model needs to find EVERYTHING in the haystack, and just can't afford enough attention for this challenge.

View all activity

Organizations

authored a paper 6 months ago

Tests as Prompt: A Test-Driven-Development Benchmark for LLM Code Generation

Paper • 2505.09027 • Published May 13

authored 3 papers about 1 year ago

A Case Study of Web App Coding with OpenAI Reasoning Models

Paper • 2409.13773 • Published Sep 19, 2024 • 6

WebApp1K: A Practical Code-Generation Benchmark for Web App Development

Paper • 2408.00019 • Published Jul 30, 2024 • 1

Insights from Benchmarking Frontier Language Models on Web App Code Generation

Paper • 2409.05177 • Published Sep 8, 2024 • 7