23 19 1

Yi Cui

onekq

https://onekq.ai

AI & ML interests

Benchmark, Code Generation Model

Recent Activity

replied to their post 1 day ago

Context rot is such a catchy phrase, but the problem has been identified 2+ years ago, called attention decay. https://huggingface.co/papers/2307.03172 I spotted the same problem in coding tasks, and documented in my book (https://www.amazon.com/dp/9999331130). Why did this problem become hot again? This is because many of us thought the problem has been solved by long context models, which is not true. Here we were misled by benchmarks. Most long-context benchmarks build around the QA scenario, i.e. "finding needle in haystack". But in agentic scenarios, the model needs to find EVERYTHING in the haystack, and just can't afford enough attention for this challenge.

replied to their post 1 day ago

posted an update 2 days ago

View all activity

Organizations

replied to their post 1 day ago

Good stuff! I didn't consider token cost at all.

I'm thinking about an open source project for a context compressor (algorithmic, at most a small on-premise model) for agent builders. Does this make sense? If so, how should it look like?

replied to their post 1 day ago

Do you know any work that studied how agents use context?

posted an update 2 days ago

Post

3459

Context rot is such a catchy phrase, but the problem has been identified 2+ years ago, called attention decay.
Lost in the Middle: How Language Models Use Long Contexts (2307.03172)

I spotted the same problem in coding tasks, and documented in my book (https://www.amazon.com/dp/9999331130).

Why did this problem become hot again? This is because many of us thought the problem has been solved by long context models, which is not true.

Here we were misled by benchmarks. Most long-context benchmarks build around the QA scenario, i.e. "finding needle in haystack". But in agentic scenarios, the model needs to find EVERYTHING in the haystack, and just can't afford enough attention for this challenge.

5 replies

replied to their post 2 days ago

That is the case.

I'm a developer at heart. As a developer, majority of your time is spent running things and hop around environments, e.g. IDE, Cloud, Github. These environments all happen to have full featured bash support, a perfect sandbox for the CLI form factor.

The paradigm change AI brought to the developer world is nothing short of meteoric, but also an exception. Lots of efforts try to generalize the momentum to the next area(s). I won't bet on them.

posted an update 3 days ago

Post

4090

I am on the model layer and focus on atomic tasks, so I don't get involved in product discussions. But this provocative article provoked the community quite a bit. The case in point is Claude Code, which happens to be my biggest productivity revolution since ChatGPT.

RAG predated TUI and agents. So to be fair it's quite an achievement to survive the AI evolution. But I feel it is overshadowed by context engineering in the agent era. How does everyone feel about this?

https://www.nicolasbustamante.com/p/the-rag-obituary-killed-by-agents

2 replies

posted an update 5 days ago

Post

303

I tried to test the DeepSeek OCR model on a diagram-to-SQL task, i.e. visualize SQL schema to a E-R diagram, the combine the diagram with natural language question as prompt. The model outputs SQL query but unusable. The multimodality model (DeepSeek VL) performs better, but the good old coding LLM is far better.

So this model is still, and meant to be, an OCR model. It does compress long context in a new way, but will have to be trained for other tasks will long context will be applied. OCR itself doesn't need long context.

TLDR: lots of work will have to be done to make this main stream.

posted an update 7 days ago

Post

263

I wrote this article to explain the difference between vision token and text token. They are apples and oranges, but also the source of compression efficiency of DeepSeek OCR (don't forget Glyph by THUDM!)

https://huggingface.co/blog/onekq/behind-each-token

I am running experiment with DeepSeek OCR BTW

posted an update 8 days ago

Post

167

Random thought on 🐋DeepSeek🐋 OCR model: layout software/design will be hot, presentation matters way more than wording.

posted an update 14 days ago

Post

221

Claude 4.5 is just slightly behind GPT-5
onekq-ai/WebApp1K-models-leaderboard

1 reply

posted an update 15 days ago

Post

222

DeepSeek 3.2-exp is doing better than v3, but behind R1. This is quite interesting.
onekq-ai/WebApp1K-models-leaderboard

posted an update 27 days ago

Post

322

WebApp1K measures an oldest and simplest kind of task predated ChatGPT. It is code completion, you can also consider it a translation task mapping test spec into code. It requires no conversation, reasoning (which helps sometimes), or RL.

I don't think it is on the roadmap of top labs. Otherwise, you can't explain why Claude 4 has the same 70+ score on SweBench, which is way more challenging than this benchmark.

Neither do I encourage model builders to optimize towards my benchmark, which in itself won't be too hard to top the leaderboard. I just argue that we're still in a very early phase.

What I witness now is still the same pattern: the dropping of generic models strategically optimized towards famous benchmarks. Meanwhile, agent builders (top labs and startups alike) painfully prompt these models to follow their expectations, and pray they won't drift overnight.

posted an update 30 days ago

Post

226

GPT OSS is as of now the top open source model, whose performance is very close to Claude and GPT-5, and above all other models.

onekq-ai/WebApp1K-models-leaderboard

posted an update about 1 month ago

Post

275

My book on WebApp1K is published on Amazon

https://www.amazon.com/dp/9999331130
https://elivabooks.com/en/book/book-8002082768

posted an update about 1 month ago

Post

364

Lots of users don't know what WebApp1K tests. I published two datasets for the tasks and generations by models.

onekq-ai/WebApp1K-React-Generations
onekq-ai/WebApp1K-Duo-React-Generations

posted an update about 1 month ago

Post

4416

Claude Opus 4.1 is slightly better than Opus 4, but still behind GPT-5
onekq-ai/WebApp1K-models-leaderboard

3 replies

posted an update about 1 month ago

Post

278

Claude Opus 4 is the best for Claude, but didn't beat GPT-5 onekq-ai/WebApp1K-models-leaderboard

replied to their post about 1 month ago

is codex available on API? If so, we should test

posted an update about 1 month ago

Post

170

GPT-5 achieves new SOTA! But I need to evaluate more models after a few months of gap.
onekq-ai/WebApp1K-models-leaderboard

2 replies

posted an update 5 months ago

Post

419

Wow, the new Gemini Pro climbed really fast, after just one month. The inference is quite fast too.

onekq-ai/WebApp1K-models-leaderboard

posted an update 5 months ago

Post

380

The new R1 is on a par with the old R1. Meet the expectation.
onekq-ai/WebApp1K-models-leaderboard

Yi Cui

AI & ML interests

Recent Activity

Organizations

onekq's activity