I'm thinking about an open source project for a context compressor (algorithmic, at most a small on-premise model) for agent builders. Does this make sense? If so, how should it look like?
Why did this problem become hot again? This is because many of us thought the problem has been solved by long context models, which is not true.
Here we were misled by benchmarks. Most long-context benchmarks build around the QA scenario, i.e. "finding needle in haystack". But in agentic scenarios, the model needs to find EVERYTHING in the haystack, and just can't afford enough attention for this challenge.
I'm a developer at heart. As a developer, majority of your time is spent running things and hop around environments, e.g. IDE, Cloud, Github. These environments all happen to have full featured bash support, a perfect sandbox for the CLI form factor.
The paradigm change AI brought to the developer world is nothing short of meteoric, but also an exception. Lots of efforts try to generalize the momentum to the next area(s). I won't bet on them.
I am on the model layer and focus on atomic tasks, so I don't get involved in product discussions. But this provocative article provoked the community quite a bit. The case in point is Claude Code, which happens to be my biggest productivity revolution since ChatGPT.
RAG predated TUI and agents. So to be fair it's quite an achievement to survive the AI evolution. But I feel it is overshadowed by context engineering in the agent era. How does everyone feel about this?
I tried to test the DeepSeek OCR model on a diagram-to-SQL task, i.e. visualize SQL schema to a E-R diagram, the combine the diagram with natural language question as prompt. The model outputs SQL query but unusable. The multimodality model (DeepSeek VL) performs better, but the good old coding LLM is far better.
So this model is still, and meant to be, an OCR model. It does compress long context in a new way, but will have to be trained for other tasks will long context will be applied. OCR itself doesn't need long context.
TLDR: lots of work will have to be done to make this main stream.
I wrote this article to explain the difference between vision token and text token. They are apples and oranges, but also the source of compression efficiency of DeepSeek OCR (don't forget Glyph by THUDM!)
WebApp1K measures an oldest and simplest kind of task predated ChatGPT. It is code completion, you can also consider it a translation task mapping test spec into code. It requires no conversation, reasoning (which helps sometimes), or RL.
I don't think it is on the roadmap of top labs. Otherwise, you can't explain why Claude 4 has the same 70+ score on SweBench, which is way more challenging than this benchmark.
Neither do I encourage model builders to optimize towards my benchmark, which in itself won't be too hard to top the leaderboard. I just argue that we're still in a very early phase.
What I witness now is still the same pattern: the dropping of generic models strategically optimized towards famous benchmarks. Meanwhile, agent builders (top labs and startups alike) painfully prompt these models to follow their expectations, and pray they won't drift overnight.