Don’t presume this study has anything to do with programming. They measured an agent’s ability to search long conversations, not code.
> We evaluate on a 116-question representative subset of the LongMemEval benchmark (Wu et al., 2025), which tests an agent’s ability to answer questions over long conversations spanning multiple sessions.
I recently watched the new Palantir + Kirkland & Ellis fund formation platform demo, and I was surprised to see how effective the union of structured data was in an agent harness. We're used to dealing with flat files and comparing here basic ways of searching, essentially, long strings, but using Palantir's "Ontology" graph framework, I think Kirkland is going to be able to achieve some exception and differentiating outcomes in legal tech. The whole idea assumes that they've got great structured data already, and perhaps that's the real valuable unknown, but giving an agent those tools is super powerful.
I wrote about it[1] and came away with a different view on both Palantir and the future of agentic workflows personally.
This paper oversells on the title. Like, what is chronos, which embedding model was used, which reranker, how was the reranking done, why is chronos much better than claude code
This is a surprising result. With structured inputs like source code, I’d expect grep to outperform semantic search, but natural language’s errors and inconsistencies seem to leave so many cracks for information to fall through.
This paper is based on quality so I don't think it should be that surprising if you take loops into consideration. What the agent finds in the first pass, can help if formulate the next grep if needed.
Exactly this, and this tool called qmd is what I use for the hybrid search portion. It also uses local LLMs to provide summaries on your own markdown data too. My agents use both depending on what type of search they are doing, and both provide good results.
That assumes that the agent knows which one is better. And to bake in which one is better via post-training would require a study like this to establish where each one works well
I’ve got a custom ultra high performance streaming semantic search I exposed as a tool and the RL bias in Claude is almost insurmountable without copious and consistent steering. Codex will follow instructions and use the tools I ask it to but for gods sake between Claude asking to take a nap because it’s getting late in the session and it regressing to RL biased tools like grep it’s maddening. When I can get it to use my compositional tools tool calls drop from like 20-50 to 3-4, but it’s almost impossible to steer.
Tangential, I have a hook that rewriters grep to rg but lately I wonder if this is actually wasteful as the model is so biased to grep, is there a way to shim/alias perhaps?
`gsc grep` is just an alias for `gsc rg`, mostly because agents are much more likely to reach for “grep” than “rg”.
It works pretty well, but it is not a perfect drop-in replacement. `grep` and `ripgrep` differ in a few details, especially around glob/wildcard behaviour and flags. What I found works is to not use `grep` in search examples, and have the CLI spit out an error message for the AI saying this is `ripgrep`, so it needs to use `ripgrep` syntax.
I see it using the Bash tool infrequently though sometimes Grep. I'm on Claude Code for now due to subscription lock-in, been contemplating moving to pi though
My experience here (also Claude user) is that the model uses different tools in different contexts. I see rg more on frontend and grep more on backend work. I imagine it defaults to using the tool it has more learning around within the contexts it's reaching for and since for the most part it's 6 of one or half a dozen of the other you'll see environment specific usages for these tools in claude for now. I imagine eventually it'll standardize but we're early yet on such things.
If you'd told me a decade ago I'd finally learn some sed in 26 because I'd want to understand what the AI was doing I'd have told you you were crazy . . .
I've been on a look out for any harness that properly secures a protocol to the LLM, but they're all just "here's some tools, hopefully you don't use bash for everything".
This has been posted before, but a dead-simple pattern that helps enormously with steering the model to the right code area is a DESIGN.md that it creates, updates, and references periodically.
Don’t presume this study has anything to do with programming. They measured an agent’s ability to search long conversations, not code.
> We evaluate on a 116-question representative subset of the LongMemEval benchmark (Wu et al., 2025), which tests an agent’s ability to answer questions over long conversations spanning multiple sessions.
Combining regex filtering with semantic ranking using multi-vector embeddings has yielded good results for me. I use ColGREP from the LightOn team asa daily driver - https://github.com/lightonai/next-plaid/blob/main/colgrep/RE...
I recently watched the new Palantir + Kirkland & Ellis fund formation platform demo, and I was surprised to see how effective the union of structured data was in an agent harness. We're used to dealing with flat files and comparing here basic ways of searching, essentially, long strings, but using Palantir's "Ontology" graph framework, I think Kirkland is going to be able to achieve some exception and differentiating outcomes in legal tech. The whole idea assumes that they've got great structured data already, and perhaps that's the real valuable unknown, but giving an agent those tools is super powerful.
I wrote about it[1] and came away with a different view on both Palantir and the future of agentic workflows personally.
[1] sorry, LinkedIn: https://www.linkedin.com/pulse/fund-managements-killer-app-d...
This paper oversells on the title. Like, what is chronos, which embedding model was used, which reranker, how was the reranking done, why is chronos much better than claude code
This is a surprising result. With structured inputs like source code, I’d expect grep to outperform semantic search, but natural language’s errors and inconsistencies seem to leave so many cracks for information to fall through.
This paper is based on quality so I don't think it should be that surprising if you take loops into consideration. What the agent finds in the first pass, can help if formulate the next grep if needed.
If you are truly bitter-lesson pilled - give the agent all the tools and let it decide which to use.
- regex (grep) - hybrid search (bm25+vector)
this X vs Y is uninteresting when the answer can be both.
Both is usually the right answer, since you can use LLMs to do query expansion and effectively increase the recall performance of your retrieval algo
Exactly this, and this tool called qmd is what I use for the hybrid search portion. It also uses local LLMs to provide summaries on your own markdown data too. My agents use both depending on what type of search they are doing, and both provide good results.
https://github.com/tobi/qmd
That assumes that the agent knows which one is better. And to bake in which one is better via post-training would require a study like this to establish where each one works well
I’ve got a custom ultra high performance streaming semantic search I exposed as a tool and the RL bias in Claude is almost insurmountable without copious and consistent steering. Codex will follow instructions and use the tools I ask it to but for gods sake between Claude asking to take a nap because it’s getting late in the session and it regressing to RL biased tools like grep it’s maddening. When I can get it to use my compositional tools tool calls drop from like 20-50 to 3-4, but it’s almost impossible to steer.
it will only use tools it was trained on? what's the benfit of givig it all the tools.
I'm still disappointed that ai can't use ctags, its used for finding strings and patterns, its right there.
> I'm still disappointed that ai can't use ctags,
What do you mean by this? Do you mean not automatically build the index?
it inspects a project, finds the ctags files, then goes on to use grep.
Tangential, I have a hook that rewriters grep to rg but lately I wonder if this is actually wasteful as the model is so biased to grep, is there a way to shim/alias perhaps?
My CLI does something close to this:
https://github.com/gitsense/gsc-cli
`gsc grep` is just an alias for `gsc rg`, mostly because agents are much more likely to reach for “grep” than “rg”.
It works pretty well, but it is not a perfect drop-in replacement. `grep` and `ripgrep` differ in a few details, especially around glob/wildcard behaviour and flags. What I found works is to not use `grep` in search examples, and have the CLI spit out an error message for the AI saying this is `ripgrep`, so it needs to use `ripgrep` syntax.
If performance is the concern, ugrep will get you most of the way there relative to gnu grep, and should be fully grep compatible in terms of syntax:
https://github.com/Genivia/ugrep#aliases
Claude Code may ship with ugrep already.
Many harnesses are doing this already, "Grep" is the tool name, ripgrep is the implementation
It depends on if it is using Grep the harness tool or Grep from the bash tool
I see it using the Bash tool infrequently though sometimes Grep. I'm on Claude Code for now due to subscription lock-in, been contemplating moving to pi though
My experience here (also Claude user) is that the model uses different tools in different contexts. I see rg more on frontend and grep more on backend work. I imagine it defaults to using the tool it has more learning around within the contexts it's reaching for and since for the most part it's 6 of one or half a dozen of the other you'll see environment specific usages for these tools in claude for now. I imagine eventually it'll standardize but we're early yet on such things.
If you'd told me a decade ago I'd finally learn some sed in 26 because I'd want to understand what the AI was doing I'd have told you you were crazy . . .
Why do you have subscription lock-in? Even if you pay for a yearly subscription, Anthropic will refund you pro rata if you cancel early.
I've been on a look out for any harness that properly secures a protocol to the LLM, but they're all just "here's some tools, hopefully you don't use bash for everything".
I have always used traditional grep to search codebases. It serves me better than an IDE when there’re lots of scattered and frequent queries.
grep’s design is surprisingly winning, exceeding expectations to this day.
you might be interested in https://github.com/boyter/cs
pretty fast and neat project to search code interactively with a lot of optimizations on finding the right thing
Is <blank> the only ML paper title?
Feels important, but I wish they also had compared against something like MeiliSearch or Algolia.
100%, there's even Typesense, open source Algolia, which can do hybrid search and a number of other fancy things
I'm currently working on a markdown kb / search tool for my agents, in part built on TS
I'm curious to see what patterns it's grepping.
Surely 'strings' would be even better?
This has been posted before, but a dead-simple pattern that helps enormously with steering the model to the right code area is a DESIGN.md that it creates, updates, and references periodically.
What does it contain?