19 points | by PranoyP 8 hours ago
13 comments
I appreciate the details shared in this paper but it'd be great if they open sourced their implementation!
Curious if the behaviour driven testing can be done by another LLM agent (or a group of agents) - one LLM agent testing another. Could lead to a self-improving loop?
A powerful move beyond benchmarks — this paper redefines LLM evaluation through realistic, behavior-driven testing.
Very interesting work.
Excellent work
Interesting
Nice Work
Nice work
Great work
interesting
[dead]
I appreciate the details shared in this paper but it'd be great if they open sourced their implementation!
Curious if the behaviour driven testing can be done by another LLM agent (or a group of agents) - one LLM agent testing another. Could lead to a self-improving loop?
A powerful move beyond benchmarks — this paper redefines LLM evaluation through realistic, behavior-driven testing.
Very interesting work.
Excellent work
Interesting
Nice Work
Nice work
Great work
interesting
[dead]