There is a frustrating gap between benchmarks and real world ability.
O1 or even O3 might be able to crack academic level math problems, but I still wouldn't trust it to correctly fill out a McDonalds application using a PDF of my resume and a calendar of my availability.
A lot of that has to do with certainty. The GPTs and Claudes will be replacing graudate-level research assistant jobs and other jobs that are very high skill but have soft success criteria long before they replace travel agents, which have low skill but very hard criteria for success.
The reasoning models are much better suited to questions that have answers and a conclusion to arrive at. Ie exactly what benchmarks ask. Rather than make me a todo list app or whatever.
It’s a bit like you get instruct tuned models and you get chat tuned ones. It’s not really one worse than the other just aimed at different uses
These benchmarks are mostly focused on math, which benefits a lot from an improved CoT and is also less sensitive to having "reduced knowledge" in smaller model.
OpenAI was caught gaming benchmarks recently with FrontierMath. Just (yet another) sign that benchmarks are very flawed and everyone is training on them.
So I would not put too much weight on how the models are doing on benchmarks.
Was OpenAI demonstrated to have cheated, or just that they had access to the benchmarks and it can’t be proven they didn’t cheat? (Which is hard to do in any case).
Last I saw FrontierMath said they had a holdback set of problems specifically to ensure investors with access couldn’t cheat[1]. Or did that turn out to be a lie?
Please don’t spread misinformation. OpenAI didn’t cheat on FrontierMath. They have access to the questions, same as MMLU, MATHD, GPQA, ARC-AGI and pretty much every eval. Sure, we could all be lying, but it would be pretty self defeating, horrible for employee morale, and quickly discovered.
You should be able to run it locally w/ something like Ollama. It's been a while since I tinkered with the local LLM tools, but 1.5B is tiny, so it should run at a decent clip even on just your CPU.
`ollama run hf.co/bartowski/DeepSeek-R1-Distill-Qwen-1.5B-GGUF:F16` and you're off to the races.
These benchmarks have even the small models absolutely demolishing Sonnet-3.5, which doesn't reflect my subjective experience.
It still seems to me that these models are 'dumb' and often don't understand what I'm asking, where claude's intuition is much stronger.
I feel r1 14b even feels weaker than qwen 2.5 14b
Primary use-case is web technology / coding. Maybe I'm prompting it incorrectly?
There is a frustrating gap between benchmarks and real world ability.
O1 or even O3 might be able to crack academic level math problems, but I still wouldn't trust it to correctly fill out a McDonalds application using a PDF of my resume and a calendar of my availability.
A lot of that has to do with certainty. The GPTs and Claudes will be replacing graudate-level research assistant jobs and other jobs that are very high skill but have soft success criteria long before they replace travel agents, which have low skill but very hard criteria for success.
The reasoning models are much better suited to questions that have answers and a conclusion to arrive at. Ie exactly what benchmarks ask. Rather than make me a todo list app or whatever.
It’s a bit like you get instruct tuned models and you get chat tuned ones. It’s not really one worse than the other just aimed at different uses
These benchmarks are mostly focused on math, which benefits a lot from an improved CoT and is also less sensitive to having "reduced knowledge" in smaller model.
Vibes are important in this case...
OpenAI was caught gaming benchmarks recently with FrontierMath. Just (yet another) sign that benchmarks are very flawed and everyone is training on them.
So I would not put too much weight on how the models are doing on benchmarks.
Was OpenAI demonstrated to have cheated, or just that they had access to the benchmarks and it can’t be proven they didn’t cheat? (Which is hard to do in any case).
Last I saw FrontierMath said they had a holdback set of problems specifically to ensure investors with access couldn’t cheat[1]. Or did that turn out to be a lie?
1. https://www.reddit.com/r/singularity/comments/1i4n0r5/this_i...
OpenAI funded the frontier dataset, and hid this relationship. Looks very fishy.
Please don’t spread misinformation. OpenAI didn’t cheat on FrontierMath. They have access to the questions, same as MMLU, MATHD, GPQA, ARC-AGI and pretty much every eval. Sure, we could all be lying, but it would be pretty self defeating, horrible for employee morale, and quickly discovered.
Living in a world where consequences no longer matter, this doesn't say much to me. :(
Whether you suspect OpenAI is gaming benchmarks or not, it is plainly false to assert they were caught gaming benchmarks.
they were caught hiding a financial relationship with a benchmark owner, while making that benchmark a centerpiece of their advertisements.
epoch is also very vague about how much access to the dataset openai had, but it's clear that they had more access than the general public: https://www.lesswrong.com/posts/8ZgLYwBmB3vLavjKE/some-lesso...
so yes, openai were caught gaming it.
Where can we read some genuine non-cherrypicked conversations with this model?
You should be able to run it locally w/ something like Ollama. It's been a while since I tinkered with the local LLM tools, but 1.5B is tiny, so it should run at a decent clip even on just your CPU.
`ollama run hf.co/bartowski/DeepSeek-R1-Distill-Qwen-1.5B-GGUF:F16` and you're off to the races.
There are official ollama models, no need to use HF
https://ollama.com/library/deepseek-r1/tags