> we uncover that RL-trained models excel at low k (e.g., pass@1) but are consistently outperformed by base models at high k (e.g., pass@256).
This is a weak argument. I think I get what we are trying to say, but let's take this to the extreme, say pass@10^10^100. Just like a group of monkeys could write Shakespeare if given enough time, a complete random model could probably outperform an RL-trained model at pass@10^10^100. Would we then say the random model can reason too?
Of course the correct reasoning trace will be in the base model's distribution, just like any other well-formed, coherent paragraph. Kind of makes me think, maybe sampling efficiency _is_ intelligence?
I don't like papers that ask a question in the title, so here's the answer:
"RL boosts sampling efficiency but reduces the reasoning capacity boundary."
Perhaps better to put it like this: Given one, or few attempts, RL trained models beat non-RL models. Given many attempts, non-RL models come up with better answers.
At least with Deep Seek math (with the same RL technique as the later R1) they noted similar things in their paper in the "Why RL Works?" section. Around the 1:04:00 mark of this Yannic Kilcher video review of the Deepseek math paper he goes over that section and points to basically the same limitations as the hn submission paper, starts at around the 1hr 4m mark and ends with this:
1:05:40
the Improvement is attributed to boosting the correct response from Top K
1:05:46
rather than the enhancement of fundamental capabilities this is something that we've come to learn in a
1:05:52
lot of different ways from like reinforcement learning on language
1:05:58
models or even supervised fine-tuning is that what's happening most likely is
1:06:04
more that the capabilities of doing all of these things are already present in
1:06:09
the underlying pre-trained language model
> 5.2.2. Why RL Works?
> In this paper, we conduct reinforcement learning based on a subset of instruction tuning
data, and it achieves significant performance enhancement upon the instruction tuning model.
> To further explain why reinforcement learning works. We evaluate the Pass@K and Maj@K
accuracy of the Instruct and RL models on two benchmarks. As shown in Figure 7, RL enhances
Maj@K’s performance but not Pass@K. These findings indicate that RL enhances the model’s
overall performance by rendering the output distribution more robust, in other words, it seems
> that the improvement is attributed to boosting the correct response from TopK rather than
the enhancement of fundamental capabilities. Similarly, (Wang et al., 2023a) identified a
misalignment problem in reasoning tasks within the SFT model, showing that the reasoning
performance of SFT models can be improved through a series of preference alignment strategies
> (Song et al., 2023; Wang et al., 2023a; Yuan et al., 2023b).
In the video he reads into this that these methods alone may not at all get us over the data wall and are still fundamentally limited by the distribution of the base model they augment.
Thanks for sharing. I had trouble reading the transcript, so here is Claude's cleaned up version and summary:
Here's the condensed and formatted transcription in a single paragraph:
This is the last thing I want to highlight this section on why RL works. Here they evaluate different things - they evaluate specifically pass at K and maj at K. Maj at K is like majority voting, so what you do is you have a model, you have a question, and you output not just one output but an ordered set. So you give your top 20 answers - 0 is your best answer that the model wants to give most, then the second most answer, third most answer, and so on. They could all be correct, just different reformulations of the same answer or different derivations stated in different ways. What you're interested in is how many of the top K results are correct - that's the pass at K. And if you had to vote if majority voting on the top K, how often would you be correct then? There's a slight difference, and that slight difference is actually made more drastic by reinforcement learning. They say, "As shown in figure 7, reinforcement learning enhances majority at K performance but not pass at K." These findings indicate that reinforcement learning enhances the model's overall performance by rendering the output distribution more robust. In other words, it seems that the improvement is attributed to boosting the correct response from Top K rather than the enhancement of fundamental capabilities. This is something we've come to learn in many different ways from reinforcement learning on language models or even supervised fine-tuning - what's happening most likely is that the capabilities of doing all of these things are already present in the underlying pre-trained language model.
Summary: Reinforcement learning improves language model performance not by enhancing fundamental capabilities but by making the output distribution more robust, effectively boosting correct responses within the top results rather than improving the model's inherent abilities.
I don't know a lot about this but it seems like if the sampling performance was adequate, external checks like theorem verification would work to get "over the data wall."
There have already been good results there with DeepMind's math Olympiad work. I think the LLM portion there was only for translating from informal to formal in the training process and in the final process they still used a manual translation to a formal description and the solver was transformer based and RL trained, but I think not starting with any language base, but it was able to learn some distribution helpful in solving the problems with RL, verifier,and light scaffolding of the tree search alone.
I think this is definitely not true of catastrophic forgetting from finetuning. And with other related types of forgetting from model abliteration there are often extreme increases hallucination.
The InstructGPT paper also showed that RLHF made hallucination worse (with more user data rejecting common hallucinations instruction tuning and RLHF may lower specific hallucinations rejected by users though).
RL might be making hallucinations worse, that’s true. Why do you think RL is causing catastrophic forgetting? Are there factual knowledge benchmarks showing it for o3 or o4-mini?
RL constrains the space of possible output token sequences to what is likely to lead to the correct answer. So we are inherently making a trade-off to reduce variance. A non-RL model will have higher variance, so given enough attempts, it will come up with some correct answers that an RL model can't.
They write "We manually inspect CoT validity to ensure correct answers stem from valid reasoning, not lucky guesses." but the example answer they show at the end only gets the correct number due to two errors canceling out. The model calculates 195+367+562+900 and gets 1924 instead of 2024, and also turns -437 - 2*234 into -805 instead of -905, but in total 1924-805 = 2024-905 = 1119 and from there the remaining steps are correct again.
It would be interesting to know how much of the sampling efficiency improvement from reinforcement learning is due to being better at basic arithmetic (something which could also be achieved by giving the model access to a calculator tool) and how much is due to choosing the correct approach for solving the problem more often.
I'm a bit skeptical of this until it's proven that they're getting the right answers in the right ways. It could be that base models are just more random and when given 200 guesses out of 1000 possible answers tend to distribute them more evenly, bringing up the pass@k number.
‘Crucially, all correct solutions from RL-trained models already exist in the base model's distribution, proving RLVR enhances sampling efficiency, not reasoning capacity, while inadvertently shrinking the solution space.’ — wouldn't any kind of RL fail to converge or even progress at all if the solution weren't to be found in the base model distribution? The way training is set up, the models absolutely need to be able to find right solutions in a reasonable time, otherwis there wouldn't be any training signal.
That depends a bit on the length of the RL training and the distribution of problems you're training on. You're correct that RL won't get any "traction" (via positive rewards) on problems where good behavior isn't already in the model's behavior distribution.
However, if you're training on many problems, it's possible in principle that if you have traction on _any_ of the problems, then the learning signal you get from success on those problems will have a positive effect on the model's behavior on other problems. Ie, the learning that you do on problems where the model is already producing positive reward behavior will nudge the model towards producing positive reward behavior on problems where it wasn't previously doing so.
Offhand, I don't know any specific examples for LLMs. In general though, if you google something like "automated curriculum design for reinforcement learning", you should find some relevant references.
Some straightforward scenarios are in, eg, robotics where one can design sequences of increasingly difficult instances of a task like moving objects from one storage bin to another. The basic idea is that the agent would have no reward or learning signal if it jumped straight into the full version of the task, so you let it develop competence on simpler variants and gradually increase difficulty until the agent can get useful learning signal on the full task.
If you don't know the answer to a problem, you're not going to be able to repeat sampling until it is correct. Random strings will saturate all benchmarks at k=infinity if tested this way.
>Our key finding is that all reasoning paths in the RLVR model are already present in the base model.
This is a really good observation. It means that you don't need to RL the full model. You merely need to RL a few LoRAs or maybe a small Mamba model appended to the final layer.
Also fun stuff many don't know - If you run a regular models chat template with a reasoning tuned model, it can go back to acting like the base model, with no "thinking" process.
"Reasoning" models are not any better than non reasoning models. It's a parlor trick, and benchmarks which claimed it wasn't are bad.
> If you run a regular models chat template with a reasoning tuned model, it can go back to acting like the base model, with no "thinking" process.
Well, of course. They've been "fine-tuned" with specific chat templates. Remove those and the fine-tune doesn't take precedence anymore. That's expected behaviour I'd say.
> "Reasoning" models are not any better than non reasoning models. It's a parlor trick, and benchmarks which claimed it wasn't are bad.
All of them? Including the closed ones, never public? I highly doubt that.
> we uncover that RL-trained models excel at low k (e.g., pass@1) but are consistently outperformed by base models at high k (e.g., pass@256).
This is a weak argument. I think I get what we are trying to say, but let's take this to the extreme, say pass@10^10^100. Just like a group of monkeys could write Shakespeare if given enough time, a complete random model could probably outperform an RL-trained model at pass@10^10^100. Would we then say the random model can reason too?
Of course the correct reasoning trace will be in the base model's distribution, just like any other well-formed, coherent paragraph. Kind of makes me think, maybe sampling efficiency _is_ intelligence?
If this was just the effect you mention you would not expect the base model to surpass the RL model though. Plus their k are much smaller than that.
I think it's a very interesting and meaningful study.
I don't like papers that ask a question in the title, so here's the answer:
"RL boosts sampling efficiency but reduces the reasoning capacity boundary."
Perhaps better to put it like this: Given one, or few attempts, RL trained models beat non-RL models. Given many attempts, non-RL models come up with better answers.
My gut feeling when using DeepSeek is that its performance is a lot smoother, the responses feel more robust and not as brittle.
At least with Deep Seek math (with the same RL technique as the later R1) they noted similar things in their paper in the "Why RL Works?" section. Around the 1:04:00 mark of this Yannic Kilcher video review of the Deepseek math paper he goes over that section and points to basically the same limitations as the hn submission paper, starts at around the 1hr 4m mark and ends with this:
https://www.youtube.com/watch?v=bAWV_yrqx4w&t=1h4mfrom the paper:
> 5.2.2. Why RL Works? > In this paper, we conduct reinforcement learning based on a subset of instruction tuning data, and it achieves significant performance enhancement upon the instruction tuning model. > To further explain why reinforcement learning works. We evaluate the Pass@K and Maj@K accuracy of the Instruct and RL models on two benchmarks. As shown in Figure 7, RL enhances Maj@K’s performance but not Pass@K. These findings indicate that RL enhances the model’s overall performance by rendering the output distribution more robust, in other words, it seems > that the improvement is attributed to boosting the correct response from TopK rather than the enhancement of fundamental capabilities. Similarly, (Wang et al., 2023a) identified a misalignment problem in reasoning tasks within the SFT model, showing that the reasoning performance of SFT models can be improved through a series of preference alignment strategies > (Song et al., 2023; Wang et al., 2023a; Yuan et al., 2023b).
In the video he reads into this that these methods alone may not at all get us over the data wall and are still fundamentally limited by the distribution of the base model they augment.
Thanks for sharing. I had trouble reading the transcript, so here is Claude's cleaned up version and summary:
Here's the condensed and formatted transcription in a single paragraph: This is the last thing I want to highlight this section on why RL works. Here they evaluate different things - they evaluate specifically pass at K and maj at K. Maj at K is like majority voting, so what you do is you have a model, you have a question, and you output not just one output but an ordered set. So you give your top 20 answers - 0 is your best answer that the model wants to give most, then the second most answer, third most answer, and so on. They could all be correct, just different reformulations of the same answer or different derivations stated in different ways. What you're interested in is how many of the top K results are correct - that's the pass at K. And if you had to vote if majority voting on the top K, how often would you be correct then? There's a slight difference, and that slight difference is actually made more drastic by reinforcement learning. They say, "As shown in figure 7, reinforcement learning enhances majority at K performance but not pass at K." These findings indicate that reinforcement learning enhances the model's overall performance by rendering the output distribution more robust. In other words, it seems that the improvement is attributed to boosting the correct response from Top K rather than the enhancement of fundamental capabilities. This is something we've come to learn in many different ways from reinforcement learning on language models or even supervised fine-tuning - what's happening most likely is that the capabilities of doing all of these things are already present in the underlying pre-trained language model. Summary: Reinforcement learning improves language model performance not by enhancing fundamental capabilities but by making the output distribution more robust, effectively boosting correct responses within the top results rather than improving the model's inherent abilities.
I don't know a lot about this but it seems like if the sampling performance was adequate, external checks like theorem verification would work to get "over the data wall."
There have already been good results there with DeepMind's math Olympiad work. I think the LLM portion there was only for translating from informal to formal in the training process and in the final process they still used a manual translation to a formal description and the solver was transformer based and RL trained, but I think not starting with any language base, but it was able to learn some distribution helpful in solving the problems with RL, verifier,and light scaffolding of the tree search alone.
I'm pretty sure RL causes catastrophic forgetting of its base knowledge and that's why o3 hallucinates so much more.
If you mess around with trained weights you're going to delete some base knowledge, as least the knowledge that is outside of the tasks you RL on.
Solution could be to mix RL training with foundational knowledge training, so LLM can refresh memory and not forget things.
Hallucinations usually happen when a model never knew the answer, not when it forgot something.
I think this is definitely not true of catastrophic forgetting from finetuning. And with other related types of forgetting from model abliteration there are often extreme increases hallucination.
The InstructGPT paper also showed that RLHF made hallucination worse (with more user data rejecting common hallucinations instruction tuning and RLHF may lower specific hallucinations rejected by users though).
Some mention of that here: https://huyenchip.com/2023/05/02/rlhf.html#rlhf_and_hallucin...
RL might be making hallucinations worse, that’s true. Why do you think RL is causing catastrophic forgetting? Are there factual knowledge benchmarks showing it for o3 or o4-mini?
RL constrains the space of possible output token sequences to what is likely to lead to the correct answer. So we are inherently making a trade-off to reduce variance. A non-RL model will have higher variance, so given enough attempts, it will come up with some correct answers that an RL model can't.
They write "We manually inspect CoT validity to ensure correct answers stem from valid reasoning, not lucky guesses." but the example answer they show at the end only gets the correct number due to two errors canceling out. The model calculates 195+367+562+900 and gets 1924 instead of 2024, and also turns -437 - 2*234 into -805 instead of -905, but in total 1924-805 = 2024-905 = 1119 and from there the remaining steps are correct again.
It would be interesting to know how much of the sampling efficiency improvement from reinforcement learning is due to being better at basic arithmetic (something which could also be achieved by giving the model access to a calculator tool) and how much is due to choosing the correct approach for solving the problem more often.
I'm a bit skeptical of this until it's proven that they're getting the right answers in the right ways. It could be that base models are just more random and when given 200 guesses out of 1000 possible answers tend to distribute them more evenly, bringing up the pass@k number.
They should try again with higher temperature on the RL model to introduce more variance.
‘Crucially, all correct solutions from RL-trained models already exist in the base model's distribution, proving RLVR enhances sampling efficiency, not reasoning capacity, while inadvertently shrinking the solution space.’ — wouldn't any kind of RL fail to converge or even progress at all if the solution weren't to be found in the base model distribution? The way training is set up, the models absolutely need to be able to find right solutions in a reasonable time, otherwis there wouldn't be any training signal.
That depends a bit on the length of the RL training and the distribution of problems you're training on. You're correct that RL won't get any "traction" (via positive rewards) on problems where good behavior isn't already in the model's behavior distribution.
However, if you're training on many problems, it's possible in principle that if you have traction on _any_ of the problems, then the learning signal you get from success on those problems will have a positive effect on the model's behavior on other problems. Ie, the learning that you do on problems where the model is already producing positive reward behavior will nudge the model towards producing positive reward behavior on problems where it wasn't previously doing so.
This is an interesting scenario: do you know of any documented examples?
Offhand, I don't know any specific examples for LLMs. In general though, if you google something like "automated curriculum design for reinforcement learning", you should find some relevant references.
Some straightforward scenarios are in, eg, robotics where one can design sequences of increasingly difficult instances of a task like moving objects from one storage bin to another. The basic idea is that the agent would have no reward or learning signal if it jumped straight into the full version of the task, so you let it develop competence on simpler variants and gradually increase difficulty until the agent can get useful learning signal on the full task.
If you don't know the answer to a problem, you're not going to be able to repeat sampling until it is correct. Random strings will saturate all benchmarks at k=infinity if tested this way.
>Our key finding is that all reasoning paths in the RLVR model are already present in the base model.
This is a really good observation. It means that you don't need to RL the full model. You merely need to RL a few LoRAs or maybe a small Mamba model appended to the final layer.
This 100% tracks with my experience.
Also fun stuff many don't know - If you run a regular models chat template with a reasoning tuned model, it can go back to acting like the base model, with no "thinking" process.
"Reasoning" models are not any better than non reasoning models. It's a parlor trick, and benchmarks which claimed it wasn't are bad.
> If you run a regular models chat template with a reasoning tuned model, it can go back to acting like the base model, with no "thinking" process.
Well, of course. They've been "fine-tuned" with specific chat templates. Remove those and the fine-tune doesn't take precedence anymore. That's expected behaviour I'd say.
> "Reasoning" models are not any better than non reasoning models. It's a parlor trick, and benchmarks which claimed it wasn't are bad.
All of them? Including the closed ones, never public? I highly doubt that.