Evaluating Probabilistic Reasoning in LLMs Through Language-Only Decision Tasks

1 points | by PaulHoule 12 hours ago

1 comments

adamzwasserman 11 hours ago
I see problems:
The paper claims that Qwen3-4B achieved 89.2% best-arm selection by demonstrating superior "probabilistic reasoning". But this is a 2-armed bandit where random guessing should converge to ~50% over 500 runs of 25 iterations each. An 89% rate is suspiciously high and suggests to me that something else is happening (like prompt bias or the model pattern-matching rather than reasoning)
When they increase from 2 to 5 arms, Qwen3-4B drops from 89% to 6.5% accuracy. I assert that if it truly had probabilistic reasoning capability, performance would degrade more gracefully.
The "overthinking" explanation is hand-wavy. I don't see evidence or chain of reasoning. This is just a post-hoc story to explain unexpected results.
No discussion of variance, confidence intervals, or statistical significance. With 500 runs, these should be straightforward to calculate.
Does the claimed 89% accuracy in a binary choice task strike anyone else as implausibly high for what they're claiming?