Linguistic RL: 3B Models Exceed 100B Performance (86% vs. 81%)

2 points | by drawson5570 5 hours ago

1 comments

drawson5570 5 hours ago
# Reddit r/MachineLearning Post
## Title (must start with tag): [R] Linguistic RL: 3B Models Exceed 100B Performance Through Self-Reflection (86% vs 81%)
## Post Body:
*TL;DR*: We taught tiny models (3B/1.5B) to beat Claude 3.5 Haiku (100B) by having Claude "journal" about its mistakes, then training small models on the learned strategy. Cost: <$10. Student exceeds teacher.
---
## Results
| Model | Size | Baseline | After LRL+LoRA | Improvement | |-------|------|----------|----------------|-------------| | *Qwen2.5-3B* | 3B | 12% | *86.0%* | *+74pp* | | *Qwen2.5-1.5B* | 1.5B | ~8% | *82.7%* | *+75pp* | | Claude 3.5 Haiku | ~100B | 81.3% → 84.0% | baseline | +2.7pp (via LRL) |
Both students *outperformed the 67× larger teacher* they learned from.
---
## How It Works
*Step 1: Teacher Self-Improvement ("Linguistic RL")*
Give Claude a problem → it solves → tell it if correct → ask it to reflect:
``` "What did I miss? How can I improve?" ```
Through pure self-reflection (no gradients!), Claude writes journal entries like:
``` "I was only checking adjacent meetings. I need to check ALL overlaps to find the maximum simultaneous conflicts." ```
Accuracy improves 81% → 84% just from thinking about mistakes.
*Step 2: Extract Strategy*
Pull out Claude's learned solving strategy as natural language curriculum.
*Step 3: Train Student with LoRA*
Fine-tune small model (3B/1.5B) on examples showing: - Problem - Claude's strategic thinking - Answer
*Result*: 3B model learns O(n log n) sweep line algorithm, achieves 96% on easy problems.
---
## Why This Matters
* Economics* - Training: <$10 in API calls - Inference: Free forever (runs locally) - 100-1000× cheaper than API deployment
* Science* - 67× compression (100B → 1.5B) with performance gain - Learned algorithmic reasoning, not pattern matching - Students exceed teacher = knowledge is compressible
* Safety* - Human-readable learning process - Can audit what was learned - No black-box distillation
* Democratization* - Frontier capabilities on consumer hardware - One-time extraction, infinite reuse - Fully open source
---
## Code & Reproducibility
Published to Zenodo: [DOI 10.5281/zenodo.17585532](https://zenodo.org/records/17585532) GitHub: https://github.com/DRawson5570/linguistic-rl-scheduling-expe... Fixed seeds, full logs, complete configs Universal framework - adapt to any domain
*Quick start:* ```bash git clone https://github.com/DRawson5570/linguistic-rl-scheduling-expe... cd validated_results_qwen3b_claude35haiku pip install transformers torch peft anthropic python run_validation.py ```
Requirements: 12GB GPU, Anthropic API key (~$5)
---
## Framework
We built a universal pipeline - works for any domain:
```python from framework import run_knowledge_transfer
results = run_knowledge_transfer( domain=YourCustomDomain(), teacher_model="claude-3-5-haiku-20241022", student_model="Qwen/Qwen2.5-3B-Instruct" ) ---
## Open Questions
1. *How small can we go?* Testing 1.5B → 0.5B compression 2. *What knowledge compresses well?* Algorithmic vs. factual vs. creative reasoning 3. *Recursive teaching?* Can students become teachers? 4. *Safety implications?* More auditable than weight distillation?
---
## Links
- Paper: https://zenodo.org/records/17585532 - Code: https://github.com/DRawson5570/linguistic-rl-scheduling-expe... - 3B Results: [validated_results_qwen3b_claude35haiku/](https://github.com/DRawson5570/linguistic-rl-scheduling-expe...) - 1.5B Results: [validated_results_qwen1.5b_claude35haiku/](https://github.com/DRawson5570/linguistic-rl-scheduling-expe...)