DeepSeek-R1 is absolutely wild—and so are we! 🚀 After 5 days of DeepSeek-R1, we’ve replicated its pure reinforcement learning magic on math reasoning — no reward models, no supervised fine-tuning, from a base model — and the results are mind-blowing: 🧠 A 7B model + 8K MATH examples for verification + Reinforcement Learning = "aha moment" 🌟 Long Chain-of-Thought and Self Reflection emerge naturally 🔥 Record Math Performance: ✅ 33.3% on AIME ✅ 62.5% on AMC ✅ 77.2% on MATH 📈 Outperforms Qwen2.5-Math-7B-Instruct and matches strong methods like PRIME and rStar-MATH, despite using >50x less data and just the simple PPO algorithm! We proudly open-source our complete training code and methodology to the research community. By sharing these resources, we aim to establish our Simple Reinforcement Learning recipe as an inspiration for future development in reinforcement learning. 🔬Check out more details and our findings in our blog: https://lnkd.in/gfK9pgtt 🖥️ The training code can be found at: https://lnkd.in/gmwwfigt
Interesting
Everyone can tune his/her LLM.
The explanation and the code are very clear and easy to understand. Thanks
Awesome Qian Liu and congrats on your new job!
applied AI/ML System Software Engineer
2wVery helpful as