Qian Liu’s Post

DeepSeek-R1 is absolutely wild—and so are we! 🚀 After 5 days of DeepSeek-R1, we’ve replicated its pure reinforcement learning magic on math reasoning — no reward models, no supervised fine-tuning, from a base model — and the results are mind-blowing: 🧠 A 7B model + 8K MATH examples for verification + Reinforcement Learning = "aha moment" 🌟 Long Chain-of-Thought and Self Reflection emerge naturally 🔥 Record Math Performance: ✅ 33.3% on AIME  ✅ 62.5% on AMC  ✅ 77.2% on MATH 📈 Outperforms Qwen2.5-Math-7B-Instruct and matches strong methods like PRIME and rStar-MATH, despite using >50x less data and just the simple PPO algorithm! We proudly open-source our complete training code and methodology to the research community. By sharing these resources, we aim to establish our Simple Reinforcement Learning recipe as an inspiration for future development in reinforcement learning. 🔬Check out more details and our findings in our blog: https://lnkd.in/gfK9pgtt 🖥️ The training code can be found at: https://lnkd.in/gmwwfigt

7B Model and 8K Examples: Emerging Reasoning with Reinforcement Learning is Both Effective and Efficient | Notion

7B Model and 8K Examples: Emerging Reasoning with Reinforcement Learning is Both Effective and Efficient | Notion

hkust-nlp.notion.site

Zexin Yan

applied AI/ML System Software Engineer

2w

Very helpful as

Like
Reply
Zaheen E Muktadi Syed

Graduate Research Assistant @University of Central Florida || Ex Huawei

3w

Interesting

Xiaobing S.

Research Scientist (NLP, LLM, Safety)

3w

Everyone can tune his/her LLM.

Mahatma Kawa

Machine Learning Engineer at BRI | LLM Practitioner | Robotics Enthusiast | Experienced IoT Engineer

3w

The explanation and the code are very clear and easy to understand. Thanks

Awesome Qian Liu and congrats on your new job!

See more comments

To view or add a comment, sign in

Explore topics