Alessandro Cappelli’s Post

View profile for Alessandro Cappelli

Cofounder & Research Scientist

📜 A new blog post explaining the fundamentals of RL for LLMs—the perfect place to intuitively understand PPO and RLHF from first principles. 👇👇 Pre-trained LLMs are often compared to Lovecraft's Shoggoth, a monstrous, shape-shifting beast. PPO is the algorithm that brought it under control. Take a look! 🔗 https://lnkd.in/eXqDvrUV

  • No alternative text description for this image
Giorgio Angelotti

ML Consultant @ Vesuvius Challenge | PhD in AI | Physicist | Data Scientist

4mo

Sir, nice blog post. I have a silly question. RL is a framework to obtain a (sub-)optimal policy for a sequential decision making problem under uncertainty. Usually, after the agent takes an action, there is an environment that changes the state of the system in a possibly stochastic way. In what you are describing, the next state of the system is dictated deterministically by the action (the next token) produced by the agent (the LLM). Therefore, wouldn't the problem be better formalized as a multi-armed bandit? It would make more sense to use the general RL formalism if you were to train on full conversations rather than just prompt-answers, in that case the "environment" could be the human user that changes the state in a stochastic way with his prompts. Does this make sense to you?

To view or add a comment, sign in

Explore topics