Tero Keski-Valkama’s Post

View profile for Tero Keski-Valkama

Helping AIs cure cancer. AI generalist working from Spain. Experience in leadership, AI research, software engineering.

To first get the elephant out of the way, the graph says no such thing. The y-axis is "attention on RL (reinforcement learning)", and not any indicator of the strength or intelligence of the models. I am now seeing a lot of posts about "RLHF is barely RL", and that we should try to make the models "more RL". This is part true but subtly incorrect. What we want is more search, not sparse scalar rewards. By search I obviously mean the algorithmic sense, not "search engine" sense. Search is about exploring the solution space to find satisfactory or best possible solutions. RL means using search to optimize sparse scalar reward feedback. RL in the classical sense doesn't scale, because any reasonable reward signals aren't informative enough to learn about the world. That's why RL systems generally have all sorts of hacks from proxy reward functions, like intrinsic rewards and differentiable critics, to world models which learn by self-supervision. Over the history of RL, we have actually done less and less RL, and more and more self-supervises learning, to get more information and training feedback from the world. We have also swapped naive rewards to depictions of goal states (for controllability), or to preferences (e.g. DPO, to sidestep the impossible challenge of defining a good reward signal). May I also point out how terribly bad an idea it is to set a single reward objective to a highly intelligent entity? It will make a huge mess, no matter how smart you think you were when you set it to be rewarded by profit, or by closed sales. It will not only make a mess, but it will lead to a monomaniacal specialist AI, not a generalist one. It will be weak and highly exploitable by AIs not encumbered by the same dysfunctions. What we need is more search over open-ended non-imitative objectives, not "more RL".

  • No alternative text description for this image

To view or add a comment, sign in

Explore topics