100% found this document useful (1 vote)
572 views

4 - ChatGPT - Optimizing Language Models For Dialogue

To create a reward model for reinforcement learning, AI trainers' conversations with a chatbot were analyzed. Messages written by the model were randomly selected and AI trainers ranked alternative completions to collect comparison data. This reward model was then used to fine-tune the model through several iterations of Proximal Policy Optimization, improving its responses.

Uploaded by

manuel rodriguez
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
572 views

4 - ChatGPT - Optimizing Language Models For Dialogue

To create a reward model for reinforcement learning, AI trainers' conversations with a chatbot were analyzed. Messages written by the model were randomly selected and AI trainers ranked alternative completions to collect comparison data. This reward model was then used to fine-tune the model through several iterations of Proximal Policy Optimization, improving its responses.

Uploaded by

manuel rodriguez
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 1

To create a reward model for reinforcement learning, we needed to collect comparison

data, which consisted of two or more model responses ranked by quality. To collect this
data, we took conversations that AI trainers had with the chatbot. We randomly selected
a model-written message, sampled several alternative completions, and had AI trainers
rank them. Using these reward models, we can fine-tune the model using Proximal Policy
Optimization. We performed several iterations of this process.

ChatGPT is fine-tuned from a model in the GPT-3.5 series, which finished training in
early 2022. You can learn more about the 3.5 series here. ChatGPT and GPT 3.5 were
trained on an Azure AI supercomputing infrastructure.

Limitations
ChatGPT sometimes writes plausible-sounding but incorrect or nonsensical answers.
Fixing this issue is challenging, as: (1) during RL training, there’s currently no source
of truth; (2) training the model to be more cautious causes it to decline questions that
it can answer correctly; and (3) supervised training misleads the model because the
ideal answer depends on what the model knows, rather than what the human
demonstrator knows.

You might also like