-
Notifications
You must be signed in to change notification settings - Fork 47
[Example] Policy model as its own reward model #270
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
9 commits
Select commit
Hold shift + click to select a range
294b696
initialize trainable_ruler
hiyuchang 437489e
fix some bugs
hiyuchang e1c8529
add results for trainable ruler
hiyuchang 4fa54b2
Merge branch 'main' into feat/trainable_ruler
hiyuchang 712e62c
fix readme and logger
hiyuchang a329dc3
update figures and fix comments
hiyuchang d192294
fix task_id type
hiyuchang 3fb6097
fix a typo
hiyuchang fced450
fix test
hiyuchang File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,55 @@ | ||
| # Policy Model as Its Own Reward Model | ||
|
|
||
|
|
||
| Ref: ART's RULER; Kimi-k2. | ||
|
|
||
| This example shows an implementation of training a policy model as its own reward model with GRPO, inspired by ART's [RULER](https://art.openpipe.ai/fundamentals/ruler) and KIMI's [K2](https://moonshotai.github.io/Kimi-K2/). | ||
|
|
||
| We simulate a scenario where only a fraction (`PROBABILITY_GROUND_TRUTH_AVAILABLE = 0.2`) of tasks have ground-truth answers. We optimize two objectives jointly: one for response generation, the other for RULER-reward generation. | ||
|
|
||
|
|
||
| ## Configurations and Metrics | ||
|
|
||
| The config file is located in [`gsm8k_ruler.yaml`](gsm8k_ruler.yaml). | ||
|
|
||
| Some key configs in this example are: | ||
|
|
||
| * `default_workflow_type`: set to `math_trainable_ruler_workflow` | ||
| * `std_threshold` for GRPO advantage: set to small value, filter out group of experiences with same rewards (e.g., when RULER fails to return valid scores, they are set to all zero) | ||
| * `sync_style`: use `dynamic_by_explorer`, due to filtering of experiences | ||
| * `train_batch_size`: set to 960; note that one explore step can generate more than 96 * 8 = 768 experiences | ||
| * `lr`: set to small value (2e-6) for stability, as rewards can be noisy | ||
|
|
||
|
|
||
|
|
||
| Some important metrics to pay attention to are: | ||
|
|
||
| * `reward`: reward calculated by rule or by RULER | ||
| * `gold_reward`: sum of `accuracy_reward` and `format_reward`, rule-based calculation with ground truth | ||
| * `judge_success`: whether RULER successfully returns a valid score (a coarse estimation, mix up two types of experiences) | ||
| * `reward_for_judger`: reward for the LLM working as a RULER reward model, calculated by mean absolute error (MAE) distance from gold scores | ||
| * `eval_accuracy`: accuracy on the evaluation set (ultimate metric for success of RL) | ||
|
|
||
|
|
||
| ## Results | ||
|
|
||
| We show the results below: | ||
|
|
||
|
|
||
|  | ||
|
|
||
|  | ||
|
|
||
|  | ||
|
|
||
|  | ||
|
|
||
|  | ||
|
|
||
|
|
||
| You may compare the above results with [the RULER example](../../examples/grpo_gsm8k_ruler/README.md) with Qwen2.5-32B-Instruct as LLM judge (`auxiliary_models`). | ||
|
|
||
|
|
||
| ## Potential improvements | ||
|
|
||
| As this is a toy example, we may consider some further improvements, such as automatically balancing the number of samples for two objectives, or their loss weights. We also plan to test out this approach in broader scenarios, e.g., cross-domain transfer of the model's critic capability. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,78 @@ | ||
| project: "Trinity-RFT-gsm8k-trainable-ruler" | ||
| name: "qwen2.5-1.5B-gsm8k-trainable-ruler" | ||
| checkpoint_root_dir: ${oc.env:TRINITY_CHECKPOINT_ROOT_DIR,./checkpoints} | ||
| algorithm: | ||
| algorithm_type: grpo | ||
| advantage_fn_args: | ||
| std_threshold: 0.0001 # effectively zero | ||
| repeat_times: 8 | ||
| model: | ||
| model_path: ${oc.env:TRINITY_MODEL_PATH,Qwen/Qwen2.5-1.5B-Instruct} | ||
| max_prompt_tokens: 12288 | ||
| max_response_tokens: 12288 | ||
| max_model_len: 16000 # slightly smaller than ppo_max_token_len_per_gpu (16384) | ||
| cluster: | ||
| node_num: 1 | ||
| gpu_per_node: 8 | ||
| buffer: | ||
| total_epochs: 1 | ||
| batch_size: 96 | ||
| train_batch_size: 960 | ||
| explorer_input: | ||
| taskset: | ||
| name: gsm8k | ||
| storage_type: file | ||
| path: 'openai/gsm8k' | ||
| subset_name: 'main' | ||
| split: 'train' | ||
| format: | ||
| prompt_key: 'question' | ||
| response_key: 'answer' | ||
| rollout_args: | ||
| temperature: 1.0 | ||
| eval_tasksets: | ||
| - name: gsm8k-eval | ||
| storage_type: file | ||
| path: 'openai/gsm8k' | ||
| subset_name: 'main' | ||
| split: 'test' | ||
| format: | ||
| prompt_key: 'question' | ||
| response_key: 'answer' | ||
| default_workflow_type: 'math_trainable_ruler_workflow' | ||
| trainer_input: | ||
| experience_buffer: | ||
| name: gsm8k_buffer | ||
| storage_type: queue | ||
| explorer: | ||
| eval_interval: 10 | ||
| runner_num: 32 | ||
| rollout_model: | ||
| engine_type: vllm_async | ||
| engine_num: 4 | ||
| tensor_parallel_size: 1 | ||
| enable_prefix_caching: false | ||
| enforce_eager: true | ||
| dtype: bfloat16 | ||
| seed: 42 | ||
| synchronizer: | ||
| sync_style: dynamic_by_explorer | ||
| sync_method: 'nccl' | ||
| sync_interval: 5 | ||
| sync_timeout: 3600 | ||
| trainer: | ||
| save_interval: 100 | ||
| trainer_config: | ||
| actor_rollout_ref: | ||
| model: | ||
| use_remove_padding: true | ||
| actor: | ||
| use_dynamic_bsz: true | ||
| ppo_max_token_len_per_gpu: 16384 | ||
| ulysses_sequence_parallel_size: 1 | ||
| optim: | ||
| lr: 2e-6 | ||
| ref: | ||
| log_prob_use_dynamic_bsz: ${trainer.trainer_config.actor_rollout_ref.actor.use_dynamic_bsz} | ||
| log_prob_max_token_len_per_gpu: ${trainer.trainer_config.actor_rollout_ref.actor.ppo_max_token_len_per_gpu} | ||
| ulysses_sequence_parallel_size: ${trainer.trainer_config.actor_rollout_ref.actor.ulysses_sequence_parallel_size} # sp size |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.