-
Notifications
You must be signed in to change notification settings - Fork 47
[Example] Policy model as its own reward model #270
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @hiyuchang, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request integrates a novel approach for training large language models where the policy model itself learns to act as a reward model. This is achieved through a new example and a dedicated workflow that supports joint optimization of both response generation and reward prediction, particularly useful in environments with limited ground truth data. The changes also include minor improvements to existing RULER-related functionalities and better handling of experience metadata.
Highlights
- New Trainable RULER Example: Introduces an example demonstrating how a policy model can be trained to serve as its own reward model, inspired by ART's RULER and Kimi-k2.
- MathTrainableRULERWorkflow: Adds a new workflow that enables joint optimization for response generation and RULER-reward generation, handling scenarios with partial ground truth availability.
- RULER Workflow Robustness: Enhances the existing MathRULERWorkflow by improving prompt formatting and adding validation for the length of parsed scores from the judger's response.
- Experience ID Handling: Updates the workflow_runner to allow workflows to explicitly set experience task IDs, preventing unintended overwrites.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.
| Feature | Command | Description |
|---|---|---|
| Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
| Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
| Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
|
/unittest-module-common |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces an interesting new feature where the policy model also acts as its own reward model, inspired by RULER and K2. The implementation includes a new example, a new workflow, and necessary adjustments to support it. The code is generally well-structured. My review focuses on a critical type mismatch that could cause runtime errors, along with a couple of minor documentation improvements for clarity and correctness. Overall, a great addition once the critical issue is addressed.
Summary
Tests
Github Test Reporter by CTRF 💚 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This pull request implements a policy model serving as its own reward model using a trainable RULER approach, allowing the same model to generate responses and evaluate them for reinforcement learning.
Key changes:
- Modified ID fields from integer to string/integer union to support extended task identifiers
- Created a new trainable RULER workflow where the policy model judges its own outputs
- Added example configuration and documentation for the new workflow
Reviewed Changes
Copilot reviewed 10 out of 15 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| trinity/explorer/workflow_runner.py | Added check to preserve existing task IDs when set by workflows |
| trinity/common/workflows/workflow.py | Changed default batch_id and task_id from 0 to empty string |
| trinity/common/workflows/math_trainable_ruler_workflow.py | New workflow implementing policy model as its own reward model |
| trinity/common/workflows/math_ruler_workflow.py | Fixed template string bug and added validation for score list length |
| trinity/common/workflows/init.py | Added import and export for new trainable RULER workflow |
| trinity/common/experience.py | Changed EID batch and task fields to Union[int, str] type |
| examples/grpo_gsm8k_trainable_ruler/gsm8k_ruler.yaml | Configuration file for trainable RULER example |
| examples/grpo_gsm8k_trainable_ruler/README.md | Documentation for the new trainable RULER approach |
| examples/grpo_gsm8k_ruler/README.md | Minor formatting cleanup |
| docs/sphinx_doc/source/tutorial/faq.md | Fixed incomplete sentence in FAQ |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
|
/unittest-all |
Summary
Failed Tests
Skipped
Tests
Github Test Reporter by CTRF 💚 |
|
/unittest-module-common |
Summary
Tests
Github Test Reporter by CTRF 💚 |
Description
As the title says.
Checklist
Please check the following items before code is ready to be reviewed.