[Example] Policy model as its own reward model #270

hiyuchang · 2025-09-12T03:32:31Z

Description

As the title says.

Checklist

Please check the following items before code is ready to be reviewed.

Code has passed all tests
Docstrings have been added/updated in Google Style
Documentation has been updated
Code is ready for review

gemini-code-assist

Summary of Changes

Hello @hiyuchang, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request integrates a novel approach for training large language models where the policy model itself learns to act as a reward model. This is achieved through a new example and a dedicated workflow that supports joint optimization of both response generation and reward prediction, particularly useful in environments with limited ground truth data. The changes also include minor improvements to existing RULER-related functionalities and better handling of experience metadata.

Highlights

New Trainable RULER Example: Introduces an example demonstrating how a policy model can be trained to serve as its own reward model, inspired by ART's RULER and Kimi-k2.
MathTrainableRULERWorkflow: Adds a new workflow that enables joint optimization for response generation and RULER-reward generation, handling scenarios with partial ground truth availability.
RULER Workflow Robustness: Enhances the existing MathRULERWorkflow by improving prompt formatting and adding validation for the length of parsed scores from the judger's response.
Experience ID Handling: Updates the workflow_runner to allow workflows to explicitly set experience task IDs, preventing unintended overwrites.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

hiyuchang · 2025-09-12T03:32:51Z

/unittest-module-common

gemini-code-assist

Code Review

This pull request introduces an interesting new feature where the policy model also acts as its own reward model, inspired by RULER and K2. The implementation includes a new example, a new workflow, and necessary adjustments to support it. The code is generally well-structured. My review focuses on a critical type mismatch that could cause runtime errors, along with a couple of minor documentation improvements for clarity and correctness. Overall, a great addition once the critical issue is addressed.

trinity/common/workflows/math_trainable_ruler_workflow.py

examples/grpo_gsm8k_trainable_ruler/README.md

trinity/common/workflows/math_trainable_ruler_workflow.py

github-actions · 2025-09-12T03:39:55Z

Summary

Tests 📝	Passed ✅	Failed ❌	Skipped ⏭️	Other ❓	Flaky 🍂	Duration ⏱️
32	32	0	0	0	0	367ms

Tests

Test Name	Status	Duration
tests/common/config_test.py::TestConfig::test_all_examples_are_valid	✅	11ms
tests/common/config_test.py::TestConfig::test_config_flatten	✅	1ms
tests/common/config_test.py::TestConfig::test_continue_from_checkpoint_is_valid	✅	1ms
tests/common/config_test.py::TestConfig::test_load_default_config	✅	4ms
tests/common/experience_test.py::TestEID::test_eid_properties	✅	1ms
tests/common/experience_test.py::TestExperience::test_action_mask_and_logprobs_type	✅	1ms
tests/common/experience_test.py::TestExperience::test_assertions	✅	1ms
tests/common/experience_test.py::TestExperience::test_dpo_experience	✅	1ms
tests/common/experience_test.py::TestExperience::test_gather	✅	1ms
tests/common/experience_test.py::TestExperience::test_hf_datasets_conversion	✅	1ms
tests/common/experience_test.py::TestExperience::test_multi_turn_experience	✅	1ms
tests/common/experience_test.py::TestExperience::test_serialize_deserialize	✅	1ms
tests/common/experience_test.py::TestExperience::test_single_turn_experience	✅	1ms
tests/common/experience_test.py::TestExperience::test_to_dict	✅	1ms
tests/common/experience_test.py::TestExperienceConversion::test_batch_conversion	✅	1ms
tests/common/experience_test.py::TestExperienceConversion::test_dpo_experience_batch_conversion	✅	1ms
tests/common/experience_test.py::TestExperienceConversion::test_experience_model_experience_conversion	✅	1ms
tests/common/experience_test.py::TestExperienceConversion::test_gather_experiences_with_custom_fields	✅	1ms
tests/common/experience_test.py::TestExperienceConversion::test_multiturn_experience_batch_converstion	✅	1ms
tests/common/vllm_test.py::ModelWrapperTest_0::test_generate	✅	38ms
tests/common/vllm_test.py::ModelWrapperTest_1::test_generate	✅	16ms
tests/common/vllm_test.py::ModelWrapperTest_2::test_generate	✅	16ms
tests/common/vllm_test.py::ModelWrapperTest_3::test_generate	✅	53ms
tests/common/vllm_test.py::ModelWrapperTest_4::test_generate	✅	47ms
tests/common/vllm_test.py::ModelWrapperTest_5::test_generate	✅	35ms
tests/common/vllm_test.py::ModelWrapperTest_6::test_generate	✅	46ms
tests/common/vllm_test.py::TestAPIServer::test_api	✅	24ms
tests/common/vllm_test.py::TestAsyncAPIServer::test_api_async	✅	24ms
tests/common/vllm_test.py::TestTokenizer::test_action_mask	✅	1ms
tests/common/vllm_test.py::TestTokenizer::test_action_mask_with_tools	✅	1ms
tests/common/vllm_test.py::TestAPIServerToolCall_0_deepseek_r1::test_api_tool_calls	✅	21ms
tests/common/vllm_test.py::TestAPIServerToolCall_1::test_api_tool_calls	✅	20ms

Github Test Reporter by CTRF 💚

examples/grpo_gsm8k_trainable_ruler/README.md

trinity/explorer/workflow_runner.py

Copilot

Pull Request Overview

This pull request implements a policy model serving as its own reward model using a trainable RULER approach, allowing the same model to generate responses and evaluate them for reinforcement learning.

Key changes:

Modified ID fields from integer to string/integer union to support extended task identifiers
Created a new trainable RULER workflow where the policy model judges its own outputs
Added example configuration and documentation for the new workflow

Reviewed Changes

Copilot reviewed 10 out of 15 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
trinity/explorer/workflow_runner.py	Added check to preserve existing task IDs when set by workflows
trinity/common/workflows/workflow.py	Changed default batch_id and task_id from 0 to empty string
trinity/common/workflows/math_trainable_ruler_workflow.py	New workflow implementing policy model as its own reward model
trinity/common/workflows/math_ruler_workflow.py	Fixed template string bug and added validation for score list length
trinity/common/workflows/init.py	Added import and export for new trainable RULER workflow
trinity/common/experience.py	Changed EID batch and task fields to Union[int, str] type
examples/grpo_gsm8k_trainable_ruler/gsm8k_ruler.yaml	Configuration file for trainable RULER example
examples/grpo_gsm8k_trainable_ruler/README.md	Documentation for the new trainable RULER approach
examples/grpo_gsm8k_ruler/README.md	Minor formatting cleanup
docs/sphinx_doc/source/tutorial/faq.md	Fixed incomplete sentence in FAQ

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

trinity/common/workflows/math_ruler_workflow.py

pan-x-c · 2025-09-12T06:34:59Z

/unittest-all

github-actions · 2025-09-12T07:19:09Z

Summary

Tests 📝	Passed ✅	Failed ❌	Skipped ⏭️	Other ❓	Flaky 🍂	Duration ⏱️
143	141	1	1	0	0	2.6s

Failed Tests

Failed Tests ❌	Fail Message
❌ tests/common/experience_test.py::TestEID::test_eid_properties	The test failed in the call phase due to an assertion error

Skipped

Tests	Status
tests/trainer/trainer_test.py::TestTrainerMultiModal::test_trainer	skipped ⏭️

Tests

Test Name	Status	Duration
tests/algorithm/advantage_fn_test.py::TestGroupedAdvantageFn::test_duplicate_grpo	✅	1ms
tests/algorithm/advantage_fn_test.py::TestGroupedAdvantageFn::test_grpo_advantage	✅	1ms
tests/algorithm/advantage_fn_test.py::TestGroupedAdvantageFn::test_grpo_correct_bias	✅	1ms
tests/algorithm/advantage_fn_test.py::TestGroupedAdvantageFn::test_grpo_reward_std	✅	1ms
tests/algorithm/advantage_fn_test.py::TestGroupedAdvantageFn::test_step_wise_grpo_advantage	✅	1ms
tests/algorithm/policy_loss_test.py::VerlPolicyLossTest::test_dpo_policy_loss	✅	1ms
tests/algorithm/policy_loss_test.py::VerlPolicyLossTest::test_gspo_policy_loss	✅	1ms
tests/algorithm/policy_loss_test.py::VerlPolicyLossTest::test_mix_policy_loss	✅	1ms
tests/algorithm/policy_loss_test.py::VerlPolicyLossTest::test_opmd_policy_loss	✅	1ms
tests/algorithm/policy_loss_test.py::VerlPolicyLossTest::test_ppo_policy_loss	✅	1ms
tests/algorithm/policy_loss_test.py::VerlPolicyLossTest::test_sft_policy_loss	✅	1ms
tests/buffer/experience_pipeline_test.py::TestExperiencePipeline::test_experience_pipeline	✅	10ms
tests/buffer/experience_storage_test.py::ExperienceStorageTest::test_sql_experience_buffer	✅	3ms
tests/buffer/experience_storage_test.py::ExperienceStorageTest::test_sql_storage_0_sft	✅	4ms
tests/buffer/experience_storage_test.py::ExperienceStorageTest::test_sql_storage_1_dpo	✅	5ms
tests/buffer/file_test.py::TestFileBuffer::test_file_reader	✅	1ms
tests/buffer/file_test.py::TestFileBuffer::test_file_writer	✅	2ms
tests/buffer/formatter_test.py::TestFormatter::test_dpo_messages_formatter	✅	1ms
tests/buffer/formatter_test.py::TestFormatter::test_dpo_plaintext_formatter	✅	1ms
tests/buffer/formatter_test.py::TestFormatter::test_sft_messages_formatter	✅	1ms
tests/buffer/formatter_test.py::TestFormatter::test_sft_plaintext_formatter	✅	1ms
tests/buffer/formatter_test.py::TestFormatter::test_task_formatter	✅	1ms
tests/buffer/queue_test.py::TestQueueBuffer::test_priority_queue_buffer_reuse	✅	7ms
tests/buffer/queue_test.py::TestQueueBuffer::test_priority_queue_capacity	✅	3ms
tests/buffer/queue_test.py::TestQueueBuffer::test_queue_buffer_0_queue	✅	4ms
tests/buffer/queue_test.py::TestQueueBuffer::test_queue_buffer_1_priority_queue	✅	4ms
tests/buffer/queue_test.py::TestQueueBuffer::test_queue_buffer_capacity	✅	4ms
tests/buffer/reward_shaping_mapper_test.py::TestRewardShapingMapper::test_basic_usage	✅	1ms
tests/buffer/sql_test.py::TestSQLBuffer::test_sql_buffer_read_write	✅	4ms
tests/buffer/task_storage_test.py::TaskStorageTest::test_read_task_0	✅	1ms
tests/buffer/task_storage_test.py::TaskStorageTest::test_read_task_1	✅	2ms
tests/buffer/task_storage_test.py::TaskStorageTest::test_read_task_2	✅	1ms
tests/buffer/task_storage_test.py::TaskStorageTest::test_read_task_3	✅	3ms
tests/buffer/task_storage_test.py::TaskStorageTest::test_read_task_4	✅	1ms
tests/buffer/task_storage_test.py::TaskStorageTest::test_read_task_5	✅	3ms
tests/cli/launcher_test.py::TestLauncherMain::test_main_run_command	✅	1ms
tests/cli/launcher_test.py::TestLauncherMain::test_main_run_in_dlc	✅	1ms
tests/cli/launcher_test.py::TestLauncherMain::test_main_studio_command	✅	1ms
tests/common/config_test.py::TestConfig::test_all_examples_are_valid	✅	10ms
tests/common/config_test.py::TestConfig::test_config_flatten	✅	1ms
tests/common/config_test.py::TestConfig::test_continue_from_checkpoint_is_valid	✅	1ms
tests/common/config_test.py::TestConfig::test_load_default_config	✅	4ms
tests/common/experience_test.py::TestEID::test_eid_properties	❌	1ms
tests/common/experience_test.py::TestExperience::test_action_mask_and_logprobs_type	✅	1ms
tests/common/experience_test.py::TestExperience::test_assertions	✅	1ms
tests/common/experience_test.py::TestExperience::test_dpo_experience	✅	1ms
tests/common/experience_test.py::TestExperience::test_gather	✅	1ms
tests/common/experience_test.py::TestExperience::test_hf_datasets_conversion	✅	1ms
tests/common/experience_test.py::TestExperience::test_multi_turn_experience	✅	1ms
tests/common/experience_test.py::TestExperience::test_serialize_deserialize	✅	1ms
tests/common/experience_test.py::TestExperience::test_single_turn_experience	✅	1ms
tests/common/experience_test.py::TestExperience::test_to_dict	✅	1ms
tests/common/experience_test.py::TestExperienceConversion::test_batch_conversion	✅	1ms
tests/common/experience_test.py::TestExperienceConversion::test_dpo_experience_batch_conversion	✅	1ms
tests/common/experience_test.py::TestExperienceConversion::test_experience_model_experience_conversion	✅	1ms
tests/common/experience_test.py::TestExperienceConversion::test_gather_experiences_with_custom_fields	✅	1ms
tests/common/experience_test.py::TestExperienceConversion::test_multiturn_experience_batch_converstion	✅	1ms
tests/common/vllm_test.py::ModelWrapperTest_0::test_generate	✅	36ms
tests/common/vllm_test.py::ModelWrapperTest_1::test_generate	✅	15ms
tests/common/vllm_test.py::ModelWrapperTest_2::test_generate	✅	16ms
tests/common/vllm_test.py::ModelWrapperTest_3::test_generate	✅	54ms
tests/common/vllm_test.py::ModelWrapperTest_4::test_generate	✅	49ms
tests/common/vllm_test.py::ModelWrapperTest_5::test_generate	✅	36ms
tests/common/vllm_test.py::ModelWrapperTest_6::test_generate	✅	47ms
tests/common/vllm_test.py::TestAPIServer::test_api	✅	24ms
tests/common/vllm_test.py::TestAsyncAPIServer::test_api_async	✅	24ms
tests/common/vllm_test.py::TestTokenizer::test_action_mask	✅	1ms
tests/common/vllm_test.py::TestTokenizer::test_action_mask_with_tools	✅	1ms
tests/common/vllm_test.py::TestAPIServerToolCall_0_deepseek_r1::test_api_tool_calls	✅	21ms
tests/common/vllm_test.py::TestAPIServerToolCall_1::test_api_tool_calls	✅	19ms
tests/explorer/explorer_test.py::BaseExplorerCase::test_explorer	✅	1ms
tests/explorer/explorer_test.py::TestExplorerCountdownEval::test_explorer	✅	76ms
tests/explorer/explorer_test.py::TestExplorerCountdownNoEval::test_explorer	✅	64ms
tests/explorer/explorer_test.py::TestExplorerGSM8k::test_explorer	✅	198ms
tests/explorer/scheduler_test.py::SchedulerTest::test_async_workflow	✅	4ms
tests/explorer/scheduler_test.py::SchedulerTest::test_concurrent_operations	✅	4ms
tests/explorer/scheduler_test.py::SchedulerTest::test_get_results	✅	19ms
tests/explorer/scheduler_test.py::SchedulerTest::test_multi_step_execution	✅	5ms
tests/explorer/scheduler_test.py::SchedulerTest::test_non_repeatable_workflow	✅	4ms
tests/explorer/scheduler_test.py::SchedulerTest::test_scheduler_all_methods	✅	14ms
tests/explorer/scheduler_test.py::SchedulerTest::test_scheduler_restart_after_stop	✅	7ms
tests/explorer/scheduler_test.py::SchedulerTest::test_split_tasks	✅	7ms
tests/explorer/scheduler_test.py::SchedulerTest::test_stepwise_experience_eid	✅	4ms
tests/explorer/scheduler_test.py::SchedulerTest::test_wait_all	✅	7ms
tests/explorer/scheduler_test.py::SchedulerTest::test_wait_all_timeout_with_multi_batch	✅	13ms
tests/explorer/step_wise_workflow_test.py::WorkflowTest::test_reward_propagation_workflow	✅	1ms
tests/explorer/step_wise_workflow_test.py::WorkflowTest::test_step_wise_reward_workflow	✅	1ms
tests/explorer/step_wise_workflow_test.py::WorkflowTest::test_workflows_raise_error	✅	1ms
tests/explorer/step_wise_workflow_test.py::WorkflowTest::test_workflows_stop_at_max_env_steps	✅	1ms
tests/explorer/workflow_test.py::WorkflowTest::test_gsm8k_workflow	✅	1ms
tests/explorer/workflow_test.py::WorkflowTest::test_math_boxed_workflow	✅	1ms
tests/explorer/workflow_test.py::WorkflowTest::test_math_complex_workflow	✅	1ms
tests/explorer/workflow_test.py::WorkflowTest::test_math_eval_workflow	✅	1ms
tests/explorer/workflow_test.py::WorkflowTest::test_math_fraction_workflow	✅	1ms
tests/explorer/workflow_test.py::WorkflowTest::test_math_workflow	✅	1ms
tests/explorer/workflow_test.py::WorkflowTest::test_rm_gallery_workflow	✅	1ms
tests/explorer/workflow_test.py::WorkflowTest::test_workflow_repeatable	✅	1ms
tests/explorer/workflow_test.py::WorkflowTest::test_workflow_resettable	✅	1ms
tests/manager/synchronizer_test.py::TestSynchronizerExit::test_synchronizer	✅	28ms
tests/manager/synchronizer_test.py::TestStateDictBasedSynchronizer_0::test_synchronizer	✅	59ms
tests/manager/synchronizer_test.py::TestStateDictBasedSynchronizer_1::test_synchronizer	✅	65ms
tests/manager/synchronizer_test.py::TestStateDictBasedSynchronizer_2::test_synchronizer	✅	88ms
tests/manager/synchronizer_test.py::TestStateDictBasedSynchronizer_3::test_synchronizer	✅	77ms
tests/manager/synchronizer_test.py::TestNCCLBasedSynchronizer_0::test_synchronizer	✅	52ms
tests/manager/synchronizer_test.py::TestNCCLBasedSynchronizer_1::test_synchronizer	✅	52ms
tests/service/data_juicer_test.py::TestDataJuicer::test_config	✅	1ms
tests/service/data_juicer_test.py::TestDataJuicer::test_server_start	✅	22ms
tests/service/data_juicer_test.py::TestDataJuicerExperiencePipeline::test_data_juicer_operators	✅	21ms
tests/service/data_juicer_test.py::TestDataJuicerTaskPipeline::test_data_juicer_task_pipeline	✅	14ms
tests/trainer/trainer_test.py::TestTrainerCountdown_0_fsdp::test_trainer	✅	135ms
tests/trainer/trainer_test.py::TestTrainerCountdown_1_megatron::test_trainer	✅	322ms
tests/trainer/trainer_test.py::TestStepAheadAsyncRL::test_trainer	✅	57ms
tests/trainer/trainer_test.py::TestTrainerGSM8K_0_fsdp::test_trainer	✅	46ms
tests/trainer/trainer_test.py::TestTrainerGSM8K_1_fsdp2::test_trainer	✅	45ms
tests/trainer/trainer_test.py::TestTrainerGSM8K_2_fsdp::test_trainer	✅	47ms
tests/trainer/trainer_test.py::TestTrainerGSM8K_3_fsdp2::test_trainer	✅	57ms
tests/trainer/trainer_test.py::TestTrainerSFTWarmupGSM8K::test_trainer	✅	63ms
tests/trainer/trainer_test.py::TestTrainerDPO::test_trainer	✅	31ms
tests/trainer/trainer_test.py::TestTrainerSFT::test_trainer	✅	29ms
tests/trainer/trainer_test.py::TestTrainerToolsSFT::test_trainer_tools	✅	30ms
tests/trainer/trainer_test.py::TestFullyAsyncMode_0_fsdp::test_fully_async_mode	✅	65ms
tests/trainer/trainer_test.py::TestFullyAsyncMode_1_fsdp::test_fully_async_mode	✅	72ms
tests/trainer/trainer_test.py::TestFullyAsyncMode_2_megatron::test_fully_async_mode	✅	156ms
tests/trainer/trainer_test.py::TestTrainerMIX::test_trainer	✅	57ms
tests/trainer/trainer_test.py::TestTrainerMultiModal::test_trainer	⏭️	1ms
tests/utils/eval_utils_test.py::TestComputeScore::test_both_boxed_and_equivalent	✅	1ms
tests/utils/eval_utils_test.py::TestComputeScore::test_both_boxed_and_not_equivalent	✅	1ms
tests/utils/eval_utils_test.py::TestComputeScore::test_empty_ground_truth	✅	1ms
tests/utils/eval_utils_test.py::TestComputeScore::test_empty_solution_string	✅	1ms
tests/utils/eval_utils_test.py::TestComputeScore::test_multiple_boxed_answers_in_solution	✅	1ms
tests/utils/eval_utils_test.py::TestComputeScore::test_solution_boxed_truth_raw_and_equivalent	✅	1ms
tests/utils/eval_utils_test.py::TestComputeScore::test_solution_boxed_truth_raw_and_not_equivalent	✅	1ms
tests/utils/eval_utils_test.py::TestComputeScore::test_solution_not_boxed	✅	1ms
tests/utils/eval_utils_test.py::TestComputeScore::test_solution_raw_and_ground_truth_boxed_equivalent	✅	1ms
tests/utils/eval_utils_test.py::TestMathEvalUtils::test_extract_answer	✅	1ms
tests/utils/eval_utils_test.py::TestMathEvalUtils::test_verify_math_answer	✅	1ms
tests/utils/eval_utils_test.py::TestEvalUtils::test_is_equiv	✅	1ms
tests/utils/log_test.py::LogTest::test_actor_log	✅	2ms
tests/utils/log_test.py::LogTest::test_group_by_node	✅	2ms
tests/utils/log_test.py::LogTest::test_no_actor_log	✅	1ms
tests/utils/plugin_test.py::TestPluginLoader::test_load_plugins_local	✅	1ms
tests/utils/plugin_test.py::TestPluginLoader::test_load_plugins_remote	✅	6ms
tests/utils/plugin_test.py::TestPluginLoader::test_passing_custom_class	✅	3ms

Github Test Reporter by CTRF 💚

hiyuchang · 2025-09-12T07:25:16Z

/unittest-module-common

github-actions · 2025-09-12T07:32:19Z

Summary

Tests 📝	Passed ✅	Failed ❌	Skipped ⏭️	Other ❓	Flaky 🍂	Duration ⏱️
32	32	0	0	0	0	362ms

Tests

Test Name	Status	Duration
tests/common/config_test.py::TestConfig::test_all_examples_are_valid	✅	11ms
tests/common/config_test.py::TestConfig::test_config_flatten	✅	1ms
tests/common/config_test.py::TestConfig::test_continue_from_checkpoint_is_valid	✅	1ms
tests/common/config_test.py::TestConfig::test_load_default_config	✅	4ms
tests/common/experience_test.py::TestEID::test_eid_properties	✅	1ms
tests/common/experience_test.py::TestExperience::test_action_mask_and_logprobs_type	✅	1ms
tests/common/experience_test.py::TestExperience::test_assertions	✅	1ms
tests/common/experience_test.py::TestExperience::test_dpo_experience	✅	1ms
tests/common/experience_test.py::TestExperience::test_gather	✅	1ms
tests/common/experience_test.py::TestExperience::test_hf_datasets_conversion	✅	1ms
tests/common/experience_test.py::TestExperience::test_multi_turn_experience	✅	1ms
tests/common/experience_test.py::TestExperience::test_serialize_deserialize	✅	1ms
tests/common/experience_test.py::TestExperience::test_single_turn_experience	✅	1ms
tests/common/experience_test.py::TestExperience::test_to_dict	✅	1ms
tests/common/experience_test.py::TestExperienceConversion::test_batch_conversion	✅	1ms
tests/common/experience_test.py::TestExperienceConversion::test_dpo_experience_batch_conversion	✅	1ms
tests/common/experience_test.py::TestExperienceConversion::test_experience_model_experience_conversion	✅	1ms
tests/common/experience_test.py::TestExperienceConversion::test_gather_experiences_with_custom_fields	✅	1ms
tests/common/experience_test.py::TestExperienceConversion::test_multiturn_experience_batch_converstion	✅	1ms
tests/common/vllm_test.py::ModelWrapperTest_0::test_generate	✅	37ms
tests/common/vllm_test.py::ModelWrapperTest_1::test_generate	✅	16ms
tests/common/vllm_test.py::ModelWrapperTest_2::test_generate	✅	16ms
tests/common/vllm_test.py::ModelWrapperTest_3::test_generate	✅	53ms
tests/common/vllm_test.py::ModelWrapperTest_4::test_generate	✅	47ms
tests/common/vllm_test.py::ModelWrapperTest_5::test_generate	✅	36ms
tests/common/vllm_test.py::ModelWrapperTest_6::test_generate	✅	45ms
tests/common/vllm_test.py::TestAPIServer::test_api	✅	23ms
tests/common/vllm_test.py::TestAsyncAPIServer::test_api_async	✅	23ms
tests/common/vllm_test.py::TestTokenizer::test_action_mask	✅	1ms
tests/common/vllm_test.py::TestTokenizer::test_action_mask_with_tools	✅	1ms
tests/common/vllm_test.py::TestAPIServerToolCall_0_deepseek_r1::test_api_tool_calls	✅	21ms
tests/common/vllm_test.py::TestAPIServerToolCall_1::test_api_tool_calls	✅	20ms

Github Test Reporter by CTRF 💚

hiyuchang added 5 commits September 5, 2025 16:08

initialize trainable_ruler

294b696

fix some bugs

437489e

add results for trainable ruler

e1c8529

Merge branch 'main' into feat/trainable_ruler

4fa54b2

fix readme and logger

712e62c

gemini-code-assist bot reviewed Sep 12, 2025

View reviewed changes

trinity/common/workflows/math_trainable_ruler_workflow.py Show resolved Hide resolved

examples/grpo_gsm8k_trainable_ruler/README.md Outdated Show resolved Hide resolved

trinity/common/workflows/math_trainable_ruler_workflow.py Show resolved Hide resolved

yanxi-chen reviewed Sep 12, 2025

View reviewed changes

examples/grpo_gsm8k_trainable_ruler/README.md Outdated Show resolved Hide resolved

pan-x-c reviewed Sep 12, 2025

View reviewed changes

trinity/explorer/workflow_runner.py Show resolved Hide resolved

hiyuchang added 3 commits September 12, 2025 14:14

update figures and fix comments

a329dc3

fix task_id type

d192294

fix a typo

3fb6097

pan-x-c requested a review from Copilot September 12, 2025 06:29

Copilot AI reviewed Sep 12, 2025

View reviewed changes

trinity/common/workflows/math_ruler_workflow.py Show resolved Hide resolved

fix test

fced450

pan-x-c approved these changes Sep 12, 2025

View reviewed changes

pan-x-c merged commit 81bf03c into modelscope:main Sep 12, 2025
2 checks passed

yaochaorui pushed a commit to yaochaorui/Trinity-RFT that referenced this pull request Sep 19, 2025

[Example] Policy model as its own reward model (modelscope#270)

7d6cbbd

[Example] Policy model as its own reward model #270

[Example] Policy model as its own reward model #270

Uh oh!

Conversation

hiyuchang commented Sep 12, 2025 • edited by pan-x-c Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

hiyuchang commented Sep 12, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Sep 12, 2025

Summary

Tests

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

pan-x-c commented Sep 12, 2025

Uh oh!

github-actions bot commented Sep 12, 2025

Summary

Failed Tests

Skipped

Tests

Uh oh!

hiyuchang commented Sep 12, 2025

Uh oh!

github-actions bot commented Sep 12, 2025

Summary

Tests

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

hiyuchang commented Sep 12, 2025 •

edited by pan-x-c

Loading