Cleaning up the old-version data processor #213

HYLcool · 2025-08-21T12:48:00Z

Description

As the title says. And updating some docs about data processor.

Checklist

Please check the following items before code is ready to be reviewed.

Code has passed all tests
Docstrings have been added/updated in Google Style
Documentation has been updated
Code is ready for review

gemini-code-assist

Summary of Changes

Hello @HYLcool, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request focuses on a significant cleanup and refactoring of the data processing module. The primary goal is to remove an outdated version of the data processor and its associated components, streamlining the codebase. Concurrently, the documentation has been updated to accurately reflect the current unified data processor architecture, which leverages Data-Juicer for various data handling tasks. This change improves code maintainability and clarity by eliminating deprecated functionalities and providing up-to-date guidance.

Highlights

Major Refactoring of Data Processing Module: The pull request undertakes a significant refactoring of the data processing module, removing several older components such as DataActiveIterator, DataTaskParser, DataCleaner, DataSynthesizer, RftDataset, and their associated test files. This indicates a strategic shift towards a more streamlined and potentially unified data processing architecture.
Documentation Updates: Documentation has been updated to reflect the current state of the data processor. This includes a new 'Overview' section in example_data_functionalities.md explaining Trinity-RFT's unified data processor and its integration with Data-Juicer, as well as clarifications on operator concepts in trinity_programming_guide.md.
Configuration and Constant Streamlining: The configuration and constant definitions have been streamlined. Specifically, RewardShapingConfig and DataPipelineConfig dataclasses were removed from trinity/common/config.py, and the DataProcessorPipelineType enum values in trinity/common/constants.py were simplified for better consistency.
Code Modernization and Cleanup: Existing code that interacts with the data processing pipeline has been updated to align with the new architecture. This includes changes in DataJuicerOperator and TaskPipeline to utilize the simplified DataProcessorPipelineType enum, ensuring compatibility with the removed older data processor components.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request effectively cleans up the old data processor, replacing it with a more streamlined, service-based architecture centered around DataJuicerOperator. The removal of numerous legacy files significantly simplifies the codebase, and the accompanying documentation updates are helpful. My review includes a couple of suggestions to enhance the documentation's clarity and long-term stability.

docs/sphinx_doc/source/tutorial/example_data_functionalities.md

docs/sphinx_doc/source/tutorial/trinity_programming_guide.md

* update the link to the data processor arch

pan-x-c · 2025-08-22T02:35:25Z

/unittest-all

github-actions · 2025-08-22T03:07:55Z

Summary

Tests 📝	Passed ✅	Failed ❌	Skipped ⏭️	Other ❓	Flaky 🍂	Duration ⏱️
106	106	0	0	0	0	1.9s

Tests

Test Name	Status	Duration
tests/algorithm/advantage_fn_test.py::TestGroupedAdvantageFn::test_duplicate_grpo	✅	1ms
tests/algorithm/advantage_fn_test.py::TestGroupedAdvantageFn::test_grpo_advantage	✅	1ms
tests/algorithm/advantage_fn_test.py::TestGroupedAdvantageFn::test_grpo_correct_bias	✅	1ms
tests/algorithm/advantage_fn_test.py::TestGroupedAdvantageFn::test_grpo_reward_std	✅	1ms
tests/algorithm/advantage_fn_test.py::TestGroupedAdvantageFn::test_step_wise_grpo_advantage	✅	1ms
tests/algorithm/policy_loss_test.py::VerlPolicyLossTest::test_dpo_policy_loss	✅	1ms
tests/algorithm/policy_loss_test.py::VerlPolicyLossTest::test_gspo_policy_loss	✅	1ms
tests/algorithm/policy_loss_test.py::VerlPolicyLossTest::test_mix_policy_loss	✅	1ms
tests/algorithm/policy_loss_test.py::VerlPolicyLossTest::test_opmd_policy_loss	✅	1ms
tests/algorithm/policy_loss_test.py::VerlPolicyLossTest::test_ppo_policy_loss	✅	1ms
tests/algorithm/policy_loss_test.py::VerlPolicyLossTest::test_sft_policy_loss	✅	1ms
tests/buffer/experience_pipeline_test.py::TestExperiencePipeline::test_experience_pipeline	✅	11ms
tests/buffer/file_test.py::TestFileBuffer::test_file_buffer	✅	2ms
tests/buffer/file_test.py::TestFileBuffer::test_file_reader	✅	1ms
tests/buffer/file_test.py::TestFileBuffer::test_file_writer	✅	2ms
tests/buffer/queue_test.py::TestQueueBuffer::test_priority_queue_buffer_reuse	✅	7ms
tests/buffer/queue_test.py::TestQueueBuffer::test_priority_queue_capacity	✅	3ms
tests/buffer/queue_test.py::TestQueueBuffer::test_queue_buffer_0_queue	✅	4ms
tests/buffer/queue_test.py::TestQueueBuffer::test_queue_buffer_1_priority_queue	✅	4ms
tests/buffer/queue_test.py::TestQueueBuffer::test_queue_buffer_capacity	✅	5ms
tests/buffer/reward_shaping_mapper_test.py::TestRewardShapingMapper::test_basic_usage	✅	1ms
tests/buffer/sql_test.py::TestSQLBuffer::test_create_sql_buffer	✅	4ms
tests/common/config_test.py::TestConfig::test_all_examples_are_valid	✅	1ms
tests/common/config_test.py::TestConfig::test_config_flatten	✅	1ms
tests/common/config_test.py::TestConfig::test_continue_from_checkpoint_is_valid	✅	1ms
tests/common/config_test.py::TestConfig::test_load_default_config	✅	5ms
tests/common/experience_test.py::TestEID::test_eid_properties	✅	1ms
tests/common/experience_test.py::TestExperience::test_action_mask_and_logprobs_type	✅	1ms
tests/common/experience_test.py::TestExperience::test_assertions	✅	1ms
tests/common/experience_test.py::TestExperience::test_dpo_experience	✅	1ms
tests/common/experience_test.py::TestExperience::test_gather	✅	1ms
tests/common/experience_test.py::TestExperience::test_hf_datasets_conversion	✅	1ms
tests/common/experience_test.py::TestExperience::test_multi_turn_experience	✅	1ms
tests/common/experience_test.py::TestExperience::test_serialize_deserialize	✅	1ms
tests/common/experience_test.py::TestExperience::test_single_turn_experience	✅	1ms
tests/common/experience_test.py::TestExperience::test_to_dict	✅	1ms
tests/common/experience_test.py::TestExperienceConversion::test_batch_conversion	✅	1ms
tests/common/experience_test.py::TestExperienceConversion::test_dpo_experience_batch_conversion	✅	1ms
tests/common/experience_test.py::TestExperienceConversion::test_experience_model_experience_conversion	✅	1ms
tests/common/experience_test.py::TestExperienceConversion::test_gather_experiences_with_custom_fields	✅	1ms
tests/common/experience_test.py::TestExperienceConversion::test_multiturn_experience_batch_converstion	✅	1ms
tests/common/vllm_test.py::ModelWrapperTest_0::test_generate	✅	37ms
tests/common/vllm_test.py::ModelWrapperTest_1::test_generate	✅	17ms
tests/common/vllm_test.py::ModelWrapperTest_2::test_generate	✅	16ms
tests/common/vllm_test.py::ModelWrapperTest_3::test_generate	✅	54ms
tests/common/vllm_test.py::ModelWrapperTest_4::test_generate	✅	47ms
tests/common/vllm_test.py::ModelWrapperTest_5::test_generate	✅	35ms
tests/common/vllm_test.py::ModelWrapperTest_6::test_generate	✅	47ms
tests/common/vllm_test.py::TestAPIServer::test_api	✅	23ms
tests/common/vllm_test.py::TestTokenizer::test_assistant_token_mask	✅	1ms
tests/common/vllm_test.py::TestAPIServerToolCall_0_deepseek_r1::test_api_tool_calls	✅	21ms
tests/common/vllm_test.py::TestAPIServerToolCall_1::test_api_tool_calls	✅	19ms
tests/explorer/explorer_test.py::BaseExplorerCase::test_explorer	✅	1ms
tests/explorer/explorer_test.py::TestExplorerCountdownEval::test_explorer	✅	65ms
tests/explorer/explorer_test.py::TestExplorerCountdownNoEval::test_explorer	✅	69ms
tests/explorer/explorer_test.py::TestExplorerGSM8k::test_explorer	✅	198ms
tests/explorer/scheduler_test.py::SchedulerTest::test_concurrent_operations	✅	4ms
tests/explorer/scheduler_test.py::SchedulerTest::test_get_results	✅	19ms
tests/explorer/scheduler_test.py::SchedulerTest::test_multi_step_execution	✅	4ms
tests/explorer/scheduler_test.py::SchedulerTest::test_non_repeatable_workflow	✅	4ms
tests/explorer/scheduler_test.py::SchedulerTest::test_scheduler_all_methods	✅	14ms
tests/explorer/scheduler_test.py::SchedulerTest::test_scheduler_restart_after_stop	✅	7ms
tests/explorer/scheduler_test.py::SchedulerTest::test_split_tasks	✅	7ms
tests/explorer/scheduler_test.py::SchedulerTest::test_stepwise_experience_eid	✅	4ms
tests/explorer/scheduler_test.py::SchedulerTest::test_wait_all	✅	7ms
tests/explorer/scheduler_test.py::SchedulerTest::test_wait_all_timeout_with_multi_batch	✅	13ms
tests/explorer/step_wise_workflow_test.py::WorkflowTest::test_reward_propagation_workflow	✅	1ms
tests/explorer/step_wise_workflow_test.py::WorkflowTest::test_step_wise_reward_workflow	✅	1ms
tests/explorer/step_wise_workflow_test.py::WorkflowTest::test_workflows_raise_error	✅	1ms
tests/explorer/step_wise_workflow_test.py::WorkflowTest::test_workflows_stop_at_max_env_steps	✅	1ms
tests/explorer/workflow_test.py::WorkflowTest::test_gsm8k_workflow	✅	1ms
tests/explorer/workflow_test.py::WorkflowTest::test_math_boxed_workflow	✅	1ms
tests/explorer/workflow_test.py::WorkflowTest::test_math_complex_workflow	✅	1ms
tests/explorer/workflow_test.py::WorkflowTest::test_math_eval_workflow	✅	1ms
tests/explorer/workflow_test.py::WorkflowTest::test_math_fraction_workflow	✅	1ms
tests/explorer/workflow_test.py::WorkflowTest::test_math_workflow	✅	1ms
tests/explorer/workflow_test.py::WorkflowTest::test_rm_gallery_workflow	✅	1ms
tests/explorer/workflow_test.py::WorkflowTest::test_workflow_repeatable	✅	1ms
tests/explorer/workflow_test.py::WorkflowTest::test_workflow_resettable	✅	1ms
tests/manager/synchronizer_test.py::TestSynchronizerExit::test_synchronizer	✅	28ms
tests/manager/synchronizer_test.py::TestStateDictBasedSynchronizer_0::test_synchronizer	✅	59ms
tests/manager/synchronizer_test.py::TestStateDictBasedSynchronizer_1::test_synchronizer	✅	64ms
tests/manager/synchronizer_test.py::TestStateDictBasedSynchronizer_2::test_synchronizer	✅	95ms
tests/manager/synchronizer_test.py::TestStateDictBasedSynchronizer_3::test_synchronizer	✅	88ms
tests/manager/synchronizer_test.py::TestNCCLBasedSynchronizer_0::test_synchronizer	✅	51ms
tests/manager/synchronizer_test.py::TestNCCLBasedSynchronizer_1::test_synchronizer	✅	52ms
tests/service/data_juicer_test.py::TestDataJuicer::test_config	✅	1ms
tests/service/data_juicer_test.py::TestDataJuicer::test_server_start	✅	21ms
tests/service/data_juicer_test.py::TestDataJuicerExperiencePipeline::test_data_juicer_operators	✅	21ms
tests/service/data_juicer_test.py::TestDataJuicerTaskPipeline::test_data_juicer_task_pipeline	✅	14ms
tests/trainer/trainer_test.py::TestTrainerCountdown::test_trainer	✅	134ms
tests/trainer/trainer_test.py::TestStepAheadAsyncRL::test_trainer	✅	55ms
tests/trainer/trainer_test.py::TestTrainerGSM8K_0_fsdp::test_trainer	✅	44ms
tests/trainer/trainer_test.py::TestTrainerGSM8K_1_fsdp2::test_trainer	✅	45ms
tests/trainer/trainer_test.py::TestTrainerSFTWarmupGSM8K::test_trainer	✅	51ms
tests/trainer/trainer_test.py::TestTrainerDPO::test_trainer	✅	32ms
tests/trainer/trainer_test.py::TestTrainerSFT::test_trainer	✅	29ms
tests/trainer/trainer_test.py::TestFullyAsyncMode::test_fully_async_mode_0_queue	✅	70ms
tests/trainer/trainer_test.py::TestFullyAsyncMode::test_fully_async_mode_1_priority_queue	✅	67ms
tests/trainer/trainer_test.py::TestTrainerMIX::test_trainer	✅	62ms
tests/utils/eval_utils_test.py::TestMathEvalUtils::test_extract_answer	✅	1ms
tests/utils/eval_utils_test.py::TestMathEvalUtils::test_verify_math_answer	✅	1ms
tests/utils/eval_utils_test.py::TestEvalUtils::test_is_equiv	✅	1ms
tests/utils/plugin_test.py::TestPluginLoader::test_load_plugins_local	✅	1ms
tests/utils/plugin_test.py::TestPluginLoader::test_load_plugins_remote	✅	5ms
tests/utils/plugin_test.py::TestPluginLoader::test_passing_custom_class	✅	3ms

Github Test Reporter by CTRF 💚

Copilot

Pull Request Overview

This PR removes the old-version data processor components and updates documentation. The cleanup involves removing outdated data processing modules that are no longer needed, including the data server, processors, controllers, and related test files.

Removed old data processing server and pipeline components
Deleted legacy data processor classes and their tests
Updated documentation to reflect current data processing architecture
Cleaned up unused configuration classes and constants

Reviewed Changes

Copilot reviewed 20 out of 20 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
trinity/data/server.py	Removed Flask-based data processing server
trinity/data/readme.md	Removed outdated data engine documentation
trinity/data/processors/*.py	Removed legacy data processor classes (synthesizer, annotator, cleaner, base)
trinity/data/core/*.py	Removed old core data classes (formatter, dataset, comparator)
trinity/data/controllers/*.py	Removed task parser and active iterator controllers
trinity/common/constants.py	Removed DataProcessorPipelineType enum
trinity/common/config.py	Removed data pipeline configuration classes
tests/data/	Removed all test files for legacy data processing components
docs/	Updated documentation to reflect current data processing architecture

Comments suppressed due to low confidence (3)

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

HYLcool added 2 commits August 21, 2025 19:52

- clear old-version data module

3c2057e

* updating docs

1c5db1a

HYLcool requested review from pan-x-c and yxdyc August 21, 2025 12:48

HYLcool self-assigned this Aug 21, 2025

HYLcool added the documentation Improvements or additions to documentation label Aug 21, 2025

gemini-code-assist bot reviewed Aug 21, 2025

View reviewed changes

docs/sphinx_doc/source/tutorial/example_data_functionalities.md Outdated Show resolved Hide resolved

docs/sphinx_doc/source/tutorial/trinity_programming_guide.md Outdated Show resolved Hide resolved

HYLcool added 2 commits August 21, 2025 20:49

* modified according to the gemini's comments

9d50786

* remove pipeline type enum

b22c5a3

* update the link to the data processor arch

pan-x-c approved these changes Aug 22, 2025

View reviewed changes

pan-x-c requested a review from Copilot August 22, 2025 03:10

Copilot AI reviewed Aug 22, 2025

View reviewed changes

pan-x-c merged commit 61d10e7 into modelscope:main Aug 22, 2025
2 checks passed

yaochaorui pushed a commit to yaochaorui/Trinity-RFT that referenced this pull request Aug 27, 2025

Cleaning up the old-version data processor (modelscope#213)

04504a0

HYLcool deleted the dev/data_code_clean branch September 3, 2025 11:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Cleaning up the old-version data processor #213

Cleaning up the old-version data processor #213

Uh oh!

HYLcool commented Aug 21, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

pan-x-c commented Aug 22, 2025

Uh oh!

github-actions bot commented Aug 22, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Cleaning up the old-version data processor #213

Cleaning up the old-version data processor #213

Uh oh!

Conversation

HYLcool commented Aug 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

pan-x-c commented Aug 22, 2025

Uh oh!

github-actions bot commented Aug 22, 2025

Summary

Tests

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

HYLcool commented Aug 21, 2025 •

edited

Loading