Skip to content

Conversation

@chenyushuo
Copy link
Collaborator

@chenyushuo chenyushuo commented Sep 23, 2025

Description

  1. Separate save_state_dict from save_checkpoint.
  2. Support default_hdfs_dir.

Checklist

Please check the following items before code is ready to be reviewed.

  • Code has passed all tests
  • Docstrings have been added/updated in Google Style
  • Documentation has been updated
  • Code is ready for review

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @chenyushuo, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request refactors the FSDP checkpoint saving mechanism to provide a clearer distinction between saving a model's state dictionary and saving a full checkpoint. It introduces modular, threaded helper methods for saving individual components (model, optimizer, extra states) and updates the public API in the worker and trainer modules to leverage this improved structure. This change aims to enhance the maintainability and clarity of the checkpointing system.

Highlights

  • Separation of Concerns: The core functionality of saving a model's state dictionary (save_state_dict) has been explicitly separated from the more comprehensive checkpoint saving process (save_checkpoint).
  • Granular Saving Methods: New private methods (_save_model, _save_optimizer, _save_extra_state) have been introduced in fsdp_checkpoint_manager.py to handle the threaded saving of individual components (model, optimizer, and additional states like LR scheduler and RNG).
  • Refactored Checkpoint Manager: The save_checkpoint method in fsdp_checkpoint_manager.py now delegates to these new granular saving methods, streamlining its logic and removing the model_state_dict_only parameter's direct influence on internal saving decisions.
  • Updated Worker API: The fsdp_workers.py file now exposes a new save_state_dict method and a refactored save_checkpoint method, both utilizing the updated checkpoint manager and incorporating LoRA-specific saving logic.
  • Trainer Integration: The verl_trainer.py has been updated to call the new save_state_dict method on the actor rollout worker group when saving only the state dictionary, aligning with the refactored API.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the FSDP checkpoint saving logic by separating save_state_dict from save_checkpoint, which is a good step towards improving modularity. The use of threading for I/O operations and the extraction of helper methods like _save_model are positive changes. However, the review identified a critical issue: a TypeError will occur at runtime due to a mismatch in method signatures after the refactoring. I have also included several medium-severity recommendations to enhance code clarity and maintainability by removing an unused parameter, updating an outdated docstring, and suggesting a refactoring to reduce code duplication in the new saving methods.

@chenyushuo
Copy link
Collaborator Author

/unittest-module-trainer

@chenyushuo
Copy link
Collaborator Author

/unittest-module-trainer

@chenyushuo
Copy link
Collaborator Author

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the checkpoint saving mechanism, primarily by separating save_state_dict from save_checkpoint. This is a significant change that introduces a new CheckpointMonitor actor to manage asynchronous checkpointing and state dictionary saving, which is a good architectural improvement for decoupling components. The changes span across trainer, worker, and checkpoint manager implementations for both FSDP and Megatron strategies.
My review has identified a critical race condition in the new CheckpointMonitor actor that could lead to inconsistent state reporting. I've also found a few medium to high severity issues related to overly broad exception handling and incorrect log messages. Addressing these issues will improve the robustness and debuggability of the new checkpointing system.

@github-actions
Copy link

Summary

Tests 📝 Passed ✅ Failed ❌ Skipped ⏭️ Other ❓ Flaky 🍂 Duration ⏱️
18 16 1 1 0 0 1.8s

Failed Tests

Failed Tests ❌ Fail Message
❌ tests/trainer/trainer_test.py::TestTrainerMIX::test_trainer The test failed in the call phase due to an assertion error

Skipped

Tests Status
tests/trainer/trainer_test.py::TestTrainerMultiModal::test_trainer skipped ⏭️

Tests

Test Name Status Flaky Duration
tests/trainer/trainer_test.py::TestTrainerCountdown_0_fsdp::test_trainer 139ms
tests/trainer/trainer_test.py::TestTrainerCountdown_1_megatron::test_trainer 305ms
tests/trainer/trainer_test.py::TestStepAheadAsyncRL::test_trainer 66ms
tests/trainer/trainer_test.py::TestTrainerGSM8K_0_fsdp::test_trainer 54ms
tests/trainer/trainer_test.py::TestTrainerGSM8K_1_fsdp2::test_trainer 54ms
tests/trainer/trainer_test.py::TestTrainerGSM8K_2_fsdp::test_trainer 56ms
tests/trainer/trainer_test.py::TestTrainerGSM8K_3_fsdp2::test_trainer 66ms
tests/trainer/trainer_test.py::TestTrainerSFTWarmupGSM8K::test_trainer 100ms
tests/trainer/trainer_test.py::TestTrainerDPO::test_trainer 40ms
tests/trainer/trainer_test.py::TestTrainerSFT::test_trainer 36ms
tests/trainer/trainer_test.py::TestTrainerToolsSFT::test_trainer_tools 36ms
tests/trainer/trainer_test.py::TestFullyAsyncMode_0_fsdp::test_fully_async_mode 83ms
tests/trainer/trainer_test.py::TestFullyAsyncMode_1_fsdp::test_fully_async_mode 80ms
tests/trainer/trainer_test.py::TestFullyAsyncMode_2_megatron::test_fully_async_mode 165ms
tests/trainer/trainer_test.py::TestTrainerCheckpointSave_0_fsdp::test_trainer 101ms
tests/trainer/trainer_test.py::TestTrainerCheckpointSave_1_megatron::test_trainer 381ms
tests/trainer/trainer_test.py::TestTrainerMIX::test_trainer 1ms
tests/trainer/trainer_test.py::TestTrainerMultiModal::test_trainer ⏭️ 1ms

Github Test Reporter by CTRF 💚

@chenyushuo
Copy link
Collaborator Author

/unittest-module-trainer

@chenyushuo chenyushuo changed the title [WIP] Refactor fsdp checkpoint save Refactor fsdp checkpoint save Sep 26, 2025
@chenyushuo chenyushuo changed the title Refactor fsdp checkpoint save Refactor checkpoint save Sep 26, 2025
@github-actions
Copy link

Summary

Tests 📝 Passed ✅ Failed ❌ Skipped ⏭️ Other ❓ Flaky 🍂 Duration ⏱️
18 17 0 1 0 0 1.9s

Skipped

Tests Status
tests/trainer/trainer_test.py::TestTrainerMultiModal::test_trainer skipped ⏭️

Tests

Test Name Status Flaky Duration
tests/trainer/trainer_test.py::TestTrainerCountdown_0_fsdp::test_trainer 152ms
tests/trainer/trainer_test.py::TestTrainerCountdown_1_megatron::test_trainer 324ms
tests/trainer/trainer_test.py::TestStepAheadAsyncRL::test_trainer 60ms
tests/trainer/trainer_test.py::TestTrainerGSM8K_0_fsdp::test_trainer 54ms
tests/trainer/trainer_test.py::TestTrainerGSM8K_1_fsdp2::test_trainer 50ms
tests/trainer/trainer_test.py::TestTrainerGSM8K_2_fsdp::test_trainer 57ms
tests/trainer/trainer_test.py::TestTrainerGSM8K_3_fsdp2::test_trainer 62ms
tests/trainer/trainer_test.py::TestTrainerSFTWarmupGSM8K::test_trainer 100ms
tests/trainer/trainer_test.py::TestTrainerDPO::test_trainer 39ms
tests/trainer/trainer_test.py::TestTrainerSFT::test_trainer 36ms
tests/trainer/trainer_test.py::TestTrainerToolsSFT::test_trainer_tools 36ms
tests/trainer/trainer_test.py::TestFullyAsyncMode_0_fsdp::test_fully_async_mode 79ms
tests/trainer/trainer_test.py::TestFullyAsyncMode_1_fsdp::test_fully_async_mode 80ms
tests/trainer/trainer_test.py::TestFullyAsyncMode_2_megatron::test_fully_async_mode 164ms
tests/trainer/trainer_test.py::TestTrainerCheckpointSave_0_fsdp::test_trainer 125ms
tests/trainer/trainer_test.py::TestTrainerCheckpointSave_1_megatron::test_trainer 427ms
tests/trainer/trainer_test.py::TestTrainerMIX::test_trainer 55ms
tests/trainer/trainer_test.py::TestTrainerMultiModal::test_trainer ⏭️ 1ms

Github Test Reporter by CTRF 💚

@pan-x-c
Copy link
Collaborator

pan-x-c commented Sep 26, 2025

/unittest-module-manager

@github-actions
Copy link

Summary

Tests 📝 Passed ✅ Failed ❌ Skipped ⏭️ Other ❓ Flaky 🍂 Duration ⏱️
7 7 0 0 0 0 568ms

Tests

Test Name Status Flaky Duration
tests/manager/synchronizer_test.py::TestSynchronizerExit::test_synchronizer 32ms
tests/manager/synchronizer_test.py::TestStateDictBasedSynchronizer_0::test_synchronizer 74ms
tests/manager/synchronizer_test.py::TestStateDictBasedSynchronizer_1::test_synchronizer 79ms
tests/manager/synchronizer_test.py::TestStateDictBasedSynchronizer_2::test_synchronizer 115ms
tests/manager/synchronizer_test.py::TestStateDictBasedSynchronizer_3::test_synchronizer 121ms
tests/manager/synchronizer_test.py::TestNCCLBasedSynchronizer_0::test_synchronizer 70ms
tests/manager/synchronizer_test.py::TestNCCLBasedSynchronizer_1::test_synchronizer 71ms

Github Test Reporter by CTRF 💚

@chenyushuo
Copy link
Collaborator Author

/unittest-module-trainer

@chenyushuo
Copy link
Collaborator Author

/unittest-module-trainer

@github-actions
Copy link

Summary

Tests 📝 Passed ✅ Failed ❌ Skipped ⏭️ Other ❓ Flaky 🍂 Duration ⏱️
18 17 0 1 0 0 1.8s

Skipped

Tests Status
tests/trainer/trainer_test.py::TestTrainerMultiModal::test_trainer skipped ⏭️

Tests

Test Name Status Flaky Duration
tests/trainer/trainer_test.py::TestTrainerCountdown_0_fsdp::test_trainer 147ms
tests/trainer/trainer_test.py::TestTrainerCountdown_1_megatron::test_trainer 323ms
tests/trainer/trainer_test.py::TestStepAheadAsyncRL::test_trainer 63ms
tests/trainer/trainer_test.py::TestTrainerGSM8K_0_fsdp::test_trainer 53ms
tests/trainer/trainer_test.py::TestTrainerGSM8K_1_fsdp2::test_trainer 52ms
tests/trainer/trainer_test.py::TestTrainerGSM8K_2_fsdp::test_trainer 56ms
tests/trainer/trainer_test.py::TestTrainerGSM8K_3_fsdp2::test_trainer 62ms
tests/trainer/trainer_test.py::TestTrainerSFTWarmupGSM8K::test_trainer 103ms
tests/trainer/trainer_test.py::TestTrainerDPO::test_trainer 39ms
tests/trainer/trainer_test.py::TestTrainerSFT::test_trainer 35ms
tests/trainer/trainer_test.py::TestTrainerToolsSFT::test_trainer_tools 36ms
tests/trainer/trainer_test.py::TestFullyAsyncMode_0_fsdp::test_fully_async_mode 84ms
tests/trainer/trainer_test.py::TestFullyAsyncMode_1_fsdp::test_fully_async_mode 80ms
tests/trainer/trainer_test.py::TestFullyAsyncMode_2_megatron::test_fully_async_mode 172ms
tests/trainer/trainer_test.py::TestTrainerCheckpointSave_0_fsdp::test_trainer 110ms
tests/trainer/trainer_test.py::TestTrainerCheckpointSave_1_megatron::test_trainer 342ms
tests/trainer/trainer_test.py::TestTrainerMIX::test_trainer 57ms
tests/trainer/trainer_test.py::TestTrainerMultiModal::test_trainer ⏭️ 1ms

Github Test Reporter by CTRF 💚

@pan-x-c pan-x-c merged commit 7e01668 into modelscope:main Sep 26, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants