Refactor checkpoint save #299

chenyushuo · 2025-09-23T12:12:48Z

Description

Separate save_state_dict from save_checkpoint.
Support default_hdfs_dir.

Checklist

Please check the following items before code is ready to be reviewed.

Code has passed all tests
Docstrings have been added/updated in Google Style
Documentation has been updated
Code is ready for review

gemini-code-assist · 2025-09-23T12:13:05Z

Summary of Changes

Hello @chenyushuo, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request refactors the FSDP checkpoint saving mechanism to provide a clearer distinction between saving a model's state dictionary and saving a full checkpoint. It introduces modular, threaded helper methods for saving individual components (model, optimizer, extra states) and updates the public API in the worker and trainer modules to leverage this improved structure. This change aims to enhance the maintainability and clarity of the checkpointing system.

Highlights

Separation of Concerns: The core functionality of saving a model's state dictionary (save_state_dict) has been explicitly separated from the more comprehensive checkpoint saving process (save_checkpoint).
Granular Saving Methods: New private methods (_save_model, _save_optimizer, _save_extra_state) have been introduced in fsdp_checkpoint_manager.py to handle the threaded saving of individual components (model, optimizer, and additional states like LR scheduler and RNG).
Refactored Checkpoint Manager: The save_checkpoint method in fsdp_checkpoint_manager.py now delegates to these new granular saving methods, streamlining its logic and removing the model_state_dict_only parameter's direct influence on internal saving decisions.
Updated Worker API: The fsdp_workers.py file now exposes a new save_state_dict method and a refactored save_checkpoint method, both utilizing the updated checkpoint manager and incorporating LoRA-specific saving logic.
Trainer Integration: The verl_trainer.py has been updated to call the new save_state_dict method on the actor rollout worker group when saving only the state dictionary, aligning with the refactored API.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request refactors the FSDP checkpoint saving logic by separating save_state_dict from save_checkpoint, which is a good step towards improving modularity. The use of threading for I/O operations and the extraction of helper methods like _save_model are positive changes. However, the review identified a critical issue: a TypeError will occur at runtime due to a mismatch in method signatures after the refactoring. I have also included several medium-severity recommendations to enhance code clarity and maintainability by removing an unused parameter, updating an outdated docstring, and suggesting a refactoring to reduce code duplication in the new saving methods.

trinity/trainer/verl/fsdp_workers.py

trinity/trainer/verl/fsdp_checkpoint_manager.py

chenyushuo · 2025-09-25T11:21:39Z

/unittest-module-trainer

chenyushuo · 2025-09-26T06:02:08Z

/unittest-module-trainer

chenyushuo · 2025-09-26T06:03:58Z

/gemini review

gemini-code-assist

Code Review

This pull request refactors the checkpoint saving mechanism, primarily by separating save_state_dict from save_checkpoint. This is a significant change that introduces a new CheckpointMonitor actor to manage asynchronous checkpointing and state dictionary saving, which is a good architectural improvement for decoupling components. The changes span across trainer, worker, and checkpoint manager implementations for both FSDP and Megatron strategies.
My review has identified a critical race condition in the new CheckpointMonitor actor that could lead to inconsistent state reporting. I've also found a few medium to high severity issues related to overly broad exception handling and incorrect log messages. Addressing these issues will improve the robustness and debuggability of the new checkpointing system.

trinity/trainer/verl_trainer.py

trinity/manager/synchronizer.py

tests/trainer/trainer_test.py

trinity/manager/synchronizer.py

github-actions · 2025-09-26T06:33:50Z

Summary

Tests 📝	Passed ✅	Failed ❌	Skipped ⏭️	Other ❓	Flaky 🍂	Duration ⏱️
18	16	1	1	0	0	1.8s

Failed Tests

Failed Tests ❌	Fail Message
❌ tests/trainer/trainer_test.py::TestTrainerMIX::test_trainer	The test failed in the call phase due to an assertion error

Skipped

Tests	Status
tests/trainer/trainer_test.py::TestTrainerMultiModal::test_trainer	skipped ⏭️

Tests

Test Name	Status	Duration
tests/trainer/trainer_test.py::TestTrainerCountdown_0_fsdp::test_trainer	✅	139ms
tests/trainer/trainer_test.py::TestTrainerCountdown_1_megatron::test_trainer	✅	305ms
tests/trainer/trainer_test.py::TestStepAheadAsyncRL::test_trainer	✅	66ms
tests/trainer/trainer_test.py::TestTrainerGSM8K_0_fsdp::test_trainer	✅	54ms
tests/trainer/trainer_test.py::TestTrainerGSM8K_1_fsdp2::test_trainer	✅	54ms
tests/trainer/trainer_test.py::TestTrainerGSM8K_2_fsdp::test_trainer	✅	56ms
tests/trainer/trainer_test.py::TestTrainerGSM8K_3_fsdp2::test_trainer	✅	66ms
tests/trainer/trainer_test.py::TestTrainerSFTWarmupGSM8K::test_trainer	✅	100ms
tests/trainer/trainer_test.py::TestTrainerDPO::test_trainer	✅	40ms
tests/trainer/trainer_test.py::TestTrainerSFT::test_trainer	✅	36ms
tests/trainer/trainer_test.py::TestTrainerToolsSFT::test_trainer_tools	✅	36ms
tests/trainer/trainer_test.py::TestFullyAsyncMode_0_fsdp::test_fully_async_mode	✅	83ms
tests/trainer/trainer_test.py::TestFullyAsyncMode_1_fsdp::test_fully_async_mode	✅	80ms
tests/trainer/trainer_test.py::TestFullyAsyncMode_2_megatron::test_fully_async_mode	✅	165ms
tests/trainer/trainer_test.py::TestTrainerCheckpointSave_0_fsdp::test_trainer	✅	101ms
tests/trainer/trainer_test.py::TestTrainerCheckpointSave_1_megatron::test_trainer	✅	381ms
tests/trainer/trainer_test.py::TestTrainerMIX::test_trainer	❌	1ms
tests/trainer/trainer_test.py::TestTrainerMultiModal::test_trainer	⏭️	1ms

Github Test Reporter by CTRF 💚

chenyushuo · 2025-09-26T06:45:07Z

/unittest-module-trainer

github-actions · 2025-09-26T07:18:14Z

Summary

Tests 📝	Passed ✅	Failed ❌	Skipped ⏭️	Other ❓	Flaky 🍂	Duration ⏱️
18	17	0	1	0	0	1.9s

Skipped

Tests	Status
tests/trainer/trainer_test.py::TestTrainerMultiModal::test_trainer	skipped ⏭️

Tests

Test Name	Status	Duration
tests/trainer/trainer_test.py::TestTrainerCountdown_0_fsdp::test_trainer	✅	152ms
tests/trainer/trainer_test.py::TestTrainerCountdown_1_megatron::test_trainer	✅	324ms
tests/trainer/trainer_test.py::TestStepAheadAsyncRL::test_trainer	✅	60ms
tests/trainer/trainer_test.py::TestTrainerGSM8K_0_fsdp::test_trainer	✅	54ms
tests/trainer/trainer_test.py::TestTrainerGSM8K_1_fsdp2::test_trainer	✅	50ms
tests/trainer/trainer_test.py::TestTrainerGSM8K_2_fsdp::test_trainer	✅	57ms
tests/trainer/trainer_test.py::TestTrainerGSM8K_3_fsdp2::test_trainer	✅	62ms
tests/trainer/trainer_test.py::TestTrainerSFTWarmupGSM8K::test_trainer	✅	100ms
tests/trainer/trainer_test.py::TestTrainerDPO::test_trainer	✅	39ms
tests/trainer/trainer_test.py::TestTrainerSFT::test_trainer	✅	36ms
tests/trainer/trainer_test.py::TestTrainerToolsSFT::test_trainer_tools	✅	36ms
tests/trainer/trainer_test.py::TestFullyAsyncMode_0_fsdp::test_fully_async_mode	✅	79ms
tests/trainer/trainer_test.py::TestFullyAsyncMode_1_fsdp::test_fully_async_mode	✅	80ms
tests/trainer/trainer_test.py::TestFullyAsyncMode_2_megatron::test_fully_async_mode	✅	164ms
tests/trainer/trainer_test.py::TestTrainerCheckpointSave_0_fsdp::test_trainer	✅	125ms
tests/trainer/trainer_test.py::TestTrainerCheckpointSave_1_megatron::test_trainer	✅	427ms
tests/trainer/trainer_test.py::TestTrainerMIX::test_trainer	✅	55ms
tests/trainer/trainer_test.py::TestTrainerMultiModal::test_trainer	⏭️	1ms

Github Test Reporter by CTRF 💚

pan-x-c · 2025-09-26T07:33:13Z

/unittest-module-manager

tests/trainer/trainer_test.py

github-actions · 2025-09-26T07:43:58Z

Summary

Tests 📝	Passed ✅	Failed ❌	Skipped ⏭️	Other ❓	Flaky 🍂	Duration ⏱️
7	7	0	0	0	0	568ms

Tests

Test Name	Status	Duration
tests/manager/synchronizer_test.py::TestSynchronizerExit::test_synchronizer	✅	32ms
tests/manager/synchronizer_test.py::TestStateDictBasedSynchronizer_0::test_synchronizer	✅	74ms
tests/manager/synchronizer_test.py::TestStateDictBasedSynchronizer_1::test_synchronizer	✅	79ms
tests/manager/synchronizer_test.py::TestStateDictBasedSynchronizer_2::test_synchronizer	✅	115ms
tests/manager/synchronizer_test.py::TestStateDictBasedSynchronizer_3::test_synchronizer	✅	121ms
tests/manager/synchronizer_test.py::TestNCCLBasedSynchronizer_0::test_synchronizer	✅	70ms
tests/manager/synchronizer_test.py::TestNCCLBasedSynchronizer_1::test_synchronizer	✅	71ms

Github Test Reporter by CTRF 💚

chenyushuo · 2025-09-26T07:58:26Z

/unittest-module-trainer

chenyushuo · 2025-09-26T08:45:02Z

/unittest-module-trainer

github-actions · 2025-09-26T09:16:43Z

Summary

Tests 📝	Passed ✅	Failed ❌	Skipped ⏭️	Other ❓	Flaky 🍂	Duration ⏱️
18	17	0	1	0	0	1.8s

Skipped

Tests	Status
tests/trainer/trainer_test.py::TestTrainerMultiModal::test_trainer	skipped ⏭️

Tests

Test Name	Status	Duration
tests/trainer/trainer_test.py::TestTrainerCountdown_0_fsdp::test_trainer	✅	147ms
tests/trainer/trainer_test.py::TestTrainerCountdown_1_megatron::test_trainer	✅	323ms
tests/trainer/trainer_test.py::TestStepAheadAsyncRL::test_trainer	✅	63ms
tests/trainer/trainer_test.py::TestTrainerGSM8K_0_fsdp::test_trainer	✅	53ms
tests/trainer/trainer_test.py::TestTrainerGSM8K_1_fsdp2::test_trainer	✅	52ms
tests/trainer/trainer_test.py::TestTrainerGSM8K_2_fsdp::test_trainer	✅	56ms
tests/trainer/trainer_test.py::TestTrainerGSM8K_3_fsdp2::test_trainer	✅	62ms
tests/trainer/trainer_test.py::TestTrainerSFTWarmupGSM8K::test_trainer	✅	103ms
tests/trainer/trainer_test.py::TestTrainerDPO::test_trainer	✅	39ms
tests/trainer/trainer_test.py::TestTrainerSFT::test_trainer	✅	35ms
tests/trainer/trainer_test.py::TestTrainerToolsSFT::test_trainer_tools	✅	36ms
tests/trainer/trainer_test.py::TestFullyAsyncMode_0_fsdp::test_fully_async_mode	✅	84ms
tests/trainer/trainer_test.py::TestFullyAsyncMode_1_fsdp::test_fully_async_mode	✅	80ms
tests/trainer/trainer_test.py::TestFullyAsyncMode_2_megatron::test_fully_async_mode	✅	172ms
tests/trainer/trainer_test.py::TestTrainerCheckpointSave_0_fsdp::test_trainer	✅	110ms
tests/trainer/trainer_test.py::TestTrainerCheckpointSave_1_megatron::test_trainer	✅	342ms
tests/trainer/trainer_test.py::TestTrainerMIX::test_trainer	✅	57ms
tests/trainer/trainer_test.py::TestTrainerMultiModal::test_trainer	⏭️	1ms

Github Test Reporter by CTRF 💚

refactor fsdp checkpoint save

3490c2c

gemini-code-assist bot reviewed Sep 23, 2025

View reviewed changes

chenyushuo added 2 commits September 24, 2025 20:23

add CheckpointMonitor and refactor checkpoint manager

cdfd70e

bug fix in checkpoint monitor

a2708fd

chenyushuo added 2 commits September 26, 2025 10:57

add hdfs support

9716d56

Add unittest and bug fix

18cc773

gemini-code-assist bot reviewed Sep 26, 2025

View reviewed changes

chenyushuo added 2 commits September 26, 2025 14:39

bug fix in trainer_test

7d8e99a

apply suggestions from gemini

b4224d3

chenyushuo changed the title ~~[WIP] Refactor fsdp checkpoint save~~ Refactor fsdp checkpoint save Sep 26, 2025

chenyushuo changed the title ~~Refactor fsdp checkpoint save~~ Refactor checkpoint save Sep 26, 2025

pan-x-c reviewed Sep 26, 2025

View reviewed changes

tests/trainer/trainer_test.py Outdated Show resolved Hide resolved

fix in unittest

c330c82

fix in unittest

4076ee7

pan-x-c approved these changes Sep 26, 2025

View reviewed changes

pan-x-c merged commit 7e01668 into modelscope:main Sep 26, 2025
1 check passed

Refactor checkpoint save #299

Refactor checkpoint save #299

Uh oh!

Conversation

chenyushuo commented Sep 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist

Uh oh!

gemini-code-assist bot commented Sep 23, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chenyushuo commented Sep 25, 2025

Uh oh!

chenyushuo commented Sep 26, 2025

Uh oh!

chenyushuo commented Sep 26, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Sep 26, 2025

Summary

Failed Tests

Skipped

Tests

Uh oh!

chenyushuo commented Sep 26, 2025

Uh oh!

github-actions bot commented Sep 26, 2025

Summary

Skipped

Tests

Uh oh!

pan-x-c commented Sep 26, 2025

Uh oh!

Uh oh!

github-actions bot commented Sep 26, 2025

Summary

Tests

Uh oh!

chenyushuo commented Sep 26, 2025

Uh oh!

chenyushuo commented Sep 26, 2025

Uh oh!

github-actions bot commented Sep 26, 2025

Summary

Skipped

Tests

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

chenyushuo commented Sep 23, 2025 •

edited

Loading