-
Notifications
You must be signed in to change notification settings - Fork 47
Refactor checkpoint save #299
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor checkpoint save #299
Conversation
Summary of ChangesHello @chenyushuo, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request refactors the FSDP checkpoint saving mechanism to provide a clearer distinction between saving a model's state dictionary and saving a full checkpoint. It introduces modular, threaded helper methods for saving individual components (model, optimizer, extra states) and updates the public API in the worker and trainer modules to leverage this improved structure. This change aims to enhance the maintainability and clarity of the checkpointing system. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request refactors the FSDP checkpoint saving logic by separating save_state_dict from save_checkpoint, which is a good step towards improving modularity. The use of threading for I/O operations and the extraction of helper methods like _save_model are positive changes. However, the review identified a critical issue: a TypeError will occur at runtime due to a mismatch in method signatures after the refactoring. I have also included several medium-severity recommendations to enhance code clarity and maintainability by removing an unused parameter, updating an outdated docstring, and suggesting a refactoring to reduce code duplication in the new saving methods.
|
/unittest-module-trainer |
|
/unittest-module-trainer |
|
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request refactors the checkpoint saving mechanism, primarily by separating save_state_dict from save_checkpoint. This is a significant change that introduces a new CheckpointMonitor actor to manage asynchronous checkpointing and state dictionary saving, which is a good architectural improvement for decoupling components. The changes span across trainer, worker, and checkpoint manager implementations for both FSDP and Megatron strategies.
My review has identified a critical race condition in the new CheckpointMonitor actor that could lead to inconsistent state reporting. I've also found a few medium to high severity issues related to overly broad exception handling and incorrect log messages. Addressing these issues will improve the robustness and debuggability of the new checkpointing system.
Summary
Failed Tests
Skipped
Tests
Github Test Reporter by CTRF 💚 |
|
/unittest-module-trainer |
Summary
Skipped
Tests
Github Test Reporter by CTRF 💚 |
|
/unittest-module-manager |
Summary
Tests
Github Test Reporter by CTRF 💚 |
|
/unittest-module-trainer |
|
/unittest-module-trainer |
Summary
Skipped
Tests
Github Test Reporter by CTRF 💚 |
Description
save_state_dictfromsave_checkpoint.default_hdfs_dir.Checklist
Please check the following items before code is ready to be reviewed.