-
Notifications
You must be signed in to change notification settings - Fork 47
Refactor Storage #227
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor Storage #227
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @pan-x-c, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request, marked as Work In Progress, focuses on a significant refactoring of the buffer storage related modules. The primary goal is to enhance the modularity, maintainability, and extensibility of the data storage and formatting components within the system. This involves a major reorganization of internal classes and configurations to provide a more generic and flexible approach to handling various data schemas and storage types.
Highlights
- Consolidated Data Formatters: The separate SFT and DPO formatter classes (e.g., SFTMessagesFormatter, SFTPlaintextFormatter) have been consolidated into unified SFTFormatter and DPOFormatter classes, which now handle different prompt types internally. A new FORMATTER registry has been introduced for better management.
- Reorganized Storage Modules: The ray_wrapper module has been removed, and its functionalities for SQL, File, and Queue storage have been moved into dedicated trinity/buffer/storages submodules, improving code organization and separation of concerns.
- Generic Schema Type for Storage: The algorithm_type configuration in StorageConfig has been replaced with a more generic schema_type, allowing storage buffers to be defined based on the data schema (e.g., "experience", "sft", "dpo") rather than being tied to a specific algorithm.
- Updated SQL Schema Definitions: SQLAlchemy models for data storage have been modernized to directly store serialized Experience objects and utilize a new SQL_SCHEMA registry for dynamic table creation, enhancing data integrity and flexibility.
- Refined Experience Data Model: The Experience class has been updated to use message lists (chosen_messages, rejected_messages) for DPO-specific data, providing a more structured representation of multi-turn conversations.
- Simplified Task Type Configuration: The TaskType enum for distinguishing between training and evaluation tasks has been replaced with a simpler boolean is_eval flag in StorageConfig.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.
| Feature | Command | Description |
|---|---|---|
| Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
| Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
| Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a significant and well-structured refactoring of the buffer storage modules. The changes improve modularity by separating different storage implementations (File, Queue, SQL) into a dedicated storages directory and unifying data formatters. This greatly enhances code organization and clarity. While the overall direction is excellent, I've identified a few critical issues related to changes in data handling logic that could lead to unexpected behavior or bugs. Specifically, there are concerns about the removal of deepcopy for configuration objects, a fundamental change in the SQL buffer's reading strategy, and missing data in Experience objects created from DPO data. Addressing these points will help ensure the refactoring is robust and functionally correct.
|
/unittest-all |
Summary
Failed Tests
Tests
Github Test Reporter by CTRF 💚 |
|
/unittest-all |
Summary
Tests
Github Test Reporter by CTRF 💚 |
|
/unittest-all |
Summary
Tests
Github Test Reporter by CTRF 💚 |
|
/unittest-buffer |
Summary
Tests
Github Test Reporter by CTRF 💚 |
|
/unittest-all |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR refactors the storage layer by introducing a dedicated buffer.storage module and separating the storage of Task and Experience data. The refactoring includes renaming storage classes, simplifying SQL schemas, and moving retry configuration parameters.
- Introduces
buffer.storagemodule withSQLStorage,QueueStorage, andFileStorageclasses - Separates Task storage (Explorer input) from Experience storage (Trainer input) with dedicated SQL schemas
- Moves retry configuration from
BufferConfigtoStorageConfigfor better organization
Reviewed Changes
Copilot reviewed 72 out of 76 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| trinity/manager/config_manager.py | Moves retry configuration fields to experience buffer storage config |
| trinity/common/workflows/workflow.py | Sets default value for Task.workflow field |
| trinity/common/experience.py | Reorders prompt_length field and updates DPO experience field names |
| trinity/common/constants.py | Removes TaskType enum and adds MLFLOW monitor type |
| trinity/common/config.py | Updates storage configuration structure and field names |
| trinity/buffer/storage/sql.py | New SQL storage implementation with separate Task and Experience storage classes |
| trinity/buffer/storage/queue.py | New queue storage implementation moved from ray_wrapper |
| trinity/buffer/storage/file.py | New file storage implementation moved from ray_wrapper |
| trinity/buffer/schema/sql_schema.py | Simplified SQL schema models for different data types |
| trinity/buffer/schema/formatter.py | Refactored formatters with unified structure for different data types |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
Summary
Tests
Github Test Reporter by CTRF 💚 |
Description
buffer.storagemodule, and rename the originalbuffer.DBWrapper,buffer.QueueWrapperandbuffer.FileWrapperintobuffer.storage.SQLStorage,buffer.storage.QueueStorageandbuffer.storage.FileStorage.TaskandExperiencedata.Taskstorage is dedicated to Explorer input, andExperiencestorage is dedicated to Trainer input.TaskandExperience. (Incompatible with the old version)FormatterofTaskandExperiencedata. NowFormatteris mainly responsible for converting JSON format data intoTaskorExperienceinstance.max_retry_intervalandmax_retry_timesfromBufferConfigintoStorageConfig. (Incompatible with the old version)Checklist
Please check the following items before code is ready to be reviewed.