-
Notifications
You must be signed in to change notification settings - Fork 47
Add taskset scheduler #326
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Summary of ChangesHello @chenyushuo, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the system's data handling capabilities by introducing a sophisticated "TasksetScheduler". This new component allows for the management and sequential processing of multiple datasets (tasksets), each with independent progress tracking and configurable shuffling. The changes streamline the data pipeline, improve reproducibility through seed-based shuffling, and lay the groundwork for more complex training and evaluation workflows. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a TasksetScheduler to manage multiple tasksets and adds shuffling capabilities to the FileReader. The changes are a good step towards more flexible data handling. However, I've found several critical issues related to state management and incorrect method calls that could lead to runtime errors and data loss on restart. Specifically, the explorer state saving/loading mechanism is broken, and a method call for seeding the shuffle is incorrect. I've also identified some potential index-out-of-bounds errors and missing type hints. Please address these points to ensure the stability and correctness of the new features.
|
/unittest-all |
|
/unittest-all |
lingzhq
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some questions.
|
/unittest-all |
|
/unittest-all |
Summary
Failed Tests
Skipped
Tests
Github Test Reporter by CTRF 💚 |
|
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a new TasksetScheduler to manage and sample from multiple datasets (tasksets) using various selection strategies. The changes are extensive, adding new components like data selectors, a pass rate calculator, and updating configuration handling to support multiple tasksets. The implementation is generally solid, but there are a few areas for improvement regarding code correctness, performance, and robustness. I've identified a potential AttributeError and a ZeroDivisionError, suggested performance improvements, and pointed out some code clarity issues.
|
/unittest-all |
1 similar comment
|
/unittest-all |
|
/unittest-all |
|
/unittest-all |
Summary
Skipped
Tests
Github Test Reporter by CTRF 💚 |
|
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a significant and well-designed feature: a taskset scheduler and a data selector system. This enhancement enables sophisticated data sampling strategies, such as multi-task and curriculum learning, by allowing intelligent, adaptive sampling from multiple datasets. The implementation is well-structured, separating the concerns of scheduling, selection, and difficulty estimation. The addition of comprehensive documentation and tests is also a major plus. My review focuses on a few areas where the implementation could be made more robust, particularly in handling configurations from multiple tasksets to avoid hardcoding assumptions.
|
/unittest-module-buffer |
Summary
Tests
Github Test Reporter by CTRF 💚 |
Description
TasksetSchedulerto enable multi-taskset scheduling.Selectorclass which allows users to determine how to selector task from single taskset.DifficultyBasedSelectorwhich sample tasks by difficulty level._HFBatchReaderandExperiencePipelineto support the functionality ofSelector.Checklist
Please check the following items before code is ready to be reviewed.