-
Notifications
You must be signed in to change notification settings - Fork 47
Support Multi-Modal LLM #234
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @hiyuchang, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request significantly enhances the framework by integrating multi-modal capabilities into its Large Language Model (LLM) operations. The primary goal is to allow models to process and generate content that combines text with visual data, such as images and videos. This involves extending core data structures to accommodate multi-modal inputs, introducing new API methods for multi-modal chat and generation, and updating the training infrastructure to properly handle these new data types. The changes lay the groundwork for developing and training advanced multi-modal LLMs within the system.
Highlights
- Multi-Modal LLM Support: This PR introduces comprehensive support for multi-modal Large Language Models (LLMs) within the framework, enabling the processing and generation of responses that incorporate both text and visual data (images and videos).
- New Multi-Modal API Endpoints: New
chat_mmandgenerate_mmmethods have been added to theModelWrapperand specifically implemented forvllm_model, allowing users to interact with multi-modal LLMs through chat and generation interfaces. - Enhanced Data Structures for Multi-Modality: The core
Experiencedata structure has been extended to storemulti_modal_dataandmulti_modal_inputs, ensuring that multi-modal information is properly captured and propagated throughout the system, from data loading to training. - Multi-Modal Workflow Integration: A new
SimpleMMWorkflowhas been introduced, demonstrating how multi-modal data can be integrated into existing task workflows, particularly for tasks involving both text and images. - Trainer Updates for Multi-Modal Data: The training pipeline, specifically the
verl_trainer, has been updated to correctly handle and process multi-modal inputs, including the dynamic calculation ofposition_idsfor models like Qwen2VL.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.
| Feature | Command | Description |
|---|---|---|
| Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
| Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
| Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds foundational support for multi-modal models, including new methods for handling image and video data in experiences and workflows. The changes are extensive, touching upon configuration, data structures, model wrappers, and the training pipeline. My review focuses on improving code correctness, performance, and maintainability. Key feedback includes fixing a potential crash in experience creation, addressing performance bottlenecks in multi-modal data processing, and correcting logic in the new VLM and workflow implementations. Overall, this is a great step towards multi-modal capabilities, and the following suggestions aim to solidify this new functionality.
|
/unittest-module-common |
|
/unittest-module-trainer |
Summary
Failed Tests
Tests
Github Test Reporter by CTRF 💚 |
|
/unittest-module-common |
Summary
Tests
Github Test Reporter by CTRF 💚 |
|
/unittest-module-trainer |
Summary
Skipped
Tests
Github Test Reporter by CTRF 💚 |
Description
vllm_model.chat_mmandvllm_model.generate_mmmin_pixelsandmax_pixelsfor processorChecklist
Please check the following items before code is ready to be reviewed.