0% found this document useful (0 votes)
54 views25 pages

MetaMorph: Unified Visual Understanding

The document introduces Visual-Predictive Instruction Tuning (VPiT), a method that enhances pretrained large language models (LLMs) to generate both text and visual tokens from multimodal inputs. The findings indicate that visual understanding and generation capabilities are mutually beneficial, with understanding data being more effective for training than generation data. The resulting model, MetaMorph, demonstrates competitive performance in visual understanding and generation tasks, suggesting that LLMs possess inherent visual capabilities that can be efficiently adapted through instruction tuning.

Uploaded by

zhangjunqi13.cst
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
54 views25 pages

MetaMorph: Unified Visual Understanding

The document introduces Visual-Predictive Instruction Tuning (VPiT), a method that enhances pretrained large language models (LLMs) to generate both text and visual tokens from multimodal inputs. The findings indicate that visual understanding and generation capabilities are mutually beneficial, with understanding data being more effective for training than generation data. The resulting model, MetaMorph, demonstrates competitive performance in visual understanding and generation tasks, suggesting that LLMs possess inherent visual capabilities that can be efficiently adapted through instruction tuning.

Uploaded by

zhangjunqi13.cst
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

MetaMorph: Multimodal Understanding and

Generation via Instruction Tuning


Shengbang Tong1,2,∗,† , David Fan1 , Jiachen Zhu1,2,∗ , Yunyang Xiong3 , Xinlei Chen1 , Koustuv Sinha1 ,
Michael Rabbat1 , Yann LeCun1,2 , Saining Xie2 , Zhuang Liu1,†
1
FAIR, Meta, 2 New York University, 3 Meta Reality Labs

Work done at Meta, † Corresponding authors

In this work, we propose Visual-Predictive Instruction Tuning (VPiT)—a simple and effective exten-
sion to visual instruction tuning that enables a pretrained LLM to quickly morph into an unified
arXiv:2412.14164v1 [[Link]] 18 Dec 2024

autoregressive model capable of generating both text and visual tokens. VPiT teaches an LLM to
predict discrete text tokens and continuous visual tokens from any input sequence of image and text
data curated in an instruction-following format. Our empirical investigation reveals several intriguing
properties of VPiT: (1) visual generation ability emerges as a natural byproduct of improved visual
understanding, and can be unlocked efficiently with a small amount of generation data; (2) while
we find understanding and generation to be mutually beneficial, understanding data contributes to
both capabilities more effectively than generation data. Building upon these findings, we train our
MetaMorph model and achieve competitive performance on both visual understanding and generation.
In visual generation, MetaMorph can leverage the world knowledge and reasoning abilities gained from
LLM pretraining, and overcome common failure modes exhibited by other generation models. Our
results suggest that LLMs may have strong “prior” vision capabilities that can be efficiently adapted
to both visual understanding and generation with a relatively simple instruction tuning process.

Date: December 19, 2024


Correspondence: st5087@[Link], zhuangl@[Link]
Project Page: [Link]/metamorph

1 Introduction
Multimodal Large Language Models (MLLMs) have advanced considerably in visual understanding, progressing
from basic image captioning to complex visual inferences (Alayrac et al., 2022; Liu et al., 2023; Dai et al.,
2024). These models process multimodal inputs—primarily images and language—and generate text tokens.
Multimodal LLMs often leverage a pretrained vision encoder (Dosovitskiy et al., 2021; Radford et al., 2021),
a pretrained language model (Touvron et al., 2023; AI@Meta, 2024), and align these modalities through
connectors such as MLP (Liu et al., 2023, 2024a) or cross-attention modules (Alayrac et al., 2022; Dai et al.,
2024). Among MLLM training methods, visual instruction tuning (Liu et al., 2023) has become widely
used (Wang et al., 2024a; Agrawal et al., 2024). It treats output embeddings of pretrained vision encoders as
continuous-valued “visual tokens” and directly feeds them as inputs to pretrained LLMs.
One benefit of visual instruction tuning is that it is data and compute efficient. A pretrained LLM can be
repurposed as a Multimodal LLM by instruction tuning with modest compute and data on the order of millions
of image-text question-answer pairs (Tong et al., 2024a; Li et al., 2024a). The effectiveness of visual instruction
tuning indicates that LLMs already possess a considerable amount of inherent visual knowledge which allows
them to efficiently learn and develop visual understanding during the instruction tuning process (Zhou et al.,
2024a). Inspired by this, we investigate whether LLMs can also be finetuned to generate visual information
with comparable efficiency and effectiveness.
Current attempts toward “unified” models—models capable of both multimodal understanding and generation—
often treat visual generation as an orthogonal capability to visual understanding. They tend to require
substantial changes to the original MLLM architecture and significant multimodal pretraining and/or finetuning.
Designing such methods is challenging, and past research takes different approaches including tokenizing

1
VPiT Inference Examples
Generate an image of the animal resulting from a
Text Vision Text Vision monarch caterpillar's metamorphosis
Adapted
Head Head Head Head
Diffusion Here’s the generated image based on
Model
Autoregressive your request: <image_start><image_end>
Autoregressive Model
Model Projector
Adapter Adapter
Vision What’s the animal in this image?
Encoder
Encoder

The animal in the image is a butterfly.

Figure 1 VPiT Training, Inference, and Examples of MetaMorph. Left: In Visual-Predictive Instruction Tuning (VPiT),
we finetune a pretrained LLM to generate both text and visual tokens using separate text and vision heads. Middle:
During inference, the model accepts an arbitrary input sequence of image(s) and text and outputs discrete text tokens
and continuous visual tokens. These visual tokens can be visualized via a separately finetuned diffusion model, which
is trained to condition on the pretrained vision encoder’s output. Right: An example conversation from MetaMorph
trained with VPiT. Here, the model implicitly solves a visual puzzle in order to generate the visual tokens of a butterfly.
The conversation continues with new user questions as the model continues to autoregressively process vision and text
tokens, independent of the diffusion-based visualization.

visual inputs into discrete tokens (Wu et al., 2024b; Team, 2024; Liu et al., 2024c), incorporating diffusion
objectives (Xie et al., 2024; Zhou et al., 2024b), and decoupling vision into separate understanding and
generation modes (Wu et al., 2024a). For example, approaches like LWM (Liu et al., 2024c), Show-o (Xie
et al., 2024), and Chameleon (Team, 2024) require billions of image-text pairs (Schuhmann et al., 2022; Gadre
et al., 2024) for extensive pretraining and finetuning.
In this work, we propose Visual-Predictive Instruction Tuning (VPiT)—a simple extension to visual instruction
tuning which builds upon the existing paradigm of passing continuous visual tokens as input to the LLM. VPiT
trains an LLM to output both continuous visual tokens and discrete text tokens in the finetuning stage. The
model takes pretrained vision encoder embeddings as well as text tokens as input, and outputs a combination
of text tokens and continuous visual tokens. To visualize the generated visual tokens, we finetune a diffusion
model to map the embeddings back into pixel space (see Figure 1 for an example). This framework allows us
to study the synergy between visual understanding, visual generation, and pretrained LLMs, which leads to
several intriguing findings outlined below.
First, we show that the ability to predict visual tokens emerges from understanding visual inputs and requires
minimal additional training. Similar to visual instruction tuning, VPiT efficiently and effectively morphs an
LLM into an “unified” model that understands and generates multimodal tokens. When trained jointly with
sufficient visual understanding data, this process requires as little as 200k additional visual generation data.
We further establish that the abilities to understand and generate visual tokens are intrinsically linked and
asymmetrical. Specifically, increasing understanding data improves visual understanding (measured by higher
VQA scores) and generation performance (measured by lower FID scores). Conversely, increasing generation
data enhances generation quality and also contributes to stronger visual understanding—but to a lesser degree.
Importantly, our findings highlight an asymmetry in how training each ability impacts the model’s overall
vision performance: understanding-centric training substantially outperforms generation-centric training in
improving both visual understanding and generation.
Building upon these findings, we train a unified model called MetaMorph to predict multimodal tokens with
VPiT. We leverage diverse data sources ranging from common visual question answering datasets to pure
image and video data without text annotations. MetaMorph achieves competitive performance on both visual
understanding and visual generation benchmarks. Furthermore, we show this unified modeling approach
allows models to leverage the power of LLMs. For instance, MetaMorph can extract knowledge from the
pretrained LLM when generating visual tokens. More surprisingly, we observe that MetaMorph can implicitly
perform reasoning steps before generating visual tokens—e.g. when prompted with “the animal resulting from
a monarch caterpillar’s metamorphosis”, MetaMorph successfully generates an image of a butterfly (Figure 1).
Our results suggest that 1) training a unified model with instruction tuning is feasible, and 2) LLMs have
strong pre-existing visual capabilities which can be activated using significantly fewer samples compared

2
to extensive pretraining. These insights shed light on the development of mixed-modality models. As the
community continues to improve visual understanding in Multimodal LLMs (Tong et al., 2024a; Wang et al.,
2024a; Li et al., 2024a) by advancing base LLMs, instruction tuning techniques, and data, we highlight that
these efforts may also implicitly lead to models that are better at visual generation.

2 Visual-Predictive Instruction Tuning


Visual instruction tuning as introduced by LLaVA (Liu et al., 2023) demonstrates that LLMs can be taught
to understand visual inputs. This is achieved by finetuning on million-scale data. The success of late-fusion
instruction tuning suggests that LLMs may already possess innate visual understanding ability. This ability
simply needs to be unlocked through lightweight finetuning. Analogously, we hypothesize that LLMs already
possess a degree of innate visual generation ability which just needs to be unlocked with lightweight finetuning.
Motivated by this, we present Visual-Predictive Instruction Tuning (VPiT, Figure 1)—a simple design which
extends existing instruction tuning methods to additionally generate visual tokens rather than text alone.
We use the same architecture and next-token prediction paradigm to unlock visual generation capabilities
without bells and whistles. We take a pretrained LLM and finetune it to predict both discrete text tokens
and continuous visual tokens. The visual tokens can be visualized with an adapted diffusion model.

2.1 From Unimodal to Multimodal Next-Token Prediction


The standard instruction tuning setup consists of an input sequence of conversation rounds (Wei et al., 2022a;
Taori et al., 2023): (Pi , Ri )N
i=1 , where Pi and Ri represent prompts and responses for the i-th round of
conversation, respectively. The model is trained to generate responses based on the prompt. VPiT adds the
following mechanisms to a standard instruction tuning setup to unlock visual understanding and generation.

Tokenizing multimodal data. We extend Pi and Ri to include both text and images. To integrate visual data
into a pretrained LLM, we process data closely following visual instruction tuning (Liu et al., 2023):
• Text Data: Text is tokenized into discrete tokens with a standard tokenizer used by the LLM.
• Visual Data: Images are encoded with a pretrained vision encoder such as SigLIP (Zhai et al., 2023).
The output is continuous visual tokens which are then interpolated to m = 64 tokens. To pass the visual
tokens as input to the LLM, we apply a trainable projection layer to align the dimensions with the LLM.

Model architecture. We take a pretrained LLM and finetune it to process arbitrary sequences of text and
visual tokens (detailed next in Section 2.2). We keep the original LLM head for text prediction, and attach a
separate vision head to the LLM for predicting visual tokens, i.e., the output tokens generated by the vision
encoder when processing images. The vision head is a projection layer that projects from the LLM’s dimension
to the vision encoder’s dimension. All response tokens can then be trained and predicted autoregressively,
with prompt tokens as context.
Unlike conventional visual instruction tuning, in VPiT, visual tokens are also outputs of the LLM—not just
inputs. To make the LLM aware of the presence of visual tokens, we introduce special tokens ⟨image_start⟩
and ⟨image_end⟩ to indicate the boundaries of visual token sequences and when to use the vision head.

Loss functions. The language head outputs a probability distribution over the vocabulary and is trained with
cross-entropy loss for next-token prediction. Visual prediction uses cosine similarity loss between the LLM’s
predicted visual tokens and those from the vision encoder. Consistent with instruction tuning practices, the
model only makes predictions and incurs loss on response tokens.

2.2 Using Broad Types of Data


Because VPiT enables the model to predict both text and visual tokens in its responses, it allows the use of
a broader range of training data. Traditional visual instruction tuning, on the other hand, primarily relies
on question-and-answer pairs. The majority of our dataset is publicly available, and we categorize it into
three major categories below. This categorization enables us to systematically study the model, as detailed in

3
Section 3 and Section 4. All data types are formatted as instruction tuning style prompt & response pairs.
See further details in Appendix C.2.
1. Visual Understanding Data: This category includes data that takes image(s) or video as input and outputs
text responses. See Figure 1 for an example. We use:
• ImageQA: Cambrian-7M (Tong et al., 2024a). The model answers questions based on input image(s).
Pi ∈ {⟨visual tokens⟩, ⟨text prompt⟩}
Ri ∈ {⟨text response⟩}

• VideoQA: VideoStar (Zohar et al., 2024) and ShareVideo (Zhang et al., 2024). The model answers
questions based on the input video. For videos in VideoQA, we process frames at 1 FPS.
Pi ∈ {⟨visual tokens⟩, · · · , ⟨visual tokens⟩, ⟨text prompt⟩}
Ri ∈ {⟨text response⟩}
2. Visual Generation Data: MetaCLIP (Xu et al., 2024). The model predicts visual tokens based on an image
description. We using at most 5 million pairs. We curate the data into question-answering formats.
Pi ∈ {⟨text prompt⟩}
Ri ∈ {⟨text response⟩, ⟨visual tokens⟩}

We prompt the model to generate visual tokens with instructions like “Generate an image of...”. The
text responses are “Here is an image based on your request...”. See Figure 1 for an example.
3. Other Visual Data: This category includes data that requires the model to predict visual tokens given
interleaved input visual tokens and text tokens. We use:
• Video Data: SomethingSomethingV2 (Goyal et al., 2017b) and HowTo100M (Miech et al., 2019).
The model predicts frames in a sequential order. We design different question-answer pairs to
probe into the video, such as asking about future frames, past frames, and reordering frames.
Pi ∈ {⟨visual tokens⟩, · · · , ⟨visual tokens⟩, ⟨text prompt⟩}
Ri ∈ {⟨visual tokens⟩, · · · , ⟨visual tokens⟩}

• Visual Thinking Data: Visualization-of-Thought (Shao et al., 2024) and VStar (Wu and Xie, 2024).
The model predicts multimodal tokens in its response before addressing problems. For instance, it
predicts a zoomed-in view of an image before generating textual responses.
Pi ∈ {⟨visual tokens⟩, ⟨text prompt⟩}
Ri ∈ {⟨text response⟩, ⟨visual tokens⟩, ⟨text response⟩}
In the response, the model will output “I will think about it visually”, followed by visual tokens
representing a zoomed-in segment of the image, and then proceed to answer the question.
• Image-to-Image Data: InstructPix2Pix (Brooks et al., 2023) and Aurora (Krojer et al., 2024). The
model generates a transformed image conditioned on a text description and an input image.
Pi ∈ {⟨visual tokens⟩, ⟨text prompt⟩}
Ri ∈ {⟨visual tokens⟩}

2.3 Mapping Tokens to Images through Diffusion


Because models trained with VPiT learn to predict continuous visual tokens, we need to map the predicted
tokens back into pixel space. We leverage the concept of a “Diffusion Autoencoder” (Bordes et al., 2022;
Preechakul et al., 2022; Pan et al., 2024b; Koh et al., 2024; Li et al., 2024c) in which the diffusion model
can be adapted to condition on image embeddings rather than text embeddings. Specifically, we finetune an
existing diffusion model to condition on outputs from the vision encoder using held-out training data.
At inference time, if the tag token ⟨image_start⟩ is generated, the model begins outputting visual tokens until
⟨image_end⟩. We then plug the generated visual tokens into the diffusion model to visualize the prediction in
pixel space. We use standard latent diffusion model training procedures. Details on the hyperparameters and
training setup are provided in Appendix A.2.

4
3 Findings on Unlocking Visual Generation
We study the following questions about the effects and synergy of visual understanding and generation, under
our VPiT framework:
§3.1 Can visual generation be unlocked through lightweight tuning, or does it require extensive data?
§3.2 Are visual understanding and generation mutually beneficial or orthogonal?
§3.3 How much does more visual understanding or generation data contribute to understanding and generation
quality?
§3.4 Which visual understanding tasks correlate the most with generation performance?

Evaluation settings. We use 9 ImageQA benchmarks (MMBench, Seed, VStar, MMVP, MMMU, ChartQA,
TextVQA, ScienceQA, RealWorldQA) to evaluate different aspects of the model. For image generation, we
use the finetuned diffusion model to visualize generated visual tokens and measure FID score (lower is better)
and CLIP score (higher is better) on the COCO-30K dataset. Unless otherwise specified, we use LLaMA-3
8B (AI@Meta, 2024) / SigLIP ViT-SO400M-14@384 (Zhai et al., 2023) as the pretrained LLM / vision
encoder. We also study the effect of different LLMs in Section 3.2. We use instruction tuned versions of
the LLMs. We pretrain the adapter between the vision encoder and the LLM following visual instruction
tuning (Liu et al., 2023, 2024a). For experiments in this section, we provide training details in Appendix A
and include the full results in Appendix B.

3.1 Visual Generation Can Be Unlocked Efficiently by Joint Training with Visual Understanding
We start by investigating the number of image-text samples required to teach a language model to generate
high-quality visual tokens. To this end, we randomly sample {1k, 5k, 10k, 50k, 200k, 1M, 3M, 5M} image-text
pairs from our generation data (MetaCLIP dataset (Xu et al., 2024)). We explore two settings: (1) finetuning
the LLM using only visual generation data, and (2) joint training visual generation with visual understanding
and the rest of data types described in Section 2.2.
In Figure 2, we see that training solely on visual generation performs significantly worse than joint training
with all other data. With over 3 million image-text pairs, the model struggles to generate high-quality visual
images (∼ 40 FID score), and performance remains inferior to joint training with 5 million pairs. This suggests
that training solely on visual generation data is significantly less sample efficient. This finding aligns with a
prior study (Zhang et al., 2023) which also suggests that LLMs cannot be easily tuned to generate visual
tokens when trained with only generation data. In contrast, joint training with other datasets substantially
improves generation performance. The model generates effective visual tokens with just 5k generation data,
and performance stabilizes around 200k samples. This indicates that visual generation is not an orthogonal
capability but rather an ability that benefits from other tasks and emerges more effectively with joint training.


*HQHUDWLRQ2QO\

*HQHUDWLRQ,PDJHWR,PDJH
),'6FRUH


*HQHUDWLRQ9LVXDO7KLQNLQJ

*HQHUDWLRQ3XUH9LGHR
 *HQHUDWLRQ9LGHR4$
 *HQHUDWLRQ,PDJH4$
   
    $OOGDWD
1XPEHURI*HQHUDWLRQ'DWD /RJ6FDOH
     
*HQHUDWLRQ'DWD$ORQH -RLQW7UDLQLQJZLWK$OO2WKHU'DWD ),'6FRUH &/,36FRUH

Figure 2 Generation-only training vs. Joint training Figure 3 Impact of different data types on visual generation. The baseline
with other data. Training solely on generation data of training on only visual generation data is red; Joint training with
results in inferior performance. Joint training with other data is yellow; Joint training with visual understanding data
additional data enables visual generation with only is green; and all data is blue. Joint training with additional data
5k generation data and yields high-quality outputs improves the baseline, with visual understanding tasks contributing
with 200k generation data. the most to enhancing visual generation.

5
  

 

94$$FFXUDF\

94$$FFXUDF\


&/,36FRUH
),'6FRUH
  

 
 
 
  

  
                 
94$$FFXUDF\ 94$$FFXUDF\ ),'6FRUH &/,36FRUH
N*HQHUDWLRQ'DWD-RLQWO\7UDLQHGZLWK 094$'DWD-RLQWO\7UDLQHGZLWK
094$ 094$ 094$ 094$ N*HQHUDWLRQ'DWD 0*HQHUDWLRQ'DWD 0*HQHUDWLRQ'DWD
094$ 094$ 094$ N*HQHUDWLRQ'DWD 0*HQHUDWLRQ'DWD 0*HQHUDWLRQ'DWD

Figure 4 VQA Performance vs. Generation Performance with Figure 5 Generation Performance vs. VQA Performance with
generation data controlled at 200k. Increasing understand- VQA data controlled at 1M. Increasing generation data im-
ing data improves VQA and generation performance. proves generation and VQA performance.

To better understand how each type of data contributes to visual generation, we conduct a controlled
experiment using 200k visual generation data, joint training individually with each data type defined in
Section 2.2. We also compare them with training all the data together. We show results in Figure 3. While
all data types enhance the model’s visual generation, the degree of improvement varies. Visual understanding
data, such as ImageQA and VideoQA, significantly boost the model’s visual generation capabilities, even
when the amount of generation data is kept constant at 200k. This indicates a strong link between the ability
to understand visual content and generate visual tokens. Additionally, combining all data types in training
further improves performance, suggesting that the benefits from different data types can be additive.

Finding 1: The ability to generate visual tokens can be unlocked with significantly less generation data
when the model is jointly trained with visual understanding data, in contrast to training only on
generation data.

3.2 Visual Understanding and Generation are Mutually Beneficial

More understanding data leads to better understanding and generation. Building upon findings from the previous
subsection, we perform a controlled experiment to investigate how visual understanding ability correlates with
visual generation ability. We ablate our model using a fixed set of 200k generation data while varying VQA
data from 1M to 7M samples from Cambrian-7M to develop different levels of visual understanding. The
results presented in Figure 4 indicate that stronger VQA ability correlates with better generation performance.

More generation data leads to better understanding and generation. Here, we investigate the reverse direction:
does enhancing the model’s visual generation capability also relate to higher VQA performance? To explore
this, we conduct a controlled experiment using 1M fixed VQA samples as the baseline for understanding. We
then vary the amount of generation data ({200k, 500k, 1M, 2M, 3M, 4M}) to adjust generation capacity while
joint training with the fixed 1M VQA data. We present results in Figure 5. Within the 1M VQA setting,
stronger generation ability is correlated with improved VQA performance. This implies that increasing the
amount of generation data not only enhances generation but also positively impacts VQA performance.

This synergy scales across different LLMs. We examine whether the findings transfer across various LLM
backbones. Using a data composition of 7M VQA samples and 1M generation data, we train VPiT on
LLaMA-3 8B, LLaMA-3.1 8B, and LLaMA-3 70B. Figure 6 shows the scaling behavior across different LLMs.

Finding 2: Visual understanding and generation are synergistic. Increasing data for either capability
enhances both simultaneously.

6
$YHUDJH94$6FRUH ),'6FRUH &/,36FRUH
  
//D0$% .
//D0$% 

&/,36FRUH

),'6FRUH
 . 

*HQHUDWLRQ'DWD
//D0$%  //D0$% 0 
 
0   
 //D0$%  0
//D0$% 

 0 
         
0 0 0 0 0 0 0 0 0
94$$FFXUDF\ 94$$FFXUDF\ 94$'DWD 94$'DWD 94$'DWD

Figure 6 Comparison between different language backbones. Figure 7 Heatmap visualization of Average VQA Score, FID
We jointly train 7M VQA and 1M Generation data on Score, and CLIP Score across varying amounts of VQA data and
different language backbones (LLaMA-3 8B, LLaMA-3.1 generation data. Darker colors indicate better performance.
8B, LLaMA-3 70B). We observe that the synergy between Increasing VQA data is more effective for improving both
understanding and generation transfer across LLMs. understanding and generation capabilities.

3.3 Understanding Data Contributes More


We investigate whether understanding and generation data contribute equally. Here, we jointly train different
scales of VQA data {1M, 4M, 7M} and generation data {200k, 500k, 1M, 2M, 3M, 4M}. Figure 7 summarizes
these findings, with the x-axis representing VQA data, and the y-axis representing generation data. Results
are visualized on heatmaps using darker colors for better performance.
The results indicate that increasing VQA data yields the most significant improvements in all three metrics.
When VQA data is relatively low (1M), increases in generation data lead to noticeable improvements, as
reflected by the gradual darkening in the plot. However, as the VQA data scales up (from 1M to 4M to 7M),
the impact of VQA data becomes more pronounced, demonstrated by a sharp color transition in the heatmap.
Ultimately, with 7M VQA data, increases in generation data contribute minimally. These results demonstrate
the critical role of understanding data in enhancing both understanding and generation performance.

Finding 3: While increasing data improves performance overall, the impact of visual understanding
data is significantly higher than the impact of visual generation data.

3.4 Certain Understanding Tasks Correlate More with Generation Performance


Given the diverse nature of understanding tasks such as OCR, Vision-Centric tasks, and Knowledge-based
tasks, we investigate which tasks most strongly correlate with generation ability. Inspired by Cambrian-1, we
categorize VQA tasks into five groups: General, Text&Chart, High-Resolution, Knowledge, and Vision-Centric
VQA. Using the results from our earlier experiments, which jointly train various VQA data scales with different
amounts of generation data, we plot each benchmark’s VQA performance against generation performance in
Figure 8. We also calculate the Pearson correlation (ρ) between VQA scores and FID/CLIP Scores.

6HHG 00%HQFK 5HDO:RUOG4$ 0093 &KDUW4$ 7H[W94$ 96WDU 6FLHQFH4$ 0008


                   
  
    

  
  
 
  
   
 
 
    

                          
&/,36FRUH
    
             
  
 
   

   
 
   
   
 
  
 
    
                          
),'6FRUH

Figure 8 Correlation analysis between generation and various understanding benchmarks. Results are collected by joint
training different amounts of VQA data combined with varying quantities of generation data. Each subplot shows the
correlation (ρ) with a fitted regression line. Stars represent data points. We analyze General VQA, Vision-Centric
VQA, Text&Chart VQA, High-Resolution VQA, and Knowledge VQA. For most tasks, generation performance and
VQA performance are strongly correlated: higher VQA performance indicates better generation and vice versa. Only
knowledge-intensive and high-resolution VQA tasks exhibit weaker correlations with generation performance.

7
Figure 8 shows that General, Vision-Centric, and Text&Chart VQA tasks strongly correlate with generation
performance, each with a Pearson correlation coefficient (ρ) above 0.85. High-Resolution VQA exhibits
moderate correlation, with ρ around 0.7. In contrast, Knowledge VQA tasks, such as MMMU, show weak
correlation with generation performance. These findings suggest that generation ability aligns more closely
with the model’s vision capabilities rather than knowledge-specific tasks.

Finding 4: General, vision-centric, and text understanding VQA tasks exhibit strong correlations with
visual generation, whereas knowledge-based VQA tasks do not.

4 MetaMorph Model
Based on the insights in Section 3, we train our unified model, MetaMorph, based on LLaMA-3.1 8B (AI@Meta,
2024), using VPiT with the data curated in Section 2.2. We present our experimental results in three parts:
quantitative performance (Section 4.1), evidence of MetaMorph leveraging LLM knowledge in visual generation
(Section 4.2), and implicit reasoning skills in multimodal contexts (Section 4.3).

Image QA Video QA Generation

RealworldQA
MMBenchEN

COCO (FID)
MV-Bench
TextVQA
ChartQA
MMMU
MMVP
SEED

VStar
SQA
Method Base LLM
Visual Understanding Only
GPT-4V* 75.8 69.1 61.4 50.0 75.7 56.8 55.0 78.5 78.0 43.5 -
Visual Generation Only
Stable Diffusion 1.5∗ - - - - - - - - - - 9.6
Dalle 2 ∗
- - - - - - - - - - 10.4
Imagen∗ - - - - - - - - - - 7.3
Unified Models
EMU-3∗ 58.5 68.2 57.4 36.6† 89.2 31.6 51.8† 68.6 64.7 - 12.8
Janus∗ DeepSeek 1.3B 69.4 63.7 - - - 30.5 - - - - 8.5
VILA-U256 †
LLaMA-2 7B 66.6 57.1 46.6 22.0 67.1 32.2 38.7 11.4 48.3 ∗
40.8 19.6
Transfusion∗ - - - - - - - - - - 6.7
Chameleon-7B† 35.7 27.2 19.6 0.0 50.3 28.4 37.1 0.0 0.0 - 26.7∗
MetaMorph (ours) LLaMA-3.1 8B 75.2 71.8 58.3 48.3 83.2 41.8 44.0 37.1 60.5 48.8 11.8

Table 1 Comparison of MetaMorph with other unified models. MetaMorph offers competitive performance compared to
other leading unified models. Models in gray are understanding-only or generation-only. Unified models without a
base LLM are trained from scratch. ∗ We use numbers reported in original papers. † We obtain results using official
open-sourced model weights.

4.1 Competitive Performance in Understanding and Generation


We compare MetaMorph with other unified models and summarize results in Table 1. Since these models are
trained on different datasets and base LLMs (or pretrained from scratch), an apples-to-apples comparison
is difficult. Nevertheless, MetaMorph demonstrates competitive performance and outperforms other unified
models on most benchmarks—even when prior models may have been trained on more data. Compared to
models trained from scratch, such as EMU-3 (Wang et al., 2024b) and Chameleon (Team, 2024), MetaMorph
leverages the strengths of the latest pretrained LLMs and achieves competitive understanding and generation
performance. MetaMorph highlights that unified models can be developed effectively from pretrained LLMs.

8
“view of “A bookshelf “A bookshelf “A glass “A glass filled “Slightly tall “Very tall
“Chhogori” “Oncilla” Chizarira” with few books” with many books” without water” with water” building” building”

Ground Truth
/ Examples

SD-3.5 8B

Janus

MetaMorph

Professional Knowledge Addressing Semantic Nuances

Figure 9 Examples of MetaMorph leveraging LLMs to generate visual tokens. Left: MetaMorph can leverage knowledge from
the LLM to generate visual tokens for professional terms that need domain-specific understanding. Right: MetaMorph
also avoids common mistakes seen in T2I models that condition on text embeddings (e.g., Stable Diffusion-3.5 8B).

4.2 MetaMorph can Leverage LLM Knowledge for Visual Generation


MetaMorph effectively leverages the world knowledge embedded in pre-trained LLMs. We show examples on
the left side of Figure 9. We prompt the model to generate concepts requiring non-trivial and specialized
knowledge. Examples include “Chhogori” (the world’s second-highest mountain), “Oncilla” (a small wildcat
from South America), and “Chizarira” (an isolated wilderness area in Zimbabwe).
MetaMorph successfully translates domain-specific knowledge into accurate visual tokens, thereby displaying
the ability to leverage world knowledge from LLMs. In contrast, the latest Text-to-Image (T2I) model, Stable
Diffusion-3.5 8B, struggles to generate the correct concept despite producing high-quality images. This issue
may stem from the text embedding models it uses—–CLIP (Radford et al., 2021) and T5 (Roberts et al.,
2019)—–which fail to properly encode these specialized terms (Yuksekgonul et al., 2022).
On the right side of Figure 9, we demonstrate how MetaMorph handles common semantic challenges more
effectively than text embedding models such as CLIP and T5. These challenges include negation and
subjectivity, using prompts with common failure patterns identified in Multimon (Tong et al., 2024b).
MetaMorph differentiates semantic nuances such as “slightly” versus “very”, “few” versus “many”, and “without”
versus “with”, which are common failures in existing text-to-image systems.

4.3 Reasoning in Multimodal Generation


In Figure 10, we present examples where the model generates images in response to puzzle prompts such as
“The national flag of the country where Yellowstone National Park is located”. For each puzzle, we directly use
the prompt “Generate an image of {puzzle}”, without calling any Chain-of-Thought (CoT) (Wei et al., 2022b)
in the prompts. MetaMorph generates the correct image from prompts that require multi-step reasoning.
For example, when answering the question “A musical instrument, this instrument is often played by the
scientist who formulated the theory of special relativity”, the model needs to implicitly complete three reasoning
steps: it identifies Albert Einstein as the scientist formulated the theory of special relativity, recognizes that
his preferred instrument is the violin, and then directly generates correct visual tokens—a violin—without
explicitly separating these steps during the generation process. This result implies that MetaMorph implicitly
solves the puzzle and generates correct visual tokens immediately following the prompt. These results align
with the findings in Physics of LLMs (Ye et al., 2024; Allen-Zhu, 2024), where the authors suggest that LLMs
precompute reasoning graphs before autoregressively generating subsequent tokens. Here, we demonstrate
that this capability transfers to the unified multimodal model setting even when decoding visual tokens.

9
Step-by-Step Logic Chain Solution
Prompt SD3.5-8B Janus MetaMorph
(For Reference) Examples

“The national flag of the country Yellowstone National Park’s Location


where Yellowstone National Park ➟ America
is located” ➟ American Flag

“The flower celebrated in spring Sushi's Origin


festivals in the country where ➟ Japan
sushi originated” ➟ Flower in Spring Festivals in Japan
➟ Cherry Blossom (Sakura)

“The large mammal that shares Constellation Associated with the


its name with a constellation Northern Sky
often visible in the night sky and ➟ Ursa Major
associated with the northern part ➟ Ursa (Latin for 'Bear’)
of the world” ➟ Large Mammal Named 'Bear'

“A musical instrument, this Scientist Who Formulated Special


instrument is often played by the Relativity
scientist who formulated the ➟ Albert Einstein
theory of special relativity” ➟ Instrument Often Played by Einstein
➟ Violin

2+7
“The animal associated with ➟9
having (2+7) lives” ➟ Animal Believed to Have 9 lives
➟ Cat

Figure 10 Examples of MetaMorph solving reasoning problems in visual generation. We design puzzles that require multi-step
reasoning. We include reference logic chains needed to solve the puzzles, and reference solution examples . When
prompting each model, we directly feed in the puzzle without any CoT hints or logic chains. MetaMorph has the
ability to implicitly solve these puzzles and generate the correct image without explicitly creating or processing a logic
chain. It demonstrates that the implicit reasoning skills in text-only LLMs can transfer to unified multimodal models.

5 Related Work

Instruction tuning and visual instruction tuning. Instruction tuning (Wei et al., 2022a; Taori et al., 2023) finetunes
a pretrained LLM to learn the format and style of interaction. This process helps the model to effectively
convey the knowledge and capabilities acquired during pretraining (Zhou et al., 2024a). LLaVA (Liu et al.,
2023) extends instruction tuning into the multimodal domain. Since then, different lines of work focus on
improving data curation (Chen et al., 2023; Laurençon et al., 2024a,b), visual representation (Tong et al.,
2024a; Kar et al., 2025; Chen et al., 2024b), and instruction tuning strategies (Gao et al., 2024; Liu et al.,
2024b). Using only a few million multimodal instruction tuning data, this line of research (Liu et al., 2024b;
Tong et al., 2024a; Li et al., 2024a) has enabled open-source MLLMs to reach performance levels comparable
to those of proprietary models (OpenAI, 2024; Anthropic, 2024) on a number of benchmarks (Liu et al., 2024d;
Yue et al., 2024a,b) and applications (Zhai et al., 2024; Pan et al., 2024a).

From Multimodal LLMs to unified models. Recent efforts to construct unified models have primarily relied on
either extensive pretraining or heavy fine-tuning on billion-scale datasets. Some studies also use continuous
embeddings for predicting visual tokens, integrating visual regression losses (Sun et al., 2024b,a) or leveraging
diffusion-based methods (Dong et al., 2024). Other approaches (Lu et al., 2022a; Aghajanyan et al., 2022;
Team, 2024; Wu et al., 2024b; Liu et al., 2024c; Wang et al., 2024b; Lu et al., 2024) tokenize multimodal
data into discrete tokens, which are then trained using autoregressive transformers. Recent research has
also explored hybrid strategies that combine autoregressive and diffusion objectives (Zhou et al., 2024b; Xie
et al., 2024). Different from previous studies, we demonstrate that unified models can be effectively trained
in low-data regimes during instruction tuning, while also providing insights into the reciprocal relationship
between visual understanding and visual generation.

10
6 Discussion
In this work, we propose VPiT—a simple yet effective extension to visual instruction tuning—that enables
LLMs to predict multimodal tokens. VPiT unlocks the use of a more diverse range of instruction tuning data
than just visual question answering, such as text-to-image and pure image and video data. Through controlled
experiments, we find that visual generation ability emerges as a natural byproduct of improved visual under-
standing and requires modest additional generation data. In addition, we find that while visual understanding
and generation are mutually beneficial, adding more visual understanding data disproportionately improves
overall performance compared to adding more generation data.
Leveraging these insights, we train MetaMorph by finetuning LLaMA-3.1 8B with VPiT. With a simple
training process, MetaMorph achieves competitive performance in both visual understanding and generation.
Qualitative evaluation of our model shows that MetaMorph can leverage world knowledge and reasoning
abilities of the base LLM during visual generation. For example, it can perform multimodal tasks that typically
require multiple steps of reasoning, such as generating images of specialized proper nouns (“Chhogori”) or
solving visual puzzles (“generate an image of the animal resulting from a monarch caterpillar’s metamorphosis”).
This indicates that LLMs already possess a degree of “prior” visual knowledge which can be activated with
only minimal instruction tuning with VPiT. Overall, LLMs may have a similar representation space as unified
and multi-functional models (Huh et al., 2024). We hope the insights from this work inspire more exploration
toward developing LLMs for general intelligence.

References
Armen Aghajanyan, Bernie Huang, Candace Ross, Vladimir Karpukhin, Hu Xu, Naman Goyal, Dmytro Okhonko,
Mandar Joshi, Gargi Ghosh, Mike Lewis, et al. Cm3: A causal masked multimodal model of the internet. arXiv
preprint arXiv:2201.07520, 2022.
Pravesh Agrawal, Szymon Antoniak, Emma Bou Hanna, Devendra Chaplot, Jessica Chudnovsky, Saurabh Garg,
Theophile Gervet, Soham Ghosh, Amélie Héliou, Paul Jacob, et al. Pixtral 12b. arXiv preprint arXiv:2410.07073,
2024.
AI@Meta. Llama 3 model card. 2024.
Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch,
Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. In NeurIPS,
2022.
Zeyuan Allen-Zhu. ICML 2024 Tutorial: Physics of Language Models, 2024. Project page: [Link]
com/.
Anthropic. Claude, 2024.
Jimmy Lei Ba, Jamie Kiros, and Geoffrey E. Hinton. Layer normalization. In NeurIPS, 2016.
Adrien Bardes, Quentin Garrido, Jean Ponce, Xinlei Chen, Michael Rabbat, Yann LeCun, Mahmoud Assran, and
Nicolas Ballas. Revisiting feature prediction for learning visual representations from video. In TMLR, 2024.
Florian Bordes, Randall Balestriero, and Pascal Vincent. High fidelity visualization of what your self-supervised
representation knows about. In TMLR, 2022.
Tim Brooks, Aleksander Holynski, and Alexei A Efros. Instructpix2pix: Learning to follow image editing instructions.
In CVPR, 2023.
Lin Chen, Jisong Li, Xiaoyi Dong, Pan Zhang, Conghui He, Jiaqi Wang, Feng Zhao, and Dahua Lin. Sharegpt4v:
Improving large multi-modal models with better captions. arXiv preprint arXiv:2311.12793, 2023.
Lin Chen, Xilin Wei, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Bin Lin, Zhenyu
Tang, et al. Sharegpt4video: Improving video understanding and generation with better captions. In NeurIPS,
2024a.
Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhangwei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo,
Zheng Ma, et al. How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source
suites. arXiv preprint arXiv:2404.16821, 2024b.

11
Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale N
Fung, and Steven Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning. In
NeurIPS, 2024.
Runpei Dong, Chunrui Han, Yuang Peng, Zekun Qi, Zheng Ge, Jinrong Yang, Liang Zhao, Jianjian Sun, Hongyu
Zhou, Haoran Wei, et al. Dreamllm: Synergistic multimodal comprehension and creation. In ICLR, 2024.
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa
Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers
for image recognition at scale. In ICLR, 2021.
Samir Yitzhak Gadre, Gabriel Ilharco, Alex Fang, Jonathan Hayase, Georgios Smyrnis, Thao Nguyen, Ryan Marten,
Mitchell Wortsman, Dhruba Ghosh, Jieyu Zhang, et al. Datacomp: In search of the next generation of multimodal
datasets. In NeurIPS, 2024.
Peng Gao, Renrui Zhang, Chris Liu, Longtian Qiu, Siyuan Huang, Weifeng Lin, Shitian Zhao, Shijie Geng, Ziyi Lin,
Peng Jin, et al. Sphinx-x: Scaling data and parameters for a family of multi-modal large language models. arXiv
preprint arXiv:2402.05935, 2024.
Yuying Ge, Yixiao Ge, Ziyun Zeng, Xintao Wang, and Ying Shan. Planting a seed of vision in large language model.
arXiv preprint arXiv:2307.08041, 2023.
Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michalski, Joanna Materzynska, Susanne Westphal, Heuna Kim,
Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller-Freitag, et al. The" something something" video
database for learning and evaluating visual common sense. In ICCV, 2017a.
Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: Elevating
the role of image understanding in visual question answering. In CVPR, 2017b.
Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415, 2016.
Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. Clipscore: A reference-free evaluation
metric for image captioning. In EMNLP, 2021.
Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a
two time-scale update rule converge to a local nash equilibrium. In NeurIPS, 2017.
Minyoung Huh, Brian Cheung, Tongzhou Wang, and Phillip Isola. The platonic representation hypothesis. In ICML,
2024.
Oğuzhan Fatih Kar, Alessio Tonioni, Petra Poklukar, Achin Kulshrestha, Amir Zamir, and Federico Tombari. Brave:
Broadening the visual encoding of vision-language models. In ECCV, 2025.
Jing Yu Koh, Daniel Fried, and Russ R Salakhutdinov. Generating images with multimodal language models. In
NeurIPS, 2024.
Benno Krojer, Dheeraj Vattikonda, Luis Lara, Varun Jampani, Eva Portelance, Christopher Pal, and Siva Reddy.
Learning action and reasoning-centric image editing from videos and simulations. In NeurIPS, 2024.
Hugo Laurençon, Lucile Saulnier, Léo Tronchon, Stas Bekman, Amanpreet Singh, Anton Lozhkov, Thomas Wang,
Siddharth Karamcheti, Alexander Rush, Douwe Kiela, et al. Obelics: An open web-scale filtered dataset of interleaved
image-text documents. Advances in Neural Information Processing Systems, 36, 2024a.
Hugo Laurençon, Léo Tronchon, Matthieu Cord, and Victor Sanh. What matters when building vision-language
models? arXiv preprint arXiv:2405.02246, 2024b.
Yann LeCun. A path towards autonomous machine intelligence version 0.9. 2, 2022-06-27. Open Review, 62(1):1–62,
2022.
Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Yanwei Li, Ziwei Liu, and
Chunyuan Li. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 2024a.
Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, et al.
Mvbench: A comprehensive multi-modal video understanding benchmark. In CVPR, 2024b.
Tianhong Li, Dina Katabi, and Kaiming He. Return of unconditional generation: A self-supervised representation
generation method. In NeurIPS, 2024c.

12
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence
Zitnick. Microsoft coco: Common objects in context. In ECCV, 2014.
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. In NeurIPS, 2023.
Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. In CVPR,
2024a.
Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Improved
reasoning, ocr, and world knowledge, 2024b.
Hao Liu, Wilson Yan, Matei Zaharia, and Pieter Abbeel. World model on million-length video and language with
ringattention. arXiv preprint arXiv:2402.08268, 2024c.
Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui
He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? In ECCV, 2024d.
I Loshchilov. Decoupled weight decay regularization. In ICLR, 2019.
Jiasen Lu, Christopher Clark, Rowan Zellers, Roozbeh Mottaghi, and Aniruddha Kembhavi. Unified-io: A unified
model for vision, language, and multi-modal tasks. In ICLR, 2022a.
Jiasen Lu, Christopher Clark, Sangho Lee, Zichen Zhang, Savya Khosla, Ryan Marten, Derek Hoiem, and Aniruddha
Kembhavi. Unified-io 2: Scaling autoregressive multimodal models with vision language audio and action. In CVPR,
2024.
Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark,
and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. In
NeurIPS, 2022b.
Ahmed Masry, Do Xuan Long, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. Chartqa: A benchmark for question
answering about charts with visual and logical reasoning. In ACL, 2022.
Brandon McKinzie, Zhe Gan, Jean-Philippe Fauconnier, Sam Dodge, Bowen Zhang, Philipp Dufter, Dhruti Shah,
Xianzhi Du, Futang Peng, Floris Weers, et al. Mm1: Methods, analysis & insights from multimodal llm pre-training.
arXiv preprint arXiv:2403.09611, 2024.
Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, and Josef Sivic. Howto100m:
Learning a text-video embedding by watching hundred million narrated video clips. In ICCV, 2019.
OpenAI. gpt4o, 2024.
Jiayi Pan, Yichi Zhang, Nicholas Tomlin, Yifei Zhou, Sergey Levine, and Alane Suhr. Autonomous evaluation and
refinement of digital agents. In COLM, 2024a.
Xichen Pan, Li Dong, Shaohan Huang, Zhiliang Peng, Wenhu Chen, and Furu Wei. Kosmos-g: Generating images in
context with multimodal large language models. In ICLR, 2024b.
Konpat Preechakul, Nattanat Chatthee, Suttisak Wizadwongsa, and Supasorn Suwajanakorn. Diffusion autoencoders:
Toward a meaningful and decodable representation. In CVPR, 2022.
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda
Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision.
In ICML, 2021.
Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. Zero: Memory optimizations toward training
trillion parameter models. In SC20: International Conference for High Performance Computing, Networking, Storage
and Analysis, pages 1–16. IEEE, 2020.
Adam Roberts, Colin Raffel, Katherine Lee, Michael Matena, Noam Shazeer, Peter J Liu, Sharan Narang, Wei Li, and
Yanqi Zhou. Exploring the limits of transfer learning with a unified text-to-text transformer. JMLR, 2019.
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image
synthesis with latent diffusion models. In CVPR, 2022.
Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo
Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. Laion-5b: An open large-scale dataset for
training next generation image-text models. In NeurIPS, 2022.

13
Hao Shao, Shengju Qian, Han Xiao, Guanglu Song, Zhuofan Zong, Letian Wang, Yu Liu, and Hongsheng Li. Visual
cot: Advancing multi-modal language models with a comprehensive dataset and benchmark for chain-of-thought
reasoning. In NeurIPS, 2024.
Oleksii Sidorov, Ronghang Hu, Marcus Rohrbach, and Amanpreet Singh. Textcaps: a dataset for image captioning
with reading comprehension, 2020.
Quan Sun, Yufeng Cui, Xiaosong Zhang, Fan Zhang, Qiying Yu, Yueze Wang, Yongming Rao, Jingjing Liu, Tiejun
Huang, and Xinlong Wang. Generative multimodal models are in-context learners. In CVPR, 2024a.
Quan Sun, Qiying Yu, Yufeng Cui, Fan Zhang, Xiaosong Zhang, Yueze Wang, Hongcheng Gao, Jingjing Liu, Tiejun
Huang, and Xinlong Wang. Generative pretraining in multimodality. In ICLR, 2024b.
Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B.
Hashimoto. Alpaca: A strong, replicable instruction-following model, 2023.
Chameleon Team. Chameleon: Mixed-modal early-fusion foundation models. arXiv preprint arXiv:2405.09818, 2024.
Shengbang Tong, Ellis Brown, Penghao Wu, Sanghyun Woo, Manoj Middepogu, Sai Charitha Akula, Jihan Yang,
Shusheng Yang, Adithya Iyer, Xichen Pan, et al. Cambrian-1: A fully open, vision-centric exploration of multimodal
llms. In NeurIPS, 2024a.
Shengbang Tong, Erik Jones, and Jacob Steinhardt. Mass-producing failures of multimodal systems with language
models. In NeurIPS, 2024b.
Shengbang Tong, Zhuang Liu, Yuexiang Zhai, Yi Ma, Yann LeCun, and Saining Xie. Eyes wide shut? exploring the
visual shortcomings of multimodal llms. In CVPR, 2024c.
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov,
Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. LLaMA 2: Open foundation and fine-tuned chat models.
2023.
Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin
Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. arXiv preprint
arXiv:2409.12191, 2024a.
Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang,
Zhen Li, Qiying Yu, et al. Emu3: Next-token prediction is all you need. arXiv preprint arXiv:2409.18869, 2024b.
Jason Wei, Maarten Bosma, Vincent Y Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Dai, and
Quoc V Le. Finetuned language models are zero-shot learners. In ICLR, 2022a.
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al.
Chain-of-thought prompting elicits reasoning in large language models. In NeurIPS, 2022b.
Chengyue Wu, Xiaokang Chen, Zhiyu Wu, Yiyang Ma, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu,
Chong Ruan, et al. Janus: Decoupling visual encoding for unified multimodal understanding and generation. arXiv
preprint arXiv:2410.13848, 2024a.
Penghao Wu and Saining Xie. V*: Guided visual search as a core mechanism in multimodal llms. In CVPR, 2024.
Yecheng Wu, Zhuoyang Zhang, Junyu Chen, Haotian Tang, Dacheng Li, Yunhao Fang, Ligeng Zhu, Enze Xie, Hongxu
Yin, Li Yi, et al. Vila-u: a unified foundation model integrating visual understanding and generation. arXiv preprint
arXiv:2409.04429, 2024b.
xAI. grok, 2024.
Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, Weihao Wang, Kevin Qinghong Lin, Yuchao Gu, Zhijie
Chen, Zhenheng Yang, and Mike Zheng Shou. Show-o: One single transformer to unify multimodal understanding
and generation. arXiv preprint arXiv:2408.12528, 2024.
Hu Xu, Saining Xie, Xiaoqing Ellen Tan, Po-Yao Huang, Russell Howes, Vasu Sharma, Shang-Wen Li, Gargi Ghosh,
Luke Zettlemoyer, and Christoph Feichtenhofer. Demystifying clip data. In ICLR, 2024.
Tian Ye, Zicheng Xu, Yuanzhi Li, and Zeyuan Allen-Zhu. Physics of Language Models: Part 2.1, Grade-School
Math and the Hidden Reasoning Process. ArXiv e-prints, abs/2407.20311, 2024. Full version available at http:
//[Link]/abs/2407.20311.

14
Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming
Ren, Yuxuan Sun, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark
for expert agi. In CVPR, 2024a.
Xiang Yue, Tianyu Zheng, Yuansheng Ni, Yubo Wang, Kai Zhang, Shengbang Tong, Yuxuan Sun, Ming Yin, Botao
Yu, Ge Zhang, et al. Mmmu-pro: A more robust multi-discipline multimodal understanding benchmark. arXiv
preprint arXiv:2409.02813, 2024b.
Mert Yuksekgonul, Federico Bianchi, Pratyusha Kalluri, Dan Jurafsky, and James Zou. When and why vision-language
models behave like bags-of-words, and what to do about it? In ICLR, 2022.
Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training.
In ICCV, 2023.
Yuexiang Zhai, Hao Bai, Zipeng Lin, Jiayi Pan, Shengbang Tong, Yifei Zhou, Alane Suhr, Saining Xie, Yann LeCun,
Yi Ma, et al. Fine-tuning large vision-language models as decision-making agents via reinforcement learning. In
NeurIPS, 2024.
Ruohong Zhang, Liangke Gui, Zhiqing Sun, Yihao Feng, Keyang Xu, Yuanhan Zhang, Di Fu, Chunyuan Li, Alexander
Hauptmann, Yonatan Bisk, et al. Direct preference optimization of video large multimodal models from language
model reward. arXiv preprint arXiv:2404.01258, 2024.
Yuhui Zhang, Brandon McKinzie, Zhe Gan, Vaishaal Shankar, and Alexander Toshev. Pre-trained language models do
not help auto-regressive text-to-image generation. In EMNLP, 2023.
Chunting Zhou, Pengfei Liu, Puxin Xu, Srinivasan Iyer, Jiao Sun, Yuning Mao, Xuezhe Ma, Avia Efrat, Ping Yu, Lili
Yu, et al. Lima: Less is more for alignment. In NeurIPS, 2024a.
Chunting Zhou, Lili Yu, Arun Babu, Kushal Tirumala, Michihiro Yasunaga, Leonid Shamis, Jacob Kahn, Xuezhe Ma,
Luke Zettlemoyer, and Omer Levy. Transfusion: Predict the next token and diffuse images with one multi-modal
model. arXiv preprint arXiv:2408.11039, 2024b.
Orr Zohar, Xiaohan Wang, Yonatan Bitton, Idan Szpektor, and Serena Yeung-levy. Video-star: Self-training enables
video instruction tuning with any supervision. In arXiv preprint arXiv:2407.06189, 2024.

15
Appendix
A Training Details and Hyperparameters

A.1 MetaMorph Training


We follow the training recipe outlined in prior studies (Tong et al., 2024a; McKinzie et al., 2024), using a
two-stage training approach. First, we pretrain a two-layer MLP with a GELU activation (Hendrycks and
Gimpel, 2016) as the adapter between the visual tokens and the LLM. We train this adapter on Cambrian
adapter data while excluding all data points sourced from LAION (Schuhmann et al., 2022). Next, we finetune
the entire model, excluding the vision backbone, using the instruction tuning data described in Section 2.2
and detailed in Appendix C.
We use DeepSpeed (Rajbhandari et al., 2020) Zero-3 to train our model on H100 GPUs. Detailed training
hyperparameters for all experiments are provided in Table 2. We conduct all of the experiments with 1 epoch.

Backbone Data Adapter Instruction Tuning


Experiment LLM Adapter Instruction Tuning lr wd bs lr wd bs
Section 3 (LLaMA-3 8B) LLaMA-3 8B Cambrian Adapter Data∗ Section 3 Experiment Setting 4.90e-5 0.0 768 6.93e-5 0 1536
Section 3 (LLaMA-3.1 8B) LLaMA-3.1 8B Cambrian Adapter Data∗ Section 3 Experiment Setting 4.90e-5 0.0 768 6.93e-5 0 1536
Section 3 (LLaMA-3 70B) LLaMA-3 70B Cambrian Adapter Data∗ Section 3 Experiment Setting 4.90e-5 0.0 768 4.90e-5 0 768
MetaMorph LLaMA-3.1 8B Cambrian Adapter Data∗ All Data from Section 2.2 4.90e-5 0.0 768 6.93e-5 0 1536

Table 2 Implementation details and hyperparameters for all experiments. We exclude data points in LAION (Schuhmann
et al., 2022) from Cambrian adapter data.

A.2 Diffusion Visualizer Training


We leverage pretrained diffusion models such as Stable Diffusion 1.5 (Rombach et al., 2022). We use a 2-layer
MLP projector to align the SigLIP embedding dimension with the cross-attention dimension in the pretrained
diffusion model. The first layer applies a linear transformation to map the input dimension to 2048, followed
by layer normalization (Ba et al., 2016) and a ReLU activation. The second layer reduces the 2048-dimensional
features to the output dimension through a linear transformation, followed by a final layernorm.
We set the batch size to 2112. The learning rate schedule begins with a logarithmic warm-up over the first
2000 steps, gradually increasing from zero to a peak value of 1.1e-5. After this warm-up phase, the learning
rate decreases linearly over the next 12000 steps until reaching zero. We use the AdamW (Loshchilov, 2019)
optimizer to train our model, with β parameters (0.9, 0.999). We apply a weight decay of 0.01.
During diffusion training, we freeze the VAE encoder and Siglip encoder, only training the projector and
the diffusion U-Net. The CFG level is set to 0.7. This is because we start with a pretrained diffusion model
and aim to transform the conditioning from CLIP text to SigLIP image embeddings. A higher CFG level
ensures the model maintains high image quality while gradually adapting to the new conditioning in the
remaining fraction. Empirically, this approach achieves the best balance between adaptation and image quality.
For the training datasets, since we finetune the diffusion model to condition on SigLIP image embeddings,
training this model does not require text descriptions for conditioning. Instead, we use images curated through
in MetaCLIP (Xu et al., 2024) and train this diffusion model to visualize the visual tokens generated by
MetaMorph.

A.3 Evaluation Benchmarks


For evaluation, we use nine ImageQA, one VideoQA and two generation benchmarks:
• MMBench (Liu et al., 2024d): A comprehensive benchmark spans across 20 multimodal ability dimensions.
• Seed (Ge et al., 2023): A benchmark focusing on visual tasks for multimodal understanding, consists of
19k multiple choice questions with accurate human annotations.
• V*STAR (Wu and Xie, 2024): A VQA benchmark designed for testing details in high-resolution images.

16
Loss Image QA

RealworldQA
MMBenchEN

TextVQA
ChartQA
MMMU
MMVP
SEED

VStar
AVG

SQA
.

None (VQA Only) 55.50 73.11 69.96 55.69 41.33 80.39 37.29 46.60 35.16 59.96
L1 Loss 53.83 72.17 69.28 57.25 34.67 79.00 34.00 45.55 32.40 60.17
Cosine Sim 55.93 73.78 71.36 55.03 44.00 79.83 35.29 47.64 36.60 59.79
Table 3 Comparison of different loss functions. Training with cosine similarity loss enables the model to effectively utilize
non-VQA data, which in turn enhances its visual understanding.

• MMVP (Tong et al., 2024c): A benchmark for evaluating “CLIP-Blind” pairs in Vision Language Models.
• MMMU (Yue et al., 2024a): A benchmark designed to evaluate multimodal models on extensive multi-
discipline tasks requiring college-level subject knowledge and deliberate reasoning.
• ChartQA (Masry et al., 2022): A large-scale benchmark involving visual and logical reasoning over charts.
• TextVQA (Sidorov et al., 2020):A benchmark designed to evaluate models’ ability to read and reason
about text in images to answer questions.
• ScienceQA (Lu et al., 2022b): A multimodal benchmark for answering science-related questions requiring
integration of visual and textual data.
• RealWorldQA (xAI, 2024): A benchmark focused on real-world multimodal reasoning tasks.
• MV-Bench (Li et al., 2024b): A benchmark contains a comprehensive video understanding benchmark,
which covers 20 challenging video tasks that cannot be effectively solved with a single frame.
• FID Score (Heusel et al., 2017): A metric for evaluating the quality of generated images by comparing
their feature distributions with real images.
• CLIP Score (Hessel et al., 2021): A benchmark metric that uses CLIP embeddings to measure alignment
between generated images and their corresponding text descriptions.

B Ablation Studies on Visual Prediction Objective


We compare our approach to the commonly used L1 regression loss, which has been widely adopted in
contrastive self-supervised learning methods (LeCun, 2022; Bardes et al., 2024). For this comparison, we
train MetaMorph, based on LLaMA-3 8B, using datasets described in Section 2.2. We highlight that cosine
similarity and L1 loss influence the embedding outputs differently: cosine similarity enforces normalization,
while L1 loss does not. This discrepancy in output normalization prevents a direct and fair comparison in
terms of generation performance. Consequently, our analysis focuses exclusively on VQA performance.
In Table 3, we compare models trained using L1 loss and cosine similarity loss. Our analysis reveals that
training with cosine similarity results in better average performance and outperforms L1 loss on most
benchmarks. Notably, these vision loss functions affect only tasks requiring visual predictions and do not
directly influence VQA tasks, as the VQA training data does not include image token responses. This
improvement is potentially because training with cosine similarity enhances visual generation, which in turn
contributes to better visual understanding.
To further investigate, we compare our method—incorporating a broader range of non-VQA data alongside
Cambrian-7M—–with a baseline trained exclusively on Cambrian-7M. The results show that combining
broader dataset with cosine similarity loss leads to better performance across multiple benchmarks. This
finding reinforces our earlier observations in Section 3: enhancing visual generation capabilities contributes to
improved visual understanding, highlighting the benefits of leveraging non-VQA data.

17
o
de ImageQA (44.0%) Generation (31.1%)
Vi
e
r
Pu Cambrian-7M (Tong et al., 2024a) (7067.0 K) MetaCLIP (Xu et al., 2024) (5000.0 K)

Im
QA

age
Video

VideoQA (9.9%) Visual Thinking (3.2%)


VideoStar (Zohar et al., 2024) (1055.0 K) VisualCoT (Shao et al., 2024) (361.0 K)

QA
ShareVideo (Zhang et al., 2024) (540.0 K) VStar (Wu and Xie, 2024) (148.0 K)
on

Pure Video (8.8%) Image-to-Image (3.0%)


ti

HowTo100M (Miech et al., 2019) (1193.0 K) InstructPix2Pix (Brooks et al., 2023) (313.0 K)
ra
Gene SmthSmthV2 (Goyal et al., 2017a) (220.0 K) Aurora (Krojer et al., 2024) (169.0 K)

Figure 11 Data composition. Left: The inner circle shows the distribution of MetaMorph data. Right: All the data
sources and categories in the MetaMorph data.

C Data

C.1 Data Composition


We summarize the categorization of data and the number of samples for each source in Figure 11. This
diverse dataset is curated to showcase that an LLM can be finetuned across a variety of tasks, where each
task contributes to and enhances the performance of others, as discussed in Section 3.1.

C.2 Data Proprocessing


As discussed in Section 2.2, we use a wide range of data, spanning from visual question answering tasks to
unlabeled video data. Here, we detail the preprocessing steps applied to each data source to convert them
into instruction-tuning-style QA conversations.

ImageQA. We use Cambrian-7M (Tong et al., 2024a), a dataset already curated in instruction tuning format.
An example entry looks like the below:

Example from ImageQA

Prompt:
<image_start><image><image_end> What is the animal in the image?
Response:
It is a burmilla cat.

VideoQA. We use VideoStar (Zohar et al., 2024) and ShareVideo (Chen et al., 2024a), both curated in an
instruction tuning format. For each video, we extract frames at a rate of one frame per second and input
these frames into the LLM. An example QA entry for an 8-second video is structured as follows:

Example from VideoQA

Prompt:
<image_start><image><image_end><image_start><image><image_end><image_-
start><image><image_end><image_start><image><image_end><image_start><image><image_-
end><image_start><image><image_end><image_start><image><image_end><image_-
start><image><image_end>
What’s the color of the dog in this video? (a) white (b) yellow (c) black Please only answer a single letter
and nothing else
Response:
b

Generation data. We use image-text pairs in MetaCLIP (Xu et al., 2024). The original data consists of

18
images paired with corresponding text descriptions. We add system prompts and define answering formats,
transforming the image-text pairs into question-answer formats suitable for instruction tuning.

Example from Generation data

Prompt:
Generate an image of a puppy.
Response:
Here is an image based on your request: <image_start><image><image_end>.

Unlike in ImageQA and VideoQA, we require the model to predict the visual tokens in the response.

Visual Thinking Data. We explore incorporating vision as part of the model’s reasoning process to enhance its
answers. As a preliminary step, we experiment with the Visualization-of-Thought (Shao et al., 2024) and
VStar (Wu and Xie, 2024) datasets. Originally, these datasets were designed to teach models how to utilize
external tools and APIs, such as segmentation or zoom-in cropping. In this work, we aim to integrate these
visual skills directly into the model’s inference steps. We use system prompts, such as “think visually before
you answer the question”, to activate this visual reasoning mode. Here is an example:

Example from Visual thinking data

Prompt:
Think visually before you answer the question: what is the color of the hat in the image <image_-
start><image><image_end>.
Response:
Let’s consider this visually: <image_start><image><image_end> black.

In this example, the model first predicts a zoomed-in view of the key region in the image necessary for
answering the question, and then it provides the answer.

Image to image data. The model is required to predict a transformed image based on a given text description.
We use Aurora (Krojer et al., 2024) and InstructPix2Pix (Brooks et al., 2023) datasets to train this capability.
An example is presented below:

Example from Image to image data

Prompt:
<image_start><image><image_end> Make this a rainy day.
Response:
<image_start><image><image_end>.

Here, the model is expected to predict the visual tokens of the transformed image as per the specified prompt.

Pure video data. We explore commonly used open-source video datasets in instruction tuning: Something-
SomethingV2 (Goyal et al., 2017a) and HowTo100M (Miech et al., 2019). We design the following tasks from
the pure video:
1) Forward Frame Prediction. In this task, the model is presented with the initial frame of a video sequence
and must predict the subsequent frames at fixed time intervals. An example is presented below:

19
Example of Forward Frame Prediction

Prompt:
<image_start><image><image_end> Can you predict what happens in the next 3 frames, each 5 seconds
apart?
Response:
<image_start><image><image_end><image_start><image><image_end><image_-
start><image><image_end>

2) Partial Sequence Completion. This task requires the model to complete a video sequence when given only
a subset of frames while maintaining temporal coherence:

Example of Partial Sequence Completion

Prompt:
<image_start><image><image_end><image_start><image><image_end><image_-
start><image><image_end> Can you predict the 2 missing frames in this 5-second-interval sequence?
Response:
<image_start><image><image_end><image_start><image><image_end>

3) Reverse Temporal Prediction. This task challenges the model to reconstruct the preceding frames given the
final frame of a sequence:

Example of Reverse Temporal Reasoning

Prompt:
<image_start><image><image_end> Work backwards to predict the previous 4 frames, each 5 seconds
apart.
Response:
<image_start><image><image_end><image_start><image><image_end><image_-
start><image><image_end><image_start><image><image_end>

4) Temporal Sequence Reordering. In this task, the model receives a shuffled sequence of video frames and
must reconstruct their correct temporal order:

Example of Temporal Sequence Reordering

Prompt:
<image_start><image><image_end><image_start><image><image_end><image_-
start><image><image_end><image_start><image><image_end>
Arrange these frames in their correct temporal sequence.
Response:
<image_start><image><image_end><image_start><image><image_end><image_-
start><image><image_end><image_start><image><image_end>

Each task is designed to train the model’s temporal understanding and visual reasoning capabilities.

C.3 Potential Image Leakage in Testing Data


When selecting data sources, we carefully choose those that do not overlap with the testing sets of our
evaluation data, such as COCO (Lin et al., 2014). However, given that the data used in a Section 2.2 is
composed of numerous sources, some degree of data leakage may be inevitable. As discussed and analyzed in a
prior work (Tong et al., 2024a), even when image overlap occurs, it does not necessarily imply that the exact
image-question pairs have been encountered during training. Unlike traditional unimodal computer vision
research, where an image alone constitutes a data point, the multimodal paradigm treats each image-text
(question-answer) pair as a distinct and unique data point.

20
Joint train With Other Data # of Generation Data FID Score
Yes 1k 68.5
No 1k 115.0
Yes 5k 19.2
No 5k 116,4
Yes 10k 18.7
No 10k 111.0
Yes 50k 17.1
No 50k 111.8
Yes 200k 15.2 .
No 200k 110.7
Yes 200k 14.7
No 200k 93.7
Yes 1M 14.4
No 1M 52.8
Yes 3M 15.1
No 3M 39.2
Yes 5M 14.3
No 5M 27.7

Table 4 Results of training solely on generation data vs. joint training with additional data. These results correspond to
Figure 2. Joint training with additional data significantly improves generation performance. At 5,000 samples, the
model begins to generate reasonably accurate visual tokens, indicating that visual generation is an ability unlocked
through the learning of other tasks.

Joint training Data Data Type FID Score CLIP Score


None - 110.5 5.7
Image-to-Image Other Visual Data 97.5 6.4
Visual Thinking Other Visual Data 93.5 6.5 .
Pure Video Other Visual Data 84.7 8.1
VideoQA Visual Understanding Data 26.5 16.1
ImageQA Visual Understanding Data 18.9 22.0

Table 5 Impact of joint training 200k generation data with different data types. These results correspond to Figure 3. Among
the data types analyzed, joint training with visual understanding data has the most significant impact on enhancing
visual generation performance.

D Generating Visual Tokens


Here, we include the quantitative results of all the experiments in Section 3.

D.1 Results of Samples Needed to Unlock Visual Generation


Table 4 presents the quantitative results corresponding to Figure 2, which examines generation performance
under two conditions: training exclusively on generation data and joint training with all other data described
in Section 2.2. The results demonstrate that the model can develop the ability for visual generation with a
relatively modest amount of data when trained jointly with understanding tasks. In contrast, teaching this
skill in isolation requires a substantially larger dataset.
In Table 5, we present the quantitative results corresponding to Figure 3, which investigates the impact of
joint training on generation data in combination with various types of data outlined in Section 2.2. The results
show that joint training with visual understanding data—–specifically ImageQA and VideoQA–—provides the
most significant improvement in visual generation performance.

D.2 Results of Joint training Different Understanding and Generation Data


In Table 6, we present the numerical results of joint training with varying scales of understanding data (1M,
4M, 7M) and generation data (200k, 500k, 1M, 2M, 3M, 4M). These findings demonstrate that increasing the

21
Data Composition Image QA Generation

RealworldQA
MMBenchEN

CLIP Score
FID Score
TextVQA
ChartQA
MMMU
Average

MMVP
SEED

VStar
SQA
# of VQA Data # of Generation Data
1M 200k 46.4 60.0 62.2 50.3 24.0 80.0 38.4 37.4 16.4 48.8 28.3 15.2
1M 500k 48.2 66.4 63.2 50.8 24.3 80.4 39.9 38.7 18.2 51.6 28.1 15.9
1M 1M 49.1 70.1 65.2 52.2 21.3 80.0 39.5 38.7 20.4 54.6 27.3 16.5
1M 2M 49.9 67.8 66.0 50.2 30.3 80.2 38.9 39.0 21.8 54.8 23.1 17.8
1M 3M 51.1 71.3 67.1 55.4 33.0 79.5 38.8 37.4 22.7 55.0 21.1 21.1
1M 4M 51.4 71.1 66.9 52.4 31.0 80.5 39.8 41.1 24.0 56.0 18.4 22.3 .
4M 200k 53.8 73.1 68.8 55.0 34.7 81.2 38.5 44.0 29.5 59.2 21.4 20.5
4M 500k 53.3 73.0 69.9 55.3 32.7 80.6 40.2 39.3 29.6 58.9 16.0 24.8
4M 1M 54.2 73.8 69.6 54.9 33.3 82.1 36.6 45.6 32.4 59.9 16.0 24.8
4M 2M 53.8 72.8 70.3 55.2 37.3 80.8 36.8 44.0 31.2 56.2 15.6 24.7
4M 3M 54.3 71.8 70.1 57.7 36.0 81.0 38.0 42.9 32.6 59.0 16.1 24.8
4M 4M 54.4 75.2 69.9 56.0 37.3 81.4 38.1 40.8 31.6 59.3 15.3 25.5
7M 200k 55.8 73.1 70.3 55.6 42.0 81.0 40.8 44.0 35.2 60.6 18.2 22.3
7M 500k 55.6 74.4 70.6 56.2 38.7 81.9 37.9 44.0 36.0 60.5 15.2 25.5
7M 1M 55.8 74.3 70.3 56.3 42.7 81.3 36.6 44.5 35.8 60.6 14.5 26.6
7M 2M 55.4 73.9 71.1 56.9 40.0 81.6 35.9 42.4 35.4 61.6 14.8 27.1
7M 3M 55.6 74.2 71.0 57.3 38.0 81.1 40.1 43.5 35.0 60.2 14.2 27.5
7M 4M 56.2 75.4 70.4 55.4 44.0 80.4 39.6 45.0 35.2 60.2 14.9 26.3

Table 6 Full results of joint training on varying amounts of VQA data (1M, 4M, 7M) and generation data (200k, 500k, 1M, 2M, 3M,
4M). These results correspond to Figure 4, Figure 5, Figure 7, and Figure 8, which analyze how different combinations
of understanding and generation data impact the model’s visual understanding and generation performance.

Pretrained LLM Image QA Generation


RealworldQA
MMBenchEN

CLIP Score
FID Score
TextVQA
ChartQA
MMMU
Average

MMVP
SEED

VStar
SQA

.
LLM
LLaMA-3 8B 55.8 74.3 70.3 56.3 42.7 81.3 36.6 44.5 35.8 60.6 14.5 26.6
LLaMA-3.1 8B 56.7 75.8 70.2 56.2 44.7 81.9 41.2 43.4 36.0 61.3 13.2 27.1
LLaMA-3 70B 60.7 80.7 72.6 58.3 48.7 87.8 48.9 47.1 37.4 65.0 13.8 26.8

Table 7 Full results of training on different LLMs. We train 7M VQA data and 1M generation data on different LLM
backbones (LLaMA-3 8B, LLaMA-3.1 8B, and LLaMA-3 70B) and measure understanding and generation performance.

amount of understanding data yields more substantial improvements in both understanding tasks (e.g., VQA
performance) and generation tasks (e.g., FID scores and CLIP scores) compared to increasing the amount of
generation data. These results, consistent with our analysis in Section 3.2 and Section 3.3, highlight that
understanding data play a more pivotal role in enhancing performance across both task types.

D.3 Results of Training on Different LLMs


We present the results of training with 7M VQA data and 1M generation data across various LLM backbones,
including LLaMA-3 8B, LLaMA-3.1 8B, and LLaMA-3 70B. As shown in Table 7, which corresponds to
the results in Figure 6, we observe that stronger LLM backbones lead to improvements in both visual
understanding and visual generation. These findings further support the conclusion that visual understanding
and generation are reciprocal processes, where advancements in one drives enhancements in the other.

E More Examples of MetaMorph


We provide additional examples of MetaMorph in Figure 12 and Figure 13. These examples illustrate how
MetaMorph extends beyond the capabilities of typical MLLMs by leveraging learned skills to perform novel
tasks such as visual reasoning and visual transformation. In Figure 12, when prompted with the question

22
“What is the type of hat?”, MetaMorph first generates visual tokens related to hats and then answers correctly
with “top hat”. The model also demonstrates the ability to perform image transformations, such as creating
a cartoon version of an image or altering it to appear as daytime. Additionally, we showcase examples of
MetaMorph solving implicit puzzles, such as interpreting “a rearrangement of the letters in the word ’tca” ’,
before generating the corresponding visual tokens of cats.

23
Make a child cartoon from it How many birds are in the image and why?

There is only one bird in the image. The


<image_start><image_end>
bird is captured in a single frame, and its
reflection in the water creates the illusion
of two birds.

Generate an image of the pet animal whose


name is a rearrangement of the letters in the
word 'tca’

Here's the generated image based on


your request: <image_start><image_end>

Reflect on the visual details: What's the type of


hat that animal wear in the image?

Let's imagine the critical detail: What's the animal in this image?
<image_start><image_end> The animal is
wearing a top hat.
The animal in the image is a domestic
cat, specifically a tabby cat. Tabbies are
characterized by their distinctive coat
patterns, which include stripes, dots, and
swirling patterns. The markings on this
cat's face, with the 'M' shape on its
forehead and the stripes on its ears, are
typical of a tabby pattern.

Is the coke in the ads regular or zerosugar?

The coke in the ads is zero sugar.

Figure 12 Examples of MetaMorph (I). We showcase examples of MetaMorph’s capabilities: transforming images based
on prompts (top-left), answering challenging questions (top-right), integrating visual tokens into reasoning processes
(bottom-left), implicitly solving puzzles (bottom-right) and answering tricky video-qa questions. (bottom).

24
Generate an image of a puppy

Here's the generated image based on


your request: <image_start><image_end>

What’s this city?

The city in the image is New York City,


specifically the Lower Manhattan area,
which is home to the iconic World Trade
Center complex. The presence of the
One World Trade Center, which is the
tallest building in the image, is a
distinctive feature.

Generate an image of Kagu


Can you turn it to view during the daytime?

Here's the generated image based on


<image_start><image_end> your request: <image_start><image_end>

Figure 13 Examples of MetaMorph (II).We showcase more examples of MetaMorph’s capabilities: answering questions and
transforming images in one conversation (left), generating images (top-right), and leveraging knowledge in LLMs to
generate rare concepts (bottom-right).

25

You might also like