MetaMorph: Unified Visual Understanding
MetaMorph: Unified Visual Understanding
In this work, we propose Visual-Predictive Instruction Tuning (VPiT)—a simple and effective exten-
sion to visual instruction tuning that enables a pretrained LLM to quickly morph into an unified
arXiv:2412.14164v1 [[Link]] 18 Dec 2024
autoregressive model capable of generating both text and visual tokens. VPiT teaches an LLM to
predict discrete text tokens and continuous visual tokens from any input sequence of image and text
data curated in an instruction-following format. Our empirical investigation reveals several intriguing
properties of VPiT: (1) visual generation ability emerges as a natural byproduct of improved visual
understanding, and can be unlocked efficiently with a small amount of generation data; (2) while
we find understanding and generation to be mutually beneficial, understanding data contributes to
both capabilities more effectively than generation data. Building upon these findings, we train our
MetaMorph model and achieve competitive performance on both visual understanding and generation.
In visual generation, MetaMorph can leverage the world knowledge and reasoning abilities gained from
LLM pretraining, and overcome common failure modes exhibited by other generation models. Our
results suggest that LLMs may have strong “prior” vision capabilities that can be efficiently adapted
to both visual understanding and generation with a relatively simple instruction tuning process.
1 Introduction
Multimodal Large Language Models (MLLMs) have advanced considerably in visual understanding, progressing
from basic image captioning to complex visual inferences (Alayrac et al., 2022; Liu et al., 2023; Dai et al.,
2024). These models process multimodal inputs—primarily images and language—and generate text tokens.
Multimodal LLMs often leverage a pretrained vision encoder (Dosovitskiy et al., 2021; Radford et al., 2021),
a pretrained language model (Touvron et al., 2023; AI@Meta, 2024), and align these modalities through
connectors such as MLP (Liu et al., 2023, 2024a) or cross-attention modules (Alayrac et al., 2022; Dai et al.,
2024). Among MLLM training methods, visual instruction tuning (Liu et al., 2023) has become widely
used (Wang et al., 2024a; Agrawal et al., 2024). It treats output embeddings of pretrained vision encoders as
continuous-valued “visual tokens” and directly feeds them as inputs to pretrained LLMs.
One benefit of visual instruction tuning is that it is data and compute efficient. A pretrained LLM can be
repurposed as a Multimodal LLM by instruction tuning with modest compute and data on the order of millions
of image-text question-answer pairs (Tong et al., 2024a; Li et al., 2024a). The effectiveness of visual instruction
tuning indicates that LLMs already possess a considerable amount of inherent visual knowledge which allows
them to efficiently learn and develop visual understanding during the instruction tuning process (Zhou et al.,
2024a). Inspired by this, we investigate whether LLMs can also be finetuned to generate visual information
with comparable efficiency and effectiveness.
Current attempts toward “unified” models—models capable of both multimodal understanding and generation—
often treat visual generation as an orthogonal capability to visual understanding. They tend to require
substantial changes to the original MLLM architecture and significant multimodal pretraining and/or finetuning.
Designing such methods is challenging, and past research takes different approaches including tokenizing
1
VPiT Inference Examples
Generate an image of the animal resulting from a
Text Vision Text Vision monarch caterpillar's metamorphosis
Adapted
Head Head Head Head
Diffusion Here’s the generated image based on
Model
Autoregressive your request: <image_start><image_end>
Autoregressive Model
Model Projector
Adapter Adapter
Vision What’s the animal in this image?
Encoder
Encoder
Figure 1 VPiT Training, Inference, and Examples of MetaMorph. Left: In Visual-Predictive Instruction Tuning (VPiT),
we finetune a pretrained LLM to generate both text and visual tokens using separate text and vision heads. Middle:
During inference, the model accepts an arbitrary input sequence of image(s) and text and outputs discrete text tokens
and continuous visual tokens. These visual tokens can be visualized via a separately finetuned diffusion model, which
is trained to condition on the pretrained vision encoder’s output. Right: An example conversation from MetaMorph
trained with VPiT. Here, the model implicitly solves a visual puzzle in order to generate the visual tokens of a butterfly.
The conversation continues with new user questions as the model continues to autoregressively process vision and text
tokens, independent of the diffusion-based visualization.
visual inputs into discrete tokens (Wu et al., 2024b; Team, 2024; Liu et al., 2024c), incorporating diffusion
objectives (Xie et al., 2024; Zhou et al., 2024b), and decoupling vision into separate understanding and
generation modes (Wu et al., 2024a). For example, approaches like LWM (Liu et al., 2024c), Show-o (Xie
et al., 2024), and Chameleon (Team, 2024) require billions of image-text pairs (Schuhmann et al., 2022; Gadre
et al., 2024) for extensive pretraining and finetuning.
In this work, we propose Visual-Predictive Instruction Tuning (VPiT)—a simple extension to visual instruction
tuning which builds upon the existing paradigm of passing continuous visual tokens as input to the LLM. VPiT
trains an LLM to output both continuous visual tokens and discrete text tokens in the finetuning stage. The
model takes pretrained vision encoder embeddings as well as text tokens as input, and outputs a combination
of text tokens and continuous visual tokens. To visualize the generated visual tokens, we finetune a diffusion
model to map the embeddings back into pixel space (see Figure 1 for an example). This framework allows us
to study the synergy between visual understanding, visual generation, and pretrained LLMs, which leads to
several intriguing findings outlined below.
First, we show that the ability to predict visual tokens emerges from understanding visual inputs and requires
minimal additional training. Similar to visual instruction tuning, VPiT efficiently and effectively morphs an
LLM into an “unified” model that understands and generates multimodal tokens. When trained jointly with
sufficient visual understanding data, this process requires as little as 200k additional visual generation data.
We further establish that the abilities to understand and generate visual tokens are intrinsically linked and
asymmetrical. Specifically, increasing understanding data improves visual understanding (measured by higher
VQA scores) and generation performance (measured by lower FID scores). Conversely, increasing generation
data enhances generation quality and also contributes to stronger visual understanding—but to a lesser degree.
Importantly, our findings highlight an asymmetry in how training each ability impacts the model’s overall
vision performance: understanding-centric training substantially outperforms generation-centric training in
improving both visual understanding and generation.
Building upon these findings, we train a unified model called MetaMorph to predict multimodal tokens with
VPiT. We leverage diverse data sources ranging from common visual question answering datasets to pure
image and video data without text annotations. MetaMorph achieves competitive performance on both visual
understanding and visual generation benchmarks. Furthermore, we show this unified modeling approach
allows models to leverage the power of LLMs. For instance, MetaMorph can extract knowledge from the
pretrained LLM when generating visual tokens. More surprisingly, we observe that MetaMorph can implicitly
perform reasoning steps before generating visual tokens—e.g. when prompted with “the animal resulting from
a monarch caterpillar’s metamorphosis”, MetaMorph successfully generates an image of a butterfly (Figure 1).
Our results suggest that 1) training a unified model with instruction tuning is feasible, and 2) LLMs have
strong pre-existing visual capabilities which can be activated using significantly fewer samples compared
2
to extensive pretraining. These insights shed light on the development of mixed-modality models. As the
community continues to improve visual understanding in Multimodal LLMs (Tong et al., 2024a; Wang et al.,
2024a; Li et al., 2024a) by advancing base LLMs, instruction tuning techniques, and data, we highlight that
these efforts may also implicitly lead to models that are better at visual generation.
Tokenizing multimodal data. We extend Pi and Ri to include both text and images. To integrate visual data
into a pretrained LLM, we process data closely following visual instruction tuning (Liu et al., 2023):
• Text Data: Text is tokenized into discrete tokens with a standard tokenizer used by the LLM.
• Visual Data: Images are encoded with a pretrained vision encoder such as SigLIP (Zhai et al., 2023).
The output is continuous visual tokens which are then interpolated to m = 64 tokens. To pass the visual
tokens as input to the LLM, we apply a trainable projection layer to align the dimensions with the LLM.
Model architecture. We take a pretrained LLM and finetune it to process arbitrary sequences of text and
visual tokens (detailed next in Section 2.2). We keep the original LLM head for text prediction, and attach a
separate vision head to the LLM for predicting visual tokens, i.e., the output tokens generated by the vision
encoder when processing images. The vision head is a projection layer that projects from the LLM’s dimension
to the vision encoder’s dimension. All response tokens can then be trained and predicted autoregressively,
with prompt tokens as context.
Unlike conventional visual instruction tuning, in VPiT, visual tokens are also outputs of the LLM—not just
inputs. To make the LLM aware of the presence of visual tokens, we introduce special tokens ⟨image_start⟩
and ⟨image_end⟩ to indicate the boundaries of visual token sequences and when to use the vision head.
Loss functions. The language head outputs a probability distribution over the vocabulary and is trained with
cross-entropy loss for next-token prediction. Visual prediction uses cosine similarity loss between the LLM’s
predicted visual tokens and those from the vision encoder. Consistent with instruction tuning practices, the
model only makes predictions and incurs loss on response tokens.
3
Section 3 and Section 4. All data types are formatted as instruction tuning style prompt & response pairs.
See further details in Appendix C.2.
1. Visual Understanding Data: This category includes data that takes image(s) or video as input and outputs
text responses. See Figure 1 for an example. We use:
• ImageQA: Cambrian-7M (Tong et al., 2024a). The model answers questions based on input image(s).
Pi ∈ {⟨visual tokens⟩, ⟨text prompt⟩}
Ri ∈ {⟨text response⟩}
• VideoQA: VideoStar (Zohar et al., 2024) and ShareVideo (Zhang et al., 2024). The model answers
questions based on the input video. For videos in VideoQA, we process frames at 1 FPS.
Pi ∈ {⟨visual tokens⟩, · · · , ⟨visual tokens⟩, ⟨text prompt⟩}
Ri ∈ {⟨text response⟩}
2. Visual Generation Data: MetaCLIP (Xu et al., 2024). The model predicts visual tokens based on an image
description. We using at most 5 million pairs. We curate the data into question-answering formats.
Pi ∈ {⟨text prompt⟩}
Ri ∈ {⟨text response⟩, ⟨visual tokens⟩}
We prompt the model to generate visual tokens with instructions like “Generate an image of...”. The
text responses are “Here is an image based on your request...”. See Figure 1 for an example.
3. Other Visual Data: This category includes data that requires the model to predict visual tokens given
interleaved input visual tokens and text tokens. We use:
• Video Data: SomethingSomethingV2 (Goyal et al., 2017b) and HowTo100M (Miech et al., 2019).
The model predicts frames in a sequential order. We design different question-answer pairs to
probe into the video, such as asking about future frames, past frames, and reordering frames.
Pi ∈ {⟨visual tokens⟩, · · · , ⟨visual tokens⟩, ⟨text prompt⟩}
Ri ∈ {⟨visual tokens⟩, · · · , ⟨visual tokens⟩}
• Visual Thinking Data: Visualization-of-Thought (Shao et al., 2024) and VStar (Wu and Xie, 2024).
The model predicts multimodal tokens in its response before addressing problems. For instance, it
predicts a zoomed-in view of an image before generating textual responses.
Pi ∈ {⟨visual tokens⟩, ⟨text prompt⟩}
Ri ∈ {⟨text response⟩, ⟨visual tokens⟩, ⟨text response⟩}
In the response, the model will output “I will think about it visually”, followed by visual tokens
representing a zoomed-in segment of the image, and then proceed to answer the question.
• Image-to-Image Data: InstructPix2Pix (Brooks et al., 2023) and Aurora (Krojer et al., 2024). The
model generates a transformed image conditioned on a text description and an input image.
Pi ∈ {⟨visual tokens⟩, ⟨text prompt⟩}
Ri ∈ {⟨visual tokens⟩}
4
3 Findings on Unlocking Visual Generation
We study the following questions about the effects and synergy of visual understanding and generation, under
our VPiT framework:
§3.1 Can visual generation be unlocked through lightweight tuning, or does it require extensive data?
§3.2 Are visual understanding and generation mutually beneficial or orthogonal?
§3.3 How much does more visual understanding or generation data contribute to understanding and generation
quality?
§3.4 Which visual understanding tasks correlate the most with generation performance?
Evaluation settings. We use 9 ImageQA benchmarks (MMBench, Seed, VStar, MMVP, MMMU, ChartQA,
TextVQA, ScienceQA, RealWorldQA) to evaluate different aspects of the model. For image generation, we
use the finetuned diffusion model to visualize generated visual tokens and measure FID score (lower is better)
and CLIP score (higher is better) on the COCO-30K dataset. Unless otherwise specified, we use LLaMA-3
8B (AI@Meta, 2024) / SigLIP ViT-SO400M-14@384 (Zhai et al., 2023) as the pretrained LLM / vision
encoder. We also study the effect of different LLMs in Section 3.2. We use instruction tuned versions of
the LLMs. We pretrain the adapter between the vision encoder and the LLM following visual instruction
tuning (Liu et al., 2023, 2024a). For experiments in this section, we provide training details in Appendix A
and include the full results in Appendix B.
3.1 Visual Generation Can Be Unlocked Efficiently by Joint Training with Visual Understanding
We start by investigating the number of image-text samples required to teach a language model to generate
high-quality visual tokens. To this end, we randomly sample {1k, 5k, 10k, 50k, 200k, 1M, 3M, 5M} image-text
pairs from our generation data (MetaCLIP dataset (Xu et al., 2024)). We explore two settings: (1) finetuning
the LLM using only visual generation data, and (2) joint training visual generation with visual understanding
and the rest of data types described in Section 2.2.
In Figure 2, we see that training solely on visual generation performs significantly worse than joint training
with all other data. With over 3 million image-text pairs, the model struggles to generate high-quality visual
images (∼ 40 FID score), and performance remains inferior to joint training with 5 million pairs. This suggests
that training solely on visual generation data is significantly less sample efficient. This finding aligns with a
prior study (Zhang et al., 2023) which also suggests that LLMs cannot be easily tuned to generate visual
tokens when trained with only generation data. In contrast, joint training with other datasets substantially
improves generation performance. The model generates effective visual tokens with just 5k generation data,
and performance stabilizes around 200k samples. This indicates that visual generation is not an orthogonal
capability but rather an ability that benefits from other tasks and emerges more effectively with joint training.
* H Q H U D W L R Q 2 Q O \
* H Q H U D W L R Q , P D J H W R , P D J H
) , '