0% found this document useful (0 votes)
20 views

Lecture12 1MultimodalFusion

Uploaded by

Sanjay Reddy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views

Lecture12 1MultimodalFusion

Uploaded by

Sanjay Reddy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 66

Advanced

Multimodal Machine Learning


Lecture 12.1: Multimodal Fusion
and New Directions
Louis-Philippe Morency

* Original version co-developed with Tadas Baltrusaitis

1
Lecture Objectives

▪ Recap: multimodal fusion


▪ Kernel methods for fusion
▪ Multiple Kernel Learning
▪ New directions in multimodal machine learning
▪ Representation
▪ Alignment
▪ Translation
▪ Fusion
▪ Co-Learning
▪ Recap of multimodal challenges
Quick Recap:
Multimodal Fusion
3
Multimodal fusion

▪ Process of joining information


from two or more modalities to
perform a prediction
▪ Examples
▪ Audio-visual speech recognition
▪ Audio-visual emotion recognition
▪ Multimodal biometrics
▪ Speaker identification and
diarization
▪ Visual/Media Question answering
Multimodal Fusion

▪ Two major types


▪ Model Free
▪ Early, late, hybrid
▪ Model Based
▪ Neural Networks Prediction
▪ Graphical models
▪ Kernel Methods
Fancy
algorithm

Modality 1 Modality 2 Modality 3


Graphical Model: Learning Multimodal Structure
Sentiment
Modality-private structure
y
• Internal grouping of observations
ℎ1𝐴 ℎ2𝐴 ℎ3𝐴 ℎ4𝐴 ℎ5𝐴
Modality-shared structure
• Interaction and synchrony 𝒙𝑨𝟏 𝒙𝑨𝟐 𝒙𝑨𝟑 𝒙𝑨𝟒 𝒙𝑨𝟓

We saw the yellowdog

6
Multi-view Latent Variable Discriminative Models
Sentiment
Modality-private structure
y
• Internal grouping of observations
ℎ1𝐴 ℎ2𝐴 ℎ3𝐴 ℎ4𝐴 ℎ5𝐴
Modality-shared structure
ℎ1𝑉 ℎ2𝑉 ℎ3𝑉 ℎ4𝑉 ℎ5𝑉
• Interaction and synchrony 𝒙𝑨𝟏 𝒙𝑨𝟐 𝒙𝑨𝟑 𝒙𝑨𝟒 𝒙𝑨𝟓
𝒙𝑽𝟏 𝒙𝑽𝟐 𝒙𝑽𝟑 𝒙𝑽𝟒 𝒙𝑽𝟓

We saw the yellowdog

𝑝 𝑦 𝒙𝑨 , 𝒙𝑉 ; 𝜽) = ෍ 𝑝 𝑦, 𝒉𝑨 , 𝒉𝑽 𝒙𝑨 , 𝒙𝑽 ; 𝜽
𝒉𝑨 ,𝒉𝑽
➢ Approximate inference using loopy-belief
7
Multimodal Fusion:
Multiple Kernel Learning

8
What is a Kernel function?

▪ A kernel function: Acts as a similarity metric between


data points

𝐾 𝒙𝑖 , 𝒙𝑗 = 𝜙 𝒙𝑖 𝑇 𝜙(𝒙𝑗 ) = 𝜙 𝒙𝑖 , 𝜙(𝒙𝑗 ) , where 𝜙: 𝐷 → 𝑍

▪ Kernel function performs an inner product in feature map


space 𝜙
▪ Inner product (a generalization of the dot product) is often
denoted as . , . in SVM papers
▪ 𝒙 ∈ ℝ𝐷 (but not necessarily), but 𝜙 𝒙 can be in any space –
same, higher, lower or even in an infinite dimensional space
Non-linearly separable data

Not linearly separable


Same data, but now linearly separable

▪ Want to map our data to a linearly separable space


▪ Instead of 𝒙, want 𝜙(𝒙), in a separable space (𝜙(𝒙) is a feature
map)
▪ What if 𝜙(𝒙) is much higher dimensional? We do not want to learn
more parameters and mapping could become very expensive
Radial Basis Function Kernel (RBF)

▪ Arguably the most popular SVM kernel


1 2
▪ 𝐾 𝑥𝑖 , 𝑥𝑗 = exp − 𝑥𝑖 − 𝑥𝑗
2𝜎2
▪ 𝜙 𝒙 =?
▪ It is infinite dimensional and fairly involved, no easy way to actually
perform the mapping to this space, but we know what an inner
product looks like in it
▪ 𝜎=?
▪ a hyperparameter
▪ With a really low sigma the model becomes close to a KNN approach
(potentially very expensive)
Some other kernels

▪ Other kernels exist


▪ Histogram Intersection Kernel – good for histogram
features
▪ String kernels – specifically for text and sentence
features
▪ Proximity distribution kernel
▪ (Spatial) pyramid matching kernel
Kernel CCA

▪ If we remember CCA it used only inner products in definitions when dealing with
data, that means we can again use kernels

▪ We can now map into a high-dimensional non-linear space instead

[Lai et al. 2000]


Different properties of different signals

How do we deal with heterogeneous or multimodal data?


▪ The data of interest is not in a joint space so appropriate kernelsfor
each modality might be different

Multiple Kernel Learning (MKL) is a way to address this


▪ Was popular for image classification and retrieval before deep learning
approaches came around (winner of 2010 VOC challenge, ImageClef
2011 challenge)
▪ MKL - fell slightly out of favor when deep learning approaches became
popular
▪ Still useful when large datasets are not available
Multiple Kernel Learning

▪ Instead of providing a single kernel and validating which one works optimize
in a family of kernels (or different families for different modalities)
▪ Works well for unimodal and multimodal data, very little adaptation is needed

[Lanckriet 2004]
MKL in Unimodal Case

▪ Pick a family of kernels and learn which


kernels are important for the classification
case
▪ For example a set of RBF and polynomial
kernels

K
MKL in Multimodal/Multiview Case

▪ Pick a family of kernels for


each modality and learn which
kernels are important for the
classification case
▪ Does not need to be different
modalities, often we use
different views of the same
modality (HOG, SIFT, etc.)

K
New Directions:
Representation
18
Representation 1: Hash Function Learning

▪ We talked about coordinated representations, but mostly


enforced “simple” coordination
▪ We can make embeddings more suitable for retrieval
▪ Enforce a Hamming space (binary n-bit space)

[Cao et al. Deep visual-semantic hashing for cross-modal retrieval, KDD 2016]
Representation 2: Order-Embeddings

▪ We talked about coordinated representations, but mostly


enforced “simple” coordination
▪ Can we take it further?
▪ Replaces symmetric similarity

▪ Enforce approximate structure when training the embedding


[Vendrov et al. Order-embeddings of images and language, ICLR 2016]
Representation 3: Hierarchical Multimodal LSTM

▪ Attempts to model region-


based representations using
phrases rather than simply an
overview sentence for the
image

▪ Uses these region based


phrases to hierarchically build
sentences

Niu, Zhenxing, et al. "Hierarchical multimodal lstm for dense visual-semantic


embedding." Computer Vision (ICCV), 2017 IEEE International Conference on. IEEE,
2017.
Representation 3: Hierarchical Multimodal LSTM

Multimodal Embedding HM-LSTM


Niu, Zhenxing, et al. "Hierarchical multimodal lstm for dense visual-semantic embedding." Computer
Vision (ICCV), 2017 IEEE International Conference on. IEEE, 2017.
Representation 4: Multimodal VAE (MVAE)

▪ Introduce a multimodal variational autoencoder (MVAE) with a new


training paradigm that learns a joint distribution and is robust to
missing data

[Wu, Mike, and Noah Goodman. “Multimodal Generative Models for Scalable Weakly-Supervised Learning.”,
NIPS 2018]

23
Representation 4: Multimodal VAE (MVAE)

▪ Transform unimodal datasets into “multi-modal” problems by


treating labels as a second modality

𝑧~𝑝(𝑧) 𝑧~𝑝(𝑧|𝑥2 = 5) 𝑧~𝑝(𝑧) 𝑧~𝑝(𝑧|𝑥2 = 𝑎𝑛𝑘𝑙𝑒 𝑏𝑜𝑜𝑡)

[Wu, Mike, and Noah Goodman. “Multimodal Generative Models for Scalable Weakly-Supervised Learning.”,
NIPS 2018]

24
Representation 5: Multilingual Representations

Goal: map image and its descriptions (not translations) in both


languages close to each other.
[Gella et al. " Image Pivoting for Learning Multilingual Multimodal Representations", ACL 2017]
New Directions:
Alignment
26
Alignment 1: Books to scripts/movies

▪ Aligning very different modalities


▪ Books to scripts/movies

▪ Hand-crafted similarity based approach


[Tapaswi et al. Book2Movie: Aligning Video scenes with Book chapters, CVPR
2015]
Alignment 2: Books to scripts/movies

▪ Aligning very different modalities


▪ Books to scripts/movies

▪ Supervision based approach


[Zhu et al. Aligning Books and Movies: Towards Story-like Visual Explanations
by Watching Movies and Reading Books, ICCV 2015]
Alignment 3: Spot-The-Diff

▪ ‘Spot-the-diff’: a new task and a dataset for succinctly describing all


the differences between two similar images.
▪ Proposes a new model that captures visual salience through a latent
alignment between clusters of differing pixels and output sentences.
[Jhamtani and Berg-Kirkpatrick. Learning to Describe Differences Between Pairs of Similar
Images., EMNLP 2018]
Alignment 4: Textual Grounding

[Yeh, Raymond, et al. “Interpretable and globally optimal prediction for textual grounding using image concepts.”,
NIPS 2017.]
Alignment 4: Textual Grounding

▪ Formulate the bounding box prediction as an energy minimization


▪ The energy function is defined as a linear combination of a set of “image
concepts” 𝜙𝑐 𝑥, 𝑤𝑟 ∈ ℝ𝑊×𝐻

Word priors

[Yeh, Raymond, et al. “Interpretable and globally optimal prediction for textual grounding using image concepts.”,
NIPS 2017.]

31
Alignment 4: Textual Grounding

Word-word relationship

cos 𝑤𝑠 , 𝑤𝑠′
𝑤𝑠 = [𝑤𝑠,1 ; 𝑤𝑠,1 ; … ; 𝑤𝑠,|𝐶| ]

[Yeh, Raymond, et al. “Interpretable and globally optimal prediction for textual grounding using image concepts.”,
NIPS 2017.]

32
Alignment 5: Comprehensive Image Captions

▪ Merging attention from


text and visual modality
for image captioning

▪ Strike a balance
between details (visual
driven) and coverage of
objects (text/topic driven)

[Liu et al. simNet: Stepwise Image-Topic Merging Network for Generating Detailed
and Comprehensive Image Captions, 2018]
Alignment 5: Comprehensive Image Captions

▪ Merging attention from text and visual modality for


image captioning

[Liu et al. simNet: Stepwise Image-Topic Merging Network for Generating Detailed
and Comprehensive Image Captions, 2018]
New Directions:
Fusion
35
Fusion 1a: Multi-Head Attention for AVSR
Multi-head Attention

Afouras, Triantafyllos, Joon Son Chung, Andrew Senior, Oriol Vinyals, and Andrew Zisserman.
"Deep audio-visual speech recognition." arXiv preprint arXiv:1809.02108 (Sept 2018).
Fusion 1b: Fusion with Multiple Attentions

▪ Modeling Human Communication – Sentiment,


Emotions, Speaker Traits

Language LSTM

Vision LSTM

Acoustic LSTM

[Zadeh et al., Human Communication Decoder Network for Human Communication Comprehension, AAAI 2018]

37
Fusion 2: Memory-Based Fusion

[Zadeh et al., Memory Fusion Network for Multi-view Sequential Learning, AAAI 2018]

38
Fusion 3: Relational Questions

▪ Aims to improve relational


reasoning for Visual Question
Answering

▪ Current deep learning


architectures – unable to capture
reasoning capabilities on their
own

▪ Proposes a Relation Network


(RN) that augments CNNs for
better reasoning
Santoro, A., Raposo, D., Barrett, D. G., Malinowski, M., Pascanu, R., Battaglia, P., & Lillicrap, T.
(2017). A simple neural network module for relational reasoning. In Advances in neural
information processing systems (pp. 4967-4976).
Fusion 3: Relational Questions
Fusion 4: Structured Prediction

▪ Scene-graph prediction: The output structure is invariant to specific


permutations.
▪ The paper describe a model that satisfies the permutation invariance
property, and achieve state-of-the-art results on the competitive Visual
Genome benchmark

[Herzig et al. Mapping Images to Scene Graphs with Permutation-Invariant Structured


Prediction, NIPS 2018]
Fusion 5: Recurrent Multimodal Interaction

The same LSTM module strides across entire image /


feature map of image, acting as conv kernel

mLSTM mLSTM mLSTM

LSTM LSTM LSTM Segmentation mask

Input image

giraffe on right

[Liu et al. Recurrent Multimodal Interaction for Referring Image Segmentation, 2017]
New Directions:
Translation
43
Translation 1: Visually indicated sounds

▪ Sound generation!

[Owens et al. Visually indicated sounds, CVPR, 2016]


Translation 2: The Sound of Pixels

▪ Propose a system that learns to localize the sound sources in a


video and separate the input audio into a set of components
coming from each object by leveraging unlabeled videos.

[Zhao, Hang, et al. “The sound of pixels.”, ECCV 2018] https://youtu.be/2eVDLEQlKD0

45
Translation 2: The Sound of Pixels

▪ Trained in a self-supervised manner by learning to separate the


sound source of a video from the audio mixture of multiple videos
conditioned on the visual input associated with it.

[Zhao, Hang, et al. “The sound of pixels.”, ECCV 2018]

46
Translation 3: Learning-by-asking (LBA)

• an agent interactively learns by asking questions to an oracle


• standard VQA training has a fixed dataset of questions
• in LBA the agent has the potential to learn more quickly by asking
“good” questions (like a bright student in a class)
[Misra et al. "Learning by Asking Questions", CVPR 2018]
Translation 3: Learning-by-asking (LBA)

Training: Testing:
• Given on the input image the • LBA is evaluated exactly like
model decides what questions VQA
to ask
• Answers are obtained by
human supervised oracle
[Misra et al. "Learning by Asking Questions", CVPR 2018]
Translation 4: Navigation

▪ Goal prediction
▪ Highlight the goal location by generating a probability distribution over
the environment panoramic image

▪ Interpretability
▪ Explicit goal
prediction
modeling makes
the approach more
interpretable

[Misra et al. Mapping Instructions to Actions in 3D Environments with


Visual Goal Prediction, EMNLP 2018]
Translation 4: Navigation

▪ The paper proposes to decompose instruction execution into: 1)


goal prediction and 2) action generation

[Misra et al. Mapping Instructions to Actions in 3D Environments with Visual Goal


Prediction, EMNLP 2018]
Translation 5: Explanations for VQA and ACT

Pointing and Justification Architecture

▪ Answering Model: predicts an answer given the image and the


question
▪ Multimodal Explanation Model: generates visual and textual
explanations given the answer, question, and image

[Park et al. "Multimodal Explanations: Justifying Decisions and Pointing to the


Evidence", CVPR 2018]
Translation 5: Explanations for VQA and ACT
VQA-X: ACT-X:

[Park et al. "Multimodal Explanations: Justifying Decisions and Pointing to the


Evidence", CVPR 2018]
New Directions:
Co-Learning
53
Co-learning 1: Regularizing with Skeleton Seqs

▪ Better unimodal representation by regularizing using


a different modality

Non parallel data!


[B. Mahasseni and S. Todorovic, “Regularizing Long Short Term Memory with 3D Human-
Skeleton Sequences for Action Recognition,” in CVPR, 2016]
Co-Learning 2: Multimodal Cyclic Translation

Sentiment

Encoder Decoder

“Today was a great day!”


Verbal modality
(Spoken language)
Cyclic Co-learning Visual modality
Loss Representation

Paul Pu Liang*, Hai Pham*, et al., “Found in Translation: Learning Robust Joint Representations by
Cyclic Translations Between Modalities”, AAAI 2019
Co-learning 3: Taskonomy

Zamir, Amir R., et al. "Taskonomy: Disentangling Task Transfer Learning." Proceedings of
the IEEE Conference on Computer Vision and Pattern Recognition. 2018.
Co-learning 4: Associative Multichannel Autoencoder

▪ Learning representation through fusion and translation


▪ Use associated word prediction to address data sparsity

[Wang et al. Associative Multichannel Autoencoder for Multimodal Word


Representation, 2018]
Co-learning 5: Grounding Semantics in Olfactory Perception

▪ Grounding language in vision, sound, and smell

[Kiela et al., Grounding Semantics in Olfactory Perception, ACL-IJCNLP,


2015]
Multimodal machine
learning recap
59
Prediction Big dog
on the
beach
1 2

𝑡1

𝑡2 𝑡4 𝑡2

𝑡3 𝑡5 𝑡3

𝑡6
𝑡𝑛 𝑡𝑛

Language Visual
Input Modalities
Acoustic

60
Taxonomy of Multimodal Research [ https://arxiv.org/abs/1705.09406 ]

Representation o Encoder-decoder ▪ Model-based


o Online prediction o Kernel-based
▪ Joint
o Neural networks Alignment o Graphical models
o Neural networks
o Graphical models ▪ Explicit
o Sequential o Unsupervised Co-learning
▪ Coordinated o Supervised ▪ Parallel data
o Similarity ▪ Implicit o Co-training
o Structured o Graphical models o Transfer learning
Translation o Neural networks ▪ Non-parallel data
▪ Example-based Fusion ▪ Zero-shot learning
o Retrieval ▪ Concept grounding
▪ Model agnostic
o Combination ▪ Transfer learning
o Early fusion
▪ Model-based o Late fusion ▪ Hybrid data
o Grammar-based o Hybrid fusion ▪ Bridging
Tadas Baltrusaitis, Chaitanya Ahuja, and Louis-Philippe Morency, Multimodal Machine Learning: A Survey and Taxonomy
Core Challenge 1: Representation

Definition: Learning how to represent and summarize multimodal data in away


that exploits the complementarity and redundancy.

A Joint representations: B Coordinated representations:


Representation Repres. 1 Repres 2

Modality 1 Modality 2 Modality 1 Modality 2

62
Core Challenge 2: Alignment

Definition: Identify the direct relations between (sub)elements from two or


more different modalities.
Modality 1 Modality 2
A Explicit Alignment
t1
The goal is to directly find correspondences
t2 t4 between elements of different modalities
Fancy algorithm

t3 t5 B Implicit Alignment

Uses internally latent alignment of modalities in


order to better solve a different problem
tn tn

63
Core Challenge 3: Fusion

Definition: To join information from two or more modalities to perform a


prediction task.

Prediction

Fancy
algorithm

Modality 1 Modality 2 Modality 3

64
Core Challenge 4: Translation

Definition: Process of changing data from one modality to another, where the
translation relationship can often be open-ended or subjective.

A Example-based B Model-driven

65
Challenge 5 – Co-learning

▪ How can one modality help


learning in another modality? Prediction
▪ One modality may have more
resources
▪ Bootstrapping or domain
adaptation Modality 1 Modality 2
▪ Zero-shot learning Help during
▪ How to alternate between training

modalities during learning? Prediction

▪ Co-training (term introduced by


Avrim Blum and Tom Mitchell from Modality 1 Modality 2
CMU) Help during

▪ Transfer learning training

You might also like