Lecture12 1MultimodalFusion
Lecture12 1MultimodalFusion
1
Lecture Objectives
6
Multi-view Latent Variable Discriminative Models
Sentiment
Modality-private structure
y
• Internal grouping of observations
ℎ1𝐴 ℎ2𝐴 ℎ3𝐴 ℎ4𝐴 ℎ5𝐴
Modality-shared structure
ℎ1𝑉 ℎ2𝑉 ℎ3𝑉 ℎ4𝑉 ℎ5𝑉
• Interaction and synchrony 𝒙𝑨𝟏 𝒙𝑨𝟐 𝒙𝑨𝟑 𝒙𝑨𝟒 𝒙𝑨𝟓
𝒙𝑽𝟏 𝒙𝑽𝟐 𝒙𝑽𝟑 𝒙𝑽𝟒 𝒙𝑽𝟓
𝑝 𝑦 𝒙𝑨 , 𝒙𝑉 ; 𝜽) = 𝑝 𝑦, 𝒉𝑨 , 𝒉𝑽 𝒙𝑨 , 𝒙𝑽 ; 𝜽
𝒉𝑨 ,𝒉𝑽
➢ Approximate inference using loopy-belief
7
Multimodal Fusion:
Multiple Kernel Learning
8
What is a Kernel function?
▪ If we remember CCA it used only inner products in definitions when dealing with
data, that means we can again use kernels
▪ Instead of providing a single kernel and validating which one works optimize
in a family of kernels (or different families for different modalities)
▪ Works well for unimodal and multimodal data, very little adaptation is needed
[Lanckriet 2004]
MKL in Unimodal Case
K
MKL in Multimodal/Multiview Case
K
New Directions:
Representation
18
Representation 1: Hash Function Learning
[Cao et al. Deep visual-semantic hashing for cross-modal retrieval, KDD 2016]
Representation 2: Order-Embeddings
[Wu, Mike, and Noah Goodman. “Multimodal Generative Models for Scalable Weakly-Supervised Learning.”,
NIPS 2018]
23
Representation 4: Multimodal VAE (MVAE)
[Wu, Mike, and Noah Goodman. “Multimodal Generative Models for Scalable Weakly-Supervised Learning.”,
NIPS 2018]
24
Representation 5: Multilingual Representations
[Yeh, Raymond, et al. “Interpretable and globally optimal prediction for textual grounding using image concepts.”,
NIPS 2017.]
Alignment 4: Textual Grounding
Word priors
[Yeh, Raymond, et al. “Interpretable and globally optimal prediction for textual grounding using image concepts.”,
NIPS 2017.]
31
Alignment 4: Textual Grounding
Word-word relationship
cos 𝑤𝑠 , 𝑤𝑠′
𝑤𝑠 = [𝑤𝑠,1 ; 𝑤𝑠,1 ; … ; 𝑤𝑠,|𝐶| ]
[Yeh, Raymond, et al. “Interpretable and globally optimal prediction for textual grounding using image concepts.”,
NIPS 2017.]
32
Alignment 5: Comprehensive Image Captions
▪ Strike a balance
between details (visual
driven) and coverage of
objects (text/topic driven)
[Liu et al. simNet: Stepwise Image-Topic Merging Network for Generating Detailed
and Comprehensive Image Captions, 2018]
Alignment 5: Comprehensive Image Captions
[Liu et al. simNet: Stepwise Image-Topic Merging Network for Generating Detailed
and Comprehensive Image Captions, 2018]
New Directions:
Fusion
35
Fusion 1a: Multi-Head Attention for AVSR
Multi-head Attention
Afouras, Triantafyllos, Joon Son Chung, Andrew Senior, Oriol Vinyals, and Andrew Zisserman.
"Deep audio-visual speech recognition." arXiv preprint arXiv:1809.02108 (Sept 2018).
Fusion 1b: Fusion with Multiple Attentions
Language LSTM
Vision LSTM
Acoustic LSTM
[Zadeh et al., Human Communication Decoder Network for Human Communication Comprehension, AAAI 2018]
37
Fusion 2: Memory-Based Fusion
[Zadeh et al., Memory Fusion Network for Multi-view Sequential Learning, AAAI 2018]
38
Fusion 3: Relational Questions
Input image
giraffe on right
[Liu et al. Recurrent Multimodal Interaction for Referring Image Segmentation, 2017]
New Directions:
Translation
43
Translation 1: Visually indicated sounds
▪ Sound generation!
45
Translation 2: The Sound of Pixels
46
Translation 3: Learning-by-asking (LBA)
Training: Testing:
• Given on the input image the • LBA is evaluated exactly like
model decides what questions VQA
to ask
• Answers are obtained by
human supervised oracle
[Misra et al. "Learning by Asking Questions", CVPR 2018]
Translation 4: Navigation
▪ Goal prediction
▪ Highlight the goal location by generating a probability distribution over
the environment panoramic image
▪ Interpretability
▪ Explicit goal
prediction
modeling makes
the approach more
interpretable
Sentiment
Encoder Decoder
Paul Pu Liang*, Hai Pham*, et al., “Found in Translation: Learning Robust Joint Representations by
Cyclic Translations Between Modalities”, AAAI 2019
Co-learning 3: Taskonomy
Zamir, Amir R., et al. "Taskonomy: Disentangling Task Transfer Learning." Proceedings of
the IEEE Conference on Computer Vision and Pattern Recognition. 2018.
Co-learning 4: Associative Multichannel Autoencoder
𝑡1
𝑡2 𝑡4 𝑡2
𝑡3 𝑡5 𝑡3
𝑡6
𝑡𝑛 𝑡𝑛
Language Visual
Input Modalities
Acoustic
60
Taxonomy of Multimodal Research [ https://arxiv.org/abs/1705.09406 ]
62
Core Challenge 2: Alignment
t3 t5 B Implicit Alignment
63
Core Challenge 3: Fusion
Prediction
Fancy
algorithm
64
Core Challenge 4: Translation
Definition: Process of changing data from one modality to another, where the
translation relationship can often be open-ended or subjective.
A Example-based B Model-driven
65
Challenge 5 – Co-learning