Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval

Polysemous Visual-Semantic Embedding
for Cross-Modal Retrieval
Ruijie Quan 2019/07/13

MOTIVATION
121.07.2019
Most current methods learn injective
embedding functions that map an
instance to a single point in the shared
space.
Drawback:
Cannot effectively handle polysemous
instances while individual instances
and their cross modal associations are
often ambiguousin real-world
scenarios.
Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval

CONTRIBUTIONS
221.07.2019
 Contributions:
1. Introduce Polysemous Instance Embedding Networks (PIE-Nets) that compute
multiple and diverse representations of an instance by combining global context
with locally-guided features via multi-head self-attention and residual learning.
2. Tackle a more challenging case of video-text retrieval.
3. A new dataset of 50K video-sentence pairs collected from social media, dubbed
MRW (my reaction when)

INTRODUCTION
321.07.2019
1. injective embedding can suffer when there
is ambiguity in individual instances. e.g.,
polysemy words and images containing
multiple objects.
2. partial cross-domain association
e.g. a text sentence may describe only certain
regions of an image,
a video may contain extra frames not
described by its associated sentence
Injective Embedding could be problematic

INTRODUCTION
421.07.2019
1. formulating instance embedding as a one-
to-many mapping task
2. optimizing the mapping functions to be
robust to ambiguous instances and partial
cross-modal associations
Address the above issues by:

APPROACH
521.07.2019

APPROACH
621.07.2019
To address the issues with ambiguous instances
Propose a novel one-to-many instance embedding model, Polysemous Instance Embedding
Network (PIE-Net)
 extracts K embeddings of each instance by combining global and local information
of its input.
 obtain K locally-guided representations by attending to different parts of an input
instance (e.g., regions, frames, words) using a multi-head self-attention module
 combine each of such local representation with global representation via residual
learning to avoid learning redundant information
 regularize the K locally-guided representations to be diverse

APPROACH
721.07.2019

APPROACH
821.07.2019
To address the partial association issue
Polysemous Visual-Semantic Embedding (PVSE)tie-up two PIE-Nets and train our model
in the multiple-instance learning (MIL) framework

APPROACH
121.07.2019
Image Encoder Video Encoder Sentence Encoder
the feature map before
the final average pooling
layer as local features
apply average pooling
and feed the output to
one fully-connected layer
to obtain global features
ResNet-152
take the 2048-dim output
from the final average
pooling layer, and use
them as local features
feed Ψ(x) into a bidirec-
tional GRU (bi-GRU) with
H hidden units, and take
the final hidden states as
global features
ResNet-152
producing L 300-dim
vectors, and use them
as local features local
features
feed them into a bi-GRU
with H hidden units, and
take the final hidden
states as global features
GloVe pretrained on the
CommonCrawl dataset
1. Modality-Specific Feature Encoder

APPROACH
1021.07.2019
2. Local Feature Transformer
Multihead self-attention
3. Feature Fusion With Residual Learning
Residual Learning

APPROACH
1121.07.2019
4. Optimization and Inference
MIL Loss:
Diversity Loss: Domain Discrepancy Loss:

MRW DATASET
1221.07.2019

EXPERIMENTS
1321.07.2019

EXPERIMENTS
1421.07.2019
The number of embeddings K:
a significant improvement from K = 0 to K = 1;
this shows the effectiveness of Local Feature
Transformer.
this shows the importance of balancing global
and local information in the final embedding
simply concatenating the two features (no
residual learning) hurts the performance
Global vs. locally-guided features:

EXPERIMENTS
1521.07.2019
The results show that both loss terms are
important in our model. Overall, the model is
not much sensitive to the two relative weight
terms.
Sensitivity analysis on different loss weights:

EXPERIMENTS
1621.07.2019

Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval

More Related Content

Similar to Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval (8)

More from 哲东郑 (20)

Recently uploaded (20)