Polysemous Visual-Semantic Embedding
for Cross-Modal Retrieval
Ruijie Quan 2019/07/13
MOTIVATION
121.07.2019
Most current methods learn injective
embedding functions that map an
instance to a single point in the shared
space.
Drawback:
Cannot effectively handle polysemous
instances while individual instances
and their cross modal associations are
often ambiguousin real-world
scenarios.
Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval
CONTRIBUTIONS
221.07.2019
Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval
 Contributions:
1. Introduce Polysemous Instance Embedding Networks (PIE-Nets) that compute
multiple and diverse representations of an instance by combining global context
with locally-guided features via multi-head self-attention and residual learning.
2. Tackle a more challenging case of video-text retrieval.
3. A new dataset of 50K video-sentence pairs collected from social media, dubbed
MRW (my reaction when)
INTRODUCTION
321.07.2019
Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval
1. injective embedding can suffer when there
is ambiguity in individual instances. e.g.,
polysemy words and images containing
multiple objects.
2. partial cross-domain association
e.g. a text sentence may describe only certain
regions of an image,
a video may contain extra frames not
described by its associated sentence
Injective Embedding could be problematic
INTRODUCTION
421.07.2019
Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval
1. formulating instance embedding as a one-
to-many mapping task
2. optimizing the mapping functions to be
robust to ambiguous instances and partial
cross-modal associations
Address the above issues by:
APPROACH
521.07.2019
Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval
APPROACH
621.07.2019
Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval
To address the issues with ambiguous instances
Propose a novel one-to-many instance embedding model, Polysemous Instance Embedding
Network (PIE-Net)
 extracts K embeddings of each instance by combining global and local information
of its input.
 obtain K locally-guided representations by attending to different parts of an input
instance (e.g., regions, frames, words) using a multi-head self-attention module
 combine each of such local representation with global representation via residual
learning to avoid learning redundant information
 regularize the K locally-guided representations to be diverse
APPROACH
721.07.2019
Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval
APPROACH
821.07.2019
Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval
To address the partial association issue
Polysemous Visual-Semantic Embedding (PVSE)tie-up two PIE-Nets and train our model
in the multiple-instance learning (MIL) framework
APPROACH
121.07.2019
Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval
Image Encoder Video Encoder Sentence Encoder
the feature map before
the final average pooling
layer as local features
apply average pooling
and feed the output to
one fully-connected layer
to obtain global features
ResNet-152
take the 2048-dim output
from the final average
pooling layer, and use
them as local features
feed Ψ(x) into a bidirec-
tional GRU (bi-GRU) with
H hidden units, and take
the final hidden states as
global features
ResNet-152
producing L 300-dim
vectors, and use them
as local features local
features
feed them into a bi-GRU
with H hidden units, and
take the final hidden
states as global features
GloVe pretrained on the
CommonCrawl dataset
1. Modality-Specific Feature Encoder
APPROACH
1021.07.2019
Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval
2. Local Feature Transformer
Multihead self-attention
3. Feature Fusion With Residual Learning
Residual Learning
APPROACH
1121.07.2019
Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval
4. Optimization and Inference
MIL Loss:
Diversity Loss: Domain Discrepancy Loss:
MRW DATASET
1221.07.2019
Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval
EXPERIMENTS
1321.07.2019
Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval
EXPERIMENTS
1421.07.2019
Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval
The number of embeddings K:
a significant improvement from K = 0 to K = 1;
this shows the effectiveness of Local Feature
Transformer.
this shows the importance of balancing global
and local information in the final embedding
simply concatenating the two features (no
residual learning) hurts the performance
Global vs. locally-guided features:
EXPERIMENTS
1521.07.2019
Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval
The results show that both loss terms are
important in our model. Overall, the model is
not much sensitive to the two relative weight
terms.
Sensitivity analysis on different loss weights:
EXPERIMENTS
1621.07.2019
Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval
Thank you for your attention.

More Related Content

PDF
Cloud & mobile learning - a combination from heaven
PPTX
Merl multimodal event representation learning
PDF
A Survey on Cross-Modal Embedding
PDF
VAEs for multimodal disentanglement
PDF
SCAI invited talk @EMNLP2020
PDF
論文紹介:Multimodal Learning with Transformers: A Survey
PDF
Adria Recasens, DeepMind – Multi-modal self-supervised learning from videos
PPTX
[NS][Lab_Seminar_240729]VQA-GNN: Reasoning with Multimodal Knowledge via Grap...
Cloud & mobile learning - a combination from heaven
Merl multimodal event representation learning
A Survey on Cross-Modal Embedding
VAEs for multimodal disentanglement
SCAI invited talk @EMNLP2020
論文紹介:Multimodal Learning with Transformers: A Survey
Adria Recasens, DeepMind – Multi-modal self-supervised learning from videos
[NS][Lab_Seminar_240729]VQA-GNN: Reasoning with Multimodal Knowledge via Grap...

Similar to Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval (8)

PDF
Multi modal retrieval and generation with deep distributed models
PPTX
Multimodal deep learning
PPTX
multi modal transformers representation generation .pptx
PPTX
Vectorland: Brief Notes from Using Text Embeddings for Search
PPTX
Enhancing Video Summarization via Vision-Language Embedding
PDF
Multilabel Image Annotation using Multimodal Analysis
PDF
ECCV2022 paper reading - MultiMAE: Multi-modal Multi-task Masked Autoencoders...
PDF
vision in LMM, a close look at them in context
Multi modal retrieval and generation with deep distributed models
Multimodal deep learning
multi modal transformers representation generation .pptx
Vectorland: Brief Notes from Using Text Embeddings for Search
Enhancing Video Summarization via Vision-Language Embedding
Multilabel Image Annotation using Multimodal Analysis
ECCV2022 paper reading - MultiMAE: Multi-modal Multi-task Masked Autoencoders...
vision in LMM, a close look at them in context
Ad

More from 哲东 郑 (20)

PDF
Deep learning for person re-identification
PDF
Cross-domain complementary learning with synthetic data for multi-person part...
PDF
Step zhedong
PPTX
Visual saliency
PDF
Image Synthesis From Reconfigurable Layout and Style
PPTX
Weijian image retrieval
PPTX
Scops self supervised co-part segmentation
PPTX
Video object detection
PDF
Center nets
PPTX
C2 ae open set recognition
PPTX
Sota semantic segmentation
PPTX
Deep randomized embedding
PPTX
Semantic Image Synthesis with Spatially-Adaptive Normalization
PPTX
Instance level facial attributes transfer with geometry-aware flow
PPTX
Learning to adapt structured output space for semantic
PPTX
Unsupervised Learning of Object Landmarks through Conditional Image Generation
PPTX
Graph based global reasoning networks
PPTX
Style gan
PDF
Vi2vi
PPTX
Variational Discriminator Bottleneck
Deep learning for person re-identification
Cross-domain complementary learning with synthetic data for multi-person part...
Step zhedong
Visual saliency
Image Synthesis From Reconfigurable Layout and Style
Weijian image retrieval
Scops self supervised co-part segmentation
Video object detection
Center nets
C2 ae open set recognition
Sota semantic segmentation
Deep randomized embedding
Semantic Image Synthesis with Spatially-Adaptive Normalization
Instance level facial attributes transfer with geometry-aware flow
Learning to adapt structured output space for semantic
Unsupervised Learning of Object Landmarks through Conditional Image Generation
Graph based global reasoning networks
Style gan
Vi2vi
Variational Discriminator Bottleneck
Ad

Recently uploaded (20)

PDF
Lung cancer patients survival prediction using outlier detection and optimize...
PDF
Build Real-Time ML Apps with Python, Feast & NoSQL
PDF
IT-ITes Industry bjjbnkmkhkhknbmhkhmjhjkhj
PDF
Launch a Bumble-Style App with AI Features in 2025.pdf
PPTX
How to Convert Tickets Into Sales Opportunity in Odoo 18
PDF
Transform-Your-Factory-with-AI-Driven-Quality-Engineering.pdf
PDF
zbrain.ai-Scope Key Metrics Configuration and Best Practices.pdf
PDF
CXOs-Are-you-still-doing-manual-DevOps-in-the-age-of-AI.pdf
PDF
A symptom-driven medical diagnosis support model based on machine learning te...
PDF
Transform-Your-Streaming-Platform-with-AI-Driven-Quality-Engineering.pdf
PPTX
Blending method and technology for hydrogen.pptx
PDF
EIS-Webinar-Regulated-Industries-2025-08.pdf
PDF
Early detection and classification of bone marrow changes in lumbar vertebrae...
PDF
Advancing precision in air quality forecasting through machine learning integ...
PDF
Co-training pseudo-labeling for text classification with support vector machi...
PDF
5-Ways-AI-is-Revolutionizing-Telecom-Quality-Engineering.pdf
PDF
Decision Optimization - From Theory to Practice
PDF
Altius execution marketplace concept.pdf
PDF
CEH Module 2 Footprinting CEH V13, concepts
PDF
Introduction to MCP and A2A Protocols: Enabling Agent Communication
Lung cancer patients survival prediction using outlier detection and optimize...
Build Real-Time ML Apps with Python, Feast & NoSQL
IT-ITes Industry bjjbnkmkhkhknbmhkhmjhjkhj
Launch a Bumble-Style App with AI Features in 2025.pdf
How to Convert Tickets Into Sales Opportunity in Odoo 18
Transform-Your-Factory-with-AI-Driven-Quality-Engineering.pdf
zbrain.ai-Scope Key Metrics Configuration and Best Practices.pdf
CXOs-Are-you-still-doing-manual-DevOps-in-the-age-of-AI.pdf
A symptom-driven medical diagnosis support model based on machine learning te...
Transform-Your-Streaming-Platform-with-AI-Driven-Quality-Engineering.pdf
Blending method and technology for hydrogen.pptx
EIS-Webinar-Regulated-Industries-2025-08.pdf
Early detection and classification of bone marrow changes in lumbar vertebrae...
Advancing precision in air quality forecasting through machine learning integ...
Co-training pseudo-labeling for text classification with support vector machi...
5-Ways-AI-is-Revolutionizing-Telecom-Quality-Engineering.pdf
Decision Optimization - From Theory to Practice
Altius execution marketplace concept.pdf
CEH Module 2 Footprinting CEH V13, concepts
Introduction to MCP and A2A Protocols: Enabling Agent Communication

Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval

  • 1. Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval Ruijie Quan 2019/07/13
  • 2. MOTIVATION 121.07.2019 Most current methods learn injective embedding functions that map an instance to a single point in the shared space. Drawback: Cannot effectively handle polysemous instances while individual instances and their cross modal associations are often ambiguousin real-world scenarios. Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval
  • 3. CONTRIBUTIONS 221.07.2019 Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval  Contributions: 1. Introduce Polysemous Instance Embedding Networks (PIE-Nets) that compute multiple and diverse representations of an instance by combining global context with locally-guided features via multi-head self-attention and residual learning. 2. Tackle a more challenging case of video-text retrieval. 3. A new dataset of 50K video-sentence pairs collected from social media, dubbed MRW (my reaction when)
  • 4. INTRODUCTION 321.07.2019 Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval 1. injective embedding can suffer when there is ambiguity in individual instances. e.g., polysemy words and images containing multiple objects. 2. partial cross-domain association e.g. a text sentence may describe only certain regions of an image, a video may contain extra frames not described by its associated sentence Injective Embedding could be problematic
  • 5. INTRODUCTION 421.07.2019 Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval 1. formulating instance embedding as a one- to-many mapping task 2. optimizing the mapping functions to be robust to ambiguous instances and partial cross-modal associations Address the above issues by:
  • 7. APPROACH 621.07.2019 Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval To address the issues with ambiguous instances Propose a novel one-to-many instance embedding model, Polysemous Instance Embedding Network (PIE-Net)  extracts K embeddings of each instance by combining global and local information of its input.  obtain K locally-guided representations by attending to different parts of an input instance (e.g., regions, frames, words) using a multi-head self-attention module  combine each of such local representation with global representation via residual learning to avoid learning redundant information  regularize the K locally-guided representations to be diverse
  • 9. APPROACH 821.07.2019 Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval To address the partial association issue Polysemous Visual-Semantic Embedding (PVSE)tie-up two PIE-Nets and train our model in the multiple-instance learning (MIL) framework
  • 10. APPROACH 121.07.2019 Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval Image Encoder Video Encoder Sentence Encoder the feature map before the final average pooling layer as local features apply average pooling and feed the output to one fully-connected layer to obtain global features ResNet-152 take the 2048-dim output from the final average pooling layer, and use them as local features feed Ψ(x) into a bidirec- tional GRU (bi-GRU) with H hidden units, and take the final hidden states as global features ResNet-152 producing L 300-dim vectors, and use them as local features local features feed them into a bi-GRU with H hidden units, and take the final hidden states as global features GloVe pretrained on the CommonCrawl dataset 1. Modality-Specific Feature Encoder
  • 11. APPROACH 1021.07.2019 Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval 2. Local Feature Transformer Multihead self-attention 3. Feature Fusion With Residual Learning Residual Learning
  • 12. APPROACH 1121.07.2019 Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval 4. Optimization and Inference MIL Loss: Diversity Loss: Domain Discrepancy Loss:
  • 13. MRW DATASET 1221.07.2019 Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval
  • 15. EXPERIMENTS 1421.07.2019 Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval The number of embeddings K: a significant improvement from K = 0 to K = 1; this shows the effectiveness of Local Feature Transformer. this shows the importance of balancing global and local information in the final embedding simply concatenating the two features (no residual learning) hurts the performance Global vs. locally-guided features:
  • 16. EXPERIMENTS 1521.07.2019 Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval The results show that both loss terms are important in our model. Overall, the model is not much sensitive to the two relative weight terms. Sensitivity analysis on different loss weights:
  • 18. Thank you for your attention.