1) The document proposes a Polysemous Instance Embedding Network (PIE-Net) that computes multiple representations of instances to handle polysemous words and images containing multiple objects.
2) PIE-Net extracts K embeddings for each instance by combining global and local information using multi-head self-attention and residual learning.
3) The approach ties two PIE-Nets in a multiple-instance learning framework called Polysemous Visual-Semantic Embedding to address partial associations between images/videos and text.