0% found this document useful (0 votes)
136 views

Which Emoji Talks Best For My Picture?: Anurag Illendula KV Manohar Manish Reddy Yedulla

This paper proposes using domain knowledge from EmojiNet to improve emoji recommendation for images posted on social media. The authors develop an approach that combines visual concepts from images, textual descriptions from users, and semantic concepts from EmojiNet. They evaluate their approach on a Twitter dataset and manually annotated images from MSCOCO. Experimental results show their approach outperforms state-of-the-art methods by 9.6% accuracy in emoji recommendation. The authors aim to release the full dataset to help other researchers.

Uploaded by

Manish Reddy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
136 views

Which Emoji Talks Best For My Picture?: Anurag Illendula KV Manohar Manish Reddy Yedulla

This paper proposes using domain knowledge from EmojiNet to improve emoji recommendation for images posted on social media. The authors develop an approach that combines visual concepts from images, textual descriptions from users, and semantic concepts from EmojiNet. They evaluate their approach on a Twitter dataset and manually annotated images from MSCOCO. Experimental results show their approach outperforms state-of-the-art methods by 9.6% accuracy in emoji recommendation. The authors aim to release the full dataset to help other researchers.

Uploaded by

Manish Reddy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Which Emoji Talks Best for My Picture?

Anurag Illendula Kv Manohar Manish Reddy Yedulla


Department Of Mathematics Department Of Mathematics Department of Engineering Science
IIT Kharagpur IIT Kharagpur IIT Hyderabad
Kharagpur, India Kharagpur, India Hyderabad, India
[email protected] [email protected] [email protected]

Abstract—Emojis have evolved as complementary sources for


expressing emotion in social-media platforms where posts are
arXiv:1808.08891v1 [cs.CV] 27 Aug 2018

mostly composed of texts and images. In order to increase the


expressiveness of the social media posts, users associate relevant
emojis with their posts. Incorporating domain knowledge has
improved machine understanding of text. In this paper, we
investigate whether domain knowledge for emoji can improve the
accuracy of emoji recommendation task in case of multimedia
posts composed of image and text. Our emoji recommendation
can suggest accurate emojis by exploiting both visual and textual
content from social media posts as well as domain knowledge
from Emojinet. Experimental results using pre-trained image
classifiers and pre-trained word embedding models on Twitter
dataset show that our results outperform the current state-of-
the-art by 9.6%. We also present a user study evaluation of
our recommendation system on a set of images chosen from
MSCOCO dataset.
Index Terms—Emoji Understanding, Image Classification,
Emoji Recommendation, Domain Knowledge
Fig. 1. Usage of most frequently occurring emojis on Instagram.
The Statistics are extracted from https://www.quintly.com/blog/2017/01/
I. I NTRODUCTION instagram-emoji-study-higher-interactions
.
The word emoji has originated from the Japanese language
with the letter “e” (meaning picture) and “moji” (meaning
character). Emojis are considered to be the 21st -century trans- their posts4 . Figure 1 shows the usage of different emojis in
formation of emoticons. They were initially developed in 1999 Instagram. Images being more expressive than text, sending a
as a 12*12 pixel grid by Shigetaka Kurita as a part of a message utilizing one diminutive emoji is more efficacious
Japanese team working on an early version of a mobile internet than a text message is the other important feature which
platform. But present emoji images can be scaled to unlimited increased emoji usage in social media. According to the latest
resolution with the help of vector graphics1 . The utilization of statistics released by Emojipedia, there are 2666 emojis which
emojis in social media has visually perceived a rapid increase are further divided into different subcategories. Earlier in 2016
over the last few years as they have become a way to integrate most search engines namely Google and Bing supported emoji
a tone and non-verbal context to daily communication. Ac- search5 , but in 2017 Twitter has also enabled users to search
cording to the latest statistics released by Twitter, tweets with for tweets using emoji as a keyword6 .
images links get 2x the engagement rate of those without2 . Due to limited linguistic concepts, earlier manually anno-
The study of emoji usage in social media platforms has been tated patterns were used as external knowledge concepts to
one of the exciting research fields, Instagram has reported that enhance NLP systems. With the advent of advanced knowl-
the emoji usage on their website has seen an increase of 10% edge base construction, large amounts of semantic and syntac-
in a single month after the release of Emoji keyboard on iOS tic information are made available which helped researchers
mobiles in 2011, and it also reported that more than 50% of enhance the performance of most NLP systems namely word
all captions and comments include an emoji or two3 . Hence embedding models [6] and many other prediction and clas-
considering the extensive usage of images on their platform, sification tasks [48], [49]. In recent years several notewor-
Instagram has started analyzing profiles of users who use thy large, cross-domain knowledge graphs have been devel-
emojis, and they reported that 53% of users use emojis in oped. Researchers [42] have also worked on incrementally

1 https://cnn.it/2MitfM2 4 https://bit.ly/2EnbxSE
2 https://bit.ly/1u31GbO 5 https://selnd.com/2t4vjyk
3 https://engt.co/2JFJlxz 6 https://bit.ly/2sUVWGy
populating knowledge graph from unstructured data which an image. Consider the image shown in Figure 2. The user
encompasses problems of extraction, cleaning, and integration. who posted this tends to use an emoji which relates to one
This research in the development of advanced knowledge or more of entities at that can be used to describe seashore.
graphs has enabled many researchers [6], [41], [49] to leverage In this example we see the emoji, “water wave” which is
external domain knowledge in Natural Language Processing used to symbolize a water wave at sea. We hypothesize that
(NLP) systems to improve machine understanding. External having access to the images posted on social media can help
knowledge has also been effective to improve the accuracies recommend an emoji that can be used in the description of
of emoji understanding tasks including but not limited to emoji the image.
similarity [47], emoji sense disambiguation [46]. In this paper, In this paper, we present an approach which combines visual
we investigate whether external knowledge from EmojiNet can concepts, user descriptions and external knowledge concepts
enhance the performance of emoji recommendation task in the from EmojiNet (the only machine-readable inventory which
context of images. enables computers to understand emojis) to predict an emoji in
the context of an image. We evaluate our approach on Twitter
dataset crawled using Twitter API, and we also evaluate our
approach using a set of manually annotated images from
MSCOCO dataset. We plan to release the complete dataset
upon the acceptance of the paper to help other researchers.
Most of the current natural language processing and com-
puter vision tasks involving text or image processing rely on
manually annotated data. To create a gold-standard dataset to
evaluate our approach, we asked human annotators who are
knowledgeable with the usage of emojis to select an emoji
from the complete set of emojis which they intend to use
in the context of the image and description. We label each
image with an emoji that is selected most number of times
by the annotators. Section 3 further explains the creation of
our evaluation datasets. We also compare our accuracy with
the previous state-of-the-art approaches for emoji prediction
in the context of images; experimental results show that our
To Some it’s just water. To me, it’s where I regain model outperform the previous state-of-the-art Image2Emoji
my Sanity…!!
models developed by Cappallo et al. and Barbieri et al. [3],
Fig. 2. The emoji in the description is used to symbolize a ”WATER WAVE” [10]
at the sea . In the rest of this paper, we first discuss the related work
Image Classification is one of the fundamental challenges done by other researchers in Section 2. In Section 3, we discuss
in the field of computer vision. There has been significant the creation of evaluation datasets and discuss our model
progress in the field of Computer Vision with the emergence architecture in Section 4. In Section 5 we conduct extensive
of Convolutional Neural Networks (CNNs) [26]. These Deep experiments to evaluate our method of approach. Finally, we
Convolutional Neural networks have led to series of break- discuss the observed results using our approach and conclude
throughs for various image processing tasks including but in Sections 6 and Section 7, respectively. The source code and
not limited to image classification [26], object detection [15], annotated dataset will be made available upon the acceptance
and semantic segmentation [40]. Deep CNN’s integrate the of the paper to help other researchers.
low/ high-level features and classifies in end-to-end multi
layer fashion, and the “levels” of features can be enriched by II. R ELATED W ORK
increasing the number of stacked layers [50]. All the current Content-Based Information Retrieval (CBIR) is the historic
state-of-the-art techniques for Computer vision and Natural line of research in multimedia. This task usually deals with
language processing tasks rely heavily on labelled data. The retrieving images in the dataset that are most similar to the
current state-of-the-art in image classification includes Deep query image. Bag-of-Words representations [51] has seen a
Residual Networks [19] which consists of shortcut connections sustained line of research in this task which has been effec-
between the stacked layers of the Deep CNN and residual rep- tive up to million sized image datasets. This representation
resentations. As our task for emoji recommendation requires first consists of describing an image with a set of local
us to classify the image effectively, we use the Deep Residual descriptors such as Scale Invariant Feature Transform (SIFT)
Neural networks which is the state-of-the-art image classifier [31], and then in aggregating these descriptors into a single
to achieve better results for emoji recommendation. vector that collects the overall statistics of so-called “visual
With the rapid growth of emoji, not only they are being words”. Recently another step towards CBIR has achieved by
used with text, but also being used in the context of images VLAD [28]. This Information Retrieval task has eventually
to provide additional contextual clues on what is depicted in led to research in image classification which is one of the
fundamental challenges in Computer Vision. Convolutional Recent past has seen a rapid increase in the number of
Neural Networks (CNNs) [26] have shown promising results researchers working on using external domain knowledge to
in image classification. Research on shortcut connections has improve the accuracies of many NLP and Image processing
been an emerging topic since the development of multi layer tasks [6], [41], [49]. The reason being external knowledge
perceptrons which has shown promising results for Image helps the machine to understand the topics which can further
processing tasks. Generally, these multiple layers have been aid in machine understanding. EmojiNet, the most extensive
connected using shortcut connections using gated functions emoji sense inventory developed by Wijeratne et al. [46]
[21]. In image classification, depth of the network, i.e., number made vast amounts of linguistic knowledge available ranging
of layers within the network is of crucial importance as noted from emoji sense labels to emoji sense definitions (textual
by Simonyan et al. [43]. Increasing the depth can have an descriptions which explain the context of use of different
adverse effect for image classification task. Most notably, van- emojis). Recent research has shown that EmojiNet improved
ishing/exploding gradients problem [16] and the degradation the accuracies of emoji similarity [47] and emoji sense dis-
problem [19] are of significant importance. These problems ambiguation tasks [46]. In this paper, we leverage external
are overcome by the introduction of shortcut connections and knowledge concepts from EmojiNet to enhance the accuracy
residual representations introduced in Residual Networks [20] of emoji prediction task in the context of images.
which won the 1st place in ImageNet 2015 Image classification Embeddings capture the semantics of a word and the
competition 7 . Residual Networks and its extensions which syntactic information of the usage of the word in different
consist of many residual units have shown to achieve state- contexts. Earlier many researchers have worked on building
of-the-art accuracy for image classification tasks on datasets word embedding models to visualize words in low dimensional
such as ImageNet [38]. vector space. Earlier word2vec [34] or GloVe [37] have been
Prior work on emoji prediction in text analytics has been the most popular word embedding models. But FastText word
done by Francesco et al. [3], [4] and emoji prediction in case of embedding model [7] has been even more effective in social
images has been done by Cappallo et al. [10], [11]. Francesco NLP systems as the fastText model could learn sub word
et al. [4] have worked on building models for emoji prediction information. Many natural language processing tasks rely on
in case of text messages especially twitter using state-of-the- learning word representations in a finite-dimensional vector
art NLP techniques and also emoji prediction in case of images space. Barbieri at al. [4] and Augenstein et al. [12] have
where they combined both visual and textual features for emoji done prior work on learning emoji representations using the
prediction [3]. Their results have proved that visual features traditional approaches (CBOW and skip-gram models). Recent
can help the model predict emoji accurately in multimedia research showed that semantic embeddings are more efficient
datasets. Cappallo et al. worked on building an emoji recom- than the embeddings learned using the traditional approach as
mendation system in the context of an image considering emoji they inherit semantic and syntactic knowledge, and semantic
names as external knowledge concepts for emoji prediction. embeddings have shown great success in similarity and analog-
This recommendation system relies on state-of-the-art image ical reasoning tasks [6]. Wijeratne et al. [47] have worked on
classification model to classify images and word embedding learning semantic representations of emojis using knowledge
model to represent a word in a low dimensional vector space. concepts from EmojiNet, and these embeddings have improved
The idea of Semantic Web is that of publishing and querying the results of emoji similarity and other natural language
knowledge on the Web in a semantically structured approach; processing tasks [27]. Recent research by Seyednezhad et
it has been introduced to the wider audience by Berners-Lee al. [39] and Fede et al. [13] has shown that emoji co-
[5]. According to his vision, the web of documents must be occurrence is one of the important features which helps us
extended such that the relationships between entities can be to understand the context of use of multiple emojis. Illendula
represented. This vision of Berners-Lee has led to the devel- et al. [22] have worked on learning emoji representations
opment of many structured knowledge bases where different using emoji co-occurrence network graph and state-of-the-art
entities are linked by relationships. Wordnet [35], Freebase network embedding model, these embeddings out-performed
[8], and YAGO [44] are some of the manually constructed the previous state-of-the-art accuracies for sentiment analysis
knowledge bases which deal with textual knowledge. Berners- task.
Lee’s vision has eventually led to the development of semi- In this paper, we present an emoji recommendation system
automatically or automatically constructed knowledge graphs which effectively recommends an emoji in the context of an
like DBPedia [2] and NELL [36]. Many attempts have been image. We use Res-Net [19] image classifier which is the
made by several researchers to embed the symbolic represen- state-of-the-art image classifier, word representations trained
tations into continuous space which helps in statistical learning on a corpus of descriptions of images in MSCOCO [30] using
approaches [9]. The extensive use of emojis in social media Google-News pre-trained word embeddings and FastText [23]
has been identified earlier, and this led to the development of model which could learn sub-word information. We use the
the first emoji sense inventory EmojiNet by Wijeratne et al. bag of words model developed by Wijeratne et al. [47] to learn
[46]. the emoji embeddings which are used as external knowledge
concepts. We report the results observed considering differ-
7 https://bit.ly/2y4J8Cz ent emoji knowledge concepts from EmojiNet namely emoji
names, emoji senses, emoji sense definitions and two different undergraduate students, two annotators from Indian Institute of
word embedding models in Section 5. We also present the Technology Kharagpur, and the other belongs to Indian Insti-
results obtained by our user study on MSCOCO dataset in tute of Technology Hyderabad. The two annotators are aged
Section 5. between 18-23; one males and one female. The annotators
were shown an image with the complete set of descriptions
III. DATASET C REATION
and asked to select an emoji which they wish to use to increase
A. Twitter Dataset their expressiveness. Each image in our evaluation dataset is
We extracted tweets using the Twitter streaming API geo- annotated by all the three annotators, and we assume that
localized in the United States of America considering each the emoji selected most times by the annotators as the emoji
emoji as a keyword at a time for search from the list of predicted in the context of the image. We use this annotated
2389 emojis listed in EmojiNet8 . We then filtered the tweets dataset to evaluate our model for emoji recommendation and
by considering only the tweets which are embedded with report our results in Section 5.
an image, and further filtered the dataset by considering the
tweets which have only one emoji embedded in the tweet IV. M ODEL
since our model couldn’t learn the context of use of multiple A. Pre Training
emojis at the same time. During the process of filtration, We extracted the captions corresponding to each image
we also ensured that the tweet has a textual description. We of MSCOCO validation dataset and trained a FastText
could extract 27136 tweets which have one of 1079 emojis, [23] word embedding model to learn word representations
the distribution of the number of tweets of the dataset is in finite-dimensional vector space. We also used the pre-
shown in Figure 3. We consider the emoji embedded in the trained Google-News11 word embedding model trained using
description as the label for the tweet, and we use our model word2vec [17] and pre-trained FastText model trained on
to get a set of emoji recommendations in the context of the Wikipedia corpus [33] to evaluate our model. We make
image with textual description. We then evaluate our emoji use of Emojinet which gathers knowledge concepts of 2389
recommendation model and report the results in Section 5. emojis. Specifically, EmojiNet provides a set of words (also
called as senses), its POS tag and its sense definitions. It
maps 12,904 sense definitions to 2,389 emojis. We learn
the emoji representations from these external knowledge
concepts using the approach discussed by Wijeratne et
al. [47]. They replaced the word vectors of all words in
the emoji definition and formed a 300-dimensional vector
performing vector average. Also, the vector mean (or average)
adjusts for word embedding bias that could take place due
to certain emoji definitions having considerably more words
than others has been noted by Kenter et al. [25]. Figure
4 illustrates the emoji embeddings model used to learn
Fig. 3. Distribution of Number of Images in Twitter Dataset
emoji representations from emoji knowledge concepts. We
. use the standard pre-processing techniques which include
removing stop-words, removing articles and lemmatizing
B. User Annotation each word of emoji sense definitions and get another set of
knowledge concepts which are referred as processed emoji
We also evaluate our model on a set of 600 images from
sense definitions in later sections of the paper. We learn three
MSCOCO 2017 validation dataset9 [29] which belong to
types of emoji embeddings Emoji Embeddings Senses,
different classes10 listed in ImageNet Image Classification
and Emoji Embeddings Descriptions, and
competition. We ensured that our evaluation dataset does not
Emoji Embeddings Processed Descriptions using emoji
include multiple images of the same category as this would
senses, emoji sense definitions, and processed emoji sense
lead to biased results, and this filtration also allows us to verify
definitions respectively. The model of approach is evaluated
the accuracy of our approach on different classes of images.
using these emoji embeddings as knowledge concepts.
These set of images in MSCOCO dataset are associated with
Emoji Embeddings Senses : The emoji sense forms is
a set five descriptions which explain the context of the image.
a list of different senses of what emoji mean in different
We asked three annotators who are knowledgeable with the
contexts. EmojiNet lists “love”, “face”, “beloved”, “dear”,
context of use of emojis to manually annotate the image with
“adorable” etc to be sense forms for the emoji “face blowing a
the textual description with an emoji from the complete set
kiss” ( ). The emoji embedding, in this case, is the vector
of 2389 emojis listed in EmojiNet. The human annotators are
average of word embeddings as described in Figure 4. The
8 https://bit.ly/2JDX0F0 equation corresponding to calculation of emoji embedding, in
9 https://bit.ly/2JHIhZX
10 https://bit.ly/2mYUfDd 11 https://bit.ly/1R9Wsqr
this case, is given below where V ~i represent word embedding Embedding φt
of word Wi , n represents the number of distinct sense forms :
Bag Of Words
P~ Text Description
Vi of Image
Emoji Embeddings Senses = (1)
n Semantic
Embedding
Emoji Embeddings Descriptions: The emoji defini- Probabilities

tions are textual descriptions which explain the context of use CNN
Input Image Classifier
of an emoji. EmojiNet lists “Love is a variety of different
Emoji
feelings, states, and attitudes that ranges from interpersonal Prediction
Scores
affection to pleasure.”, “An intense feeling of affection and
Emoji
care towards another person.” as some of the descriptions Embedding

for the emoji “face blowing a kiss”( ). The equation


corresponding to the emoji embedding is given below where Knowledge concepts

Ci represents the number of occurrences of word Wi , V ~i is


the word embedding of Wi :
P~
Vi · Ci
Emoji Embedding = P (2)
Ci Fig. 5. Model Architecture
Emoji
Definition
Word .
Embeddings Word Count (Wᵅ)

We hypothesize that this image embedding helps us compute


the context of the image using the word representation of the
(1/∑Wᵅ)∑(Iᵅ * Wᵅ)
image classes, thus further helping us to predict an emoji
in the context of the image. The image caption helps us
understand the context of use of the image has been noted by
Barbieri et al. [3]. Hence we use image caption as an additional
feature to learn image representation. We use the same bag-of-
words model illustrated in Figure 4 (approach used to calculate
emoji representations) to calculate the representation of the
image caption in low dimensional vector space. We use vector
Word Representations in Final Emoji
Vector Space (Eᵅ) Embedding addition operation to combine the caption embedding and the
image embedding as both representations are embedded in the
Fig. 4. Generation of Emoji Embeddings similar vector space [18], we term the combined embedding as
. image embedding in further sections. Consider Figure 6 where
we represent the emojis and the image embedding calculated
B. Model Architecture using our approach on the same vector space. We calculated
We use a pre-trained Resnet-152 [19] image classifier, a 152 the 300-dimensional emoji representations using knowledge
layered Residual Network for image classification. Resnet-152 concepts from EmojiNet and the image embedding using our
predicts the probability that an image belongs to a particular approach, and we use pre-trained FastText word embedding
class. We replace the class label with its corresponding word model on Wiki Corpus 12 [33] for word representations. Since
embeddings learned using the word embedding models as one cannot visualize 300-dimensional vectors, we use the
discussed earlier and we call this the class embedding [1]. tSNE visualization [32] to project the 300-dimensional emoji
Many researchers [3], [14] have worked on combining textual representations to two-dimensional vector space. We observed
and visual features for improving accuracies of multimedia that emojis which are most similar to the context of the image
tasks in the fields of NLP and Image processing. Using the are closer to the image compared to other emojis. Hence we
probabilities predicted by Resnet-152 classifier we calculate could justify that this image embedding helps us in this emoji
the image embedding which combines the textual features and recommendation task. Thus this further adds a strong argument
embeds the image to the similar embedding space as words to combine both visual and textual features for emoji scoring.
[24]. This image embedding helps us to visualize an image in Each emoji is scored according to the similarity between image
the same vector space as words. Let Wi , C~ i denote class label embedding (visualizes the context of the image) and the emoji
and word embedding of the class label respectively, Pi denote embeddings (visualizes the context of use of an emoji), and
probability associated with this class. We compute the image we use cosine similarity as the distance measure. We term
embedding using this task as emoji scoring in further sections of the paper.
X
Image Embedding = C~ i ∗ Pi (3) 12 https://bit.ly/2FMTB4N
B. User Study
As discussed earlier in Section 3, each image in the
MSCOCO dataset is annotated by three annotators. We ob-
served a high accuracy score for MSCOCO dataset as the
descriptions of each image listed in MSCOCO explains the
context of the image more effectively as compared to the user
descriptions on social media platforms, this helped our model
to effectively capture the context of the image using the textual
descriptions. Table 2 reports the number of images where the
emoji label selected most times by three annotators is the
emoji recommended by our model. We also report the number
of images where the emoji label is one of the top-3 emojis
predicted by our model. We used Cohen’s kappa coefficient
Fig. 6. Embedding (Image+Textual description) and Emojis to similar vector (κ) to measure the inter-rater agreement to be 0.664 which is
space which helps us to calculate similarity between entities. The curve groups a good agreement value (0.6 < κ < 0.8).
the emojis which are closer and similar. to the context of the image
TABLE II
N UMBER OF I MAGES WHERE USER ANNOTATED EMOJI LABEL BELONGS
We evaluate our model using the two emoji embeddings as TO SET OF EMOJI RECOMMENDATIONS BY OUR MODEL
knowledge concepts and report our results on some images
extracted from MSCOCO in Table 4. We observe that the Knowledge Concepts
emojis recommended by our model are in context with the Emoji Emoji Emoji Processed
image. Names Senses Definitions Emoji Definitions
top-1 148 311 217 356
V. E XPERIMENTS top-3 224 386 278 426

A. Twitter Dataset
We use our emoji recommendation model to predict the VI. D ISCUSSION
emoji which can be used in the context of the image by To further demonstrate the effectiveness of the proposed
emoji scoring and calculate the number of tweets where the method, we compare it with the state-of-the-art image2emoji
actual emoji label used in the tweet is the emoji predicted models for emoji prediction. Table 4 summarizes the results
by our model. Table 1 reports the percentage of tweets in obtained by our model and the image2emoji model [10]. The
which the emoji label is the emoji predicted by our model. third and fourth column report the top 5 emojis arranged ac-
We considered image embedding as the visual feature (V) cording to their score when processed emoji sense definitions,
and the combination of image embedding and textual embed- emoji senses are used as external knowledge concepts respec-
ding as the combined visual and textual feature (V + T) to tively. The last column reports the predicted emojis using the
evaluate our model. We evaluated it on three different word image2emoji model. Consider the set of emojis predicted in
embedding models namely Google News word embedding the context of 4th image in Table 4, the emojis predicted using
model, FastText trained on MSCOCO Descriptions, FastText the image2emoji model does not closely relate to the context
trained on entire Wikipedia corpus [33] and using four exter- of the image, whereas the emojis predicted using processed
nal knowledge concepts namely Emoji Names, Emoji Sense emoji definitions as external knowledge concepts are more
Forms, Emoji Sense Definitions and Processed Emoji Sense relevant in the context of the image. Further, it can be noted
Definitions. All the results in this paper report the number that the recommendations obtained by considering processed
of tweets where the emoji used in the tweet is the emoji emoji sense definitions as external knowledge concepts are
recommended by our model. more relevant compared to other sets of predictions, this is
due to the fact the processed emoji sense definitions explain
TABLE I the context of use of an emoji.
P ERCENTAGE OF T WEETS IN WHICH THE EMOJI USED IN THE TWEET IS
THE EMOJI RECOMMENDED BY OUR MODEL (T HESE ACCURACIES ARE
Table 1 and Table 3 report the accuracies observed con-
OBSERVED IF WE CONSIDERED THE TOP 20 MOST FREQUENTLY sidering top 20 most frequent emojis and the complete set
OCCURRING EMOJIS IN T WITTER DATASET FOR EMOJI SCORING ) (V - of 1089 emojis present in the twitter dataset and scored
V ISUAL FEATURES , V + T - COMBINED V ISUAL AND TEXTUAL FEATURES )
according to their relevance to the context of the image
Knowledge Concepts respectively. Francesco et al. [3] has reported that they have
Word Embedding Model Emoji Emoji achieved a accuracy of 35.5% when they considered top-
Names Sense Definitions
V V+T V V+T
20 most frequent emojis as labels for emoji prediction. Our
Google News Word Embeddings 29.9% 31.2% 40.1% 40.9% model has outperformed and achieved a accuracy of 45.1%
FastText trained on MSCOCO 31.8% 32.9% 41.8% 43.9% if top-20 most frequently used emojis in Twitter dataset are
FastText trained on Wiki Corpus 32.3% 34.8% 42.3% 45.1% considered as labels for emoji recommendation. We observed
TABLE III
P ERCENTAGE OF T WEETS IN WHICH THE EMOJI USED IS ONE OF THE TOP -3 EMOJIS RECOMMENDED BY OUR MODEL
(T HESE ACCURACIES ARE OBSERVED IF WE CONSIDERED THE COMPLETE SET OF EMOJIS IN T WITTER DATASET (1089 EMOJIS ) FOR EMOJI SCORING ) (V
- V ISUAL FEATURES , V + T - COMBINED V ISUAL AND TEXTUAL FEATURES )

Emoji Knowledge Concepts


Word Embedding Model Emoji Names Emoji Sense Forms Emoji Sense Definitions Processed Sense Definitions
V V+T V V+T V V+T V V+T
Google News Word Embeddings 4.08% 5.72% 14.76% 18.49% 8.46% 12.03% 19.21% 21.23%
FastText on MSCOCO Descriptions 4.71% 6.63% 15.34% 18.24% 9.46% 12.89% 19.62% 22.04%
FastText on Wikipedia Corpus 5.45% 7.02% 15.61% 17.89% 9.02% 13.43% 19.45% 22.49%

an accuracy of 22.49% (our model could predict the correct [2] S Auer, C Bizer, G Kobilarov, J Lehmann, R Cyganiak, and Z Ives.
emoji label in 6102 images out of 27136 images) when Dbpedia: A nucleus for a web of open data. In The semantic web.
Springer, 2007.
processed emoji sense definitions are considered as external [3] F Barbieri, M Ballesteros, F Ronzano, and H Saggion. Multimodal emoji
knowledge concepts and 1089 emojis are used as labels for prediction. arXiv preprint arXiv:1803.02392, 2018.
emoji scoring. Hence this further demonstrates the effective- [4] F Barbieri, M Ballesteros, and H Saggion. Are emojis predictable? arXiv
ness of the proposed approach with a huge set of 1089 emoji preprint arXiv:1702.07285, 2017.
[5] T Berners-Lee, J Hendler, and O Lassila. The semantic web. Scientific
labels. Also, it can be noted that in most cases FastText american, 2001.
trained word embeddings on Wikipedia corpus have resulted [6] J Bian, B Gao, and T Liu. Knowledge-powered deep learning for word
in high accuracies as the fastText model is proved to result in embedding. In Joint European Conference on Machine Learning and
Knowledge Discovery in Databases. Springer, 2014.
high accuracies in most NLP tasks compared to other word
[7] P Bojanowski, E Grave, A Joulin, and T Mikolov. Enriching word
embedding models [33]. vectors with subword information. arXiv preprint arXiv:1607.04606,
2016.
VII. C ONCLUSION AND F UTURE W ORK [8] K Bollacker, C Evans, P Paritosh, T Sturge, and J Taylor. Freebase: a
In this paper, we introduced a knowledge-enabled emoji collaboratively created graph database for structuring human knowledge.
In Proceedings of the 2008 ACM SIGMOD international conference on
recommendation system which helps users select an emoji Management of data. AcM, 2008.
which talks better for an image or picture using domain [9] A Bordes, J Weston, R Collobert, and Y Bengio. Learning structured
knowledge from EmojiNet. Experimental results show that our embeddings of knowledge bases. In AAAI, 2011.
results have outperformed the previous and current state-of- [10] S Cappallo, T Mensink, and C GM Snoek. Image2emoji: Zero-shot
emoji prediction for visual media. In Proceedings of the 23rd ACM
the-art results for the image2emoji models [3], [10]. Table international conference on Multimedia. ACM, 2015.
4 reports some of the exciting emoji recommendations using [11] S Cappallo, S Svetlichnaya, P Garrigues, T Mensink, and C G Snoek.
various knowledge concepts for emoji scoring. Accuracy of The new modality: Emoji challenges in prediction, anticipation, and
retrieval. arXiv preprint arXiv:1801.10253, 2018.
our model has been observed to be more if processed emoji [12] B Eisner, T Rocktäschel, I Augenstein, M Bošnjak, and S Riedel.
sense definitions are used as external knowledge concepts emoji2vec: Learning emoji representations from their description. arXiv
for emoji scoring. We further plan to extend our work by preprint arXiv:1609.08359, 2016.
introducing deep learning models for emoji predictions in the [13] H Fede, I Herrera, SM M Seyednezhad, and R Menezes. Representing
emoji usage using directed networks: A twitter case study. In Interna-
context of images. Venugopalan et al. ( [45]) have used lin- tional Workshop on Complex Networks and their Applications. Springer,
guistic knowledge from large text corpora to generate natural 2017.
language descriptions of videos. Using this as a reference, [14] D Galanopoulos, M Dojchinovski, K Chandramouli, T Kliegr, and
V Mezaris. Multimodal fusion: Combining visual and textual cues for
we plan to extend our work in future by building models concept detection in video. In Multimedia Data Mining and Analytics:
which can summarize a video to a sequence of meaningful Disruptive Innovation, 2015.
emojis which convey the same visual content of a video, using [15] R B. Girshick, J Donahue, T Darrell, and J Malik. Rich feature
existing domain knowledge for emojis. hierarchies for accurate object detection and semantic segmentation.
CoRR, abs/1311.2524, 2013.
ACKNOWLEDGEMENT [16] X Glorot and Y Bengio. Understanding the difficulty of training deep
feedforward neural networks. In Yee Whye Teh and Mike Titterington,
We are grateful to Swati Padhee, Sanjaya Wijeratne and Dr. editors, Proceedings of the Thirteenth International Conference on Ar-
Amit P. Sheth for thought-provoking discussions on the topic. tificial Intelligence and Statistics, volume 9 of Proceedings of Machine
Learning Research, Chia Laguna Resort, Sardinia, Italy, 13–15 May
We acknowledge support from the Indian Institute of Technol- 2010. PMLR.
ogy Kharagpur and Indian Institute of Technology Hyderabad. [17] Y Goldberg and O Levy. word2vec explained: Deriving mikolov
Any opinions, findings, and conclusions/recommendations ex- et al.’s negative-sampling word-embedding method. arXiv preprint
arXiv:1402.3722, 2014.
pressed in this material are those of the author(s) and do not [18] Y-I Ha, J Kim, D Won, M Cha, and J Joo. Characterizing clickbaits on
necessarily reflect the views of Indian Institute of Technology instagram. In Proceedings of the 2018 ICWSM, 2018.
Kharagpur and Indian Institute of Technology Hyderabad. [19] K He, X Zhang, S Ren, and J Sun. Deep residual learning for image
recognition. arXiv preprint arXiv:1512.03385, 2015.
R EFERENCES [20] K He, X Zhang, S Ren, and J Sun. Identity mappings in deep residual
[1] Z Akata, F Perronnin, Z Harchaoui, and C Schmid. Label-embedding for networks. arXiv preprint arXiv:1603.05027, 2016.
image classification. IEEE transactions on pattern analysis and machine [21] S Hochreiter and J Schmidhuber. Long short-term memory. Neural
intelligence, 38(7), 2016. computation, 1997.
TABLE IV
T OP 5 E MOJIS P REDICTED IN THE CONTEXT OF AN IMAGE USING DIFFERENT EMOJI KNOWLEDGE CONCEPTS

Using Processed
S.No Image Text Description Using emoji senses Using emoji names
sense definition

A person looks down


1 at something while
sitting on a bike

The dog is playing


2 with his toy in the
grass

A tennis player in
3
action on the court

Cup of coffee with


4 dessert items on a
wooden grained table

[22] Anurag Illendula and Manish Reddy Yedulla. Learning emoji em- [38] O Russakovsky, J Deng, Z Huang, A C. Berg, and L Fei-Fei. Detecting
beddings using emoji co-occurrence network graph. arXiv preprint avocados to zucchinis: what have we done, and where are we going? In
arXiv:1806.07785, 2018. ICCV, 2013.
[23] A Joulin, E Grave, P Bojanowski, M Douze, Hé Jégou, and T Mikolov. [39] SM M Seyednezhad and R Menezes. Understanding subject-based
Fasttext.zip: Compressing text classification models. arXiv preprint emoji usage using network science. In Workshop on Complex Networks
arXiv:1612.03651, 2016. CompleNet. Springer, 2017.
[24] A Karpathy and L Fei-Fei. Deep visual-semantic alignments for [40] E Shelhamer, J Long, and T Darrell. Fully convolutional networks for
generating image descriptions. In CVPR, 2015. semantic segmentation. CoRR, abs/1605.06211, 2016.
[25] T Kenter, A Borisov, and M de Rijke. Siamese cbow: Optimiz- [41] A Sheth, S Perera, S Wijeratne, and K Thirunarayan. Knowledge will
ing word embeddings for sentence representations. arXiv preprint propel machine understanding of content: extrapolating from current
arXiv:1606.04640, 2016. examples. In Web Intelligence 2017, 2017.
[26] A Krizhevsky, I Sutskever, and G E. Hinton. Imagenet classification [42] J Shin, S Wu, F Wang, C De Sa, C Zhang, and C Ré. Incremental
with deep convolutional neural networks. Commun. ACM, 60(6), 2017. knowledge base construction using deepdive. Proceedings of the VLDB
[27] U Kursuncu, M Gaur, U Lokala, A Illendula, K Thirunarayan, R Dani- Endowment, 8(11), 2015.
ulaityte, A Sheth, and I B Arpinar. ” what’s ur type?” contextualized [43] K. Simonyan and A. Zisserman. Very deep convolutional networks for
classification of user types in marijuana-related communications using large-scale image recognition. CoRR, abs/1409.1556, 2014.
compositional multiview embedding. arXiv preprint arXiv:1806.06813, [44] F M Suchanek, G Kasneci, and G Weikum. Yago: a core of semantic
2018. knowledge. In WWW. ACM, 2007.
[28] Q Li, Q Peng, and C Yan. Multiple vlad encoding of cnns for image [45] S Venugopalan, L A Hendricks, R Mooney, and K Saenko. Improving
classification. Computing in Science & Engineering, 2018. lstm-based video description with linguistic knowledge mined from text.
[29] T Lin, M Maire, S Belongie, J Hays, P Perona, D Ramanan, P Dollár, arXiv preprint arXiv:1604.01729, 2016.
and C L Zitnick. Microsoft coco: Common objects in context. In ECCV. [46] S Wijeratne, L Balasuriya, A Sheth, and D Doran. Emojinet: An open
Springer, 2014. service and api for emoji sense discovery. In ICWSM, Montreal, Canada,
[30] T-Yi Lin, M Maire, S J. Belongie, L D. Bourdev, R B. Girshick, J Hays, May 2017.
P Perona, D Ramanan, P Dollár, and C. L Zitnick. Microsoft COCO: [47] S Wijeratne, L Balasuriya, A Sheth, and D Doran. A semantics-based
common objects in context. CoRR, abs/1405.0312, 2014. measure of emoji similarity. In WI, 2017.
[31] D G Lowe. Distinctive image features from scale-invariant keypoints. [48] C Xing, W Wu, Y Wu, J Liu, Y Huang, M Zhou, and W Ma. Topic
International journal of computer vision, 2004. aware neural response generation. In AAAI, volume 17, 2017.
[32] L Maaten and G Hinton. Visualizing data using t-sne. Journal of [49] B Yang and T Mitchell. Leveraging knowledge bases in lstms for
machine learning research, 9(Nov), 2008. improving machine reading. In ACL, volume 1, 2017.
[33] T Mikolov, E Grave, P Bojanowski, C Puhrsch, and A Joulin. Advances [50] M D Zeiler and R Fergus. Visualizing and understanding convolutional
in pre-training distributed word representations. In LREC, 2018. networks. In ECCV. Springer, 2014.
[34] T Mikolov, I Sutskever, K Chen, G S Corrado, and J Dean. Distributed [51] Y Zhang, R Jin, and Z Zhou. Understanding bag-of-words model: a
representations of words and phrases and their compositionality. In statistical framework. International Journal of Machine Learning and
Advances in neural information processing systems, 2013. Cybernetics, 2010.
[35] G A Miller. Wordnet: a lexical database for english. Communications
of the ACM, 1995.
[36] T Mitchell, W Cohen, E Hruschka, P Talukdar, B Yang, J Betteridge,
A Carlson, B Dalvi, M Gardner, and B Kisiel. Never-ending learning.
Communications of the ACM, 2018.
[37] J Pennington, R Socher, and C Manning. Glove: Global vectors for
word representation. In EMNLP, 2014.

You might also like