A Practical Guide To Hybrid Natural Language Processing (Combining Neural Models and Knowledge Graph
A Practical Guide To Hybrid Natural Language Processing (Combining Neural Models and Knowledge Graph
Ronald Denaux
Andres Garcia-Silva
A Practical Guide
to Hybrid Natural
Language
Processing
Combining Neural Models and
Knowledge Graphs for NLP
A Practical Guide to Hybrid
Natural Language Processing
Jose Manuel Gomez-Perez • Ronald Denaux •
Andres Garcia-Silva
123
Jose Manuel Gomez-Perez Ronald Denaux
Expert System Expert System
Madrid, Spain Madrid, Spain
Andres Garcia-Silva
Expert System
Madrid, Spain
This Springer imprint is published by the registered company Springer Nature Switzerland AG.
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
To family and friends and all who made this
book possible in one way or another.
Foreword
vii
viii Foreword
In 2005, Wikipedia was still a young website, and most of the public just started to
become aware of it. The Wikipedia community sent out calls for the very first global
meet-up of contributors, named Wikimania. Markus Krotzsch and I were both early
contributors and have just started as Ph.D. students working on the Semantic Web.
We wanted to go to Wikimania and meet those people we knew only online.
We sat down and thought about what kind of idea to submit to Wikimania. The
most obvious one was to combine Wikipedia with Semantic Web technologies. But
what would that mean?
Wikipedia’s power lies in the ease of editing, and in the idea that anyone
can contribute to it. It lies in its community, and in the rules and processes the
community had set up. The power of the Semantic Web was to publish machine-
readable data on the Web and to allow agents to combine the data from many
different sources. Our idea was to enable Wikipedia communities to create content
that is more machine-readable and can participate in the Semantic Web.
Our talk was set up for the first session on the first day, and we used the talk to
start a conversation that would continue for years. The talk led to the creation of
Semantic MediaWiki, an extension to the MediaWiki software powering Wikipedia
Foreword ix
Fig. 1 Map showing the 100 largest cities with a female mayor. Data from Wikidata, Map from
Open StreetMaps. Link: https://w.wiki/GbC
and other wikis, and that has found use in numerous places, such as NASA, the
Museum of Modern Art, the US Intelligence community, General Electric, and
many more. The talk also eventually led to the creation of Wikidata, but it took
many years to get there.
In the talk, we proposed a system that would allow us to answer questions using
Wikipedia’s content. Our example question was: what are the world’s largest cities
with a female mayor?
Wikipedia, at that point, already had all the relevant data, but it was spread out
through many articles. One could, given enough time, comb through the articles of
all the largest cities, check who the mayor is, and start keeping a spreadsheet of
potential answers, and finally clean it up and produce the end result.
Today, with the availability of Wikidata, we can get answers to that question (see
Fig. 1) and many others within seconds. Wikidata is a large knowledge base that
anyone can edit and that has, as of early 2020, collected nearly a billion statements
about more than 75 million topics of interest.
But, although we finally have a system that allows all of us to ask these questions
and get beautiful visualizations as answers, there is still a high barrier towards
allowing a wide range of people to actually benefit from this data. It requires writing
queries in the query language SPARQL, a rare skill set. Can we do better?
Most of the knowledge in the world is encoded not in knowledge graphs but
in natural language. Natural language is also the most powerful user interface we
know.
The 2010s saw neural models created through deep learning applied to tasks in
natural language processing become hugely popular. Whether generation, summa-
rization, question answering, or translation—undoubtedly the poster child of this
x Foreword
querying and answer generation. The two technologies support and enhance each
other. This book shows practical ways to combine them and unlock new capabilities
to engage and empower users.
Enjoy the book, and use your newly acquired knowledge wisely!
xiii
xiv Preface
who may be looking for ways to leverage structured knowledge bases to optimize
results along the NLP downstream.
Readers from industry and academia in the above-mentioned communities will
thus find in this book a practical resource to hybrid NLP. Throughout the book, we
show how to leverage complementary representations stemming from the analysis
of unstructured text corpora as well as the entities and relations described explicitly
in a knowledge graph, integrate such representations, and use the resulting features
to effectively solve different NLP tasks in different domains. In these pages, the
reader will have access to actual executable code with examples, exercises, and real-
world applications in key domains like disinformation analysis and machine reading
comprehension of scientific literature.
In writing this book, we did not seek to provide an exhaustive account of current
NLP approaches, techniques, and toolkits, either knowledge-based, neural, or based
on other forms of machine learning. We consider this is sufficiently well covered in
the literature. Instead, we chose to focus on the main building blocks that the reader
actually needs to be aware of in order to assimilate and apply the main ideas of
this book. Indeed, all the chapters are self-contained and the average reader should
not encounter major difficulties in their comprehension. As a result, you have in
your hands a compact yet insightful handbook focused on the main challenge of
reconciling knowledge-based and neural approaches to NLP. We hope you will
enjoy it.
This book provides readers with a principled yet practical guide to hybrid
approaches to natural language processing involving a combination of neural
methods and knowledge graphs. The book addresses a number of questions related
to hybrid NLP systems, including:
• How can neural methods extend previously captured knowledge explicitly
represented as knowledge graphs in cost-efficient and practical ways and vice
versa?
• What are the main building blocks and techniques enabling a hybrid approach to
NLP that combines neural and knowledge-based approaches?
• How can neural and structured, knowledge-based representations be seamlessly
integrated?
• Can this hybrid approach result in better knowledge graphs and neural represen-
tations?
Preface xv
• How can the quality of the resulting hybrid representations be inspected and
evaluated?
• What is the impact on the performance of NLP tasks, the processing of other data
modalities, like images or diagrams, and their interplay?
To this purpose, the book first introduces the main building blocks and then
describes how they can be intertwined, supporting the effective implementation of
real-life NLP applications. To illustrate the ideas described in the book, we include a
comprehensive set of experiments and exercises involving different algorithms over
a selection of domains and corpora in several NLP tasks.
of Vecsigrafo, called Transigrafo. This part of the chapter is also illustrated using a
notebook.
Chapter 7: Quality Evaluation discusses several evaluation methods that
provide an insight on the quality of the hybrid representations learnt by Vecsigrafo.
To this purpose, we will use a notebook that illustrates the different techniques
entailed. In this chapter, we also study how such representations compare against
lexical and semantic embeddings produced by other algorithms.
Chapter 8: Capturing Lexical, Grammatical, and Semantic Information
with Vecsigrafo. Building hybrid systems that leverage both text corpora and a
knowledge graph needs to generate embeddings for the items represented in the
graph, such as concepts, which are linked to the words and expressions in the
corpus singled out through some tokenization strategy. In this chapter and associated
notebook, we investigate the impact of different tokenization strategies and how
these may impact on the resulting lexical, grammatical, and semantic embeddings
in Vecsigrafo.
Chapter 9: Aligning Embedding Spaces and Applications for Knowledge
Graphs presents several approaches to align the vector spaces learned from different
sources, possibly in different languages. We discuss various applications such as
multi-linguality and multi-modality, which we also illustrate in an accompanying
notebook. The techniques for vector space alignment are particularly relevant in
hybrid settings, as they can provide a basis for knowledge graph interlinking and
cross-lingual applications.
Chapter 10: A Hybrid Approach to Fake News and Disinformation Analysis.
In this chapter and corresponding notebooks, we start looking at how we can apply
hybrid representations in the context of specific NLP tasks and how this improves
the performance of such tasks. In particular, we will see how to use and adapt deep
learning architectures to take into account hybrid knowledge sources to classify
documents which in this case may contain misinformation.
Chapter 11: Jointly Learning Text and Visual Information in the Scientific
Domain. In this chapter and its notebook, we motivate the application of hybrid
techniques to NLP in the scientific domain. This chapter will guide the reader to
implement state-of-the-art techniques that relate both text and visual information,
enrich the resulting features with pre-trained knowledge graph embeddings, and use
the resulting features in a series of transfer learning tasks, ranging from figure and
caption classification to multiple-choice question answering over text and diagram
of 6th grade science questions.
Chapter 12: Looking Into the Future of Natural Language Processing
provides final thoughts and guidelines on the matter of this book. It also advances
some of the future developments in hybrid natural language processing in order to
help professionals and researchers configure a path of ongoing training, promising
research fields, and areas of industrial application. This chapter includes feedback
from experts in areas related to this book, who were asked about their particular
vision, foreseeable barriers, and next steps.
Preface xvii
Materials
All the examples and exercises proposed in the book are available as executable
Jupyter notebooks in our GitHub repository.1 All the notebooks are ready to be
run on Google Colaboratory or, if the reader so prefers, in a local environment.
The book also leverages experience and feedback acquired through our tutorial
on Hybrid Techniques for Knowledge-based NLP,2 initiated at K-CAP’173 and
continued in ISWC’184 and K-CAP’19.5 The current version of the tutorial is
available online and the reader is encouraged to use it in combination with the
book to consolidate the knowledge acquired in the different chapters with executable
examples, exercises, and real-world applications.
The field addressed by this book is tremendously dynamic. Much of the relevant
bibliography in critical and related areas like neural language models has erupted
in the last months, configuring a thriving field which is taking shape as we write
these lines. New and groundbreaking contributions are therefore expected to appear
during the preparation of this book that will be studied and incorporated and may
even motivate future editions. For this reason, resources like the above-mentioned
tutorial on Hybrid Techniques for Knowledge-based NLP and others like Graham
Neubig et al.’s Concepts in Neural Networks for NLP6 are of particular importance.
This book does not seek to provide an exhaustive survey on previous work in
NLP. Although we provide the necessary pointers to the relevant bibliography in
each of the areas we discuss, we have purposefully kept it succinct and focused.
Related books that will provide the reader with a rich background in relevant areas
for this book include the following.
Manning and Schutze’s Foundations of Statistical Natural Language Processing
[114] and Jurafsky and Martin’s Speech and Language Processing [88] provide
excellent coverage for statistic approaches to natural language processing and their
applications, as well as introduce how (semi)structured knowledge representations
and resources like WordNet and FrameNet [12] can play a role in the NLP pipeline.
More recently, a number of books have covered the field with special emphasis on
neural approaches. Eisenstein’s Introduction to Natural Language Processing [51]
1 https://github.com/hybridnlp/tutorial.
2 http://hybridnlp.expertsystemlab.com/tutorial.
3 9th International Conference on Knowledge Capture (https://www.k-cap2017.org).
4 17th International Semantic Web Conference (http://iswc2018.semanticweb.org).
5 10th International Conference on Knowledge Capture (http://www.k-cap.org/2019).
6 https://github.com/neulab/nn4nlp-concepts.
xviii Preface
xix
Contents
xxi
xxii Contents
References .. .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 257
Part I
Preliminaries and Building Blocks
Chapter 1
Hybrid Natural Language Processing:
An Introduction
The history of artificial intelligence can be seen as a quest for the perfect com-
bination of reasoning accuracy and the ability to capture knowledge in machine-
actionable formats. Early AI systems developed during the ’70s like MYCIN [169]
already proved it was possible to effectively emulate human reasoning in tasks
like classification or diagnosis through artificial means. However, the acquisition
of expert knowledge from humans soon proved to be a challenging task, resulting in
what was known ever after as the knowledge acquisition bottleneck [58].
Knowledge acquisition eventually became a modeling activity rather than a
task focused on extracting knowledge from the mind of an expert in an attempt
to address such challenge and work at the so-called knowledge level [124]. The
modeling approach and working at the knowledge level facilitates focusing on
what an AI agent knows and what its goals are as an abstraction separate from
implementation details. Along the knowledge level path came ontologies, semantic
networks, and eventually knowledge graphs, which provide rich, expressive, and
actionable descriptions of the domain of interest and support logical explanations of
reasoning outcomes.
As Chris Welty put it in his foreword to Exploiting Linked Data and Knowledge
Graphs in Large Organizations [135]: “Knowledge Graphs are Everywhere! Most
major IT companies or, more accurately, most major information companies
including Bloomberg, NY Times, Google, Microsoft, Facebook, Twitter, and many
more, have significant graphs and invested in their curation. Not because any of
these graphs is their business, but because using this knowledge helps them in
their business.” However, knowledge graphs can be costly to produce and scale
since a considerable amount of well-trained human labor is needed to curate high-
quality knowledge in the required formats. Furthermore, the design decisions made
by knowledge engineers can also have an impact in terms of depth, breadth, and
focus, which may result in biased and/or brittle knowledge representations, hence
requiring continuous supervision. Ambitious research programs in the history of
AI, like CYC [101] and Halo [72],1 invested large amounts of effort to produce
highly curated knowledge bases either by knowledge engineers or by subject matter
experts, and all of them had to face challenges like the ones mentioned above.
In parallel, the last decade has witnessed a noticeable shift from knowledge-
based, human-engineered methods to data-driven and particularly neural methods
due to the increasing availability of raw data and more effective architectures,
facilitating the training of increasingly performing models. Areas of AI like
computer vision soon leveraged the advantages brought about by this new scenario.
Also the natural language processing (NLP) community has embraced the trend
with remarkably good results.
Relatively recent breakthroughs in the field of distributional semantics and word
embeddings [16] have proved particularly successful ways to capture the meaning
of words in document corpora as a vector in a dense, low-dimensional space. Much
research has been put into the development of more and more effective means to pro-
duce word embeddings, ranging from algorithms like word2vec [121], GloVe [138],
or fastText [23] to neural language models capable to produce context-specific word
representations at a scale never seen before, like ELMo [139], BERT [44], and
XLNet [195]. Paraphrasing Welty: “Embeddings are Everywhere!”. Among many
applications, they have proved to be useful in terms of similarity, analogy, and
relatedness, as well as most NLP tasks, including, e.g. classification [96], question
answering [95, 137, 162], or machine translation [11, 32, 89, 175].
Certainly, word embeddings and language models are pushing the boundaries of
NLP at an unprecedented speed. However, the complexity and size of language mod-
els2 imply that training usually requires large computational resources, often out of
reach for most researchers outside of the big corporate laboratories. Additionally,
1 Halo
eventually motivated the current project Aristo (https://allenai.org/aristo).
2 At
the time of writing this introduction, BERT, with over 200 million parameters, looks small
compared to new neural language models like Salesforce’s CTRL [94] (1.6 billion parameters) or
1.2 Combining Knowledge Graphs and Neural Approaches for NLP 5
Many argue [46, 166, 168] that knowledge graphs can enhance both expressivity
and reasoning power in machine learning architectures and advocate for a hybrid
approach that leverages the best of both worlds. From a practical viewpoint, the
advantages are particularly evident, e.g. in situations where there may be a lack of
sufficient training data. For example, text data can be augmented using a knowledge
graph by expanding the corpus based on hypernymy, synonymy, and other relations
represented explicitly in the graph.
Neural-symbolic integration [10], as a field of research, addresses fundamental
problems related to building a technical bridge between both paradigms in AI.
Recently, this discussion has rekindled in the context of the Semantic Web [78].
In the area of NLP, a similar discussion is also gaining traction. Among the benefits
hybrid approaches that combine neural and knowledge-based methods can entail,
models that learn to represent textual and knowledge base entities like relations in
Google’s Text-to-Text Transfer Transformer (T5) [149], which can exceed 11 billion parameters in
its largest configuration.
6 1 Hybrid Natural Language Processing: An Introduction
the same continuous latent space for tasks like knowledge base completion have
shown to perform joint inferences involving text and knowledge bases with high
accuracy [151]. Others [180] follow up to capture the compositional structure of
textual relations and jointly optimize entity, knowledge base, and textual relation
representations. More generally, knowledge graphs can contribute to train more
expressive NLP models that are able to learn the meaning of words by linking
them to explicitly represented concepts [29, 112] in a knowledge graph and, in
addition, jointly learning the representations of both words and concepts in a single
embedding space [39].
Focused on the pragmatics of language, [6, 20] another line of research argues
that effectively capturing meaning requires not only taking into account the form
of the text but also other aspects like the particular context in which such form is
used or the intent and cognitive state of the speaker, suggesting that the text needs
to be enriched with additional information in order to actually convey the required
meaning. Accordingly, for NLP to scale beyond partial, task-specific solutions, it
must be informed by what is known about how humans use the language and their
understanding of the world.
General-purpose knowledge graphs [7, 183] in combination with domain-specific
structured resources, as well as lexical databases like WordNet [59], which group
words based on their semantics and define relations between such concepts,
seem well suited to such purpose, too. Other resources like ConceptNet [173] or
ATOMIC [156], focused on modeling commonsense, have shown that it is actually
possible to capture human understanding in knowledge graphs and train neural
models based on such graphs that exhibit commonsense reasoning capabilities in
NLP tasks.
In this book we take a step further in the combination of knowledge graphs and
neural methods for NLP. Doing so requires that we address a number of questions.
Such questions include the following: (1) how can neural methods extend previously
captured knowledge explicitly represented as knowledge graphs in cost-efficient
and practical ways and vice versa; (2) what are the main building blocks and
techniques enabling such hybrid approach to NLP; (3) how can knowledge-based
and neural representations be seamlessly integrated; (4) how can the quality of the
resulting hybrid representations be inspected and evaluated; (5) how can a hybrid
approach result in higher quality structured and neural representations compared to
the individual scenario; and (6) how does this impact on the performance of NLP
tasks, as well as the processing and interplay with other data modalities, like visual
data, in tasks related to machine comprehension. Next, we try to provide answers to
these questions.
Chapter 2
Word, Sense, and Graph Embeddings
2.1 Introduction
As discussed in the previous chapter, one of the main questions addressed by this
book deals with how to represent in a shared vector space both words and their
meaning and how doing this can be beneficial in NLP. In this chapter, we first
describe the origins of word embeddings as distributed word representations in
Sect. 2.2. Then, in Sect. 2.3, we describe a variety of methods that have been
proposed to learn word embeddings and shortly describe the type of knowledge
they seem to capture (and where they have shortcomings) and how we evaluate
such embeddings. We then move on to describing, in a similar manner, sense and
concept embeddings in Sect. 2.4 derived from text. Finally, we describe a final type
of embeddings derived from knowledge graphs in Sect. 2.5.
Learning word embeddings1 has a relatively long history [16], with earlier works
focused on deriving embeddings from co-occurrence matrices and more recent work
focusing on training models to predict words based on their context [15]. Both
approaches are roughly equivalent as long as design choices and hyperparameter
optimization are taken into account [103].
Most of the recent work in this area focused on learning embeddings for indi-
vidual words based on a large corpus and was triggered by the word2vec algorithm
proposed in [121] which provided an efficient way to learn word embeddings by
predicting words based on their context words2 and using negative sampling. More
recent improvements on this family of algorithms [118] also take into account (1)
subword information by learning embeddings for three to six character n-grams,
(2) multi-words by pre-processing the corpus and combining n-grams of words
with high mutual information like “New_York_City,” and (3) learning a weighting
scheme (rather than predefining it) to give more weight to context words depending
on their relative position to the center word.3 These advances are available via the
fastText implementation and pre-trained embeddings.
Algorithms based on word co-occurrences are also available. GloVe [138] and
Swivel [164] are two algorithms which learn embeddings directly from a sparse
co-occurrence matrix that can be derived from a corpus; they do this by calculating
relational probabilities between words based on their co-occurrence and total counts
in the corpus.
These approaches have been shown to learn lexical and semantic relations.
However, since they stay at the level of words, they suffer from issues regarding
word ambiguity. And since most words are polysemic, the learned embeddings must
either try to capture the meaning of the different senses or encode only the meaning
of the most frequent sense. In the opposite direction, the resulting embedding space
only provides an embedding for each word, which makes it difficult to derive an
embedding for the concept based on the various words which can be used to refer to
that concept.
Vecsigrafo, which will be described in detail in Chap. 6 provides extensions that
can be applied to both word2vec style algorithms and to co-occurrence algorithms.
In this book, we focus on the former and show how such extensions can be applied to
Swivel. We pick Swivel for ease of implementation, which has proved very useful
for illustration purposes in a practical handbook like this. However, applying the
Vecsigrafo approach to GloVe and the standard word2vec implementations should
be straightforward. Applying it to other approaches like fastText is a priori more
complicated, especially when taking into account subword information, since words
can be subdivided into character n-grams, but concepts cannot.
Another way that has been proposed recently to deal with problems like
polysemy of word embeddings is to use language models to learn contextual
embeddings. In such cases, the corpus is used to train a model which can be used
to compute embeddings for words based on a specific context like a sentence or
a paragraph. The main difference is that words do not have a unique embedding;
instead the embedding of the word depends on the words surrounding it. In Chap. 3
we will explore this type of word embeddings, including current popular language
modeling approaches such as ELMo [107], GPT [146], and BERT [44].
Word embeddings are evaluated using intrinsic methods [158] that assess
whether the embedding space actually encodes the distributional context of words,
and extrinsic methods where they are evaluated according to the performance of a
downstream task. Analogical reasoning [121] and word similarity [150] are often
used as intrinsic evaluation methods. The analogy task4 relies on words relations
of the form a:a* :: b:b* (i.e., a is to a* as b is to b*) and the goal is try to predict b*
given the other variables by operating on the corresponding word vectors. Word
A few approaches have been proposed to produce sense and concept embeddings
from corpora. One approach to resolve this is to generate sense embeddings [82],
whereby the corpus is disambiguated using Babelfy and then word2vec is applied
over the disambiguated version of the corpus. Since plain word2vec is applied,
only vectors for senses are generated. Jointly learning both words and senses was
proposed by Chen et al. [31] and Rothe et al. [153] via multi-step approaches where
the system first learns word embeddings, then applies disambiguation based on
WordNet and then learns the joint embeddings. While this addresses ambiguity
of individual words, the resulting embeddings focus on synonymous word-sense
pairs,6 rather than on knowledge graph concepts.
Another approach for learning embeddings for concepts based on a corpus
without requiring word-sense disambiguation is NASARI [29], which uses lexical
specificity to learn concept embeddings from Wikipedia subcorpora. These embed-
dings have as their dimensions the lexical specificity of words in the subcorpus,
hence they are sparse and harder to apply than low-dimensional embeddings such
as those produced by word2vec. For this reason, NASARI also proposes to generate
“embedded vectors” which are weighted averaged vectors from a conventional
word2vec embedding space. This approach only works for Wikipedia and BabelNet,
since you need a way to create a subcorpus that is relevant to entities in the
knowledge base.
Finally, SW2V (Senses and Words to Vectors) [112] proposes a lightweight
word-disambiguation algorithm and extends the continuous bag-of-words archi-
tecture of word2vec to take into account both words and senses. Vecsigrafo
adopts a similar approach, leveraging various differences, including the use of
an industry-grade disambiguator, a learning algorithm implemented as a variation
of correlation-based algorithms, and considering the distance of context words
and concepts to the center word. In terms of evaluation, Mancini et al. [112]
report results for 2 word similarity datasets while Vecsigrafo is substantiated by an
extensive analysis on 14 datasets and different corpus sizes. Vecsigrafo also takes
into account the inter-agreement between different vector spaces, as a measure of
how similar two vector spaces are based on the predicted distances between a set of
word pairs, in order to assess the quality of the resulting embeddings.
5 https://aclweb.org/aclwiki/Similarity_(State_of_the_art).
6 E.g. word-sense pairs appleN2 and Malus_pumila N1 have separate embeddings, but the concept
for apple tree they represent has no embedding.
2.5 Knowledge Graph Embeddings 11
Knowledge graphs have been useful to solve a wide variety of natural language pro-
cessing tasks such as semantic parsing, named entity disambiguation, information
extraction, and question answering among others. A knowledge graph is a multi-
relational graph that includes entities (nodes) and a set of relation types (edges).
Several approaches have been proposed to create concept embeddings directly from
these representations [129, 188].
Let E be a set of entities and R a set of relations, a triple of the form h, r, t
(head, relation, tail) denotes a fact, such that h, t ∈ E and r ∈ R. Facts are stored in
a knowledge base as a collection of triples D+ = {(h, r, t)}. Embeddings allow to
translate these symbolic representations in a format that simplify the manipulation
while preserving the inherent structure. Usually, knowledge graphs adhere to some
deterministic rules, e.g. constraints or transitivity. However, there are also some
latent features that reflect inner statistical properties, such as the tendency of some
entities to be related by similar characteristics (homophily) or entities that can be
divided into different groups (block structure). These latent features which may
emerge through the application of statistical relational learning techniques [129].
Generally, in a typical knowledge graph embeddings model, entities are rep-
resented by vectors in a continuous vector space and relations are taken as
operations in the same space and can be represented by vectors, matrices, or tensors
among others. Many algorithms that try to learn these representations have been
proposed. Since the knowledge graphs are typically incomplete, one of the main
applications of such concept embeddings is usually graph completion. Knowledge
graph completion (KGC) has been proposed to improve knowledge graphs by filling
in its missing connections.
Several algorithms have been implemented to tackle this task [125]. The general
approach is the following: given a triple (h, r, t), these models assign a function
f (h, r, t) of its plausibility. The aim of this learning process is to choose a function
such that the score of a correct triple will be higher than the score of an incorrect
one, usually a corrupted triple.
A family of graph embedding algorithms is the translational models that uses
distance-based scoring functions. These models basically use vector translations to
represent the relations. One of the most representative algorithms in this group is the
TransE model [25], inspired by the word2vec skipgram. This model represents both
entities and relations as vectors in the same space. A relation r here is presented
as a translation in the embedding space such that h + r ≈ t or t + r ≈ h when
(h, r, t) holds. Since TransE presents some disadvantages in dealing with relations
between more than two entities (e.g., 1-to-N, N-to-1, or N-to-N), some model
improvements have been proposed. TransH [190] basically uses vector projections
to represent relations. TransR [106] uses projections and represents the relations
in a separate vector space. TransD [54] simplifies TransR assigning two vectors
for both entities and relations and avoids matrix-vector multiplication operations.
TransM [57] that assigns different weights for each triple based on their relational
12 2 Word, Sense, and Graph Embeddings
mapping property, or rather the different number of heads and tails. Finally, the
most recent and efficient translational models are TorusE [48] that solve a TransE
regularization problem, TranSparse [30] that uses adaptive sparse matrices, and
TranSparse-DT [85] that is an extension of TranSparse with a dynamic translation.
Another group is formed of bi-linear models which employs bi-linear scoring
functions to capture the latent semantics of entity vectors. RESCAL [131] is the
first bi-linear algorithm implemented and it represents each entity as a vector
and each relation as a matrix which models pairwise interactions between latent
factors. DISTMULT [193] is a RESCAL simplification in which the relation matrix
is substituted with a diagonal matrix, such that the score function reduces the
number of parameters per relation. SimplE [90] allows learning two different
embeddings for each entity when encountered either as head or tail. An extension of
DISTMULT is ComplEx [181], where the embeddings are represented as complex
values to better model asymmetric relations, h, r, t values lie in a complex space and
the scoring function is not symmetric. ComplEx-N3 [99] extends ComplEx with
weighted nuclear 3-norm, and RotatE [174] represents entities as complex vectors
and relations as rotations from the source entity to the target entity in a complex
vector space.
There is also a wide variety of algorithms implemented using neural network
architectures. Models that implement recurrent neural networks to learn relation-
ships are IRN [165] and PTransE-RNN [105]. A model inspired by holographic
models of associative memory is HolE [130] (holographic embeddings). These
models develop a memory system inspired by holography and use convolutions
and correlations to store and retrieve information [61]. HolE tries to combine
the expressive power of tensor product, e.g. RESCAL, and the lightness and
simplicity of TransE. It is a compositional model that uses circular correlation of
vectors to represent pairs of entities and takes into account asymmetrical relations.
The SME [24] (semantic matching energy) algorithm represents both entities and
relations using vectors, modeling the relation types and the entity types in the
same way. Then, a relation is combined respectively with its head and its tail.
Finally, the dot product of these combinations returns the score of a fact. The NTN
model [171] (neural tensor network) replaces a standard linear layer with a bi-
linear tensor layer able to relate entity vectors. A scored function computes how
likely a relation between two entities is. A special case of NTN is the SLM (single-
layer model) that connects the entity vectors implicitly through the non-linearity
of a standard, single-layer neural network when tensor is set to 0. ProjE [167] is a
two-layer neural network formed by a combined layer and a projection layer. It is
a simplified version of NTN in which combination operators as diagonal matrices
allow to combine an entity embedding matrix with a relation embedding matrix.
Given an entity embeddings and a relation embeddings, the output of this model is
a candidate-entity matrix that returns top-ranked candidates to complete the triple.
There are also some models based on convolutional neural networks. ConvE [43]
is a neural link prediction model that applies convolution operations over 2D shaped
embeddings in order to extract more feature interactions. ConvKB [126], instead,
can be viewed as an extension of TransE to further model global relationships
2.5 Knowledge Graph Embeddings 13
among same dimensional entries of the entity and relation embeddings. Each triple
in this model is represented as a 3-column matrix fed to a convolution layer where
multiple filters are operated to generate different feature maps. Conv-TransE [163]
and CapsE [128] are extensions of ConvE and ConvKB, respectively. The first one
is designed to take into account also translational characteristics between entities
and relations, the latter adds a capsule network layer on top of the convolution layer.
Several recent studies have shown that enriching the triple knowledge with
additional information present in the graph can improve the strength of models.
Relation paths between entities and neighborhood information are two examples.
The first refers to a sequence of linked relations and the second refers to model
an entity as a relation-specific mixture of its neighborhood in the graph. Relation
paths can improve the performances of models such as TransE and SME [110].
Neighborhood information has been also used in TransENMM [127] to improve
TransE and in R-GCN [157] to deal with the highly multi-relational data in which
a DISTMULT decoder takes an R-GCN encoder input to produce a score for every
potential edge in the graph.
Research that try to adapt language models to knowledge graph embeddings
have been presented in RDF2Vec [152] and KG-BERT [196]. Language models are
probabilistic models able to predict a sequence of words in a given language and try
to encode both grammatical and semantic information. In RDF2Vec, the algorithm
transforms the graph data into sequences of entities in order to consider them as
sentences. To do that, the graph is divided into subgraphs using different techniques.
In the end, these sentences are used to train the neural language model to represent
each entity in the RDF graph as a vector of numerical values in a latent feature
space. The language model used in this research is word2vec. Instead, KG-BERT
uses the BERT language model to represent information, both entities and relations
as their name or description textual sequences. Therefore, this model is able to take
into account also extra textual information in addition to the one encoded by triples.
This algorithm turns the knowledge graph completion into a sequence classification
problem.
Among the most recent models in this research field can be mentioned
CrossE [199], GRank [49], and TuckER [14]. For each entity and relation,
CrossE creates a general embeddings that stores high-level properties and multiple
interaction embeddings. These embeddings are derived from an interaction matrix
which encodes specific properties and information about crossover interactions, or
rather, bidirectional effects between entities and relations. To overcome the non-
interpretability of how knowledge graph embeddings encode information, GRank
proposes a different approach that utilizes graph patterns. This model constructs an
entity ranking system for each graph pattern and then evaluates them using a ranking
measure. Instead, TuckER is a linear model based on Tucker decomposition of the
binary tensor representation of knowledge graph triples. This algorithm models
entities and relations into two different matrices.
Although these approaches are really interesting, all have the same drawback:
they encode the knowledge (including biases) explicitly contained in the source
knowledge graph, which is typically already a condensed and filtered version of
14 2 Word, Sense, and Graph Embeddings
the real-world data. Even large knowledge graphs only provide a fraction of the
data that can be gleaned from raw datasets such as Wikipedia and other web-
based text corpora, i.e. these embeddings cannot learn from raw data as it appears
in real-world documents. Approaches like KnowBert [140] and Vecsigrafo [39],
on the other hand, combine corpora and knowledge graphs, showing evidence
of improved perplexity, ability to recall facts and downstream performance on
relationship extraction, entity typing, and word-sense disambiguation.
Tables 2.1 and 2.2 report the results of the most popular KGE models on the
widespread datasets used in the scientific community.
Table 2.2 KGE models benchmarking (FB15K-237 [180] and WN18RR [43])
Model Architecture Benchmarks
FB15K-237 WN18RR
MR @10 MMR MR @10 MMR
DISTMULT [193] Bi-linear – 41.9 0.241 – 49.0 0.430
ComplEx [181] Bi-linear – 42.8 0.247 – 51.0 0.440
ConvE [43] Convolutional-NN – 50.1 0.325 – 50.2 0.430
RotatE [174] Bi-linear – 48.0 0.297 – – –
R-GCN [157] Convolutional-NN – – – – 41.7 0.248
Conv-TransE [163] Convolutional-NN – 51.0 0.330 – 52.0 0.460
ConvKB [126] Convolutional-NN 257 51.7 0.396 2554 52.5 0.248
CapsE [128] Convolutional-NN 303 59.3 0.523 719 56.0 0.415
CrossE [199] ML-based – 47.4 0.299 – – –
GRank [49] ML-based – 48.9 0.322 – 53.9 0.470
KG-BERT [196] Transformer 153 42.0 – 97 52.4 –
TuckER [14] ML-based – 54.4 0.358 – 52.6 0.470
2.6 Conclusion
Abstract Early word embeddings algorithms like word2vec and GloVe generate
static distributional representations for words regardless of the context and the sense
in which the word is used in a given sentence, offering poor modeling of ambiguous
words and lacking coverage for out-of-vocabulary words. Hence a new wave of
algorithms based on training language models such as Open AI GPT and BERT
has been proposed to generate contextual word embeddings that use as input word
constituents allowing them to generate representations for out-of-vocabulary words
by combining the word pieces. Recently, fine-tuning pre-trained language models
that have been trained on large corpora have constantly advanced the state of the art
for many NLP tasks.
3.1 Introduction
GPT [146], or BERT [44]. Such approaches generate contextual word embeddings
by relying on word constituents at different degrees of granularity and using a
language model as learning objective. Since their emergence, language models have
constantly improved the state of the art in most NLP benchmarks. In this chapter we
characterize the main approaches to generate contextual word embeddings through
neural language models. In doing so, we will also describe how language models
have generalized the use of transfer learning in natural language processing.
A statistical model of language uses the chain rule to calculate joint probabilities
over word sequences:
Learning a neural language model [21] is an unsupervised task where the model
is trained to predict the next word in a sequence given some previous words, i.e.
the model calculates the probability of an each word in the vocabulary of being
the next word in the sequence. Neural language models have been implemented
as a feed-forward network [21] and LSTM architectures in [81, ULMFiT] and
[139, ELMo]. Nevertheless, recurrent networks, including LSTMs, are inherently
sequential, which hampers parallelization on training data, a desired feature at
longer sequence lengths as memory constraints limit batching across examples
[182]. Thus, LSTMs were replaced in [146, Open AI GPT] and [44, BERT ] by
transformer architectures [182], which are not recurrent and rely on a self-attention
mechanism to extract global dependencies between input and output sequences.
Neural language models are usually learnt from high quality, grammatically
correct and curated text corpora, such as Wikipedia (ULMFiT), BookCorpus (Open
AI GPT), a combination of Wikipedia and BookCorpus (BERT) or News (ELMo).
To overcome the OOV problem these approaches use different representations based
on characters (ELMo), byte-pair encoding [160] (GPT), and word pieces [159]
(BERT).
The main purpose of using transfer learning [136] is to avoid building task-specific
models from scratch by leveraging knowledge previously acquired by training
another task. While word embeddings are learnt from large corpora, their use in
neural models to solve specific tasks is limited to the input layer. So, in practice
a task-specific neural model is built almost from scratch since the rest of model
parameters, which are typically initialized at random, need to be optimized for the
task at hand, requiring large amounts of data to produce a high performance model.
A step forward towards transfer learning in NLP were ELMo’s contextual
representations [139], which can be fine-tuned against domain data as a result of
adjusting a linear combination of the internal representations of the model. However,
the need for specific architectures to solve different tasks still remains. Recent work
based on transformers to learn language models [44, 146] has shown evidence that
transferring internal self-attention blocks along with shallow feed-forward networks
is sufficient to advance the state of the art in different evaluation tasks, showing task-
specific architectures are no longer necessary.
20 3 Understanding Word Embeddings and Language Models
3.3.1 ELMo
ELMo [139] (embeddings from language models) learns contextual word embed-
dings as a function of the entire input sentence. Contextual embeddings allow
dealing with, e.g. word polysemy, a feature that traditional word embedding
approaches did not support. ELMo trains a 2-layer BiLSTM with character convo-
lutions to learn a bidirectional language model from a large text corpus. Then, deep
contextualized word embeddings are generated as a linear function of the BiLSTM
hidden states.
The bidirectional language model is actually two language models: one forward
language model to process the input sentence and predict the next word and one
backward language model that runs the input sequence in reverse, predicting the
previous token given the future tokens.
ELMo embeddings can be used in a downstream model by collapsing all the
internal layers into a single vector. It is also possible to fine-tune the model in a
downstream task, computing a task-specific weighing of all the internal layers.
3.3.2 GPT
The Generative Pre-trained Transformer (GPT) [146], as well as its sequel GPT-2,
trained on even larger corpora, is a neural language model that can be fine-
tuned for a specific task by applying task-dependent transformations to the input,
requiring minimal changes to the model architecture. GPT is first pre-trained in an
unsupervised stage to learn a language model on a large corpus of text using a multi-
layer transformer decoder [107]. Then, in a supervised stage, the model is fine-tuned
to adjust the parameters to the target task. GPT processes the text sequences from
left to right and hence each token can only attend to previous tokens in the self-
attention layer. Fine-tuning GPT for different evaluation tasks achieves better results
than using task-specific architectures, showing that the latter are no longer required.
During supervised fine-tuning a linear layer is added on top of the transformer to
learn a classifier. Thus, the task dataset is assumed to be a sequence of input tokens
along with a label. The only new parameters are the linear layer parameters, while
the transformer parameters are just adjusted. For tasks other than text classification,
the input is transformed into an ordered sequence that the pre-trained model can
process.
3.4 Fine-Tuning Pre-trained Language Models for Bot Detection 21
3.3.3 BERT
1 https://nlp.stanford.edu/projects/glove/pre-process-twitter.rb.
2 https://nlp.stanford.edu/projects/glove/.
3 https://data.world/jaredfern/urban-dictionary-embedding.
4 https://fasttext.cc/docs/en/english-vectors.html.
3.4 Fine-Tuning Pre-trained Language Models for Bot Detection 23
Contextualized Embeddings
In addition to static pre-trained embeddings we use dynamically generated embed-
dings using ELMo. ELMo embeddings were generated from our dataset; however,
none of the trainable parameters, i.e. linear combination weights, was modified in
the process. Due to the high dimensionality of such embeddings (dim=1024) and to
prevent memory errors, we reduced the sequence size used in the classifiers to 50.
Learned and Pre-trained Embeddings
Another option to improve the classifiers is to allow the neural architecture to adjust
dynamically the embeddings or part of them in the learning process. To do so, we
generate 300 dimension embeddings initialized randomly and set them as trainable.
In addition, we concatenate these random and trainable embeddings to the pre-
trained and ELMo embeddings, which were not modified in the learning process.
In this round of experiments we always use pre-processing since in the previous
sections this option always improved the results.
Task-Specific Neural Architectures
We train text binary classifiers using convolutional neural networks and bidirectional
long short-term memory networks.
Convolutional Neural Network
For the bot detection task we use convolutional neural networks (CNNs) inspired
by Kim’s work [96], who showed how this architecture achieved good performance
in several sentence classification tasks, and other reports like [197] with equally
good results in NLP tasks. Our architecture uses 3 convolutional layers and a fully
connected layer. Each convolutional layer has 128 filters of size 5, ReLu was used as
activation function and max-pooling was applied in each layer. The fully connected
layer uses softmax as activation function to predict the probability of each message
being written by a bot or a human. All the experiments reported henceforth use a
vocabulary size of 20K tokens, sequence size 200, learning rate 0.001, 5 epochs, 128
batch size, static embeddings unless otherwise stated, and 10-fold cross-validation.
First we train the CNN classifier on our dataset using pre-trained embeddings and
compare them with randomly generated embeddings. In addition, we pre-process
our dataset using the same pre-processing script5 that was applied when learning
the GloVe Twitter embeddings.
Bidirectional Long Short-Term Memory Networks
In addition to CNN we test long short-term memory (LSTM) networks, [79], a
neural architecture that is also often used in NLP tasks [197]. LSTM networks are
sequential networks that are able to learn long-term dependencies. In our exper-
iments we use a bidirectional LSTM that processes the sequence of text forward
and backward to learn the model. The architecture of the BiLSTM comprises an
embedding layer, the BiLSTM layer with 50 processing cells, and a fully connected
layer that uses softmax as activation function to predict the probability of each
5 https://nlp.stanford.edu/projects/glove/pre-process-twitter.rb.
24 3 Understanding Word Embeddings and Language Models
message being written by a bot or a human. The rest of hyperparameters are set
with the same values that we use for the CNN experiments.
Pre-trained Language Models
We fine-tuned the following language models to carry out the bot detection
classification task: ULMFit,6 Open AI GPT,7 BERT.8 In all cases we use the
default hyperparameters:
• BERT base: 3 epochs, batch size of 32, and a learning rate of 2e−5
• Open AI GPT: 3 epochs, batch size of 8, and a learning rate of 6.25e−5
• ULMFiT: 2 epochs for the language model fine-tuning and 3 epochs for the
classifier, batch size of 32, and a variable learning rate.
At the end of this chapter we present a Jupyter notebook where we fine-tune
BERT for the bot detection task.
Evaluation results, presented in the figure below (Fig. 3.1), show that fine-tuning
language models yields overall better results than training specific neural architec-
tures that are fed with mixture of: (1) pre-trained contextualized word embeddings
(ELMo), (2) pre-trained context-independent word embeddings learnt from Com-
mon Crawl (fastText), Twitter (GloVe), and urban dictionary (word2vec), plus
embeddings optimized by the neural network in the learning process.
Fine-tuning GPT on the non-pre-processed dataset learnt the best classifier in
terms of f-measure followed by BERT. For these two approaches, the pre-processed
dataset learnt classifiers with a lower f-score, although overall these classifiers
were better than all the other tested approaches. ULMFit [81], another pre-trained
language model approach, learnt the best classifier after GPT and BERT, after pre-
processing the dataset.
Next in the ranking appears a BiLSTM classifier learnt from the pre-processed
dataset that uses a concatenation of dynamically adjusted embeddings in the training
process plus contextualized embeddings generated by ELMo. Another version
of this classifier using in addition fastText embeddings from Common Crawl
performed slightly worse. The performance of CNN-based classifiers was lower,
in general. Similarly to BiLSTMs, the best classifier was learnt from trainable
embeddings and ELMo embeddings. Adding other pre-trained embeddings did not
improve performance. In general, pre-trained embeddings and their combinations
produced the least performing classifiers. Performance improves when pre-trained
6 https://docs.fast.ai/text.html#Fine-tuning-a-language-model.
7 https://github.com/tingkai-zhang/pytorch-openai-transformer_clas.
8 https://github.com/google-research/bert#fine-tuning-with-bert.
3.4 Fine-Tuning Pre-trained Language Models for Bot Detection
Fig. 3.1 Experiment results of the bot detection classification task. Metric reported F -measure. The asterisk in the classifier shows that the pre-processed
version of the dataset was used to learn or fine-tune the classifiers. LM stands for language model
25
26 3 Understanding Word Embeddings and Language Models
Fine-tuning BERT requires to incorporate just one additional output layer. So, a
minimal number of parameters need to be learned from scratch. To fine-tune BERT
for a sequence classification task the transformer output for the CLS token is used
as the sequence representation and connected to a one layer feed-forward network
that predicts the classification labels. All the BERT parameters and the FF network
are fine-tuned jointly to maximize the log probability of the correct label.
In the notebook we provide a complete version of the dataset (large) and a reduced
one (small) to be able to run the notebook within the time frame, since fine-tuning
BERT on the large version takes more than 5 h on a regular GPU.
• Large: 500K train and 100K test labeled tweets at ./09_BERT/Large_Dataset/
• Small: 1K train and 100 test labeled tweets at ./09_BERT/Small_Dataset/
Let us download the datasets and the models from Google Drive and then
decompress the file.
The environment variable DATA_DIR holds the path to the dataset that is going
to be used on the rest of the notebook. The reader can choose the large or small
version.
The dataset is in the tsv format expected by transformer library. We can use the
panda library to load the data and visualize an excerpt of the dataset:
Running the following script, it is possible to fine-tune the model and evaluate the
model. During evaluation, the classification of the tweets on the test set is saved in
“predictions.txt,” which we will use later.
The most relevant parameters of the script are:
• model type: the model that we are going to use, in this case BERT.
• model name or path: the name of the model or path storing a specific model.
• task name: the task that we want to perform, in this case CoLA because we want
to do classification.
• output dir: the directory in which it stores the fine-tuned model.
The reader can try to change the parameters and see how it affects performance.
This process is slow even though we reduced the size of the dataset. The expected
duration in its current configuration is around 1 min.
The binary classifier is evaluated using the MCC score. This score measures how
accurately the algorithm performs on both positive and negative predictions. MCC
values range from −1 to 1 being 0 the random case, −1 the worst value, and +1 the
best value. With the small dataset the expected MCC is 0.24. On the other hand, if
trained on the larger dataset, MCC will increase to around 0.70.
Let us compute the traditional evaluation metrics (accuracy, precision, recall, and
f-measure) of our fine-tuned model to see how it performs on the test set.
3.4 Fine-Tuning Pre-trained Language Models for Bot Detection 29
The reader should see an accuracy of 62% and an f1-score of 62% which is a
good result considering the size of the dataset used in the notebook. The full model
fine-tuned against the complete dataset produces the following metrics:
We use some random examples from the test set and use the model to predict
whether the tweet was generated by a bot or not.
30 3 Understanding Word Embeddings and Language Models
3.5 Conclusion
In this chapter we have seen that neural language models trained on large corpora
have overcome traditional pre-trained word embeddings in NLP tasks. Language
models introduce a major shift in the way that word embeddings are used in
neural networks. Before them, pre-trained embeddings were used at the input of
task-specific models that were trained from scratch, requiring a large amount of
labeled data to achieve good learning results.Language models on the other hand
3.5 Conclusion 31
only require to adjust their internal representations through fine-tuning. The main
benefit is that the amount of data necessary to fine-tune the language model for a
specific task is considerably smaller compared to training a task model from scratch.
In the remainder of this book we will leverage this observation, drilling down
on the application of language models in combination with structured knowledge
graphs across different NLP tasks. In Chap. 6 we will use language models to
derive concept-level embeddings from an annotated corpus and the WordNet graph.
In a different scenario, Chap. 10 will illustrate how to apply language models to
semantically compare sentences as part of a semantic search engine for fact-checked
claims and build a knowledge graph with them.
Chapter 4
Capturing Meaning from Text as Word
Embeddings
Abstract This chapter provides a hands-on guide to learn word embeddings from
text corpora. To this purpose we choose Swivel, whose extension is the basis for the
Vecsigrafo algorithm, which will be described in Chap. 6. As introduced in Chap. 2,
word embedding algorithms like Swivel are not contextual, i.e. they do not provide
different representations for the different meanings a polysemous word may have.
As we will see in the subsequent chapters of the book, this can be addressed in a
variety of ways. For the purpose of this chapter, we focus on a basic way to represent
words using embeddings.
4.1 Introduction
1 https://github.com/tensorflow/models/tree/master/research/swivel.
2 https://colab.research.google.com/github/hybridnlp/tutorial/blob/master/
01_capturing_word_embeddings.ipynb.
First, let us download a corpus into our environment. We will use a small sample of
the UMBC corpus that has been pre-tokenized and that we have included as part of
our GitHub3 repository. First, we will clone the repository so we have access to it
from this environment.
The dataset comes as a zip file, so we unzip it by executing the following cell.
We also define a variable pointing to the corpus file:
The reader can inspect the file using the %less command to print the whole
input file at the bottom of the screen. It will be quicker to just print a few lines:
The output above shows that the input text has already been pre-processed. All
the words have been converted to lowercase (this will avoid having two separate
words for The and the) and punctuation marks have been separated from words.
This will avoid failed tokenizations like “words” such as “staff.” or “grimm,” in the
example above, which would otherwise add to our vocabulary.
3 https://github.com/hybridnlp/tutorial.
4 https://pypi.org/project/word2vec.
4.4 Generate Co-occurrence Matrix Using Swivel prep 35
We see that first, the algorithm determined the vocabulary V , this is the list of
words for which an embedding will be generated. Since the corpus is fairly small, so
is the vocabulary, which consists of only about 5K words (large corpora can result
in vocabularies with millions of words).
The co-occurrence matrix is a sparse matrix of |V | × |V | elements. Swivel uses
shards to create submatrices of |S| × |S|, where S is the shard-size specified above.
In this case, we have 100 submatrices.
5 https://github.com/stanfordnlp/GloVe.
6 https://fasttext.cc.
36 4 Capturing Meaning from Text as Word Embeddings
All this information is stored in the output folder we specified above. It consists
of 100 files, one per shard/submatrix and a few additional files:
The prep step does the following: (a) it uses a basic, white space, tokenization
to get sequences of tokens, (b) in a first pass through the corpus, it counts all tokens
and keeps only those that have a minimum frequency (5) in the corpus. Then it
keeps a multiple of the shard_size of that. The tokens that are kept form the
vocabulary with size v = |V |. (c) On a second pass through the corpus, it uses a
sliding window to count co-occurrences between the focus token and the context
tokens (similar to word2vec). The result is a sparse co-occurrence matrix of size
v × v. For easier storage and manipulation, Swivel uses sharding to split the co-
occurrence matrix into submatrices of size s × s, where s is the shard_size. The
sharded co-occurrence submatrices are stored as protobuf7 files.
With the sharded co-occurrence matrix it is now possible to learn embeddings. The
input is the folder with the co-occurrence matrix (protobuf files with the sparse
matrix). - submatrix_ rows and submatrix_ columns need to be the same
size as the shard_size used in the prep step.
This should take a few minutes, depending on the machine. The result is a list of
files in the specified output folder, including checkpoints of the model and tsv files
for the column and row embeddings.
One thing missing from the output folder is a file with just the vocabulary, which
we will need later on. We copy this file from the folder with the co-occurrence
matrix.
The tsv files are easy to inspect, but they take too much space and they are slow to
load since we need to convert the different values to floats and pack them as vectors.
7 https://developers.google.com/protocol-buffers/.
4.6 Read Stored Binary Embeddings and Inspect Them 37
Swivel offers a utility to convert the tsv files into a binary format. At the same
time it combines the column and row embeddings into a single space (it simply adds
the two vectors for each word in the vocabulary).
This adds the vocab.txt and vecs.bin to the folder with the vectors, which
the reader can verify using the following command:
Swivel provides the vecs library which implements the basic Vecs class. It accepts
a vocab_file and a file for the binary serialization of the vectors (vecs.bin).
. . . and we can load existing vectors. We assume the reader managed to generate
the embeddings by following the tutorial up to now. Note that, due to random
initialization of weight during the training step, results may be different from the
ones presented below.
The cells above should display results similar to those in Table 4.1 (for words
california and conference).
38 4 Capturing Meaning from Text as Word Embeddings
Note that the vocabulary only has single-word expressions, i.e. compound words are
not present:
A common way to work around this issue is to use the average vector of the two
individual words (of course this only works if both words are in the vocabulary):
The reader can try generating new embeddings using a small Gutenberg corpus
that is provided as part of the NLTK library. It consists of a few public-domain
works published as part of the Project Gutenberg.8
First, we download the dataset into our environment:
8 https://www.gutenberg.org.
4.8 Conclusion 39
As can be seen, the corpus consists of various books, one per file. Most word2vec
implementations require to pass a corpus as a single text file. We can issue a few
commands to do this by concatenating all the txt files in the folder into a single
all.txt file, which we will use later on.
A couple of the files are encoded using iso-8859-1 or binary encodings, which
will cause trouble later on, so we rename them to avoid including them into our
corpus.
Run the steps described above to generate embeddings for the Gutenberg dataset.
Use methods similar to the ones shown above to get a feeling for whether the
generated embeddings have captured interesting relations between words.
4.8 Conclusion
If you followed the instructions in this chapter (or executed the accompanying
notebook) you should be able to apply Swivel to learn word embeddings from
any text corpus. You should also be now able to load the embeddings and explore
the learned vector space. Based on the explorations described in this paper you
will have seen that the learned embeddings seem to have captured some semantic
similarity notions for the words. In the following chapters you will see how to
learn embeddings from knowledge graphs as well as how to modify Swivel to learn
not only word embeddings, but also embeddings for the concepts associated with
individual words. In Chap. 7 you will also learn more principled ways to measure
how good the learned embeddings are.
Chapter 5
Capturing Knowledge Graph
Embeddings
5.1 Introduction
Word embeddings aim at capturing the meaning of words based on very large
corpora; however, there are decades of experience and approaches that have tried
to capture this meaning by structuring knowledge into semantic nets, ontologies,
and graphs. Table 5.1 provides a high-level overview of how neural and symbolic
approaches address such challenges.
Table 5.1 Capturing different dimensions of meaning through neural and symbolic approaches
Dimension Neural Symbolic
Representation Vectors Symbols (URIs)
Input Large corpora Human editors (knowledge engineers)
Interpretability Linked to model and data splits Requires understanding of the schema
Alignability Parallel (annotated) corpora Heuristics and manual
Compositionality Combine vectors Merge graphs
Extensibility Fixed vocabulary, word pieces Node interlinking
Certainty Probability distribution Exact
Debugability fix training data Edit graph
In recent years, many new approaches have been proposed to derive neural
representations for existing knowledge graphs. Think of this as trying to capture
the knowledge encoded in the KG to make it easier to use this in deep learning
models.
• TransE tries to assign an embedding to nodes and relations, so that h + r is close
to t, where h and t are nodes in the graph and r is an edge. In the RDF world,
this is simply an RDF triple, where h is the subject, r is the property, and t is the
object of the triple.
• HolE is a variant of TransE, but uses a different operator (circular correlation) to
represent pairs of entities.
• RDF2Vec applies word2vec to random walks on an RDF graph (essentially paths
or sequences of nodes in the graph).
• Graph convolutions apply convolutional operations on graphs to learn the
embeddings.
• Neural message passing merges two strands of research on KG embeddings:
recurrent and convolutional approaches.
For additional information, Nickel et al. [129] compile a collection of relational
machine learning for knowledge graphs, while Nguyen [125] provides an overview
on embedding models of entities and relations for knowledge base completion.
There are several useful libraries that allow training many of the existing KGE
algorithms. Next, we enumerate some of the most popular ones in Python:
5.3 Creating Embeddings for WordNet 43
In this section, we go through the steps of generating word and concept embeddings
using WordNet, a lexico-semantic knowledge graph.
1. Choose (or implement) a KG embedding algorithm.
2. Convert the KG into format required by the KG embedding algorithm.
3. Train the model.
4. Evaluate/inspect results.
1 https://github.com/Sujit-O/pykg2vec.
2 https://github.com/tensorflow/tensorflow.
3 https://github.com/facebookresearch/PyTorch-BigGraph.
4 https://github.com/pytorch/pytorch.
5 https://github.com/dmlc/dgl.
6 https://github.com/apache/incubator-mxnet.
7 https://github.com/Accenture/AmpliGraph.
8 https://github.com/mnick/holographic-embeddings.
44 5 Capturing Knowledge Graph Embeddings
Executing the previous cell should produce a lot of output as the project is built.
Towards the end you should see something like:
which we can install on the local environment by using pip, the Python package
manager.
Now that skge is installed on this environment, we are ready to clone the
holographic-embeddings repository, which will enable us to train HolE embed-
dings.
If you want, you can browse the contents of this repo on GitHub or execute the
following to see how you can start training embeddings for the wn18 knowledge
graph, a WordNet 3.0 subset in which are filtered out the synsets appearing in less
than 15 triplets and relation types appearing in less than 5000 triplets [24]. In the
following sections we will go into more detail about how to train embeddings, so
there is no need to actually execute this training just yet.
9 https://github.com/mnick/scikit-kge.
5.3 Creating Embeddings for WordNet 45
You should see a section on the bottom of the screen with the contents of the
run_hole_wn18.sh file. The main execution is:
This shows that WordNet wn18 has been represented as a graph of 40943 nodes
(which we assume correspond to the synsets) interlinked using 18 relation types.
The full set of relations has been split into 141K triples for training and 5K triples
each for testing and validation.
5.3.2.2 Converting WordNet 3.0 into the Required Input Format from
Scratch
It will be useful to have experience converting your KG into the required input
format. Hence, rather than simply reusing the wn18.bin input file, we will
generate our own directly from the NLTK WordNet API.10
First we need to download WordNet:
Now that we have the KG, we can use the WordNet API to explore the graph. Refer
to the how-to document for a more in-depth overview, here we only show a few
methods that will be needed to generate our input file.
The main nodes in WordNet are called synsets (synonym sets). These correspond
roughly to concepts. You can find all the synsets related to a word like this:
The output from the cell above shows how synsets are identified by the NLTK
WordNet API. They have the following form:
As far as we are aware, this is a format chosen by the implementors of the NLTK
WordNet API and other APIs may choose diverging ways to refer to synsets. You
can get a list of all the synsets as follows (we only show the first 5):
Similarly, you can also get a list of all the lemma names (again we only show 5):
For a given synset, you can find related synsets or lemmas, by calling the
functions for each relation type. Below we provide a couple of examples for the
first sense of adjective adaxial. In the first example, we see that this synset belongs
to topic domain biology.n.01, which is again a synset. In the second
example, we see that it has two lemmas, which are relative to the synset. In the
third example, we retrieve the lemmas in a form that are not relative to the synset,
which is the one we will use later on.
The main nodes in WordNet are the synsets; however, lemmas can also be
considered to be nodes in the graph. Hence, you need to decide which nodes to
include. Since we are interested in capturing as much information as can be provided
by WordNet, we will include both synsets and lemmas.
WordNet defines a large number of relations between synsets and lemmas. Again,
you can decide to include all or just some of these. One particularity of WordNet is
that many relations are defined twice: e.g. hypernym and hyponym are the exact
same relation, but in reverse order. Since this is not really providing additional
information, we only include such relations once. The following cell defines all the
relations we will be taking into account. We represent these as Python dictionaries,
where the keys are the name of the relation and the values are functions that accept
a head entity and produce a list of tail entities for that specific relation:
48 5 Capturing Knowledge Graph Embeddings
Triple Generation
We are now ready to generate triples by using the WordNet API. Recall that skge
requires triples of the form (head_id, tail_id, rel_id); hence, we will
need to have some way of mapping entity (synset and lemma) names and relations
types to unique ids. We therefore assume we will have an entity_id_map and
a rel_id_map, which will map the entity name (or relation type) to an id. The
following two cells implement functions which will iterate through the synsets and
relations to generate the triples:
5.3 Creating Embeddings for WordNet 49
Now that we have methods for generating lists of triples, we can generate the input
dictionary and serialize it. We need to:
1. create our lists of entities and relations,
2. derive a map from entity and relation names to ids,
3. generate the triples,
4. split the triples into training, validation, and test subsets, and
5. write the Python dict to a serialized file.
We implement this in the following method:
50 5 Capturing Knowledge Graph Embeddings
Generate wn30.bin
Now we are ready to generate the wn30.bin file which we can feed to the HolE
algorithm implementation.
Notice that the resulting dataset now contains 265K entities, compared to 41K in
wn18 (to be fair, only 118K of the entities are synsets).
Now, we will use our WordNet 3.0 dataset to learn embeddings for both synsets and
lemmas. Since this is fairly slow, we only train for 2 epochs, which can take up to
10 min. (In the exercises at the end of this chapter, we provide a link to download
pre-computed embeddings which have been trained for 500 epochs.)
Now that we have trained the model, we can retrieve the embeddings for the entities
and inspect them.
The output file is again a pickled serialization of a Python dictionary. It contains the
model itself and results for the test and validation runs as well as execution times.
5.3 Creating Embeddings for WordNet 51
Unfortunately, skge does not provide methods for exploring the embedding
space (KG embedding libraries are more geared towards prediction of relations).
Therefore, we will convert the embeddings into an easier to explore format. We first
convert them into a pair of files for the vectors and the vocabulary and we will then
use the swivel library to explore the results.
First, we read the list of entities, which will be our vocabulary, i.e. the names of
synsets and lemmas for which we have embeddings.
Next, we generate a vocab file and a tsv file where each line contains the word
and a list of d numbers.
Now that we have these files, we can use swivel, which we used in the first
notebook to inspect the embeddings. First, download the tutorial materials and
swivel if necessary, although you may already have it on your environment if
you previously executed the first notebook of this tutorial.
52 5 Capturing Knowledge Graph Embeddings
Next, we can load the vectors using swivel’s Vecs class, which provides easy
inspection of neighbors.
As you can see, the embeddings do not look very good at the moment. In part this
is due to the fact we only trained the model for 2 epochs. We have pre-calculated
a set of HolE embeddings for 500 epochs, which you can download and inspect as
part of an optional exercise below. Results for these are much better:
One thing to notice here is that all of the top 10 closely related entities for
california.n.01 are also synsets. Similarly for lemma california, the
most closely related entities are also lemmas, although some synsets also made it
into the top 10 neighbors. This may indicate a tendency of HolE to keep lemmas
close to other lemmas and synsets close to other synsets. In general, choices about
how nodes in the KG are related will affect how their embeddings are interrelated.
5.4 Exercises
If you have a KG of your own, you can adapt the code shown above to generate a
graph representation as expected by skge and you can train your embeddings in
this way. Popular KGs are Freebase and DBpedia.
We have used code similar to the one shown above to train embeddings for 500
epochs using HolE. You can execute the following cells to download and explore
these embeddings. The embeddings are about 142MB, so downloading them may
take a few minutes.
54 5 Capturing Knowledge Graph Embeddings
The downloaded tar contains a tsv.bin and a vocab file like the one we
created above. We can use it to load the vectors using swivel’s Vecs:
Now you are ready to start exploring. The only thing to notice is that we have
added a prefix to lem_ to all lemmas and wn31_ to all synsets, as shown in the
following examples:
5.5 Conclusion
If you followed the instructions in this chapter, you will have trained a model with
concept and word embeddings derived from WordNet, a well-known knowledge
graph that encodes knowledge about words and their senses. You will also have
learned about the pre-processing needed to select which parts of the KG you want
to use to train embeddings, as well as how to export the graph into a format that
most KG embedding algorithms expect. Finally, you will have explored the learned
embeddings and briefly seen how they compare to word embeddings learned from
text corpora in the previous chapter. The main advantage of KG embeddings is that
they already encode knowledge about the desired conceptual level. On the other
hand, the main issue with KG embeddings is that you need a KG to be available
for your domain of choice. In the next chapter we see that it is also possible to learn
concept embeddings without a KG, by modifying how word embeddings are learned
from text.
Part II
Combining Neural Architectures
and Knowledge Graphs
Chapter 6
Building Hybrid Representations
from Text Corpora, Knowledge Graphs,
and Language Models
6.1 Introduction
In the previous chapters we saw models which were capable of learning word
embeddings from text or learning concept embeddings from knowledge graphs.
In this chapter we look at hybrid approaches that aim to merge the best of both
worlds. In the first half of the chapter we introduce and provide hands-on experience
with Vecsigrafo, an extension of the Swivel algorithm to learn word and concept
embeddings based on a disambiguated text corpus. We start by introducing the
required terminology and notation in Sect. 6.2. Then we provide a conceptual
intuition about how Vecsigrafo works and continue with a formal definition of the
algorithm. We also describe how Vecsigrafo is implemented (Sect. 6.4) and provide
practical Sects. 6.5 and 6.6 to learn embeddings from a sample corpus and explore
the results. In the second half of this chapter, in Sect. 6.7, we take a step further
and show how to apply transformers and neural language models to generate an
analogous representation of Vecsigrafo, called Transigrafo.
Let T be the set of tokens that can occur in text after some tokenization is applied;
this means tokens may include words (“running”), punctuation marks (“;”), multi-
word expressions (“United States of America”), or combinations of words with
punctuation marks (“However,”, “–”). Let L be the set of lemmas: base forms of
words (i.e., without morphological or conjugational variations). Note that L ⊂ T .1
We also use the term lexical entry—or simply word—to refer to a token or a
lemma. Let C be the set of concept identifiers in some knowledge graph, we use the
term semantic entry—or simply concept—to refer to elements in C.
Let V ⊂ T ∪ C be the set of lexical and semantic entries for which we want to
derive embeddings, also called the vocabulary. A corpus is a sequence of tokens
ti ∈ T ; we follow and extend the definition of context around a token (used in,
e.g., word2vec, GloVe, and Swivel) as a W -sized sliding window over the sequence
of tokens. Therefore we say that tokens t i−W , . . . , t i−1 , t i+1 , . . . , t i+W are in the
context of center token t i in the context at position i in the corpus.
Each context can be represented as a collection of center-context pairs of the
form (ti , tj ), where ti ∈ T and tj ∈ T . We extend this to take into account lemmas
and concepts: let D be the collection of center-context entry pairs (xi , xj ) observed
in a corpus, where xi ∈ V and xj ∈ V .2 We use notation #(xi , xj ) to refer to the
number of times the center entry xi co-occurred with context entry xj in D. We also
define p(xi , xj ) as the set of positions in the corpus where xi is the center word and
xj is a context word. Similarly #(x) is the number of times entry x occurred in D as
a center word.
Finally, let d be the dimension of the vector space, so that for each entry x in
V has two corresponding vectors xC and xF ∈ Rd , which correspond to the vector
representation of x as a context or as a center entry, respectively.
So as to build models that use both bottom-up (corpus-based) embeddings and top-
down structured knowledge (in a graph), we generate embeddings that share the
same vocabulary as the knowledge graphs. This means generating embeddings for
knowledge items represented in the knowledge graph such as concepts and surface
forms (words and expressions) associated with the concepts in it. In RDF, this would
typically mean values for rdfs:label properties, or words and expressions
1 We assume lemmatization correctly strips away punctuation marks (e.g., lemma of “However,” is
context entries; however, in this paper we assume both vocabularies are equal; hence, we do not
make a distinction.
6.3 What Is Vecsigrafo and How to Build It 59
3 https://www.w3.org/2016/05/ontolex.
4 Sensigrafo, Expert System’s knowledge graph: https://www.expertsystem.com/products/cogito-
cognitive-technology/semantic-technology/knowledge-graph.
60 6 Building Hybrid Representations from Text Corpora, Knowledge Graphs,. . .
context window [103], whereby co-occurrence counts are weighted according to the
distance between the center and the context entry using the harmonic function. More
formally, we use
W − δc (xi , xj ) + 1
#δ (xi , xj ) = c∈p(xi ,xj ) (6.1)
W
where δc (xi , xj ) is the distance, in token positions, between the center entry xi and
the context entry xj in a particular context at position c in the corpus. W is the
window size as presented in Sect. 6.2.
In standard word embedding algorithms, there is only one sequence of tokens;
hence 1 <= δc (xi , xj ) <= W . In our case we have three aligned sequences: tokens,
lemmas, and concepts. Hence δ(xi , xj ) may also be 0, e.g. when xi is a lemma and
xj is its disambiguated concept. Hence, in this work, we use a slightly modified
version:
δc (xi , xj ) if δc (xi , xj ) > 0
δc (xi , xj ) = (6.2)
1 if δc (xi , xj ) = 0
This gives us #δ (xi xj ) and, based on the co-occurrence matrix M we thus apply
the training phase of a slightly modified version of the Swivel algorithm to learn the
embeddings for the vocabulary. The original Swivel loss function is given by
L1 if #(xi , xj ) > 0
LS =
L0 if #(xi , xj ) = 0, where
1 #(xi , xj )|D|
L1 = #(xi , xj )1/2 (xi xj − log )
2 #(xi )#(xj )
|D|
L0 = log[1 + exp(xi xj − log )]
#(xi )#(xj )
Our modifications include: using #δ (xi , xj ) instead of the default definition and
the addition of a vector regularization term as suggested by Duong et al. [47]
(equation 3) which aims to reduce the distance between the column and row (i.e.,
center and context) vectors for all vocabulary elements, i.e.
L = LS + γ xF − xC 22 (6.3)
x∈V
Such modification is useful for our purposes, since the row and column vocabu-
laries are the same; while in the general case, usually the sum or the average of both
vectors would be used to produce the final embeddings.
6.4 Implementation 61
6.4 Implementation
Although any knowledge graph, natural language processing toolkit, and WSD
algorithm can be used to build a Vecsigrafo, our original implementation used
Expert System’s proprietary Cogito5 pipeline to tokenize and disambiguate the
corpora. Cogito is based on a knowledge graph called Sensigrafo, which is similar
to WordNet, but larger and tightly coupled to the Cogito disambiguator (i.e., the
disambiguator uses intricate heuristic rules based on lexical, domain, and semantic
rules to do its job). Sensigrafo contains about 400K lemmas and 300K concepts
(called syncons in Sensigrafo) interlinked via 61 relation types, which render almost
3M links. Sensigrafo also provides a glossa—a human readable textual definition—
for each concept, which is only intended for facilitating the inspection and curation
of the knowledge graph.
As part of the work that led to the creation of the first Vecsigrafo, we studied
the effect of applying alternative disambiguation algorithms6 and compared them
with Cogito. We implemented three disambiguation methods: (1) the shallow con-
nectivity disambiguation (scd) algorithm introduced by [112], which essentially
chooses the candidate concepts that are better connected to other candidate concepts
according to the underlying knowledge graph; (2) the most frequent disambiguation
(mostfreqd), which chooses the most frequent concept associated with each
lemma encountered in the corpus; and (3) the random concept candidate disam-
biguation (rndd), which selects a random concept for each lemma encountered in
the corpus. Note that rndd is not completely random, it still assigns a plausible
concept to each lemma, since the choice is made out of the set of all concepts
associated with the lemma.
To implement our approach, we also extended the matrix construction phase of
the Swivel [164] algorithm7 to generate a co-occurrence matrix which can include
both lexical and semantic entries as part of the vocabulary. Table 6.1 provides
an example of different tokenizations and disambiguations for a context window
derived from the same original text. To get a feeling for concepts and the effect of
different disambiguators, notice that Cogito assigns concept #82073 (with glossa
appropriate for a condition or occasion and synonyms suitable, right) to “proper,”
while scd and mostfreqd have assigned concept #91937 (with glossa marked
by suitability or rightness or appropriateness and synonyms kosher). The rndd
disambiguation has assigned an incorrect concept #189906 from mathematics (with
glossa distinguished from a weaker relation by excluding. . .).
Table 6.1 also introduces notation we will use throughout the remainder of the
book to identify embedding variations. We will use t to refer to plain text tokens and
assume Cogito-based tokenization, if a different tokenization is meant, we will add a
suffix like in the table to show that Swivel tokenization has been used. Similarly, we
Table 6.1 Example tokenizations for the first window of size W = 3 for sentence: “With regard
to enforcement, proper agreements must also be concluded with the Eastern European countries...”
t_f With regard to enforcement proper agreements also concluded eastern european
l_f With regard to enforcement proper agreement also conclude eastern European
s_f en#216081 en#4652 en#82073 en#191513 en#192047 en#150286 en#85866 en#98025
g_f PRE NOU ADJ NOU ADV VERB ADJ ADJ
First we show standard Swivel tokenization, next we show the standard Cogito tokenization with sequences for plain
text, lemmas, syncons, and grammar type; next we show alternative disambiguation syncons for the same tokenization.
Finally, we show cogito tokenization after applying filtering
use l to refer to lemmas and s to refer to concept identifiers (we assume syncons
since most of our experiments use Sensigrafo, although in some cases this may
refer to other concept identifiers in other knowledge graphs such as BabelNet). As
described above, the source sequences may be combined, which in this paper means
combinations ts or ls. Finally we use suffix _f to show that the original sequence
was filtered based on grammar type information as described above.
We use a transcribed Jupyter notebook to illustrate with real code snippets how
to actually generate a Vecsigrafo based on a subset of the UMBC corpus.8 The
notebook, which follows the procedure described in Sect. 6.3 and depicted in
Fig. 6.1, is available and executable online.9 In addition to finalizing this section,
we encourage the reader to run the live notebook and do the exercises it includes in
order to better understand what we discussed in the following lines:
As described in Sect. 6.4, the main difference with standard Swivel is that we use
word-sense disambiguation as a pre-processing step to identify the lemmas and
8 https://ebiquity.umbc.edu/resource/html/id/351/UMBC-webbase-corpus.
9 https://colab.research.google.com/github/hybridnlp/tutorial/blob/master/03_vecsigrafo.ipynb.
6.5 Training Vecsigrafo 63
concepts entailed in the text, while Swivel simply uses white-space tokenization.
Therefore, each “token” in the resulting sequences is composed of a lemma and an
optional concept identifier.
6.5.1.1 Disambiguators
6.5.1.2 Tokenizations
When applying a disambiguator, the tokens are no longer (groups of) words. Each
token can contain different types of information. We generally keep the following
token information:
• t: text, the original text (possibly normalized, i.e., lowercased).
• l: lemma, the lemma form of the word, without morphological or conjugational
information.
• g: grammar: the grammar type.
• s: syncon (or synset in the case of WordNet) identifier.
10 https://www.expertsystem.com/products/cogito-cognitive-technology/semantic-technology/
disambiguation.
11 https://github.com/hybridnlp/tutorial.
64 6 Building Hybrid Representations from Text Corpora, Knowledge Graphs,. . .
the Vecsigrafo embeddings. Execute the following cell to clone the repository, unzip
the sample corpus, and print its first line:
You should see, among others, the first line in the corpus, starting with:
The previous line shows the format we are using to represent the tokenized
corpus. We use white space to separate the tokens and have URL-encoded each
token to avoid mixing up tokens. Since this format is hard to read, we provide a
library to inspect the lines in an easy manner. Execute the following cell to display
the first line in the corpus as a table, as shown below:
We filter some of the words and only keep the lemmas and the syncon ids and
encode them into the next sequence of disambiguated tokens:
6.5 Training Vecsigrafo 65
For the standard Swivel prep, we can simply call prep using the !python com-
mand. In this case we have the toked_corpus which contains the disambiguated
sequences as shown above. The output will be a set of sharded co-occurrence
submatrices as explained in the notebook for creating word vectors.
We set the shard_size to 512 since the corpus is quite small. For larger
corpora we could use the standard value of 4096.
6.5 Training Vecsigrafo 67
For the joint-subtoken prep step, next we describe the steps that need to be
executed to implement a similar pipeline. Note that we use a Java implementation
that is not open-source yet, as it is still tied to proprietary code. Currently we are
working on refactoring the code so that Cogito subtokens are just a special case.
Until then, in our GitHub repository we provide pre-computed co-occurrence files.
First, we ran our implementation of subtoken prep on the corpus. Please note:
• We are only including lemma and synset information, i.e., we are not including
plain text and grammar information.
• Furthermore, we are filtering the corpus by: (1) removing any tokens related to
punctuation marks (PNT), auxiliary verbs (AUX), and articles (ART), since we
think these do not contribute much to the semantics of words; (2) replacing tokens
with grammar types ENT (entities) and NPH (proper names) with generic variants
grammar#ENT and grammar#NPH, respectively.
The rationale of the second point is that, depending on the input corpus, names
of people or organizations may appear a few times, but may be filtered out if they
do not appear enough times. This ensures such tokens are kept in the vocabulary
and contribute to the embeddings of words nearby. The main disadvantage is that
we will not have some proper names in our final vocabulary.
68 6 Building Hybrid Representations from Text Corpora, Knowledge Graphs,. . .
We have included the output of this process as part of the GitHub repository.
Next, we unzip this folder to inspect the results:
The previous cell extracts the pre-computed co-occurrence shards and defines a
variable precomp_coocs_path that points to the folder where these shards are
stored.
Next, we print the first 10 elements of the vocabulary to see the format that we
are using to represent the lemmas and synsets:
As the output above shows, the vocabulary we get with subtoken prep is
smaller (5.6K elements instead of over 8K) and it contains individual lemmas and
synsets (it also contains special elements grammar#ENT and grammar#NPH, as
6.5 Training Vecsigrafo 69
described above). More importantly, the co-occurrence counts take into account
the fact that certain lemmas co-occur more frequently with certain other lemmas
and synsets, which should be taken into account when learning embedding repre-
sentations.
With the sharded co-occurrence matrices created in the previous section it is now
possible to learn embeddings by calling the swivel.py script. This launches
a TensorFlow application based on various parameters (most of which are self-
explanatory):
• input_base_path: The folder with the co-occurrence matrix (protobuf files
with the sparse matrix) generated above.
• submatrix_ rows and submatrix_ columns need to be the same size as the
shard_size used in the prep step.
• num_epochs: The number of times to go through the input data (all the co-
occurrences in the shards). We have found that for large corpora, the learning
algorithm converges after a few epochs, while for smaller corpora you need a
larger number of epochs.
Execute the following cell to generate embeddings for the pre-computed co-
occurrences:
This will take a few minutes, depending on your machine. The result is a list of
files in the specified output folder, including:
• The TensorFlow graph, which defines the architecture of the model being trained.
70 6 Building Hybrid Representations from Text Corpora, Knowledge Graphs,. . .
As we have seen in previous notebooks, the tsv files are easy to inspect, but they
take too much space and they are slow to load since we need to convert the different
values to floats and pack them as vectors. Swivel offers a utility to convert the tsv
files into a binary format. At the same time, it combines the column and row
embeddings into a single space, simply adding the two vectors for each word in
the vocabulary.
This adds the vocab.txt and vecs.bin to the folder with the vectors:
6.5 Training Vecsigrafo 71
As in previous notebooks, we can now use Swivel to inspect the vectors using the
Vecs class. It accepts a vocab_file and a file for the binary serialization of the
vectors (vecs.bin).
Now we can load existing vectors. In this example we load some pre-computed
embeddings, but feel free to use the embeddings you computed by following the
steps above. Note however that due to random initialization of weight during the
training step your results may differ.
Next, let us define a basic method for printing the k-nearest neighbors for a given
word and use such method on a few lemmas and synsets in the vocabulary:
72 6 Building Hybrid Representations from Text Corpora, Knowledge Graphs,. . .
Note that using the Vecsigrafo approach gets us very different results than when
using standard Swivel. The results now include concepts (synsets), besides just
words. Without further information, this makes interpreting the results harder since
we now only have the concept identifier. However, we can search for these concepts
in the underlying knowledge graph (WordNet in this case) to explore the semantic
network and get further information.
Of course, the results produced in this example may not be very good since these
have been derived from a very small corpus (5K lines from UMBC). In the exercise
below, we encourage you to download and inspect pre-computed embeddings based
on the full UMBC corpus.
6.6 Exercise: Explore a Pre-computed Vecsigrafo 73
The data only includes the tsv version of the vectors, so we need to convert these
to the binary format that Swivel uses. And for that, we also need a vocab.txt
file, which we can derive from the tsv as follows:
74 6 Building Hybrid Representations from Text Corpora, Knowledge Graphs,. . .
As we can see, the embeddings have a vocabulary of just under 1.5M entries,
56K of which are synsets and most of the rest are lemmas. Next, we convert the
tsv into Swivel’s binary format. This can take a couple of minutes.
Fig. 6.3 Pipeline of the Language Modeling Makes Sense Algorithm (LMMS)
12 SemCor is manually disambiguated against the WordNet knowledge graph. Other lexico-
semantic graphs like Sensigrafo are equally valid, as long as the corpus is also disambiguated
6.7 From Vecsigrafo to Transigrafo 77
extend the coverage provided by such embeddings to the senses in the graph that do
not appear in SemCor, and (3) evaluate the resulting sense embeddings. Subsequent
stages of the LMMS algorithm focus on leveraging WordNet glosses and lemma
information to optimize the quality of the resulting sense embeddings.
We also introduce some extensions to the original LMMS algorithm, which are
useful to produce a Transigrafo. Such extensions include the following:
• We added a transformers back-end based on the Hugging Face transformers
library13 in order to enable experimentation with other transformer architectures
in addition to BERT, such as XLNet, XLM, and RoBERTa. Besides the obvious
model-independence advantage this represents, this also allows optimizing train-
ing performance, i.e. padding sequences to BERT-style 512 word-piece tokens
when a different model is being used in the back-end is no longer required.
• We implemented SentenceEncoder, a generalization of bert-as-service that
encodes services using the transformers back-end. SentenceEncoder allows
extracting various types of embeddings from a single execution of a batch of
sequences.
• We adopt a rolling cosine similarity measure during training in order to
monitor how the embeddings representing each sense converge to their optimal
representation.
Next, we go through the sequence of steps necessary to produce a Transigrafo
and the actual code implementing each step. The complete notebook is available
and executable online.14
6.7.1 Setup
First, we clone the LMMS GitHub repository and change the current directory to
LMMS.
against them. General-purpose graphs like DBpedia or domain-specific ones could also be used by
minimally adapting the algorithm.
13 https://github.com/huggingface/transformers.
14 https://colab.research.google.com/github/hybridnlp/tutorial/blob/master/
03a_LMMS_Transigrafo.ipynb.
78 6 Building Hybrid Representations from Text Corpora, Knowledge Graphs,. . .
Then, we import and download the nltk interface for WordNet and install the
transformers library, needed to execute the LMMS scripts.
15 https://www.sketchengine.eu/semcor-annotated-corpus.
16 https://www.sketchengine.eu/brown-corpus.
6.7 From Vecsigrafo to Transigrafo 79
We use a transformers back-end17 to train the model with the SemCor corpus. After
training, we will have the following files in the output folder:
• semcor.4-8.vecs.txt contains the embedding learnt for each sense.
• semcor.4-8.counts.txt keeps track, for each sense, of its frequency in the training
corpus.
• semcor.4-8.rolling_cosims.txt records for each sense the sequence of cosine
similarity values between the current average and the one resulting from also
considering the next occurrence in the training corpus.
• lmms_config.json keeps a record of the parameters used during training.
17 The BERT as a service alternative is not possible in Google Colaboratory, used throughout this
Next, we propagate the embeddings learnt during the previous training phase
throughout the WordNet graph in order to also calculate embeddings for the senses
that do not appear explicitly in the training corpus. There are three main levels at
which such extension is performed:
1. Synset Level: For any sense without an embedding, we assign the embeddings
of sibling senses, i.e. those that share the same synset.
2. Hypernym Level: After the synset-based expansion, those senses that do not
have an associate embedding yet are assigned the average embedding of their
hypernyms.
3. Lexname Level: After the hypernym-based expansion, top-level categories in
WordNet called lexnames are assigned the average embedding of all its available
underlying senses. Any senses that do not have an associated embedding after
the previous two steps yet are assigned the embedding of their lexname.
The new embeddings resulting from the different extensions above described are
saved in an additional file: semcor_ext.3-512.vecs.npz.
6.7 From Vecsigrafo to Transigrafo 81
The evaluation of the resulting embeddings is performed through the Java Scorer
script provided by the authors of the LMMS paper.
The SemCor corpus counts with five different test sets: senseval2,senseval3,
semeval2007, semeval2013, and semeval2015. Also, there is an additional test
set available that contains all the previous datasets (“all”). Next, we evaluate the
previously calculated embeddings against semeval2015.
82 6 Building Hybrid Representations from Text Corpora, Knowledge Graphs,. . .
Precision, recall, and f1 scores are showed (around 75% in our implementation).
We can also compare these results with the ones from the original LMMS paper
in Table 6.2. The “LMMS 1024” row shows the results we are reproducing here.
The “LMMS 2048” row shows some additional improvement, obtained through
concatenation with the embeddings resulting from glosses and lemmas.
To this purpose, we define two functions, based on the k-NN algorithm, which
allow obtaining the neighbors of a sense embedding in the Transigrafo vector space.
6.7 From Vecsigrafo to Transigrafo 83
Executing the following cells in our notebook will display the 3-neighbors
of some selected sample senses: “long%3:00:02::”, “be%2:42:03::”, and
“review%2:31:00::”.
84 6 Building Hybrid Representations from Text Corpora, Knowledge Graphs,. . .
One of the enhancements we added over the standard LMMS algorithm and its
implementation is a rolling cosine similarity measure calculated during the process
of learning the embeddings. For each sense, such measure collects cosine similarity
scores between the average vector until then and the current occurrence of the sense
in the training corpus. This value should converge to the average distance of the
sense with respect to its average in each iteration. Peaks and valleys in the expected
convergence may indicate issues, e.g. derived from a corpus that may be too small,
with underrepresented senses. Next, we include the implementation of our rolling
cosine similarity measure.
6.7 From Vecsigrafo to Transigrafo 85
As we can see in Fig. 6.4, most of the senses in the SemCor corpus (20K out of 33K)
only occur once or twice in the corpus. Only about 500 senses occur 100 times or
more. This indicates that the corpus is probably too small, providing little signal
to learn the task at hand. Therefore, future work in this area will require investing
more effort in the generation of larger disambiguated corpus, following different
approaches like, e.g. crowdsourcing.
86 6 Building Hybrid Representations from Text Corpora, Knowledge Graphs,. . .
104
103
102
101
0 20 40 60 80 100
20,000
17,500
15,000
12,500
10,000
7500
5000
2500
0
0 20 40 60 80 100
As shown in Figs. 6.5, 6.6, 6.7, and 6.8, even the embeddings of the most
frequent senses are quite unstable according to the rolling cosine similarity measure.
This pattern becomes even more obvious with lower frequency senses and their
lemmas.18
18 Note the point of these figures is not to be able to see the pattern of each lemma, but rather that
all the lemmas are quite unstable. We invite the reader to adapt the code in the notebook to plot a
single lemma.
6.7 From Vecsigrafo to Transigrafo 87
1.0
0.8
Rolling cosim
0.6
night%1:28:00:: n=100 μ=0.833 s =0.056
appear%2:39:00:: n=100 μ=0.778 s=0.057
0.4
need%2:42:00:: n=100 μ=0.769 s =0.054
1%5:00:00:cardinal:00 n=100 μ=0.767 s =0.064
0.2 cause%2:36:00:: n=100 μ=0.744 s =0.064
state%1:15:01:: n=100 μ=0.850 s =0.073
0.0
0 20 40 60 80 100
sense occurrence in corpus
Fig. 6.5 Rolling cosine similarity for the most frequent senses in SemCor
1.0
0.8
Rolling cosim
0.6
sponsor%2:40:00:: n=25 μ=0.796 s =0.082
a_couple_of%5:00:00:few:00 n=25 μ=0.679 s =0.111
0.4
take%2:31:01:: n=25 μ=0.670 s =0.057
promise%2:32:00:: n=25 μ=0.764 s =0.063
0.2
record%1:10:03:: n=25 μ=0.789 s =0.072
approximately%4:02:00:: n=25 μ=0.812 s =0.072
0.0
0 20 40 60 80 100
sense occurrence in corpus
Fig. 6.6 Rolling cosine similarity for some less frequent senses, but still frequent enough to be
visible in the plot
Fig. 6.7 Rolling cosine similarity for the senses corresponding to lemma be
88 6 Building Hybrid Representations from Text Corpora, Knowledge Graphs,. . .
1.0
0.8
Rolling cosim
0.6
Fig. 6.8 Rolling cosine similarity for other senses of lemma be with lower frequency. The pattern
of instability identified in Fig. 6.7 is even more acute due to the lower frequency of appearance of
these senses in the corpus
The values obtained in the rolling similarity measure clearly indicate issues with
the size and variety of the SemCor corpus, which can be addressed by creating
new larger disambiguated corpora. In addition to an informed comparison between
Vecsigrafo and Transigrafo, how to create such corpora, required to learn a more
robust Transigrafo model, is also a research topic that will need to be addressed
in the future. To this purpose, possible approaches may entail higher automation
(SemCor is manually annotated, which surely limited its possible size) and crowd-
sourcing, given the volume of disambiguated text that may be needed, as well
as advanced tooling that enable knowledge engineers to curate the disambiguated
corpora, following a semi-automatic approach where expert intervention is still
needed.
Another key aspect that requires further research and improvement for a prac-
tical application of Transigrafo includes multi-lingualism. Currently, most neural
language models are only available in a handful of languages in addition to
English and, though useful, multi-lingual BERT (M-BERT) is also limited [143].
However, we expect that the solution to this issue will naturally unfold with
the normal evolution of the deployment of neural language models among NLP
researchers and practitioners. Indeed, this is happening as we write these lines,
with BERT-based models already available in well-represented languages like
French (CamemBERT [115]), German,19 Italian (ALBERTo [144]), Spanish
(BETO20 ), or Dutch (BERTje [184] and RobBERT [38]). Extending such coverage
to underrepresented languages will require interesting and challenging research.
19 https://deepset.ai/german-bert.
20 https://github.com/dccuchile/beto.
6.8 Conclusion 89
6.8 Conclusion
This chapter presented two approaches for learning hybrid word and concept
embeddings from large text corpora and knowledge graphs. The first approach
requires disambiguating the training corpus prior to training. The second approach
assumes that recent transformer-based language models somehow encode concept-
level embeddings in some layers of the contextual embedding stack and derives the
concept-level embeddings from examples in a pre-annotated corpus. By following
the practical sections you will have been able to train embeddings based on both
approaches and explore their results. We also discussed some of the properties of
the representations resulting from both approaches. In the next chapter we look
more thoroughly at different ways of evaluating word and concept embeddings, with
special focus on corpus-based knowledge graph embeddings like Vecsigrafo.
Chapter 7
Quality Evaluation
Abstract In the previous chapters we have discussed various methods for gen-
erating embeddings for both words and concepts. Once you have applied some
embedding learning mechanism you may wonder how good are these embeddings?
In this chapter we look at methods for assessing the quality of the learned embed-
dings: from visualizations to intrinsic evaluations like predicting alignment with
human-rated word similarity and extrinsic evaluations based on downstream tasks.
As in the previous chapters, we provide hands-on practical sections for gaining
experience in applying evaluation methods. We also discuss the methodology and
results used for a real-world evaluation of Vecsigrafo compared to various other
methods, which provides a sense for how thorough real-world evaluations can be
performed.
7.1 Introduction
In the previous chapters, you have already seen several methods to generate
embeddings for words and concepts. At the end of the day, having embeddings is not
an end goal in and of itself; their purpose is to have a way to encode information in
such a way that they are useful in achieving useful tasks. Indeed, in the subsequent
chapters, you will see various applications of these embeddings for various text
and knowledge graph tasks such as classification, KG interlinking, and analysis of
fake news and disinformation. Since different methods for generating embeddings
encode different information, some embeddings can be more suitable for some
tasks than others. Therefore, once you have applied some embedding learning
mechanism—or once you have found some pre-calculated embeddings online—you
may wonder how good are these embeddings? In this chapter, we look at various
generic ways to assess the quality of embeddings.
First, we will provide an overview of evaluation methods in Sect. 7.2 and then,
in Sects. 7.3 and 7.4, we will practice evaluating some of the simple embeddings we
have calculated in the previous chapters; this will give you practical experience in
evaluating embeddings. Finally, in Sect. 7.5, we describe how we have applied some
of these methods to evaluate a full version of Vecsigrafo embeddings. This last part
does not include practical exercises, but will give you an idea of how real-world
embeddings can be evaluated and compared to other state-of-the-art embeddings.
1 https://projector.tensorflow.org/.
2 https://github.com/uber-research/parallax.
7.3 Practice: Evaluating Word and Concept Embeddings 93
Schnabel et al. [158] provide a good overview of methods and introduce terminology
to refer to different types of evaluations. Baroni at al. [15] focused mostly on
intrinsic evaluations and showed that predictive models like word2vec produce
better results than count models (based on co-occurrence counting). Finally, Levy
et al. [103] studied how various implementation or optimization “details” used
in predictive models, which were not needed or used in count models, affect the
performance of the resulting embeddings. Examples of such details are negative
sampling, dynamic context windows, subsampling, and vector normalization. The
paper shows that once such details are taken into account, the difference between
count and predictive models is actually not that large.
3 https://gluebenchmark.com.
94 7 Quality Evaluation
Intrinsic evaluations are those where you can use embeddings to perform relatively
simple, word-related tasks.
Schnabel et al. distinguish between:
• Absolute Intrinsic: You have a (human annotated) gold standard for a particular
task and use the embeddings to make predictions.
• Comparative Intrinsic: You use the embedding space to present predictions
to humans, who then rate them. Mostly used when there is no gold standard
available.
The tasks used in intrinsic evaluation are the following:
• Relatedness: How well do embeddings capture human-perceived word simi-
larity? Datasets typically consist of triples: two words and a similarity score
(e.g., between 0.0 and 1.0). Several available datasets although interpretation of
“word similarity” can vary.
• Synonym Detection: Can embeddings select a synonym for a given word and a
set of options? Datasets are n-tuples where the first word is the input word and
the other n-1 words are the options. Only one of the options is a synonym.
• Analogy: Do embeddings encode relations between words? Datasets are 4-
tuples: the first two words define the relation, the third word is the source of
4 http://projector.tensorflow.org.
7.3 Practice: Evaluating Word and Concept Embeddings 95
the query, and the fourth word is the solution. Good embeddings should predict
an embedding close to the solution word.
• Categorization: Can embeddings be clustered into hand-annotated categories?
Datasets are word–category pairs. Standard clustering algorithms can then be
used to generate k-clusters and the purity of the clusters can be computed.
• Selectional Preference: Can embeddings predict whether a noun–verb pair is
more likely to represent a verb–subject or a verb–object relation? For example,
people-eat is more likely to be found as a verb–subject.
Swivel comes with a eval.mk script that downloads and unzips various relatedness
and analogy datasets. The script also compiles an analogy executable. It assumes
you have a Unix environment and tools such as wget, tar, unzip, and egrep,
as well as make and a c++ compiler.
For convenience, we have included various relatedness datasets as part of this
repo in datasamples/relatedness. We assume you have generated vectors
as part of previous notebooks, which we will test here.
You can use Swivel’s wordsim.py to produce metrics for the k-cap embed-
dings we produced in previous notebooks:
96 7 Quality Evaluation
The numbers show that both embedding spaces only have a small coverage of
the evaluation datasets. Furthermore, the correlation score achieved is in the range
of 0.07–0.22, which is very poor, but expected given the size of the corpus.
For comparison state-of-the-art results are in the range of 0.65–0.8.
Intrinsic evaluations are the most direct way of evaluating (word) embeddings.
Pros
• They provide a single objective metric that enables easy comparison between
different embeddings.
• There are several readily available evaluation datasets (for English).
• If you have an existing, manually crafted knowledge graph, you can generate
your own evaluation datasets.
Cons
• Evaluation datasets are small and can be biased in terms of word selection and
annotation.
• You need to take coverage into account (besides final metric).
• Existing datasets only support English words (few datasets in other languages,
few compound words, few concepts).
• Tasks are low level and thus somewhat artificial: people care about document
classification, but not about word categories or word similarities.
This can be seen as a task for intrinsic evaluation; however, the task is very close to
the original training task used to derive the embeddings in the first place.
7.3 Practice: Evaluating Word and Concept Embeddings 97
Recall that predictive models (such as word2vec) try to minimize the distance
between a word embedding and the embeddings of the context words (and that over
a whole corpus) (Fig. 7.1).
This means that, if we have a test corpus, we can use the embeddings to try to
predict words based on their contexts. Assuming the test corpus and the training
corpus contain similar language we should expect better embeddings to produce
better predictions on average.
A major advantage of this approach is that we do not need human annotation.
Also, we can reuse the tokenization pipeline used for training to produce similar
tokens as those in our embedding space. For example, we can use word-sense
disambiguation to generate a test corpus including lemmas and concepts.
The algorithm in pseudo-code is:
The result is a single number that tells you how far the prediction embedding
was from the actual word embedding over the whole test corpus. When using cosine
similarity this should be a number between −1 and 1.
We can also use the intermediate similarities dictionary to plot diagrams
which can provide further insight. For example, random embeddings result in a plot
depicted in Fig. 7.2.
98 7 Quality Evaluation
The horizontal axis is the rank of the focus_word sorted by their frequency
in the training corpus. (For example, frequent words such as be and the would be
close to the origin, while infrequent words would be towards the end of the axis.)
The plot shows that, when words have random embeddings, on average the
distance between the prediction for each word and the word embedding is close
to 0.
These plots can be useful for detecting implementation bugs. For example, when
we were implementing the CogitoPrep utility for counting co-occurrences for
lemmas and concepts, we generated the following plot depicted in Fig. 7.3.
This showed that we were learning to predict frequent words and some non-
frequent words, but that we were not learning most non-frequent words correctly.
After fixing the bug, we got the plot shown in Fig. 7.4.
This shows that now we were able to learn embeddings that improved word
prediction across the whole vocabulary. But it also showed that prediction for the
most frequent words lagged behind more uncommon words.
After applying some vector normalization techniques to Swivel and re-centering
the vectors (we noticed that the centroid of all the vocabulary embeddings was not
the origin), we got the plot shown in Fig. 7.5, which shows better overall prediction.
7.3 Practice: Evaluating Word and Concept Embeddings 99
In extrinsic evaluations, we have a more complex task we are interested in (e.g., text
classification, text translation, image captioning), whereby we can use embeddings
as a way to represent words (or tokens). Assuming we have (1) a model architecture
100 7 Quality Evaluation
and (2) a corpus for training and evaluation (for which the embeddings provide
adequate coverage), we can then train the model using different embeddings and
evaluate its overall performance. The idea is that better embeddings will make it
easier for the model to learn the overall task.
In this practical section, we use the embrela library to study whether various
embedding spaces capture certain lexico-semantic relations on WordNet. The
approach behind embrela is described by Denaux and Gomez-Perez in [41].
The main idea here is that word/concept embeddings seem to capture many
lexico-semantic relations. However evaluation methods like word similarity and
word analogy have several drawbacks. Also, if you have an existing KG with
relations that matter to you, you want to know how well word/concept embeddings
capture those relations. The embrela pipeline is designed to help you do this by:
(1) generating word/concept pair datasets from a KG, (2) creating and evaluating
classifiers based on the embedding space(s), and (3) providing guidelines on how to
analyze the evaluation results.
There are pitfalls to all three of these steps, which the embrela pipeline
takes into account in order to avoid concluding incorrectly that embeddings capture
relational information when in fact, the generated dataset may be biased or
evaluation results may not be statistically significantly better than random guesses.
The overall pipeline is depicted in Fig. 7.6.
Fig. 7.6 embrela pipeline for assessing the lexico-semantic relational knowledge captured in an
embedding space
7.4 Practice 2: Assessing Relational Knowledge Captured by Embeddings 101
We put the main Python module on the main working folder to make importing of
submodules a bit easier to read.
Instead of generating datasets from scratch, which can be done by using the
embrela_prj/wnet-rel-pair-extractor, in this section we will use a
set of pre-generated datasets extracted from WordNet.
This should download and unzip the file containing the datasets. We can load the
metadata of the generated relations as follows:
which gives us a pandas DataFrame with metadata about the datasets. You can
print it by issuing command:
This will print a table with various fields. Here we only print a small part of that
table:
102 7 Quality Evaluation
The previous step should have printed a table with 27 rows. One for each
generated dataset. All of the datasets are lem2lem, i.e. the source and targets are
both lemmas. In the paper we also consider pairs of types lem2syn, syn2syn,
and lem2pos, where syn is a synset (or syncon) and pos is part-of-speech.
The remaining columns in the table tell us:
• The name of the relation.
• cnt: the number of positive examples extracted from the KG, WordNet 3 in this
case.
• The file name where we can find both the positive and negative examples.
Note that each dataset will have about twice the number of lines as the positive
cnt, since we aim to build balanced datasets. For example for the entailment
relation, we have 1519 positive pairs, but
3039 lines in total. Further inspection of the file shows that it is a tab-separated-
value with columns: source, target, label, and comment.
We will also use random vectors as another baseline and to filter out biased
relations. We use the vocabulary of words and syncons used in the K-CAP’19 paper,
which was derived from disambiguating the UN corpus using Cogito.
The training phase expects to get a map of vector space ids to VecPairLoader
instances, which will take care of mapping source and target words in the
generated datasets into the appropriate embeddings. Here we define the data loaders
to use. Uncomment others if you want to use other embedding spaces.
5 https://torchtext.readthedocs.io/en/latest/vocab.html#pretrained-aliases.
104 7 Quality Evaluation
Now that we have the datasets and the embeddings, we are ready to train some
models. This step is highly configurable, but in this notebook we will:
• Only train models with the nn3 architecture (i.e., with three fully connected
layers).
• Only train models for a couple of (the 27) relations to keep execution short.
• Only train three models per embedding/relation/architecture combination.
• Apply input perturbation as explained in the paper which shifts both source
and target embeddings by the same amount.
The trained models and evaluation results will be written to an output folder.
Even with this restricted setup, this step can take 5–10 min on current Google
Colaboratory environments.
Executing the previous commands will result in a long list of outputs as the
program iterates through the relations, trains and evaluates models to try to predict
relations. The results are stored in the learn_results, which is a list of trained
models along with model metadata and evaluation results gathered during training
and validation. It is a good idea to write these results to disk:
The previous step will have generated a directory structure in the specified output
dir. The folder structure is as follows:
See the embrela README for more details about the generated files.
Now that we have trained (and evaluated) models for the selected datasets, we can
load the results and analyze them.
7.4 Practice 2: Assessing Relational Knowledge Captured by Embeddings 105
7.4.5.1 Load and Aggregate Evaluation Results for the Trained Models
First we show how to load and aggregate evaluation data from disk once you have
gone through the step of learning the models and you have a train result folder
structure as described above. If you skipped that part, below we will load pre-
aggregated results.
We will read the results from disk, aggregate the results, and put them into a
pandas DataFrame for easier analysis:
This will print a sample of the table. As you can see, the data read from the tsv
has the same columns as we would get by aggregating the results read from disk.
Most of the columns in the DataFrame are aggregated values, but for further analysis
it is useful to combine fields and relation metadata.
First we add column rel_name_type, which combines the rel_name and
rel_type:
The embrela project defines a utility script relutil which calculates the
relation type based on the rows in the DataFrame:
7.4 Practice 2: Assessing Relational Knowledge Captured by Embeddings 107
Now we start merging the relation dataset metadata into the aggregated
DataFrame. We first add a field positive_examples:
ranges could be due to chance with 95% probability because several models trained
on datasets with random pairs achieve values within these ranges.
108 7 Quality Evaluation
The output shows that models trained on datasets of random pairs obtain values
υmin
for f 1 metrics with μυδrand = 0.407 and σδυrand = 0.123. This gives us: τbiased =
υmax
0.161 and τbiased = 0.653. Any results within this range have a 95% probability
of being the result of chance and are not necessarily due to the used embeddings
encoding relational knowledge about the relation being predicted.
We define two filters to detect biased relations and test results:
this imbalance during prediction, but that do not reflect the knowledge encoded in
the embeddings. To detect these, we look at model results for models trained on
random embeddings (i.e., on models mδr ,srand ,t ). We say that δr is biased if μf1
δr ,srand
f1min f1max
is outside of the [τbiased , τbiased ] range. The rationale is that even with random
embeddings, such models were able to perform outside of the 95% baseline ranges.
We can store the delta between the model and the random predictions as a field
in our DataFrame:
And we specify this delta in terms of the σ , which we will store in another
column:
110 7 Quality Evaluation
Which we can use to find which embeddings resulted in models with statistically
better results than their random counterparts.
This will show a table with 65 rows and 8 columns showing the embeddings
which had the highest improvement compared to the random embeddings.
7.5 Case Study: Evaluating and Comparing Vecsigrafo Embeddings 111
In this practical part we used the embrela library for assessing how well various
word/concept embeddings capture relational knowledge in WordNet.
The pipeline is straightforward, but takes into account a variety of pitfalls that
can occur during dataset creation, training and interpreting the classification results.
In the previous two sections, you had the opportunity to gain hands-on experience
in evaluating embeddings. To keep the practical sessions short we typically used
small embedding spaces or only used a few evaluation datasets. Based on what
you have learned, you should be able to edit the accompanying notebooks to use
your own embeddings or use alternative evaluation datasets. To further illustrate
how real-world embeddings can be evaluated, in this section we present an example
of an evaluation we performed to assess the quality of Vecsigrafo embeddings. As
you will see, we applied a variety of methodologies to gather metrics that helped
us to study how Vecsigrafo embeddings compared to other embedding learning
mechanisms. Most of this evaluation was done at an academic level with the aim of
publishing an academic paper. Note that for practitioners (e.g., in business settings)
it may be sufficient to only perform a subset of the evaluations we present below.
7.5.1.1 Embeddings
Table 7.3 shows an overview of the embeddings used during the evaluations. We
used five main methods to generate these. In general, we tried to use embeddings
with 300 dimensions, although in some cases we had to deviate. In general, as can
be seen in the table, when the vocabulary size is small (due to corpus size and
tokenization), we required a larger number of epochs to let the learning algorithm
converge.
• Vecsigrafo-based embeddings were first tokenized and word-disambiguated
using Cogito. We explored two basic tokenization variants. The first is lemma-
concept with filtered tokens (“ls filtered”), whereby we only keep lemmas and
concept ids for the corpus. Lemmatization uses the known lemmas in Sensigrafo
to combine compound words as a single token. The filtering step removes various
112 7 Quality Evaluation
6 https://github.com/tensorflow/models/tree/master/research/swivel.
7 https://github.com/stanfordnlp/GloVe.
8 https://github.com/facebookresearch/fastText.
9 https://github.com/mnick/holographic-embeddings.
10 https://github.com/bxshi/ProjE.
11 http://nlp.stanford.edu/data/glove.840B.300d.zip.
12 https://s3-us-west-1.amazonaws.com/fasttext-vectors/wiki.en.vec.
13 http://lcl.uniroma1.it/nasari/files/NASARIembed+UMBC_w2v.zip.
7.5 Case Study: Evaluating and Comparing Vecsigrafo Embeddings 113
Table 7.4 shows the Spearman correlation scores for the 14 word similarity datasets
and the various embeddings generated based on the UN corpus. The last column
in the table shows the average coverage of the pairs for each dataset. Since the
UN corpus is medium sized and focused on specific domains, many words are not
included in the learned embeddings, hence the scores are only calculated based on
a subset of the pairs.
Table 7.6 shows the results for the embeddings trained on larger corpora and
directly on the Sensigrafo. We have not included results for vectors trained with
NASARI (concept-based) and SW2V on UMBC, since these perform considerably
worse than the remaining embeddings (e.g., NASARI scored 0.487 on MEN-TR-
3k and SW2V scored 0.209 for the same dataset, and see Table 7.5 for the overall
average score). We have also not included word2vec on UMBC since it does not
achieve the best score for any of the reported datasets; however, overall it performs
a bit better than Swivel but worse than Vecsigrafo. For example, it achieves a score
of 0.737 for MEN-TR-3k).
Table 7.5 shows the aggregate results. Since some of the word similarity datasets
overlap—SIMLEX-999 and WS-353-ALL were split into its subsets, MC-30 is a
subset of RG-65—and other datasets—RW-STANFORD, SEMEVAL17, VERB-
143, and MTurk-287—have non-lemmatized words (plurals and conjugated verb
forms) which penalize embeddings that use some form of lemmatization during
tokenization, we take the average Spearman score over the remaining datasets. We
discuss the lessons we can extract from these results in Sect. 7.5.2 (Table 7.6).
114 7 Quality Evaluation
Table 7.4 Spearman correlations for word similarity datasets and UN-based embeddings
Dataset ft glove swivel swivel l f vecsi ls f vecsi ls f c vecsi ts vecsi ts c avgperc
MC-30 0.602 0.431 0.531 0.572 0.527 0.405 0.481 0.684 82.5
MEN-TR-3k 0.535 0.383 0.509 0.603 0.642 0.525 0.558 0.562 82.0
MTurk-287 0.607 0.438 0.519 0.559 0.608 0.578 0.500 0.540 69.3
MTurk-771 0.473 0.398 0.416 0.539 0.599 0.497 0.520 0.520 94.6
RG-65 0.502 0.378 0.443 0.585 0.614 0.441 0.515 0.664 74.6
RW-STANFORD 0.492 0.263 0.356 0.444 0.503 0.439 0.419 0.353 49.2
SEMEVAL17 0.541 0.395 0.490 0.595 0.635 0.508 0.573 0.610 63.0
SIMLEX-999 0.308 0.253 0.226 0.303 0.382 0.349 0.288 0.369 96.1
SIMLEX-999-Adj 0.532 0.267 0.307 0.490 0.601 0.559 0.490 0.532 96.6
SIMLEX-999-Nou 0.286 0.272 0.258 0.337 0.394 0.325 0.292 0.384 94.7
SIMLEX-999-Ver 0.253 0.193 0.109 0.186 0.287 0.288 0.196 0.219 100.0
SIMVERB3500 0.233 0.164 0.155 0.231 0.306 0.328 0.197 0.318 94.4
VERB-143 0.382 0.226 0.116 0.162 0.085 -0.089 0.234 0.019 76.2
WS-353-ALL 0.545 0.468 0.516 0.537 0.588 0.404 0.502 0.532 91.9
WS-353-REL 0.469 0.434 0.465 0.478 0.516 0.359 0.447 0.469 93.4
WS-353-SIM 0.656 0.553 0.629 0.642 0.699 0.454 0.619 0.617 91.5
YP-130 0.432 0.350 0.383 0.456 0.546 0.514 0.402 0.521 96.7
The column names refer to the method used to train the embeddings, the tokenization of the corpus
(lemma, syncon, and/or text and whether the tokens were filtered), and whether concept-based
word similarity was used instead of the usual word-based similarity. Bold values are either the best
results, or results worth highlighting, in which case they are further discussed in the text
WS-353-ALL 0.380 0.643 0.597 0.732 0.743 0.493 0.692 0.708 0.685 0.738 98.5
WS-353-REL 0.258 0.539 0.445 0.668 0.702 0.407 0.652 0.649 0.609 0.688 98.2
WS-353-SIM 0.504 0.726 0.748 0.782 0.805 0.615 0.765 0.775 0.767 0.803 99.1
YP-130 0.315 0.550 0.736 0.533 0.562 0.334 0.422 0.610 0.661 0.571 98.3
Bold values are either the best results, or results worth highlighting, in which case they are further discussed in the text
115
116 7 Quality Evaluation
The word similarity datasets are typically used to assess the correlation between
the similarity of word pairs assigned by embeddings and a gold standard defined
by human annotators. However, we can also use the word similarity datasets to
assess how similar two embedding spaces are. We do this by collecting all the
similarity scores predicted for all the pairs in the various datasets and calculating the
Spearman’s ρ metric between the various embedding spaces. We present the results
in Fig. 7.7; darker colors represent higher inter-agreement between embeddings. For
example, we see that wiki17 ft has high inter-agreement with ft and very low
with HolE c. We discuss these results in Sects. 7.5.2.1 and 7.5.2.2.
One of the disadvantages of word similarity (and relatedness) datasets is that they
only provide a single metric per dataset. In [40] we introduced word prediction
plots, a way to visualize the quality of embeddings by performing a task that is very
similar to the loss objective of word2vec. Given a test corpus (ideally different from
the corpus used to train the embeddings), iterate through the sequence of tokens
using a context window. For each center word, take the (weighted) average of the
embeddings for the context tokens and compare it to the embedding for the center
word using cosine similarity. If the cosine similarity is close to 1, this essentially
correctly predicts the center word based on its context. By aggregating all such
cosine similarities for all tokens in the corpus we can (1) plot the average cosine
similarity for each term in the vocabulary that appears in the test corpus and (2) get
an overall score for the test corpus by calculating the (weighted by token frequency)
average over all the words in the vocabulary.
Table 7.7 provides an overview of the test corpora we have chosen to generate
word and concept prediction scores and plots. The corpora are:
• webtext [108] is a topic-diverse corpus of contemporary text fragments (support
fora, movie scripts, ads) from publicly accessible websites, popular as training
data for NLP applications.
• NLTK Gutenberg selections14 contain a sample of public-domain literary texts
by well-known authors (Shakespeare, Jane Austen, Walt Whitman, etc.) from
Project Gutenberg.
• Europarl-10k: We have created a test dataset based on the Europarl [98] v7
dataset. We used the English file that has been parallelized with Spanish, removed
the empty lines, and kept only the first 10K lines. We expect Europarl to be
relatively similar to the UN corpus since they both provide transcriptions of
proceedings in similar domains.
14 https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/packages/corpora/gutenberg.zip.
7.5 Case Study: Evaluating and Comparing Vecsigrafo Embeddings 117
Fig. 7.7 Inter-embedding agreement for the word similarity datasets in the same order as
Table 7.5. Embeddings that do not mention a corpus were trained on Wikipedia 2018
Figure 7.8 shows the word prediction plots for various embeddings and the three
test corpora. Table 7.8 shows (1) the token coverage relative to the embedding
vocabulary (i.e., the percentage of the embedding vocabulary found in the tokenized
test corpus); (2) the weighted average score, this is the average cosine similarity per
prediction made (however, since frequent words are predicted more often, this may
skew the overall result if infrequent words have worse predictions); (3) the “token
average” score, this is the average of the average score per token. This gives an
indication of how likely it is to predict a token (word or concept) given its context
if a token is selected from the embedding vocabulary at random, i.e. without taking
into account its frequency in general texts. As with previous results, we will draw
conclusions about these results in Sect. 7.5.2.
118 7 Quality Evaluation
Word (and concept) similarity and prediction tasks are good for getting a sense
of the embedding quality. However, ultimately the relevant quality metric for
embeddings is whether they can be used to improve the performance of systems
that perform more complex tasks such as document categorization or knowledge
graph completion. For this reason we include an evaluation for predicting specific
types of relations in a knowledge graph between pairs of words, following recent
work in the area [102, 180, 191]. At Expert System, such a system would help our
team of knowledge engineers and linguists to curate the Sensigrafo.
To minimize introducing bias, rather than using Sensigrafo as our knowledge
graph, we have chosen to use WordNet since we have not used it to train HolE
embeddings and it is different from Sensigrafo (hence any knowledge used during
disambiguation should not affect the results). For this experiment, we chose the
following relations:
7.5 Case Study: Evaluating and Comparing Vecsigrafo Embeddings 119
Fig. 7.8 Word and concept prediction plots. The horizontal axis contains the word ids sorted by
frequency on the training corpus; although different embeddings have different vocabulary sizes,
we have fixed the plotted vocabulary size to 2M tokens to facilitate comparison. Since HolE is not
trained on a corpus, hence the frequencies are unknown, the vocabulary is sorted alphabetically.
The vertical axis contains the average cosine similarity between the weighted context vector and
the center word or concept
• verb group, relating similar verbs to each other, e.g. “shift”-“change” and
“keep”-“prevent.”
• entailment, which describes entailment relations between verbs, e.g. “peak”-
“go up” and “tally”-“count.”
Datasets
We built a dataset for each relation by (1) starting with the vocabulary of UN
vecsi ls f (the smallest vocabulary for the embeddings we are studying) and
look up all the synsets in WordNet for the lemmas. Then we (2) searched for all
120 7 Quality Evaluation
the connections to other synsets using the selected relations, which gives us a list
of positive examples. Finally, (3) we generate negative pairs based on the list of
positive examples for the same relation (this negative switching strategy has been
recommended in order to avoid models simply memorizing words associated with
positive pairs [103]). This resulted in a dataset of 3039 entailment pairs (1519
positives) and 9889 verb group pairs (4944 positives).
Training
Next, we trained a neural net with two fully connected hidden layers on each
dataset, using a 90% training, 5 validation, 5 test split. The neural nets received
as their input the concatenated embeddings for the input pairs (if the input verb
was a multi-word like “go up,” we took the average embedding of the constituent
words when using word embeddings rather than lemma embeddings). Therefore,
for embeddings with 300 dimensions, the input layer had 600 nodes, while the two
hidden layers had 750 and 400 nodes. The output node has two one-hot-encoded
nodes. For the HolE embeddings, the input layer had 300 nodes and the hidden
layers had 400 and 150 nodes. We used dropout (0.5) between the hidden nodes
and an Adam optimizer to train the models for 12 epochs on the verb group dataset
and 24 epochs on the entailment dataset. Also, to further avoid the neural net to
memorize particular words, we include a random embedding perturbation factor,
which we add to each input embedding; the idea is that the model should learn to
categorize the input based on the difference between the pair of word embeddings.
Since different embedding spaces have different values, the perturbation takes into
account the minimum and maximum values of the original embeddings.
Results
Table 7.9 shows the results of training various embeddings: cc glove, wiki
ft,15 HolE, UN vecsi ls f, and wiki vecsi ls f. Since constructing
such datasets is not straightforward [103], we also include a set of random
embeddings. The idea is that, if the dataset is well constructed, models trained
with the random embeddings should have an accuracy of 0.5, since no relational
information should be encoded in the random embeddings (as opposed to the trained
embeddings).
The main finding was that the Vecsigrafo-based embeddings learnt from the
medium-sized UN corpus outperform the rest at the prediction of both target
relations. Surprisingly the Vecsigrafo UN embeddings also outperformed the
Wikipedia-based embeddings; a possible explanation for this is the greater
15 For GloVe and fastText only the optimal results, based on the larger corpus (cc, wiki) , are shown.
7.5 Case Study: Evaluating and Comparing Vecsigrafo Embeddings 121
7.5.2 Discussion
From Tables 7.4 and 7.5 we can draw the conclusion that, for the UN corpus (a
medium-sized corpus):
• Co-training lemmas and concepts produces better embeddings than training
them using conventional word embedding methods. In particular we see that:
ρvecsilsf > ρswivell ρf t ρvecsits ρswivel ρglove , where > means
that the difference is statistically significant (t-test p < 0.01), means slightly
significance (p < 0.05), and means difference is not statistically significant.
As in the ablation study, we see that for the same tokenization strategy,
adding concepts significantly improves the quality of the word embeddings. The
comparative study furthermore shows that just lemmatizing and filtering achieves
a similar quality as that of fastText (which also performs pre-processing and uses
subword information as discussed in Sect. 2.3).
For larger corpora such as Wikipedia and UMBC:
• There is no statistically significant difference between fastText, Vecsigrafo,16
or SW2V.17 Similarly, GloVe performs at roughly the same level as these other
the average prediction accuracy drops and variance increases. This pattern is
particularly clear for GloVe trained on CommonCrawl. The same pattern applies
for wiki glove; however, the plot shows that for most words (except the most
frequent ones) these embeddings barely perform better than random (average
cosine similarity is close to 0). This suggests that there is an issue with the
default hyperparameters, or that GloVe requires a much higher number of epochs
compared to other algorithms (note we initially trained most of the embeddings
with 8 epochs, but due to poor performance we increased the presented GloVe
embeddings for Wikipedia to 25 epochs).
• fastText produces very consistent results: prediction quality does not change
depending on word frequency.
• word2vec applied to UMBC has a pattern in between that of fastText and
GloVe. It shows a high variance in prediction results, especially for very high-
frequency words and shows a linearly declining performance as words become
less frequent.
• Swivel with standard tokenization also shows mostly consistent predictions;
however, very frequent words show a higher variance in prediction quality which
is almost the opposite of GloVe: some high-frequency words tend to have a poor
prediction score, but the average score for less frequent words tends to be higher.
The same pattern applies to Vecsigrafo (based on Swivel), although it is less
clear for wiki vecsi ls. Due to the relatively small vocabulary sizes for the
studied Vecsigrafos trained on the UN corpus, it is hard to identify a learning
pattern when normalizing the vocabulary to 2M words.
By comparing the word prediction results between wiki swivel and the three
Vecsigrafo-based embeddings we can see a few counter-intuitive results.
• First, on average word prediction quality decreases by using Vecsigrafo, which is
surprising (especially since word embedding quality improves significantly based
on the word similarity results as discussed above). One possible reason for this
is that the context vector for Vecsigrafo-based predictions will typically be the
average of twice as many context tokens (since it will include both lemmas and
concepts). However, the results for UN vecsi ts would suffer from the same
issue, but this is not the case. In fact, UN vecsi ts performs as well as wiki
swivel at this task.
• Second, both UN-based Vecsigrafo embeddings outperform the wiki-based Vec-
sigrafo embedding for this task. When comparing UN vecsi ls f and wiki
vecsi ls, we see that due to the vocabulary size, the UN-based embeddings
had to perform fewer predictions for fewer tokens; hence maybe less frequent
words are introducing noise when performing word prediction. Further studies
are needed in order to explain these results. For now, the results indicate that,
for the word prediction task, Vecsigrafo embeddings based on smaller corpora
outperform those trained on larger corpora. This is especially relevant for tasks
such as Vecsigrafo-based disambiguation, for which standard word embeddings
would not be useful.
124 7 Quality Evaluation
Table 7.5 shows that for KG-based embeddings, the lemma embeddings (HolE
500e) perform poorly, while the concept-based similarity embeddings perform
relatively well (HolE c 500e). However, the concept embeddings learned using
HolE perform significantly worse than those based on the top-performing word
embedding methods (fastText on wiki and GloVe on CommonCrawl) and concept-
embedding methods (SW2V and Vecsigrafo). This result supports our hypothesis
that corpus-based concept-embeddings improve on graph-based embeddings
since they can refine the concept representations by taking into account tacit
knowledge from the training corpus, which is not explicitly captured in a
7.6 Conclusion 125
7.6 Conclusion
This chapter was all about ways to evaluate the quality of your embeddings. We saw
that there is a plethora of methods that can be used to evaluate embeddings with
various trade-offs. Some methods are easier to apply since datasets already exist
and results for various pre-trained embeddings are readily available for comparison.
Other methods require more effort and interpreting the results may not be as easy.
Finally, we saw that, depending on what you want to do with your embeddings, you
may decide to focus on one type of evaluation over another.
The overview of existing evaluation methods combined with practical exercises
for applying a variety of these methods and the case study should provide you with
plenty of ideas on how to evaluate a range of embeddings for your particular use
cases.
Chapter 8
Capturing Lexical, Grammatical,
and Semantic Information
with Vecsigrafo
8.1 Introduction
Table 8.1 Tokenizations for the first window of size W = 3 for the sentence: With regard to
breathing in the locus coeruleus, . . .
Context t i−3 t i−2 t i−1 ti t i+1 t i+2 t i+3
t With regard to breathing in the locus coeruleus
sf With regard to breathing in the locus coeruleus ∅
l With regard to breathe ∅ ∅ locus coeruleus ∅
c en#216081 en#76230 ∅ ∅ en#101470452 ∅
g PRE VER PRE ART NOU PNT
First we show basic space-based tokenization. Next, sequences for surface forms, lemmas,
syncons, and grammar type
1 Note that in the example, concept information is linked to a knowledge graph through a unique
identifier.
8.2 Approach 129
Table 8.2 Multi-word expressions (mwe) vs. single-word expressions (swe) per grammar type
(PoS) in SciGraph
Part-of-speech Example #Single-word expressions #Multi-word expressions
ADJ single+dose 104.7 0.514
ADV in+situ 21 1.98
CON even+though 34.6 3.46
ENT august+2007 1.5 1.24
NOU humid+climate 216.9 15.24
NPH Ivor+Lewis 0.917 0.389
NPR dorsal+ganglia 22.8 5.22
PRO other+than 13.89 0.005
VER take+place 69.39 0.755
Number of swe and mwe occurrences are in millions
case of grammar types like adverbs (ADV), nouns (NOU), noun phrases (NPH),
proper names (NPR), or entities (ENT), while others like conjunctions (CON) or
propositions (PRO) are domain-independent.
We quantify the impact of each individual type of linguistic annotation and their
possible combinations on the resulting embeddings following different approaches.
First, intrinsically, through word similarity and analogical reasoning tasks. In this
regard, we observe that word similarity and analogy benchmarks are of little help
in domains which, like science, have highly specialized vocabularies, highlighting
the need for domain-specific benchmarks. Second, we also evaluate extrinsically in
two downstream tasks: a word prediction task against scientific corpora extracted
from SciGraph and Semantic Scholar [3], as well as general-purpose corpora
like Wikipedia and UMBC [74], and a classification task based on the SciGraph
taxonomy of scientific disciplines.
Next, we address different ways to capture various types of linguistic and seman-
tic information in a single embedding space (Sect. 8.2) and report experimental
results based on that (Sect. 8.3). Some interesting findings include the following.
Embeddings learnt from corpora enriched with sense information consistently
improve word similarity and analogical reasoning, while the latter additionally
tends to benefit from surface form embeddings. Other results show the impact of
using part-of-speech information to generate embeddings in the target domain, with
significant improvements in precision for the classification task while penalizing
word prediction.
8.2 Approach
The method used is based on Vecsigrafo [39], which allows jointly learning word,
lemma, and concept embeddings using different tokenization strategies, as well as
linguistic annotations.
130 8 Capturing Lexical, Grammatical, and Semantic Information with Vecsigrafo
As introduced in Chap. 6, Vecsigrafo [39] is a method that jointly learns word and
concept embeddings using a text corpus and a knowledge graph. Like word2vec,
it uses a sliding window over the corpus to predict pairs of center and context
words. Unlike word2vec, which relies on a simple tokenization strategy based on
space-separated tokens (t), in Vecsigrafo the aim is to learn embeddings also for
linguistic annotations, including surface forms (sf ∈ SF ), lemmas (l ∈ L), grammar
information like part-of-speech (g ∈ G) and knowledge graph concepts (c ∈ C).
To derive such additional information from a text corpus, appropriate tokenization
strategies and a word-sense disambiguation pipeline are required.
In contrast to simple tokenization strategies based on space separation, surface
forms are the result of a grammatical analysis where one or more tokens can be
grouped using the part-of-speech information, e.g. in noun or verb phrases. Surface
forms can include multi-word expressions that refer to concepts, e.g. “Cancer
Research Center.” A lemma is the base form of a word, e.g. surface forms “is,”
“are,” “were” all have lemma “be.”
Due to the linguistic annotations that are included in addition to the original
text, the word2vec approach is no longer directly applicable. The annotated text
is a sequence of words and annotations and hence sliding a context window from
a central word needs to care about both words and annotations. For this reason,
Vecsigrafo extends the Swivel [164] algorithm to first use the sliding window to
count co-occurrences between pairs of linguistic annotations in the corpus, which
are then used to estimate the pointwise mutual information (PMI) between each pair.
The algorithm then tries to predict all the pairs as accurately as possible. In the end,
Vecsigrafo produces an embedding space: = {(x, e) : x ∈ SF ∪ L ∪ G ∪ C, e ∈
Rn }. That is, for each entry x, it learns an embedding of dimension n, such that the
dot product of two vectors predicts the PMI of the two entries.
8.3 Evaluation
Next, we describe the corpus, embeddings learnt from the linguistic annotations,
and the experiments carried out to get insights on their usefulness in the different
evaluation tasks. We present a comprehensive experimental study on the quality of
embeddings generated from a scientific corpus using additional, explicit annotations
with linguistic and semantic information extracted from the text. To this purpose,
we follow the approach proposed by Vecsigrafo. However, in contrast to the
experiments described in [39], which used general domain corpora, here we
rely on domain-specific text from scientific publications and include the use of
grammatical information (part-of-speech) in the experimentation. We cover all the
possible combinations of the resulting embeddings, including those related to space-
separated tokens, surface forms, lemmas, concepts, and grammatical information.
Finally, we add an extrinsic evaluation task on scientific text classification to further
assess the quality of the embeddings.
8.3.1 Dataset
The scientific corpus used in the following experiments is extracted from Sci-
Graph [73], a knowledge graph for scientific publications. From it, we extract titles
and abstracts of articles and book chapters published between 2001 and 2017. The
resulting corpus consists roughly of 3.2 million publications, 1.4 million distinct
words, and 700 million tokens.
Next, we use Cogito, Expert System’s NLP suite, to parse and disambiguate the
text and add the linguistic annotations associated with each token or group of tokens.
To this purpose, Cogito relies on its own knowledge graph, Sensigrafo. Note that
we could have used any other NLP toolkit as long as the generation of the linguistic
annotations used in this work were supported.
Table 8.3 Account of token types and embeddings learnt from the text
in the title and abstracts from research articles and book chapters in
English published between 2001 and 2017 and available in SciGraph
using Swivel, which is the reference algorithm for Vecsigrafo. For the remaining
linguistic annotations (sf, l, g, c), as well as their combinations including two and
three elements, we learnt their embeddings using Vecsigrafo. The difference in the
number of distinct number of annotations and actual embeddings is due to filtering.
Following previous findings [39], we filtered out elements with grammar type
article, punctuation mark, or auxiliary verbs and generalize tokens with grammar
type entity or person proper noun, replacing the original token with special tokens
grammar#ENT and grammar#NPH, respectively.
In Table 8.4 we report the 16 datasets used in the word similarity evaluation task.
During training of the Vecsigrafo embeddings we tested them at each step of the
learning process against these datasets. We present in Table 8.5 the similarity results
calculated for the Vecsigrafo embeddings in a subset of 11 datasets, although the
average reported in the last column is for the 16 datasets. Note that we only applied
concept-similarity in those combinations where sf or l were not present.
From the data we observe that there is a clear evidence that embeddings
learnt from linguistic annotations, individually or jointly learnt, perform better in
the similarity task than using token embeddings. An individual analysis of each
linguistic annotation shows that c embeddings outperform l embeddings, which
in turn are superior to sf embeddings. Note that most of the similarity datasets
in Table 8.4 contain mainly nouns and lemmatized verbs. Thus, semantic and
lemma embeddings are better representations for these datasets since they condense
different words and morphological variations in one representation while surface
forms may generate several embeddings, for the different morphological variations
of words, that might not be necessary for this task and can hamper the similarity
estimation.
Regarding Vecsigrafo embeddings jointly learnt for two or more linguistic anno-
tations, semantic embeddings c improve the results in all the combinations where
they were used. That is, concept embeddings enhance the lexical representations
based on surface forms and the grouped morphological variations represented by
lemmas in the similarity task. On the other hand, grammatical information g seems
to hamper the performance of Vecsigrafo embeddings, except when they are jointly
learnt with the semantic embeddings c. If we focus the analysis on the top five results
in Table 8.5, l and c embeddings, jointly or individually learnt, are the common
elements in the embedding combinations that perform best in the similarity task. The
performance of embeddings from surface forms sf, on the other hand, and related
combinations is at the bottom of the table. In general, embeddings based on l and c,
individually or jointly, seem to be better correlated to the human notion of similarity
than embeddings based on l and g.
134 8 Capturing Lexical, Grammatical, and Semantic Information with Vecsigrafo
To summarize, from the results obtained in the similarity task, embeddings seem
to benefit from the additional semantic information contributed by the linguistic
annotations. Lexical variations presented at the level of surface forms are better
representations in this regard than space-separated tokens, while the conflation of
these variations in a base form at the lemma level and their link to an explicit
concept increase performance in the prediction of word relatedness. Jointly learnt
embeddings for surface forms, lemmas, and concepts achieve the best overall
similarity results.
The analysis in [39] evaluates in the same settings Vecsigrafo embeddings
for lemmas and concepts learnt from a 2018 dump of Wikipedia, achieving a
Spearman’s rho of 0.566, which contrasts with the 0.380 achieved by the same
embeddings learnt from the SciGraph corpus. These results may not be directly
comparable due to the difference of corpora size (3B tokens in Wikipedia vs. 700M
in SciGraph). Another factor is the general-purpose nature of Wikipedia, while
SciGraph is specialized in the scientific domain.
8.3 Evaluation
Table 8.5 Average Spearman’s rho for a subset of all the similarity datasets and the final learnt Vecsigrafo and Swivel embeddings
Embed. MC_30 MEN_TR_3k MTurk_771 RG_65 RW_STANFORD SEMEVAL17 SIMLEX_999 SIMVERB3500 WS_353_ALL YP_130 Average ↓
sf_l_c 0.6701 0.6119 0.4872 0.6145 0.2456 0.4241 0.2773 0.1514 0.4915 0.4472 0.4064
c 0.6214 0.6029 0.4512 0.5340 0.2161 0.5343 0.2791 0.2557 0.4762 0.4898 0.3883
l_g_c 0.5964 0.6312 0.5103 0.5206 0.1916 0.4535 0.2903 0.1745 0.4167 0.4576 0.3826
l_c 0.5709 0.5787 0.4355 0.5607 0.1888 0.4912 0.2627 0.2378 0.4462 0.5180 0.3807
l 0.6494 0.6093 0.4908 0.6972 0.2002 0.4554 0.2509 0.1297 0.4410 0.3324 0.3763
g_c 0.4458 0.5663 0.4046 0.4985 0.1803 0.5280 0.3040 0.2637 0.3957 0.5322 0.3671
sf_l 0.5390 0.5853 0.4229 0.5448 0.2139 0.3870 0.2493 0.1374 0.4670 0.3627 0.3652
sf_c 0.5960 0.5603 0.4430 0.4788 0.2013 0.4958 0.2722 0.2433 0.3957 0.4835 0.3607
sf_g_c 0.5319 0.6199 0.4565 0.4690 0.2315 0.3938 0.2789 0.1452 0.4201 0.2663 0.3559
sf 0.5150 0.5819 0.4387 0.4395 0.2325 0.3855 0.2673 0.1524 0.4370 0.3630 0.3538
sf_l_g 0.5096 0.5645 0.4008 0.3268 0.2077 0.3427 0.2376 0.1100 0.3909 0.3179 0.3220
l_g 0.5359 0.5551 0.4238 0.5230 0.1648 0.3791 0.2139 0.1204 0.3087 0.3853 0.3057
sf_g 0.5016 0.5239 0.3950 0.3448 0.1681 0.3469 0.1979 0.1028 0.3699 0.3322 0.2990
t 0.1204 0.3269 0.1984 0.1546 0.0650 0.2417 0.1179 0.0466 0.1923 0.2523 0.1656
The results are sorted in descending order on the average of all datasets in the last column. Bold values indicate the largest correlation
135
136 8 Capturing Lexical, Grammatical, and Semantic Information with Vecsigrafo
For the analogy task we use the Google analogy test set2 that contains 19,544
question pairs (8869 semantic and 10,675 syntactic questions) and 14 types of
relations (9 syntactic and 5 semantic). The accuracy of the embeddings in the
analogy task is reported in Table 8.6.
Similarly to the word similarity results, in this task the linguistic annotations
also generate better embeddings than space-separated tokens. However, in this
case surface forms are more relevant than lemmas. Surface form embeddings
achieve large accuracy in both semantic and syntactic analogies. However, lemma
embeddings accuracy is large for semantic analogies and really small for syntactic
analogies. In this case, embeddings generated taking into account the morphological
variations of words are better than using the base forms. For example, some
syntactic analogies actually require to work at the surface form level, e.g., “bright-
brighter cold-colder” since the lemmas of brighter and colder are bright and cold,
respectively, and therefore in the lemma embedding space there is no representation
for brighter and colder.
Jointly learning sf and c embeddings achieves the highest accuracy over the
board and in fact these two linguistic annotations are in the top 3 results along with
g or l. As in word similarity, semantic embeddingsc improve every combination in
which they are used. If we focus on the semantic analogies, we can see that l, g, and
c reach the highest accuracy. Nevertheless the performance of this combination is
art).
8.3 Evaluation 137
very poor on syntactic analogies given that l do not include morphological variations
of words that are heavily represented in the syntactic analogies. In general the worst
results are obtained when the combinations do not include c.
In Shazeer et al. [164] Swivel embeddings learnt from Wikipedia and Gigaword
achieved an analogy accuracy in the same dataset used in this experiments of 0.739,
while the best result reported in our analysis is 0.129. As in the similarity evaluation,
the rather smaller size of the SciGraph corpus and domain specificity seem to
hamper the results in this generic evaluation tasks.
We have selected 17,500 papers from Semantic Scholar, which are not in the
SciGraph corpus, as the unseen text on which we try to predict the embeddings for
certain words (or linguistic annotations) according to the context. We applied the
same annotation pipeline to the test corpus so that we can use the embeddings learnt
from SciGraph. As baselines we used pre-trained embeddings learnt for linguistic
annotations that we derived from generic, i.e. non-scientific, corpora: a dump of
Wikipedia from January 2018 containing 2.89B tokens and UMBC [74], a web-
based corpus of 2.95B tokens.
The overall results are shown in Table 8.7. We also plot the full results for some of
the prediction tasks in Fig. 8.1. The results show a clear pattern: embeddings learnt
from linguistic annotations significantly outperform the plain space-separated token
embeddings (t). That is, the cosine similarity between the predicted embedding for
the linguistic annotation on Semantic Scholar and the actual embeddings learnt from
SciGraph is higher. Recall that the predicted embedding is calculated by averaging
the embeddings of the words or their annotations in the context window.
In general, embeddings learnt from linguistic annotations on SciGraph are better
at predicting embeddings in the Semantic Scholar corpus. For these embeddings it
was easier to predict surface forms embeddings, followed by lemma and concept
embeddings. We assume this is because sf annotations contain more information,
since they keep morphological information. Similarly, concept information may
be more difficult to predict due to possible disambiguation errors or because the
specific lemma used to refer to a concept may still provide useful information for
the prediction task.
Jointly learning embeddings for sf and other annotations (c or l) produce sf
embeddings which are slightly harder to predict than when trained on their own.
However, jointly learning l and c embeddings produces better results; i.e. jointly
learning lemmas and concepts has a synergistic effect.
When comparing to the baselines, we see that SciGraph-based embeddings out-
perform both the Wikipedia and UMBC-based embeddings. The out-of-vocabulary
(oov) numbers provide an explanation: both baselines produce embeddings which
miss many terms in the test corpus but which are included in the SciGraph
embeddings. Wikipedia misses 116K lemmas, UMBC 93K, but SciGraph only
misses 81K of the lemmas in the test corpus. Wikipedia misses most concepts
138 8 Capturing Lexical, Grammatical, and Semantic Information with Vecsigrafo
(2.2K), followed by SciGraph (1.5K) and UMBC (1.1K); however, despite missing
more concepts, the SciGraph-based embeddings outperform both baselines.
Manual inspection of the missing words shows that UMBC is missing nearly 14K
lemmas (mostly for scientific terms) which the SciGraph-based embeddings contain,
such as “negative bias temperature instability,” “supragranular,” “QHD.” Inversely,
the SciGraph vocabulary is missing nearly 7K lemmas for generic entities such as
“Jim Collins,” “Anderson School of Management,” but also misspellings. However,
most of the missing words (around 42K) are in neither UMBC nor SciGraph and
include very specific metrics (“40 kilopascal”), acronyms (“DB620”), and named
entities (“Victoria Institute of Forensic Medicine”). We observe a similar pattern
when comparing missing concepts.
0.5
0
scigraph_sf
–0.5
–1
0 100,000 200,000 300,000 400,000 500,000 600,000
0.5
0
scigraph_lc-pred-l
–0.5
–1
0 200,000 400,000 600,000 800,000
0.5
0
scigraph_c
–0.5
–1
0 20,000 40,000 60,000 80,000 100,000 120,000 140,000
0.5
–0.5
umbc_lc_pred_l
–1
0 200,000 400,000 600,000 800,000 1,000,000 1,200,000 1,400,000
0.5
–0.5
scigraph_text_t
–1
0 200,000 400,000 600,000 800,000 1,000,000 1,200,000 1,400,000
Fig. 8.1 Embedding prediction plots. The horizontal axis aligns with the terms in the vocabulary
sorted by frequency. The vertical axis is the average cosine similarity (which can range from 0 to 1)
between the predicted embedding on the Semantic Scholar corpus and the actual learnt embedding
from SciGraph. According to this task the closer to 1 the better. Blank spaces indicates lack of
coverage
Regarding precision, the top three classifiers are learnt from different com-
bination of sf, l, and g, indicating that, precision-wise, grammatical information
8.4 Discussion 141
is more relevant than semantic information (c). However, note that the common
linguistic element in the 6 first classifiers is g, even when combining it with c, and in
general removing g embeddings produced the least precise classifiers. This means
that grammar information makes the difference regardless of the other linguistic
elements, although the influence of g is enhanced when used in combination with sf
and l. Note that the precision of the top five classifiers is better than the upper bound
classifier (Table 8.8), where the embeddings were learnt at classification training
time, although the linguistic-based embeddings were not learnt for this specific
purpose.
The recall analysis shows a different picture since grammar embeddings g do not
seem to have a decisive role on the performance of the classifier recall-wise, while c
gain more relevance. sf and l help in learning the best classifier. The combination of
c and l seems to benefit recall since, as seen in 3 of the top 4 classifiers. In contrast
when concepts are combined with sf the recall is lower. The fact that l embeddings
are more directly related to c than sf seems to make a difference in the recall
analysis when these three elements are involved. In general, g-based embedding
combinations generate classifiers with lower recall.
Finally, the f -measure data shows more heterogeneous results since by definition
it is the harmonic mean of precision and recall, and hence the combinations that
generate the best f -measure need a high precision and a high recall. sf and l
combination is at the top (best recall), followed by their combination with g (best
precision). c appears in positions 3 to 6 in the ranking, however when combined
solely with sf or g the f -measure results are the worst. g, on the other hand, when
combined with at least two other elements ranks high in f -measure, while when it
is combined with a single linguistic annotation type the resulting performance of the
classifiers is the worst.
8.4 Discussion
Although the different evaluation tasks assess the embeddings from different
perspectives, in all of them we have observed that the embeddings learnt from
linguistic annotations outperform space-separated token embeddings. In the intrin-
sic evaluation tasks, the embeddings learnt from linguist annotations have different
performance mainly due to the vocabulary used in the evaluation datasets. For exam-
ple, in the analogical reasoning datasets, syntactic analogies are more represented
compared to semantic analogies. Also, most of the syntactic analogies often include
morphological variations of the words, which are better covered by surface form
embeddings than lemma embeddings, where morphological variations are conflated
in a single base form. For example, comparative and superlative analogies like
cheap-cheaper and cold-colder may require to use morphological variations of the
verb by adding the -er and -(e)st suffixes, and although some adjectives and adverbs
may produce irregular forms, the only ones included in the dataset are good-better,
bad-worse, good-best, and bad-worst.
142 8 Capturing Lexical, Grammatical, and Semantic Information with Vecsigrafo
In contrast, in the similarity datasets (see Table 8.4) most of the word pairs
available for evaluation are either nouns, named entities, or lemmatized verbs—
only the Verb-143 dataset contains conjugated verbs. Thus, most of the word pairs
are non-inflected word forms and therefore concept and lemma embeddings are the
representations more fitted for this task which is in line with the evaluation results
reported in Table 8.5.
Remarkably, in the analogy task, surface form embeddings jointly learnt with
concept embeddings achieve the highest performance, and in the similarity task
jointly learning concepts and lemma embeddings also improves the lemma embed-
dings performance in the task. Therefore, including concept embeddings in the
learning process of other embeddings generally helps to learn better distributional
representations.
The word prediction task provides additional insight in the sense that embeddings
learnt from linguistic features seem to better capture the distributional context, as
evaluated against unseen text. Surface form and lemma embeddings were the easiest
to predict.
Finally, in the text classification task, single embeddings from lemma and
surface form annotations were more useful than space-separated token embeddings.
Nevertheless, in this task concept embeddings perform worst mainly due to the
low coverage offered by these embeddings with respect to the whole vocabulary,
indicating that the knowledge graph used to annotate the text possibly provides
a limited coverage of the domain. We also show that jointly learning lemma and
surface form embeddings helps to train the best classifier in term of f -measure
and recall. Furthermore, adding grammar embeddings produced the best overall
precision.
Next, we illustrate with actual code some of the experiments performed in this
chapter. The notebook is available online3 as part of the tutorial accompanying this
book.
Here we show a practical exercise of how to use Vecsigrafo and a surface form-based
tokenization strategy to classify scientific literature from Springer Nature SciGraph.
Articles are classified in one or more of the 22 first-level categories in SciGraph.
Previously we have extracted from SciGraph the papers published in 2011. For each
paper we consider only the text in the title and the abstract.
3 https://colab.research.google.com/github/hybridnlp/tutorial/blob/master/
09_classification_of_scholarly_communications.ipynb.
8.5 Practice: Classifying Scientific Literature Using Surface Forms 143
Unzip the content and set the variables that point to the data and embeddings
To speed up the classifier learning process, we take a sampe of the whole dataset. If
you want to use the whole dataset, please comment the second-to-last line below.
144 8 Capturing Lexical, Grammatical, and Semantic Information with Vecsigrafo
Glance at the vocabulary gathered by the tokenizer. Note that surface forms of
multi-word expression use the + symbol to concatenate the single words.
The embedding layer is a matrix with all the embeddings corresponding to all the
vocabulary words. In other words, the rows are the words in the vocabulary and the
columns the embeddings dimensions.
8.5 Practice: Classifying Scientific Literature Using Surface Forms 147
8.6 Conclusion
help to train the best classifier and, if grammar embeddings are also used, then the
highest precision is also achieved.
In the future we will use linguistic annotations to enhance neural language
models like BERT and assess whether such linguistic annotations can also help to
learn better language models, improving performance when fine-tuned for various
downstream tasks. In this regard, a path to explore addresses the extension of
the current transformer-based architectures used to learn language models so that
not only word pieces but also other kinds of linguistic annotations like the ones
discussed in this chapter can be ingested and their effect propagated across the
model.
Chapter 9
Aligning Embedding Spaces
and Applications for Knowledge Graphs
9.1 Introduction
In the real world, NLP often can be used to extract value by analyzing texts in highly
specific domains and settings, e.g. scientific papers on economics, computer science,
or literature in English and Spanish. In order to build effective NLP pipelines,
you may find yourself trying to combine embedding spaces with different but
overlapping vocabularies. For example, some embedding spaces may be derived
from generic but large corpora and only include words in the individual languages
(English and Spanish). Other embeddings may be derived from smaller corpora
which have been annotated with concepts, hence these embeddings include lemmas
as well as concepts. Finally, other embeddings may have been derived directly from
custom knowledge graphs for the specific domains. In many cases, you may even
have separate knowledge graphs for different languages as they may have been built
by different teams to address different goals in different markets (e.g., United States
vs Latin America).
Since maintaining separate embedding spaces and KGs is expensive, ideally you
want to reduce the number of embedding spaces and KGs or at least align them with
each other so that downstream systems, e.g. a classifier or a rule-engine, that depend
on a particular embedding space or KG can be reused.
In the symbolic world, KG curation and interlinking has been an active area of
research for decades [52]. In the machine learning world, recent techniques have
also been proposed. In this chapter we will briefly look at some of these techniques.
In Sect. 9.2 we start by providing a brief overview of existing approaches that have
been proposed; this section also provides an overview of how these techniques
can be applied to a variety of problems. Then, in Sect. 9.3 we look at some basic
embedding space alignment techniques and apply these to find a mapping between
word embeddings in two different languages. We wrap this chapter with a practical
exercise for finding alignments between old and modern English in Sect. 9.4.
In one of the original word2vec papers [120], Mikolov et al. already proposed ways
for aligning embeddings learned from corpora in different languages.1 This initial
technique relied on the assumption that the learned embedding spaces can be aligned
via a linear transformation. Unfortunately, the geometry of embedding spaces is
non-linear. Among others, the resulting spaces have issues such as hubness [45],
which means that there are clusters of many words in the vocabulary which are very
close to each other, i.e. the vectors are not evenly distributed across the space. It is
unclear how embedding learning algorithms can be manipulated to avoid such areas
or whether this is desirable.
Even if we managed to avoid geometrical issues with the embedding spaces,
differences in the vocabularies and the underlying training data would result in
words being assigned different areas in the space. Consider words cat in English and
gato in Spanish, although both primarily refer to felines, in Spanish, this word is also
commonly used to refer to a mechanical jack for lifting heavy objects. Therefore at
the word level, the embedding for these words must occupy different areas in the
space since we want the Spanish word to be close to other words related to the
felines, but also close to words related to mechanical instruments (which is not the
case for the English word). This means that in recent years, there have been many
works looking at non-linear alignment of pre-computed embedding spaces [37, 70].
Of course, deep learning systems are really good at finding non-linear functions;
therefore, it is also possible to tackle this alignment problem using standard machine
learning. In Sect. 9.3.2 we look at how this can be implemented.
Most of the work in this area has traditionally focused on aligning multi-lingual
word embeddings. However, the same techniques can be applied to a multitude of
alignment problems. In the area of hybrid neural/symbolic systems, it is interesting
to consider how embeddings can be used to improve knowledge graphs. As we
have seen in previous chapters, KGs are crucial in knowledge management and
complex information systems [135], but they are expensive to build and maintain.
This involves work finding whether there are new concepts that should be added to
the KG (including deciding what the correct place should be in the graph, or whether
there are already concepts which are related or even the same). Such KG curation
also includes finding errors in the KG due to human input error or automatically
derived knowledge. In the next subsections we provide an overview of work being
done in these areas.
The most straightforward alignment between two embedding spaces can be achieved
by using a translation matrix, as shown in [120]. Basically, a translation matrix W
is such that z=Wx, where z is a vector belonging to the target vector space and x is
the equivalent in the source.
To calculate the translation matrix, you need a dictionary that provides mappings
for a subset of your vocabularies. You can then use existing linear algorithms to
calculate the pseudo inverse. For optimal results, it is recommended to use very
large corpora. If this is not possible, when using smaller corpora, it is a good idea to
use parallel corpora so that the same words are encoded in similar ways.
In the following example, we use pre-trained embeddings for the most frequent
5K lemmas in the United Nations parallel corpus [202].
9.3 Embedding Space Alignment Techniques 155
We first get the tutorial code and import the libraries we will be using.
Besides the embeddings for English and Spanish, we also provide a dictionary
that was generated automatically to map 1K English lemmas into Spanish.
This will print the first five lines in the dictionary file:
We can see from the reported numbers that some English lemmas were mapped
to the same Spanish lemma. Let us inspect some of the entries in the dictionary:
However, since the dictionary was generated automatically, it may be the case
that some of the entries in the dictionary are not in the English or Spanish
vocabularies. We only need the ids in the respective vecs:
From the 1K dictionary entries, only 477 pairs were both in the English and the
Spanish vecs. In order to verify that the translation works, we can split this into 450
pairs that we will use to calculate the translation matrix and we keep the remaining
27 for testing:
Before calculating the translation matrix, let us verify that we need one. We chose
three example words:
• conocimiento and proporcionar are in the training set,
• tema is in the test set.
For each word, we get:
• the five Spanish neighbors for the English vector,
• the five Spanish neighbors for the Spanish translation according to the dictionary.
This should output tables for the three search words which translate based on the
dictionary to knowledge, supply, and theme. The tables should be similar to:
158 9 Aligning Embedding Spaces and Applications for Knowledge Graphs
Clearly, simply using the Spanish vector in the English space does not work. Let
us get the matrices:
As we can see, the translation matrix is just a 300 × 300 matrix. Now that we have
the translation matrix, Let us inspect the example words to see how it performs:
The output with the three updated tables for the query words is shown below.
For spacing we have omitted the first column as they are shown above. The first
three columns are the same as above: results when looking up the Spanish vector
directly in the English embedding space, the results when looking up the translation
according to the dictionary. The final two columns are the results obtained using the
translation matrix:
As we can see, the results provided by the translation matrix are similar to what
we would get if we had a perfect dictionary between the two languages. Note that
160 9 Aligning Embedding Spaces and Applications for Knowledge Graphs
these results also apply to words which are not in the seed vocabulary, like tema.
Feel free to explore with other words.
The linear alignment seems to work OK for this set of embeddings. In our
experience, when dealing with larger vocabularies (and vocabularies mixing lemmas
and concepts), this approach does not scale, since the number of parameters is
limited to the d × d translation matrix.
For such cases it is possible to follow the same approach, but instead of deriving
a pseudo-inverse matrix, we train a neural network to learn a non-linear translation
function. The non-linearities can be introduced by using activation functions such
as ReLUs. See Towards a Vecsigrafo: Portable Semantics in Knowledge-based Text
Analytics [40] for more details.
Instead of using simple neural networks, there are also libraries which attempt to
learn mappings using mathematical and statistical analysis. In particular the MUSE
library by Facebook AI2 is worth trying out, especially since it provides both
supervised and unsupervised methods with and without a seed dictionary.
The purpose of this exercise is to use two Vecsigrafos, one built on UMBC and
WordNet and another one produced by directly running Swivel against a corpus of
Shakespeare’s complete works, to try to find correlations between old and modern
English, e.g. “thou” -> “you”, “dost” -> “do”, “raiment” -> “clothing”. For example,
you can try to pick a set of 100 words in “ye olde” English corpus and see how they
correlate to UMBC over WordNet.
Next, we prepare the embeddings from the Shakespeare corpus and load a UMBC
Vecsigrafo, which will provide the two vector spaces to correlate.
First, we download the corpus into our environment. We will use the Shakespeare’s
complete works corpus, published as part of Project Gutenberg and publicly
available.3 If you have not cloned the tutorial yet, you can do so now:
2 https://github.com/facebookresearch/MUSE.
3 http://www.gutenberg.org.
9.4 Exercise: Find Correspondences Between Old and Modern English 161
The output will show messages from TensorFlow and progress learning the
embeddings. Next, we check the context of the “vec” directory. Should contain
checkpoints of the model plus .tsv files for column and row embeddings.
Finally, you can inspect the generated files with the following command:
Define a basic method to print the k nearest neighbors for a given word
Next, you can adapt the steps presented in Sect. 10.4.3 to load Vecsigrafo embed-
dings trained on the UMBC corpus and see how they compare to the embeddings
trained on the Shakespeare corpus in the previous section (Sect. 9.4.2).
Either follow the instructions given in Sect. 9.3.1 to find a linear alignment
between the two embedding spaces or, attempt to use the MUSE library. The goal
is to find correlation between terms in old English extracted from the Shakespeare
corpus and terms in modern English extracted from UMBC. If you choose to try
a linear alignment, you will need to generate a dictionary relating pairs of lemmas
between the two vocabularies and use it to produce a pair of translation matrices to
transform vectors from one vector space to the other. Then apply the k_neighbors
method to identify the correlations.
164 9 Aligning Embedding Spaces and Applications for Knowledge Graphs
This exercise proposes the use of Shakespeare’s complete works and UMBC to
provide the student with embeddings that can be exploited for different operations
between the two vector spaces. Particularly, we propose to identify terms and their
correlations over such spaces. If you want to contribute your solution, we encourage
you to submit your solution as a pull request for the corresponding tutorial notebook
on GitHub.4
9.5 Conclusion
This chapter presented the issue of dealing with multiple, overlapping embedding
spaces. In practice, you will often need to build applications that need to combine
models and results from different embedding spaces; which means sooner or
later you may have to deal with this issue. In this chapter, we have presented a
few methods to deal with this problem and provided practical exercises. We also
highlighted specific applications of such embedding alignment approaches for more
general problems such as knowledge graph completion, multi-linguality, and multi-
modality.
4 https://github.com/hybridnlp/tutorial/blob/master/06_shakespeare_exercise.ipynb.
Part III
Applications
Chapter 10
A Hybrid Approach to Disinformation
Analysis
Abstract Disinformation and fake news are complex and important problems
where natural language processing can play an important role in helping people
navigate online content. In this chapter, we provide various practical tutorials
where we apply several of the hybrid NLP techniques involving neural models and
knowledge graphs introduced in earlier chapters to build prototypes that solve some
of the pressing issues posed by disinformation.
10.1 Introduction
In this chapter we will build a prototype for a real-world application in the context
of disinformation analysis. The prototype we will build shows how deep learning
approaches for NLP and knowledge graphs can be combined to benefit from the
best of both machine learning and symbolic approaches.
Among other things, we will see that injecting concept embeddings into simple
deep learning models for text classification can improve such models. Similarly, we
will show that the output of deep learning classifiers for identifying misinforming
texts can be used as input for propagating such signals in social knowledge graphs.
This chapter has four main sections:
• Section 10.2 provides an overview of the area of disinformation detection and a
high-level idea of what we want to build in the rest of the chapter,
• in Sect. 10.3 we build a database of claims that have been manually fact-checked
as well as a neural index in order to, given a query sentence, find similar claims
that have been previously fact-checked,
• in Sect. 10.4 we use hybrid word/concept embeddings to build a model for
detecting documents which show signs of using deceptive language, and
• Section 10.5 shows how we can combine information provided by human and
machine-learning based annotations and the structure of a knowledge graph to
propagate credibility scores. This allows us to estimate the credibility of nodes
for which we do not yet have direct evidence.
Disinformation and misinformation is, at the time of writing, a hot research topic.
Although disinformation has always been a prevalent phenomenon in societies,
it has gained further impact in the era of decentralized and social media. While
disinformation used to require control of mass communication media, nowadays
anyone with a social media account is able to spread (mis)information. Furthermore,
it is much harder to control the spread of misinformation, since the threshold to
spread messages has been greatly reduced. In this section we provide a summary of
what has been published in academic research related to disinformation and discuss
how it has impacted our design for what we will build in the remaining sections of
this chapter.
To start, it is useful to have a precise definition of disinformation to inform our
design. Fallis et al. [56] consider various philosophical aspects such as the relation
(and differences) between disinforming and lying. The analysis identifies that when
speaking about disinformation it is crucial to take into account the intention of
the source as well as the communication process (besides the information being
disseminated as such). The author arrives at the following formal definition for
disinformation. You disinform X if and only if:
• You disseminate information
• You believe that p is false
• You foresee that X is likely to infer from the content of information I that p
• p is false
• It is reasonable for X to infer from the content of information I that p
While this seems to be a very good formal definition of disinformation, in
practice, it poses several technical difficulties. The main difficulty is that current
NLP systems only have limited support of propositional level analysis of the
content: i.e. NLP systems can identify actors, entities, emotions, topics, and themes
from the content of the information, but have very limited extraction of propositions.
Even in these cases, current systems have trouble understanding negations within
propositions, which makes it very hard to determine whether the proposition is true
or not. For these reasons, when developing our disinformation detection module, we
use a weakened form of detection: we assign a disinformation score to a document
which represents the likelihood that the document contains disinformation. This
opens the possibility to use statistical methods while not having to deal with
propositional level knowledge.
10.2 Disinformation Detection 169
majority of the nodes in this graph, we will not have a reliable estimate about their
credibility. Therefore, we need a method for propagating the knowledge that we
have to other nodes in the graph. In Sect. 10.5 we will implement a prototype for
such a subcomponent.
In this section we will build a semantic search engine for fact-checked claims using
BERT. The overall approach will be to:
1. create a fine-tuned version of BERT that is able to produce claim embeddings
in such a way that semantically similar claims are close to each other in the
embedding space and
2. use the resulting BERT claim encoder to create an index for a dataset of fact-
checked claims that can be used to find claims.
As explained above, we want to train a deep learning model that is capable of:
• Given a claim c, produce an embedding for that claim vc in such a way that:
• if c1 and c2 are semantically similar (e.g., they are paraphrases of each other),
then fdist (vc1 , vc2 ) ≈ 0 for some distance function fdist .
1 https://aclweb.org/aclwiki/SemEval_Portal.
2 http://ixa2.si.ehu.es/stswiki/index.php/STSbenchmark.
3 https://gluebenchmark.com/tasks.
172 10 A Hybrid Approach to Disinformation Analysis
We will use BERT as a starting point, since it is the current state of the art in deep
learning architectures for NLP tasks and is a representative of a Transformer-based
deep learning model. The advantage of using BERT is that it has already been pre-
trained on a large corpus, so we only need to fine-tune it on the STS-B dataset.
We will use the Huggingface PyTorch-Transformers4 library as an interface to
the BERT model. We can install it on our environment, as follows:
4 https://github.com/huggingface/PyTorch-transformers.
10.3 Application: Build a Database of Claims 173
This will install the required library and its dependencies. Next, we can import
various libraries:
Now that we have the Bert tokenizer and model, we can pass it a sentence, but
we need to define which output of BERT we want to use as the sentence embedding.
We have several options:
• Input sequences are pre-prended with a special token [cls] which is meant to
be used for classification of the sequence.
• We can combine the final layer of contextual embeddings, e.g. by concatenating
or pooling them (take the sum or average).
• We can combine any combination of layers (e.g., the final four layers).
Also, since the model and tokenizer need to be used together, we define a
tok_model dict that we can pass to the function. We will split the implementation
into the following methods:
5 https://github.com/google/sentencepiece.
6 https://huggingface.co/transformers/pretrained_models.html.
174 10 A Hybrid Approach to Disinformation Analysis
The pre-trained BERT model is optimized to predict masked tokens or the next
sentence in a pair of sentences. This means that we cannot expect the pre-trained
BERT to perform well in our task of semantic similarity. Therefore, we need to
fine-tune the model. In PyTorch, we can do this by defining a PyTorch Module as
follows:
176 10 A Hybrid Approach to Disinformation Analysis
We are now ready to define the main training loop. This is a pretty standard loop for
PyTorch. The main thing here is that we:
• iterate over batches of the STS-B dataset and produce encodings for both
sentences,
• then we calculate the cosine similarity between the two encodings and map that
onto a predicted similarity score in a range between 0 and 1, and
• we use the STS-B value (normalized to the same range) to define a loss and train
the model.
10.3 Application: Build a Database of Claims 177
178 10 A Hybrid Approach to Disinformation Analysis
7 https://PyTorch.org/docs/stable/data.html?highlight=dataset#torch.utils.data.Dataset.
8 https://PyTorch.org/docs/stable/data.html#torch.utils.data.DataLoader.
10.3 Application: Build a Database of Claims 179
We are now ready to train the model, by defining the data loaders:
To train the model we need to create the optimizer. In the next lines we also start
the training (this can take about 10 min in a regular GPU):
Note that before training, we validate using the dev part of the dataset and
achieve rpearson = 0.1906, which is what the pre-trained BERT produces. This
shows that the default BERT embeddings are not very semantic, or at least not
well-aligned with what humans regard as semantic similarity. The fine-tuned model
should achieve a rpearson score close to 0.8, which is much better aligned with
human ratings.
Now that we have a model for producing semantic embeddings of sentences, we can
create a simple semantic index and define methods to populate and query it.
180 10 A Hybrid Approach to Disinformation Analysis
Our semantic index is simply a Python dict with fields sent_encoder, our
semantic encoder, and sent2emb a dict from the sentence to its embedding.
And a method to iterate over all the STS-B items in one of the DataFrames we
loaded at the beginning of the section:
To explore the newly populated dataset, we can define a method to find the top k
elements in the index for a given sentence:
10.3 Application: Build a Database of Claims 181
In the remainder of this section, we will explore the dataset using some examples.
We start by querying for a sentence related to news about traffic accidents in China:
cosim Sentence
0.993376 ‘Around 100 dead or injured’ after China earthquake
0.992287 Hundreds dead or injured in China quake\n
0.990573 Floods leave six dead in Philippines
0.989853 At least 28 people die in Chinese coal mine explosion\n
0.989653 Heavy rains leave 18 dead in Philippines\n
Let us explore another example on the topic of the economic output of the USA:
cosim Sentence
0.9973 North American markets grabbed early gains Monday morning,. . .
0.9969 North American markets finished mixed in directionless trading Monday . . .
0.9966 S. Korean economic growth falls to near 3-year low\n
0.9963 The blue-chip Dow Jones industrial average .DJI climbed 164 points, . . .
0.9962 That took the benchmark 10-year note US10YT=RR down 9/32, its yield . . .
So the results on STS-B dev seem OK. Now, let us create an index for a dataset of
checked facts from Data Commons’ Fact Check.9 First, let us download the dataset:
9 https://www.datacommons.org/factcheck/download#research-data.
182 10 A Hybrid Approach to Disinformation Analysis
We can also define a method to convert the nested Python dict into a pandas
DataFrame. We are not interested in all the data in the json feed, so we only
populate a few columns.
claimReviewed reviewed_by
“Sumber daya yang sebelumnya dikuasai asing, berhasil . . . ” Tempo.co
The push by Assembly Democrats seeking Americans with . . . PolitiFact
The EU sends Northern Ireland e500 million a year Fact Check NI
A claim that herdsmen walked into the terminal . . . DUBAWA
The datafeed contains claims in many different languages, and since our model only
works for English, we should only take into account English claims. Unfortunately,
the feed does not include a language tag, so we need to filter the feed.
10.3 Application: Build a Database of Claims 183
Now we can use this method to explore the neighboring claims in the index for
query sentences. First, we try to find claims related to a claim about Brexit
The results in our run are depicted in Table 10.4. The numerical value for the
predictions is what the model outputs, i.e. the range is between 0 (not similar at all)
and 1 (semantically very similar). In this case, we see that a related, but narrower,
claim was found with semantic similarity score of 0.76. Other results are below 0.7
and are not about Brexit at all.
Next, we use a query claim related to Northern Ireland and EU contributions:
10.3 Application: Build a Database of Claims 185
Table 10.4 Claim search results for “Most people in UK now want Brexit”
Claim Pred True?
77% of young people in the UK don’t want 0.766 Inaccurate. Polls for Great Britain
Brexit show support from those aged 18–24
for remaining in the EU at between
57% and 71%; for Northern Ireland,
betwe. . .
A claim that says Nigeria’s Independent 0.681 The claim that INEC has banned the
National Electoral Commission [INEC] use of phones and cameras at polling
ban phones at polling stations stations is NOT ENTIRELY FALSE.
While you are not banned from going
to the. . .
The DUP at no point has ever agreed to 0.636 Accurate. The St Andrew’s Agreement
establish an Irish Language Act with the committed the UK Government to an
UK government, with the Irish govern- Irish Language Act, but subsequent
ment, with Sinn Féin or anybod. . . legislation compelled the Northern Ire-
lan. . .
Says there could be a potential mass shoot- 0.636 It’s a widespread hoax message
ing at a Walmart nearby
Claim video claiming Muslims protesting 0.630 FALSE
in Kashmir after Eid prayers against article
370 dissolution
Table 10.5 Claim search results for “Northern Ireland receives yearly half a billion pounds from
the European Union”
Claim Pred True?
The EU sends Northern Ireland e500 mil- 0.752 ACCURATE WITH CONSIDERA-
lion a year TION. The e500 million figure quoted
by the SDLP is substantiated by
European Commission figures for EU
regional funding of. . .
Northern Ireland is a net contributor to the 0.747 This claim is false, as we estimate that
EU Northern Ireland was a net recipient
of £74 million in the 2014/15 financial
year. Others have claimed th. . .
Arlene Foster, the leader of the Democratic 0.736 ACCURATE. The £1bn is specific to
Unionist Party, said that the party delivered the jurisdiction of Northern Ireland and
“an extra billion pounds” for Northern Ire- is in addition to funding pledged as a
land result of the Stormont House Agr. . .
Northern Ireland were once net contribu- 0.732 True, up until the 1930s. But data show
tors of revenue to HM Treasury that Northern Ireland has run a fiscal
deficit since 1966. The most recent fig-
ure, from 2013–2014, is a subv. . .
186 10 A Hybrid Approach to Disinformation Analysis
The results are shown in Table 10.5. In this case we see that the first two matches
are on topics with scores above 0.74. Notice that the only words that appear both
in the query and the result for the top result are “Northern Ireland.” The rest of the
top 5 is still about money and Northern Ireland, but no longer relate to the EU, even
though the similarity score is still in the range of [0.73, 0.74].
Let us explore a final example: State hacking of digital devices.
Table 10.6 Claim search results for “The state can hack into any digital device”
Claim Pred True?
Claim: All computers can now be moni- 0.704 Fact Crescendo Rating: True
tored by government agencies
Claim unrelated image from a random FB 0.641 FALSE
profile used to recirculate an old incident
EVMs hacked by JIO network 0.641 Fact Crescendo Rating: False
A video of Mark Zuckerberg shows him 0.633 Pants on Fire
talking about controlling “billions of peo-
ple’s stolen data” to control the future
The results are shown in Table 10.6. In this final example, we see that one claim
has score above 0.7 and is again a related claim. The other results are somewhat
related, but not directly relevant to assess the query claim.
In this practical section we saw how transformer-based models like BERT can be
used to create a neural index for finding semantically similar claims. The examples
shown above provide an indication of how well these models work. Although
in many cases the model is able to find semantically similar sentences, simply
having a cosine similarity does not provide enough information about whether the
most similar claims found are truly a paraphrasing of the query sentence or not.
Fortunately, the ClaimReview format provides a rich context which can help us
to collect more information about the credibility of the claims and documents. In
the next sections we will see how to analyze longer documents to get a prediction
about whether they contain deceptive language (Sect. 10.4) and how we can combine
different signals about the credibility of a document (typically obtained from
human-annotated claims or machine learning models) and the knowledge graph
structure provided by ClaimReviews to estimate the credibility of other nodes
in the graph (Sect. 10.5).
10.4 Application: Fake News and Deceptive Language Detection 187
In this section, we will look at how we can use hybrid embeddings in the context
of NLP tasks. In particular, we will see how to use and adapt deep learning
architectures to take into account hybrid knowledge sources to classify documents.
First, we will introduce a basic pipeline for training a deep learning model to
perform text classification.
As a first dataset, we will use the deceptive opinion spam dataset.10 See the
exercises below for a couple of more challenging datasets on fake news detection.
This corpus contains:
• 400 truthful positive reviews from TripAdvisor
• 400 deceptive positive reviews from Mechanical Turk
• 400 truthful negative reviews from Expedia, Hotels.com, Orbitz, Priceline,
TripAdvisor, and Yelp
• 400 deceptive negative reviews from Mechanical Turk
The dataset is described in more detail in the papers by Ott et al. [133, 134]. For
convenience, we have included the dataset as part of our GitHub tutorial repository.
The last two lines show that the dataset is distributed as a comma-separated-value
file with various fields. For our purposes, we are only interested in fields:
• deceptive: this can be either truthful or deceptive
• text: the plain text of the review
The other fields: hotel (name), polarity (positive or negative), and source
(where the review comes from) are not relevant for us in this practical section.
Let us first load the dataset in a format that is easier to feed into a text
classification model. What we need is an object with fields:
• texts: an array of texts
• categories: an array of textual tags (e.g., truthful or deceptive)
10 http://myleott.com/op-spam.html.
188 10 A Hybrid Approach to Disinformation Analysis
The previous cell has in fact loaded two versions of the dataset:
• raw_hotel_ds contains the actual texts as originally published and
• raw_hotel_wnscd_ds provides the WordNet disambiguated tlgs tok-
enization (see Sect. 6.5 on Vecsigrafo for more details about this format).
This is needed because we do not have a Python method to automatically
disambiguate a text using WordNet, so we provide this disambiguated version as
part of the GitHub repo for this tutorial.
Cleaning the raw text often produces better results; we can do this as follows:
As we said above, the raw datasets consist of texts, categories, and tags.
There are different ways to process the texts before passing it to a deep learning
architecture, but typically they involve:
• Tokenization: How to split each document into basic forms which can be
represented as vectors. In this section we will use tokenizations which result in
words and synsets, but there are also architectures that accept character-level or
n-grams of characters.
10.4 Application: Fake News and Deceptive Language Detection 189
• Indexing of the text: In this step, the tokenized text is compared to a vocabulary
(or, if no vocabulary is provided, it can be used to create a vocabulary), a list
of words, so that you can assign a unique integer identifier to each token. You
need this so that tokens will then be represented as embedding or vectors in a
matrix. So having an identifier will enable you to know which row in the matrix
corresponds to which token in the vocabulary.
The clsion library, included in the tutorial GitHub repo, already provides
various indexing methods for text classification datasets. In the next cell we apply
simple indexing, which uses white-space tokenization and creates a vocabulary
based on the input dataset.
Since the vocabulary was created based on the dataset, all tokens in the dataset are
also in the vocabulary. In the next sections, we will see examples where embeddings
are provided during indexing.
The following cell prints a couple of characteristics of the indexed dataset:
The output allows us to see that the vocabulary is quite small (about 11K words).
By default, it specifies that the vocabulary embeddings should be of dimension 150,
but no vectors are specified. This means the model can assign random embeddings
to the 11K words.
Under the hood, the library creates a bidirectional LSTM model as requested (the
library also can create other model architectures such as convolutional NNs).
Since our dataset is fairly small, we do not need a very deep model. A fairly
simple bidirectional LSTM should be sufficient. The generated model will consist
of the following layers:
190 10 A Hybrid Approach to Disinformation Analysis
• The input layer: is a tensor of shape (l, ), where l is the number of tokens for
each document. The empty second parameter will let us pass the model different
number of input documents, as long as they all have the same number of tokens.
• The embedding layer converts the each input document (a sequence of word
ids) into a sequence of embeddings. Since we are not yet using pre-computed
embeddings, these will be generated at random and trained with the rest of
parameters in the model.
• The lstm layers: one or more bidirectional LSTMs. Explaining these in detail
is out of the scope of this tutorial. Suffice it to say, each layer goes through
each embedding in the sequence and produces a new embedding taking into
account previous and posterior embeddings. The final layer only produces a
single embedding, which represents the full document.
• The dense layer: is a fully connected neural network that maps the output
embedding of the final layer to a vector of 2 dimensions which can be compared
to the manual labeled tag.
Finally, we can run our experiment using the n_cross_val method. Depend-
ing on whether you have an environment with a GPU this can be a bit slow, so
we only train a model once. (In practice, model results may vary due to random
initializations, so it is usually a good idea to run the same model several times to get
an average evaluation metric and an idea of how stable the model is.)
The first element of the result is a DataFrame containing test results and a record
of the used parameters. You can inspect it by executing:
10.4.1.4 Discussion
Bidirectional LSTMs are really good at learning patterns in text and were one of
the most used architectures before transformer-based architectures were introduced.
However, this way of training a model will tend to overfit the training dataset. Since
our dataset is fairly small and narrow: it only contains texts about hotel reviews, we
should not expect this model to be able to detect fake reviews about other products
or services. Similarly, we should not expect this model to be applicable to detecting
other types of deceptive texts such as fake news.
The reason why such a model is very tied to the training dataset is that even the
vocabulary is derived from the dataset: it will be biased towards words (and senses
of those words) related to hotel reviews. Vocabulary about other products, services,
and topics cannot be learned from the input dataset.
Furthermore, since no pre-trained embeddings were used, the model had to learn
the embedding weights from scratch based on the signal provided by the “deceptive”
tags. It did not have an opportunity to learn more generic relations between words
from a wider corpus.
10.4 Application: Fake News and Deceptive Language Detection 191
In this section we use embeddings learned using HolE and trained on WordNet 3.0.
As we have seen in the previous chapters, in particular Chap. 5, such embeddings
capture the relations specified in the WordNet knowledge graph. As such, synset
embeddings tend to encode useful knowledge. However, lemma embeddings tend
to be of poorer quality when learned from the KG (compared to learning them from
large corpora of text).
Execute the following cell to download and unpack the embeddings. If you recently
executed previous notebooks as part of this tutorial, you may still have these in your
environment.
Now that we have the WordNet HolE embedding in the right format, we can
explore some of the “words” in the vocabulary:
As in the previous case (see Sect. 10.4.1.2), we need to tokenize the raw
dataset. However, since we now have access to the WordNet HolE embeddings,
it makes sense to use the WordNet disambiguated version of the text (i.e.,
raw_hotel_wnscd_ds). The clsion library already provides a method
index_ds_wnet to perform tokenization and indexing using the expected
WordNet encoding for synsets.
The above produces an ls tokenization of the input text, which means that
each original token is mapped to both a lemma and a synset. The model will
then use both of these to map each token to the concatenation of the lemma and
synset embedding. Since the WordNet HolE has 150 dimensions, each token will be
represented by a 300 dimensional embedding (the concatenation of the lemma and
synset embedding).
We define the experiment using this new dataset as follows, the main change is that
we do not want the embedding layer to be trainable, since we want to maintain the
knowledge learned via HolE from WordNet. The model should only train the LSTM
and dense layers to predict whether the input text is deceptive or not.
10.4 Application: Fake News and Deceptive Language Detection 193
10.4.2.5 Discussion
Although the model performs worse than the csim version, we can expect the
model to be applicable to closely related domains. The hope is that, even if words
did not appear in the training dataset, the model will be able to exploit embedding
similarities learned from WordNet to generalize the “deceptive” classification.
If you executed previous notebooks, you may already have the embedding in your
environment.
Since the embeddings were distributed as tsv files, we can use the
load_tsv_embeddings method. Training models with all 1.4M vocab
elements requires a lot of RAM, so we limit ourselves to only the first 250K
vocab elements (these are the most frequent lemmas and synsets in UMBC).
194 10 A Hybrid Approach to Disinformation Analysis
We use the concat_embs method, which will go through the vocabularies of both
input embeddings and concatenate them. Missing embeddings from one vocabulary
will be mapped to the zero vector. Note that since wnHolE_emb has dimension
150 and wn_vecsi_umbc_emb has dimension 160, the resulting embedding
will have dimension 310. (Besides concatenation, you could also experiment with
other merging operations such as summation, subtraction, or averaging of the
embeddings.)
In this section we have shown how to use different types of embeddings as part
of a deep learning text classification pipeline. We have not performed detailed
experiments on the WordNet-based embeddings used in this notebook and, because
the dataset is fairly small, the results can have quite a bit of variance depending on
the initialization parameters. However, we have performed studies based on Cogito-
based embeddings. The tables below show some of our results:
The first set of results corresponds to experiment 1 above. We trained the
embeddings but explored various tokenizations strategies.
As discussed above, this approach produces the best test results, but the trained
models are very specific to the training dataset. The current practice when using
BiLSTMs or similar architectures is therefore to use pre-trained word-embeddings
(although this is now being replaced by simply fine-tuning transformer-based
architectures, which we leave as an exercise). fastText embeddings tend to yield
the best performance. We got the following results:
Next, we tried using HolE embedding trained on sensigrafo 14.2, which had very
poor results:
Next we tried Vecsigrafo trained on both Wikipedia and UMBC, either using
only lemmas, only syncons, or both lemmas and syncons. Using both lemmas and
syncons always proved better.
We have not yet tried applying more recent contextual embeddings, but based
on results reported elsewhere we assume these should produce very good results.
We encourage you to use the Huggingface transformer library introduced back in
Sect. 3.4.2.1 to fine-tune BERT for this task.
In this practical section, we looked at what is involved in creating a model
to detect deceptive language. We saw that combining embeddings from different
sources can improve the performance of the model. We also saw that when training
such models, there is always a consideration about how well the model will perform
with data that is different from the training set.
In Sect. 10.3 we created a database of claims which have already been reviewed
by human fact-checkers. The main drawback of that database is that it is quite
limited; therefore, we said that it would be useful to have more automated ways
of figuring out whether a document should be trusted or not; in this section, we have
now created a model which does this. The presented model is just one of the many
automated models that can be implemented to estimate the credibility of a textual
document (see Sect. 10.2 for other methods). Assuming we are able to implement
or use such models that others have implemented, the next problem is: how can we
combine automatic estimates and our database of human-annotated claims? In the
next section we look at a way to do this.
198 10 A Hybrid Approach to Disinformation Analysis
In the previous sections we have seen how to build a classifier to detect deceptive
language in text, and we saw that it is not easy to build such a model in a way that
generalizes well. We also saw that we can build deep learning models to determine
whether two sentences are semantically similar. In this notebook, we look at how
we can use a knowledge graph of entities and claims to assign credibility values for
all entities based on a limited number of human-rated credibility scores.
In this section we will use the kg-disinfo library11 to obtain an estimation
of (dis)credibility, based on knowledge graphs (KGs). Given a KG and a few
seed nodes which contain (lack of) credibility scores (represented as a numerical
value), the system uses a metric propagation algorithm to estimate the (lack
of) credibility of neighboring nodes (for which no previous score is available).
The used knowledge graph is created from the Data Commons claimReview
datafeed.12 kg-disinfo implements a credibility score propagation algorithm
called appleseed, first published in [201].
11 https://github.com/rdenaux/kg-disinfo.
12 https://github.com/rdenaux/kg-disinfo.
13 https://storage.googleapis.com/datacommons-feeds/claimreview/latest/data.json.
14 https://json-ld.org.
10.5 Propagating Disinformation Scores Through a Knowledge Graph 199
In this section we will only use a few of the entities in the graph, namely those
accessible through paths:
To create the required knowledge graph, we need to know how to specify the graph
and injections15 in the format that kg-disinfo expects. The main points are:
• The graph is just a list of weighted and directed edges between nodes.
• The injections assign initial values to some nodes for the metric that should
be propagated. In our case these are discredibility scores.
• The weights of edges control how the injected values propagate between nodes
in the graph. You can define different weights depending on the nodes.
At the schema level, the graph and propagation weights could look as follows:
15 https://github.com/rdenaux/kg-disinfo#specifying-graphs-and-injections.
200 10 A Hybrid Approach to Disinformation Analysis
Note that weights were arbitrarily defined by us, taking into account how much
we think a certain relation between two nodes should propagate a discredibility
value. In reality, the schema of the graph may be implicit; when you convert the
Data Commons ClaimReview dataset into this format, you will only see instance
level relations between nodes in the graph, as we show in the next section.
Once we processed the Data Commons claimReview datafeed, the knowledge graph
in JSON notation is something like this:
10.5 Propagating Disinformation Scores Through a Knowledge Graph 201
We can see that there are some cases when the discredibility is 0.0 (“Error”: 0.0),
but it should be 1.0, this is because the given rating might be incorrectly encoded in
the source dataset. In some other cases (“We Explain the Research”: 0.5), we cannot
know the discredibility score, and in that case we assign a 0.5 by default. Therefore
these ratings should not be treated as a ground truth (someone could have created a
false claimReview), but as an estimation of credibility.
To better understand how the propagation works, we will first demonstrate it at the
schema level. To run it, we need to get:
• The code for the tutorial, as it contains KG data we will be using and
• the kg-disinfo distribution. This is a java jar file.
10.5 Propagating Disinformation Scores Through a Knowledge Graph 203
This should print a figure like that shown in Fig. 10.2. The idea is that the
“claimReview_altName” node has an injection of a discredibility score of 1.0.
This score is then propagated through the neighbors when running the kg-disinfo
application. The scores after propagation are shown in Fig. 10.3.
claimReview_altName
claim
claimReview_altName
claim
Now is time to run kg-disfo with the Data Commons claimReview knowledge
graph to see how the discredibility scores propagate through the graph.
The output printed by the program should allow you to see that it does not
take too much time since we created the knowledge graph with a small part (100
claimReviews) of the datafeed (more than 2000 claimReviews).
The command outputs SVG files to visualize the graph. However, even for a
relatively small graph of 100 claimReviews, the graph is difficult to inspect visually
without tooling to allow filtering and zooming. We encourage you to load the
generated files on your browser or another SVG viewer. If you are executing this
section along on a Jupyter environment or Google Colaboratory, you can inspect the
SVG using the following commands:
If you do that, you should be able to see that our knowledge graph has subgraphs
that are not connected with other subgraphs, and that the only nodes with some
10.5 Propagating Disinformation Scores Through a Knowledge Graph 205
color are those with an initial discredibility score. These correspond to nodes of
type claimReview_altNames.
You can similarly display the graph after propagation:
You should be able to see that there are some nodes that previously did not have
a score of discredibility with an estimation. This does not only happen with the
claims, but also with authors, articles, and publisher domains.
Naturally, we can also get the estimated numerical numbers for each node.
Based on the knowledge graph discredibility propagation, one of the least credi-
ble nodes, besides the claimReview_altName nodes, is “Donald J. Trump.”
The algorithm distributes the initial injections throughout the graph, but the sum
of all scores at the end of the propagation is the same as the initial injections.
This means the initial weights are diluted, which makes interpreting the final scores
difficult. A final normalization step could help with this.
At the moment the algorithm only propagates discredibility scores. However, we
often have also a score about the confidence of the score. It would be good to have
a way to also propagate this confidence information.
At the moment, we can only propagate discredibility scores between a range of
0.0 and 1.0. Arguably, a more natural way of thinking about credibility is as a range
between −1 (not credible) and 1 (fully credible).
206 10 A Hybrid Approach to Disinformation Analysis
Finally, it is worth noting that credibility injections can come from a variety of
sources. ClaimReviews come from reputable fact-checker organizations, but alter-
native sources can also be considered. For example, there are various sites/services
which provide reputation scores for internet domains.
10.6 Conclusion
11.1 Introduction
This is almost the last stop of our journey towards hybrid natural language
processing. The readers that made it this far are now familiar with the concept
of Vecsigrafo and its evolution based on neural language models, how to apply
such representations in actual applications like disinformation analysis, and how
to inspect their quality. In this chapter we make things a step more complex and
propose to deal not only with text but also with other modalities of information,
including images, figures, and diagrams. Here, too, we show the advantages of
hybrid approaches involving not only neural representations but also knowledge
graphs and show the reader how to master them for different tasks in this context.
To this purpose, as in Chap. 8 we focus on the scientific domain, lexically
complex and full of information that can be presented in many forms, not only text.
Previous analysis [72] produced an inventory of the different types of knowledge
More details on the work described in this chapter can be found in [69], including
a complete qualitative study on the activation of the model that shows evidence of
improved textual and visual semantic discrimination over the equivalent uni-modal
cases. All the code and data, including the corpora extracted from the different
datasets, are also available in GitHub.1 Practical examples, including code to
train the FCC model, a qualitative inspection, a comparison with image-sentence
matching approaches, and application in classification and question answering, are
shown in Sect. 11.8.
The main idea of this task is to learn the correspondence between scientific figures
and their captions as they appear in a scientific publication. The information
captured in the caption explains the corresponding figure in natural language,
providing guidance to identify the key features of the figure and vice versa. By
seeing a figure and reading the textual description in its caption the aim is to learn
representations that capture, e.g. what it means that two plots are similar or what
gravity looks like.
In essence, FCC is a binary classification task that receives a figure and a caption
and determines whether they correspond or not. For training, positive pairs are
actual figures and their captions from a collection of scientific publications. Negative
pairs are extracted from combinations of figures and any other randomly selected
captions. The network is then made to learn text and visual features from scratch,
without additional labeled data.
The two-branch neural architecture (Fig. 11.1) proposed for the FCC task is very
simple. It has three main parts: the vision and language subnetworks, respectively,
extracting visual and text features, and a fusion subnetwork that takes the resulting
features from the visual and text blocks and uses them to evaluate figure-caption
correspondence.
The vision subnetwork follows a VGG-style [170] design, with four blocks
of conv+conv+pool layers. Based on [96], the language subnetwork has three
convolutional blocks and a 300-D embedding layer at the input, with a maximum
sequence length of 1000 tokens.2 Vision and language subnetworks produce each a
512-D vector in the last convolutional block. The fusion subnetwork calculates
the element-wise product of the 512-D visual and text feature vectors into a
single vector r to produce a 2-way classification output (correspond or not). The
probability of each choice is the softmax of r, i.e. ŷ = sof tmax(r) ∈ R2 . During
training, the negative log probability of the correct choice is minimized.
1 https://github.com/hybridNLP/look_read_and_enrich.
2 Note that the architecture of both vision and language subnetworks can be replaced by others,
including (bi)LSTMs or based on neural language models. In this case CNNs were chosen for
simplicity and ease of inspection.
210 11 Jointly Learning Text and Visual Information in the Scientific Domain
Now, let V the vocabulary of words from a collection of documents D. Also, let
L the lemmas of such words, i.e. base forms without morphological or conjugational
variations, and C the concepts (or senses) in a knowledge graph. Each word wk in
V , e.g. made, has one lemma lk (make) and may be linked to one or more concepts
ck in C (create or produce something).
For each word wk , the FCC task learns a d-D embedding wk , which can be
combined with pre-trained word (w k ), lemma (lk ), and concept (ck ) embeddings
to produce a single vector tk . If no pre-trained knowledge is transferred from an
external source, then tk = wk . Note that D was previously lemmatized and disam-
biguated against the knowledge graph in order to select the right pre-trained lemma
and concept embeddings for each particular occurrence of wk . Equation (11.1)
shows the different combinations of learnt and pre-trained embeddings considered:
(a) learnt word embeddings only, (b) learnt and pre-trained word embeddings, and
(c) learnt word embeddings and pre-trained semantic embeddings, including both
lemmas and concepts, in line with recent findings [39].
⎧
⎪
⎪ wk (a)
⎨
tk = [wk ; w
k] (b) (11.1)
⎪
⎪
⎩[w ; l ; c ] (c)
k k k
needed. The dimensionality of tk is fixed to 300, i.e. the size of each sub-vector
in configurations (a), (b), and (c) is 300, 150, and 100, respectively. Training used
tenfold cross-validation, Adam optimization [97] with learning rate 10−4 and weight
decay 10−5 . The network was implemented in Keras and TensorFlow, with batch
size 32. The number of positive and negative cases is balanced within the batches.
Knowledge graph embedding approaches like HolE [130] and Vecsigrafo [39]
can be used to learn semantic embeddings that enrich the pre-trained FCC features.
In contrast to Vecsigrafo, which requires both a text corpus and a knowledge graph,
HolE follows a graph-based approach where embeddings are learnt exclusively
from the knowledge graph. As Sect. 11.4 will show, this gives Vecsigrafo a certain
advantage in the FCC task. Following up with previous work [39], the knowledge
graph used here is also Sensigrafo, which underlies Expert System’s Cogito NLP
platform. As introduced earlier in the book, Sensigrafo is conceptually similar
to WordNet. Cogito was used to disambiguate the text corpora prior to training
Vecsigrafo. All the semantic (lemma and concept) embeddings produced with HolE
or Vecsigrafo are 100-D.
Next, we compare the actual FCC task against two supervised baselines and
inspect the resulting features from a qualitative point of view. Then, we situate the
task of learning the correspondence between scientific figures and captions in the
more general context of image-sentence matching in order to illustrate the additional
complexity compared to natural images.
11.3 Datasets
3 https://www.springernature.com/gp/researchers/scigraph.
212 11 Jointly Learning Text and Visual Information in the Scientific Domain
Wikipedia. The January 2018 English Wikipedia dataset as one of the corpora
on which to train Vecsigrafo. As opposed to SciGraph or SemScholar, specific of
the scientific domain, Wikipedia is a source of general-purpose information.
Flickr30K and COCO, as image-sentence matching benchmarks.
The method is evaluated in the task it was trained to solve: determining whether
a figure and a caption correspond. We also compare the performance of the FCC
task against two supervised baselines, training them on a classification task against
the SciGraph taxonomy. For such baselines we first train the vision and language
networks independently and then combine them. The feature extraction parts of both
networks are the same as described in Sect. 11.2. On top of them, a fully connected
layer with 128 neurons and ReLU activation and a softmax layer is attached, with
as many neurons as target classes.
The direct combination baseline computes the figure-caption correspondence
through the scalar product between the softmax outputs of both networks. If it
exceeds a threshold, which was heuristically fixed on 0.325, the result is positive.
The supervised pre-training baseline freezes the weights of the feature extraction
trunks from the two trained networks, assembles them in the FCC architecture as
shown in Sect. 11.2, and trains the FCC task on the fully connected layers. While
direct combination provides a notion of the agreement between the two branches,
supervised pre-training is the most similar supervised approach to our method.
Table 11.1 shows the results of the FCC task and the supervised baselines. FCCk
denotes the corpus and word representation used to train the FCC task. Accvgg
shows the accuracy after replacing our visual branch with pre-trained VGG16
features learnt on ImageNet. This provides an estimate of how specific of the
scientific domain scientific figures and therefore the resulting visual features can
be compared to natural images. As the table shows, the results obtained using
11.5 Figure-Caption Correspondence vs. Image-Sentence Matching 213
pre-trained visual features are clearly worse in general (only slightly better in
FCC3 ), suggesting that the visual information contained in scientific figures indeed
differs from natural images.
The FCC network was trained on two different scientific corpora: SciGraph
(FCC1−5 ) and SemScholar (FCC6−7 ). Both FCC1 and FCC6 learnt their own word
representations without transfer of any pre-trained knowledge. Even in its most basic
form our approach substantially improves over the supervised baselines, confirming
that the visual and language branches learn from each other and also that figure-
caption correspondence is an effective source of free supervision.
Adding pre-trained knowledge at the input layer of the language subnetwork
provides an additional boost, particularly with lemma and concept embeddings
from Vecsigrafo (FCC5 ). Vecsigrafo clearly outperformed HolE (FCC3 ), which
was also beaten by pre-trained fastText [23] word embeddings (FCC2 ) trained on
SemScholar.
Although the size of Wikipedia is almost triple of our SemScholar corpus,
training Vecsigrafo on the latter resulted in better FCC accuracy (FCC4 vs. FCC5 ),
suggesting that domain relevance is more significant than sheer volume, in line with
previous findings [62]. Training FCC on SemScholar, much larger than SciGraph,
further improves accuracy, as shown in FCC6 and FCC7 .
We put the FCC task in the context of the more general problem of image-sentence
matching through a bidirectional retrieval task where images are sought given a
text query and vice versa. While Table 11.2 focuses on natural images datasets
(Flickr30K and COCO), Table 11.3 shows results on scientific datasets (SciGraph
and SemScholar) rich in scientific figures and diagrams. The selected baselines
(Embedding network, 2WayNet, VSE++, and DSVE-loc) report results obtained
on the Flickr30K and COCO datasets, also included in Table 11.2. Performance is
measured in recall at k (Rk), with k = {1, 5, 10}. From the baselines, DSVE-loc was
successfully reproduced using the code made available by the authors,4 and trained
it on SciGraph and SemScholar.
The FCC task was trained on all the datasets, both in a totally unsupervised way
and with pre-trained semantic embeddings, with focus on the best FCC configura-
tion. As shown in Table 11.1, such configuration leverages semantic enrichment
using Vecsigrafo (indicated with subscript vec). The bidirectional retrieval task
is then run using the resulting text and visual features. Further experimentation
included pre-trained VGG16 visual features extracted from ImageNet (subscript
vgg), with more than 14 million hand-annotated images. Following common
4 https://github.com/technicolor-research/dsve-loc.
214 11 Jointly Learning Text and Visual Information in the Scientific Domain
practice in image-sentence matching, the splits are 1000 samples for test and the
rest for training.
We can see a marked division between the results obtained on natural images
datasets (Table 11.2) and those focused on scientific figures (Table 11.3). In the
former case, VSE++ and DSVE-loc clearly beat all the other approaches. In contrast,
FCC performs poorly on such datasets although results are ameliorated with pre-
trained visual features (“FCCvgg ” and “FCCvgg-vec ”). Interestingly, the situation
changes radically with the scientific datasets. While the recall of DSVE-loc drops
dramatically in SciGraph, and even more in SemScholar, FCC shows the opposite
behavior in both figure and caption retrieval. Using visual features enriched with
pre-trained semantic embeddings from Vecsigrafo during training of the FCC task
further improves recall in the bidirectional retrieval task.
Unlike in Flickr30K and COCO, replacing the FCC visual features with pre-
trained ones from ImageNet brings us little benefit in SciGraph and even less in
SemScholar, where the combination of FCC and Vecsigrafo (“FCCvec ”) obtains
the best results across the board. This and the extremely poor performance of the
best image-sentence matching baseline (DSVE-loc) in the scientific datasets shows
evidence that dealing with scientific figures is considerably more complex than
natural images. Indeed, the best results in figure-caption correspondence (“FCCvec ”
in SemScholar) are still far from the SoA in image-sentence matching (DSVE-loc
in COCO).
Next, the visual and text features learnt in the FCC task are put to the test in two
different transfer learning settings: (1) classification of scientific figures and cap-
tions according to a given taxonomy and (2) multi-modal machine comprehension
for question answering given a context of text, figures, and images.
11.6 Caption and Figure Classification 215
We evaluate the language and visual representations emerging from FCC in the
context of two classification tasks that aim to identify the scientific field an arbitrary
text fragment (a caption) or a figure belong to, according to the SciGraph taxonomy.
The latter is a particularly hard task due to the whimsical nature of the figures that
appear in our corpus: figure and diagram layout is arbitrary; charts, e.g. bar and pie
charts, are used to showcase data in any field from health to engineering; figures and
natural images appear indistinctly, etc. Also, note that only the actual figure is used,
not the text fragment where it is mentioned in the paper.
The study picks the text and visual features that produced the best FCC results
with and without pre-trained semantic embeddings (see Table 11.1) and use the
language and vision subnetworks presented in Sect. 11.2 to train our classifiers on
SciGraph in two different scenarios. First, only the fully connected and softmax
layers are fine-tuned, freezing the text and visual weights (non-trainable in the
table). Second, all the parameters are fine-tuned in both networks (trainable). In both
cases, a baseline using the same networks initialized with random weights without
FCC training is used for comparison. In doing so, the first, non-trainable scenario
seeks to quantify the information contributed by the FCC features, while training
from scratch on the target corpus should provide an upper bound for figure and
caption classification. Additionally, for figure classification, the baseline freezes the
pre-trained VGG16 model. Training uses tenfold cross-validation and Adam. For
the caption classification task, the learning rate is 10−3 and batch size 128. In figure
classification, learning rate 10−4 , weight decay 10−5 , and batch size 32 are selected.
The results in Table 11.4 show that using FCC features amply beats the
baselines, including the upper bound (training from scratch on SciGraph). The
delta is particularly noticeable in the non-trainable case for both caption and
figure classification and is considerably increased in FCC7 , which uses pre-trained
semantic embeddings. This includes both the random and VGG baselines and
216 11 Jointly Learning Text and Visual Information in the Scientific Domain
We leverage the TQA dataset and the baselines in [93] to evaluate the features
learnt by the FCC task in a multi-modal machine comprehension scenario. We
study how the FCC model, which was not originally trained for this task, performs
against state-of-the-art models specifically trained for diagram question answering
and textual reading comprehension in a very challenging dataset. We also study
how pre-trained semantic embeddings impact in the TQA task: first, by enriching
11.8 Practice with Figure-Caption Correspondence 217
the visual features learnt in the FCC task as shown in Sect. 11.2 and then by using
pre-trained semantic embeddings to enrich word representations in the TQA corpus.
We focus on multiple-choice questions, which represent 73% of all the questions
in the dataset. Table 11.5 shows the performance of the FCC model against the
results reported in [93] for five TQA baselines: random, BiDAF (focused on text
machine comprehension), text only (T QA1 , based on MemoryNet), text+image
(T QA2 , VQA), and text+diagrams (T QA3 , DSDP-NET). The T QA1 and T QA2
architectures were successfully reproduced. The latter was also adapted.5 Then, the
visual features in T QA2 were replaced with those learnt by the FCC visual subnet-
work both in a completely unsupervised way (FCC6 in Table 11.1) and with pre-
trained semantic embeddings (FCC7 ), resulting in T QA4 and T QA5 , respectively.
While T QA1−5 used no pre-trained embeddings at all, T QA6−10 were trained
including pre-trained Vecsigrafo semantic embeddings. Unlike FCC, where con-
catenation was used to combine pre-trained lemma and concept embeddings with
the word embeddings learnt by the task, in the case of TQA element-wise addition
showed to work best.
Following the recommendations in [93], the TQA corpus was pre-processed to:
(1) consider knowledge from previous lessons in the textbook in addition to the
lesson of the question at hand and (2) address challenges like long question contexts
with a large lexicon. In both text and diagram MC, applying the Pareto principle
to reduce the maximum token sequence length in the text of each question, their
answers and context improved accuracy considerably. This optimized the amount of
text to consider for each question, improving the signal to noise ratio. Finally, the
most relevant paragraphs for each question were obtained through tf–idf.6 The mod-
els were trained using tenfold cross-validation, Adam, learning rate 10−2 and batch
size 128. Text MC also used 0.5 dropout and recurrent dropout in the LSTM layers.
Fitting multi-modal sources into a single memory, the use of visual FCC
features clearly outperforms all the TQA baselines in diagram MC. Enhancing
word representation with pre-trained semantic embeddings during training of the
TQA task provides an additional boost that results in the highest accuracies for both
text MC and diagram MC. These are significantly good results since, according to
the authors of the TQA dataset [93], most diagram questions in it would normally
require a specific rich diagram parse.
5 While VGG19 produces a 7-by-7 grid of 512-D image patch vectors, our visual subnetwork
08_scientific_information_management.ipynb.
218 11 Jointly Learning Text and Visual Information in the Scientific Domain
1. Training and evaluation of the FCC task, used to jointly learn text and
visual features from scientific bibliography. This includes the use of pre-
trained embeddings from a knowledge graph learnt through the Vecsigrafo
approach.
2. A qualitative analysis of the resulting features so that you can see by yourself
the information captured by them.
3. A comparison between state-of-the-art algorithms used in image-sentence
matching and the FCC task. This will allow you to better understand the
limitations of applying state-of-the-art image-sentence matching techniques to
the scientific domain.
4. Textual and visual classification over the SciGraph taxonomy using figures
and their captions as input.
5. Multi-modal question answering involving text but also diagrams over the
TQA dataset.
In addition to this notebook, all the related code and data used for the experimen-
tation in the paper can be found in GitHub.
The datasets we use for training and evaluation are SciGraph, Semantic Scholar,
and TQA, as described in Sect. 11.3. For practical reasons, in the notebook the size
of the dataset is limited to 400 figures and captions to train the FCC task. As can be
expected, this is not sufficient to train a performant model. So, when necessary we
use the weights resulting from training over the whole corpus.
Next, we clone the repository with the datasets and other materials.
The figures and captions are structured as json files. Let us take a look at them:
As introduced in Sect. 11.2, FCC is a binary classification task that receives a figure
and a caption and determines whether they correspond or not. The positive pairs are
actual figures and their captions from a collection of scientific publications. Negative
pairs are extracted from combinations of figures and other captions ad random.
220 11 Jointly Learning Text and Visual Information in the Scientific Domain
25
50
75
100
125
150
175
200
Fig. 11.2 Sample image obtained from the FCC datasets, in this case SciGraph
To illustrate the training of the FCC task, we focus on the SciGraph corpus. First,
we save in a list all the figures and their captions. For the text part, we keep not
only the tokens but also their associated WordNet synsets resulting from word-sense
disambiguation. This is a necessary step to enrich the text features with the semantic
(lemma and synset) embeddings of each token (Fig. 11.2).
As in previous notebooks, we use Cogito to tokenize the text and WordNet as the
knowledge graph used for word-sense disambiguation. In the original paper [69]
Sensigrafo, Expert System’s proprietary knowledge graph, was used instead.
Once we have our three lists with figure paths, caption tokens, and caption
synsets, we transform them into tensors.
11.8 Practice with Figure-Caption Correspondence 221
First, we create numpy arrays from the figures using the PIL library:
Then, we:
1. create two Keras tokenizers, one for caption tokens and another for their synsets,
2. create word indexes for both modalities, and
3. transform the captions in sequence arrays with a length of 1000 tokens each.
As described in Eq. (11.1), for each word wk in the vocabulary, the FCC network
learns an embedding wk that can be combined with pre-trained word w k , lemma
lk , and concept ck embeddings to produce a single vector tk . If no pre-trained
knowledge is transferred from an external source, then tk = wk . The options shown
in Eq. (11.1) include:
(a) Only word embeddings learnt by the network upon training over the corpus.
(b) Learnt and pre-trained word embeddings.
222 11 Jointly Learning Text and Visual Information in the Scientific Domain
(c) Learnt word embeddings and pre-trained semantic (lemma and concept) embed-
dings.
For conciseness, here we focus on option (c), where word, lemma, and concept
embeddings are concatenated in a 300 dimension vector (100 dimensions for words,
lemmas, and embeddings each). All the .tsv files of the embeddings are also
available in GitHub.
First we extract the embeddings from the .tsv and put them in two dictionaries
(embeddings_index_tokens and embeddings_index_synsets):
11.8 Practice with Figure-Caption Correspondence 223
Now we take every word and its disambiguated synset and fetch their pre-trained
embeddings from the dictionaries to build the embeddings matrices for the model.
In case of out-of-vocabulary words we use an array of zeros for that token during
the training of the FCC task.
As shown in Sect. 11.1, we use a two-branch neural architecture with three main
parts: the vision and language subnetworks, respectively, extracting visual and
text features, and a fusion subnetwork to evaluate figure-caption correspondence
(Fig. 11.3).
The vision subnetwork follows a VGG-style [170] design, with 3 × 3 convolu-
tional filters, 2 × 2 max-pooling layers with stride 2 and no padding. It contains four
blocks of conv+conv+pool layers, where inside each block the two convolutional
layers have the same number of filters, while consecutive blocks have doubling
number of filters (64, 128, 256, 512). The input layer receives 224 × 224 × 3
images. The final layer produces a 512-D vector after 28 × 28 max-pooling. Each
convolutional layer is followed by batch normalization [83] and ReLU layers.
224 11 Jointly Learning Text and Visual Information in the Scientific Domain
11.8 Practice with Figure-Caption Correspondence 225
FCC is evaluated in the task it was trained to solve: determining whether a figure
and a caption correspond. Next, we define some hyperparameters:
The next generator introduces 32 inputs for every batch, balanced in terms of
positive and negative cases within the batches:
• 16 positive cases with a figure and its correct caption.
• 16 negative cases with a randomly selected caption for the same figure.
For each word in the caption, three text sub-vectors are considered: token, lemma,
and synset.
11.8 Practice with Figure-Caption Correspondence 227
Fig. 11.3 Diagram of the complete basic architecture of the FCC model
228 11 Jointly Learning Text and Visual Information in the Scientific Domain
Before we train the model, we randomly select the train and test indexes. We
choose a 90% train size and a 10% test size of the whole dataset.
Finally, we train the FCC model. With the corpus and hyperparameters proposed
in the notebook, training should take around 2 min. Given the very small subset of
data we have selected as input, the results will very likely be rather far from those
reported in paper.
Due to limited size of the training set used in this notebook, the results (around
50% accuracy) are merely testimonial.
Once you have trained the FCC task, you will have three files with the resulting
text and visual features, along with the weights of the model. Such features can be
used for the transfer learning tasks proposed in the notebook.
However, since their performance is very limited by the small amount of data
used here, their actual use in transfer learning tasks is very limited. To overcome
this, in the remainder of the notebook we use FCC features that were previously
11.8 Practice with Figure-Caption Correspondence 229
learnt from a larger SciGraph selection (82K figures and captions). You are
invited to experiment with different alternatives and check the results.
Now, we inspect the features learnt by the FCC task to gain a deeper understanding
of the syntactic and semantic patterns captured for figure and caption representation
(Figs. 11.4, 11.5, 11.6, 11.7, 11.8, and 11.9).
Vision features
The analysis was carried out on an unconstrained variety of charts, diagrams, and
natural images from SciGraph, without filtering by figure type or scientific field. To
obtain a representative sample of what the FCC network learns, we focus on the 512-
D vector resulting from the last convolutional block before the fusion subnetwork.
Let us pick the feature with the most significant activation over the whole dataset
and select the figure that activate them most and show its heatmap. We prioritize
such features with a higher maximum activation against the average activation:
25
50
75
100
125
150
175
200
Fig. 11.5 Sample image with activation. Lighter means higher activation, focusing on the arrows
in the lower half of the image
Text features
Similar to the visual case, we select the feature from the last block of the language
subnetwork with the highest activation and show it heatmap.
The next parameters can change the number of features or captions for each
feature you want to analyze.
232 11 Jointly Learning Text and Visual Information in the Scientific Domain
Fig. 11.7 Sample figure with activation. Activation is stronger on the curves and the characters
identifying each plot. Also, strong activation appears on the right edge of the image
Fig. 11.8 Attention on each word of a sample caption (darker means higher activation): “The
Aliev–Panfilov model with α = 0.01, γ = 0.002, b = 0.15, c = 8, μ 1 = 0.2, μ 2 = 0.3. The phase
portrait depicts trajectories for distinct initial values ϕ 0 and r 0 (filled circles) converging to a
stable equilibrium point (top). Non-oscillatory normalized time plot of the non-dimensional action
potential ϕ and the recovery variable r (bottom)”
Fig. 11.9 Attention on each word of a sample caption (darker means higher activation): “Relative
protein levels of ubiquitin-protein conjugates in M. quadriceps femoris of rats fed either a control
diet (0 mg carnitine/kg diet; Control) or a diet supplemented with 1250 mg carnitine/kg diet
(Carnitine). (a) A representative immunoblot specific to ubiquitin is shown for three animals
per group; immunoblots for the other animals revealed similar results. Reversible staining of
nitrocellulose membranes with Ponceau S revealed equal loading of protein. (b) Bars represent
data from densitometric analysis and represent means ± SD (n = 6/group); bars are expressed
relative to the protein level of the control group (=1.00). * indicates a significant difference to the
control group (P < 0.05)”
Next, we illustrate how to run the bidirectional retrieval (caption -> figure and figure
-> caption) tasks against a small Scigraph sample (40 figures and captions). Of
course, the results will differ from those in the previous tables in Sect. 11.5. Actually,
as you will see they are much better because the search space is much smaller.
From the baselines, we successfully reproduced DSVE-loc, using the code made
available by the authors, and trained it on SciGraph. Since downloading and
234 11 Jointly Learning Text and Visual Information in the Scientific Domain
installing the materials to run the DSVE-loc testing takes approximately 15 min,
we will skip this step but you can run it on your own.
First, we clone the original DSVE-loc repository and the necessary materials
indicated there:
Next, we download a DSVE-loc model pre-trained offline over the overall 82K
SciGraph samples and a random selection of 40 new SciGraph samples in DSVE-
loc-compatible format for evaluation:
Now, we run the bidirectional retrieval task. The results are two arrays with
values of recall @ 1, 5, and 10. The first array contains figure retrieval (given
a caption, obtain the corresponding figure) results, while the second contains the
caption retrieval results.
3. Compare it with the scores obtained by the rest of the captions (or figures, if the
task is figure retrieval): If there are less than 10 captions/figures with a better
score than the correct one, it means that we have one more recall@10 point.
Similar with recall@5 (less than 5) and recall@1 (only 1).
The final count is divided by the number of samples in the test split, producing
the final recall values.
236 11 Jointly Learning Text and Visual Information in the Scientific Domain
We take the pre-trained text and visual features resulting from training the FCC
task and use the architecture of the language and vision subnetworks to train our
classifiers against the SciGraph taxonomy. Note that here we train using fine-tuning
over the whole model over a subset of 400 samples from SciGraph, hence the results
will differ from those reported in Sect. 11.6.
The first step is to obtain the categories of each figure and caption from the
dataset file:
For this task, we train the model during five epochs with a batch size of 128:
Then we take the model with the weights already loaded (modelCaptions)
and add two fully connected layers to classify the inputs into the five different
categories that we previously selected (health, engineering, math, biology, and
computer science).
11.8 Practice with Figure-Caption Correspondence 237
Finally, we train the model with the same train and test split of the dataset of the
FCC task. This will take 10 s approx.
238 11 Jointly Learning Text and Visual Information in the Scientific Domain
In this case, we train the model during six epochs with a batch size of 32:
Then we load the weights from the FCC task and we add two fully connected
layers to classify the inputs into the five different categories.
11.8 Practice with Figure-Caption Correspondence 239
Finally, we train the model with the same train and test split of the dataset of the
FCC task. It may take round 45 s.
240 11 Jointly Learning Text and Visual Information in the Scientific Domain
We leverage the TQA dataset8 developed by Kembhavi et al. to evaluate the textual
and visual features learnt by the FCC task in a multi-modal machine comprehension
scenario involving multiple-choice question answering over text and diagrams.
Next, we go through the different steps to train and evaluate a TQA model using
FCC features and semantic embeddings (TQA10).
In this notebook, we focus on a subset of 401 diagram questions from the TQA
corpus. First we save in a list all the images and text (tokens and synsets) of this
subset. The TQA dataset comprises six types of textual information (paragraph,
question, answer A, answer B, answer C, and answer D) and two types of visual
information (paragraph image and question image). Also, we save in a list the
correct answer of each question as a one-hot vector (Fig. 11.10).
8 http://vuchallenge.org/tqa.html.
11.8 Practice with Figure-Caption Correspondence 241
Let us take a look at the corpus. You can change the index variable to retrieve
different samples as you like.
242 11 Jointly Learning Text and Visual Information in the Scientific Domain
Next, we transform these lists into tensors. For images, we transform the images
into numpy arrays and then extract the features after being processed with our pre-
trained model:
11.8 Practice with Figure-Caption Correspondence 243
Next, we transform the captions into sequence arrays with a different length for
each one. We apply the Pareto principle to select the best maximum length possible,
covering the 80% of the tokens in the dataset for each type of text information.
244 11 Jointly Learning Text and Visual Information in the Scientific Domain
We adapt the text+image baseline, replacing the visual features with those learnt by
the FCC visual subnetwork and including pre-trained Vecsigrafo semantic embed-
dings, as we did in the previous experiments. In this case, we use element-wise
addition to combine pre-trained lemma and concept embeddings with the word
embeddings learnt by the network.
First, we load the embedding matrices as we did for the FCC task:
Since our TQA model is rather large,9 we use our lre_aux library (also available
in our GitHub) to load it. To compile the model we use Adam optimization with
learning rate 0.01. No dropout is used in this experiment.
In this case, we train our model during five epochs, with a batch size of 128 multiple-
choice questions:
9 The exact number of parameters in our TQA model is as follows. Total parameters: 1,287,668,
Finally, we train the model and we are ready to answer the TQA questions!
11.9 Conclusion
enrichment using knowledge graphs provided the largest performance boost, with
results generally beyond the state of the art.
Down the road of hybrid methods to natural language processing (and computer
vision, as shown herein), it will be interesting to further the study of the interplay
between the semantic concepts explicitly represented in different knowledge graphs,
contextualized embeddings, e.g. from SciBERT [19], and the text and visual features
learnt in the FCC task. Additionally, chartering the knowledge captured in such
features is still a due task at the moment of writing these lines.
Chapter 12
Looking into the Future of Natural
Language Processing
Abstract It has been a long journey, from theory to methods to code. We hope the
book took you painlessly through experiments and real NLP exercises, as much
as this is possible, and that you enjoyed it as much as we did. Now, it is time
to wrap up. Here, we provide guidelines for future directions in hybrid natural
language processing and share our final remarks, additional thoughts, and vision.
As a bonus, we also include the personal view of a selection of experts on topics
related to hybrid natural language processing. We asked them to comment on their
vision, foreseeable barriers to achieve such vision, and ways to navigate towards
it, including opportunities and challenges in promising research fields and areas
of industrial application. Now it is up to you. Hopefully this book gave you the
necessary tools to build powerful NLP systems. Use them!
1 For a collection of examples using the GPT-2 language model, see: https://thegradient.pub/gpt2-
and-the-nature-of-intelligence/.
the Grade 8 New York Regents Science Exams scoring over 90% on non-diagram,
multiple-choice questions, where 3 years before the best systems scored less than
60%.2 However, this still falls short with certain types of questions involving
discrete reasoning, mathematical knowledge, or machine reading comprehension.
We believe that the combination of neural and knowledge-based approaches will
be key for the next leap forward in NLP, addressing such challenges through reliably
represented meaning that supports reasoning. Knowledge graphs have a key role to
play in this regard. They already are a valuable asset to structure information in
expressive, machine-actionable ways in many organizations public and corporate.
Plus, the increasing adoption of knowledge graph embeddings is having a noticeable
impact in the development of new and more effective ways to create, curate, query,
and align knowledge graphs (Chap. 9), accelerating take up. Similar to the impact
word embeddings (Chaps. 2 and 4) had on capturing information from text, the
ability to represent knowledge graphs in a dense vector space (Chap. 5) has paved
the ground to exploit structured knowledge in neural NLP architectures.
Under this light, consuming structured knowledge graphs in neural NLP architec-
tures emerges as a feasible, possibly even straightforward approach in the mid-term,
with tangible benefits. As shown throughout the book, approaches like HolE and
Vecsigrafo (Chap. 6) prove that the information explicitly represented in a knowl-
edge graph can be used to obtain improved accuracies across different NLP tasks.
Vecsigrafo produces semantic embeddings that, unlike previous knowledge graph
embedding algorithms, not only learn from the graph but also from text corpora,
increasing domain coverage beyond the actual scope of the graph. Further, based on
different strategies (Chap. 8) to process the training set, co-training word, lemma,
and concept embeddings improves the quality of the resulting representations
beyond what can be learnt individually by state-of-the-art word and knowledge
graph embedding algorithms.
Not unlike language models, Vecsigrafo provides embeddings that are linked
to the disambiguated sense of each word and are therefore contextual. Beyond
Vecsigrafo, approaches like LMMS and extensions like Transigrafo (also discussed
in Chap. 6) suggest the advantages of combining language models and existing
knowledge graphs. Such approaches have shown clear practical advantages derived
from the integration of language models and knowledge graphs in NLP tasks like
word-sense disambiguation, including higher accuracy and reduced training time
and hardware infrastructure.
We find particularly interesting how the combined use of language models and
language graphs opens the door to NLP architectures where the linguistic and
knowledge representation aspects can be completely decoupled from each other,
hence potentially allowing for parallel development by independent, possibly unre-
lated teams. While the language model captures the essence of human language and
how sentences are constructed, the knowledge graph contains a human-engineered
conceptualization of the entities and relations in the target domain. Language model
2 https://www.nytimes.com/2019/09/04/technology/artificial-intelligence-aristo-passed-test.html.
12.1 Final Remarks, Thoughts and Vision 249
and knowledge graph can therefore be seen as the main modules that machines
need in order to understand text in a way that resembles human understanding.
The resulting models can in turn be extended with other modules dealing with, e.g.
mathematical or symbolic reasoning. We are very excited about the breakthroughs
that this direction of research may unroll in the near future of AI and NLP. We
foresee a near future where fine-tuning language models using knowledge graphs is
possible, infusing a particular vision and semantics over a domain, as captured by
the knowledge graph, in the representations of the language model.
For convenience, most of the knowledge graphs we have used in this book
are lexico-semantic databases like WordNet or Sensigrafo. However, applying the
approaches described herein to other types of graphs like DBpedia or Wikidata, as
well as domain-specific or proprietary graphs, is equally feasible and we are looking
forward to seeing the results this may entail for different NLP tasks and domains. We
are also eager to see future applications based on corporate graphs and to measure
their impact in actual NLP tasks.
Our research has demonstrated the benefits of hybrid NLP systems in areas
like disinformation analysis (Chap. 10) and multi-modal machine reading compre-
hension for question answering (Chap. 11), where hybrid approaches have proved
to outperform the previous state of the art. Means to evaluate the quality of the
resulting representations have also been provided in the book (Chap. 7). Probably
more important than performance increase is how the use of structured knowledge
can contribute to increase the expressivity of the model. And the other way around:
make use of the concepts and relations in a knowledge graph to provide an
explicitly represented grounding of the meaning captured in the model. Creating
such grounding in ways that can be exploited, e.g. for an enhanced ability to interpret
the model will deserve specific research in the coming years.
We expect further impact in areas like result explainability and bias detection,
where knowledge graphs shall contribute decisively to interpret the outcomes of
NLP models. Capturing commonsense and reasoning with it is another key area
in language understanding. In both cases, we have only started to glimpse the
challenges that lie ahead. However, based on commonsense knowledge graphs like
ConceptNet and ATOMIC, neural models can acquire commonsense capabilities
and reason about previously unseen events. On the other hand, the successful
adoption of this type of resources may also transform the way in which knowledge
graphs are represented nowadays; for example, ATOMIC represents commonsense
statements as text sentences and not in some logical form like most knowledge
graphs typically do nowadays.
We are living in exciting times that span across different branches of AI, resulting
in an increasing number of opportunities for discussion and research in current and
future challenges in NLP and many other areas. We hope this book has helped you
acquire the necessary knowledge to combine neural networks and knowledge graphs
to successfully solve more and more challenging NLP problems.
250 12 Looking into the Future of Natural Language Processing
Next, we share the feedback obtained from experts about the future of hybrid
NLP systems involving the combination of data-driven, neural architectures, and
structured knowledge graphs. Let us see, in lexicographic order, what they have to
say:
Agnieszka Lawrynowicz (Poznan University of Technology): I strongly sup-
port the ideas and solutions presented in the book. Currently, the most popular
methods in NLP work on multi-gigabyte raw data and therefore have too much
input, which can sometimes be inconsistent, contradictory or incorrect, while
knowledge graphs have a very tight representation of the most important facts,
which is easier to curate, and therefore a much smaller chance that the knowledge
modeled there is incorrect. I think that joint embeddings of text and knowledge
graphs is the right way to go.
One of the barriers to realize the scenario envisioned in this book may be
human resources due to traditionally separate communities working on symbolic
approaches and statistical ones that have grown largely independently, developing
their own methodologies, tools, etc. However, this is changing lately with the
growing adoption of statistical approaches including deep learning. Another barrier
is that neural approaches require large computational resources even in a hybrid
setting. Also, much of the work has been done in word embeddings using only
one context of the word, and there is a challenge in case of handling multi-sense
words, polysemous embeddings. Interestingly, this challenge is also present in open
domain knowledge graphs and ontologies, where some concepts may have slightly
different meanings depending on the domain—one way to solve it are micro-
theories, tackling only sub-domains. Entity linking is also a very big challenge, for
example, in cases when we have a domain knowledge graph and we want to use
unstructured data from the open domain. Then, wanting to link a resource from the
knowledge graph to some mention in the text, one needs to deal with ambiguity.
I see some promising directions in few-shot learning and transfer learning, i.e.
in learning from a small amount of training examples and trying to reuse already
acquired prior knowledge. Another trend is explainable AI (XAI). XAI approaches
are especially important for machine learning from unstructured data such as
text where black-box models are produced and when numerical, distributional
representations are used. An interesting next step may also be to employ natural
language explanations to support both more effective machine learning processes
and generating understandable explanations. I also think that combining knowledge
graphs and neural-based natural language processing could be better exploited in
the future in natural language inference.
Frank van Harmelen (Vrije Universiteit Amsterdam): The integration of
symbolic and neural methods is increasingly seen as crucial for the next generation
of breakthroughs in AI research and applications. This book is therefore very
important and timely, and its hands-on approach in the form of a “practical guide”
will make it all the more accessible and useful for researchers and practitioners
alike.
12.2 What Is Next? Feedback from the Community 251
learning and transfer learning. Dealing with a large number of languages, dialects,
tasks, and domains, we are bound to be creative and come up with solutions that
exploit disparate and complementary data and knowledge sources, and I believe that
only through such hybrid approaches we can advance our current methods across
a wide spectrum of language varieties and domains. This book will serve as an
extremely useful and very pragmatic entry point for anyone interested in tapping
into the mysteries of hybrid NLP.
Jacob Eisenstein and Yuval Pinter (Google Research and Georgia Institute
of Technology): Recent work has shown that knowledge can be acquired from raw
text by training neural networks to “fill in the blanks.” A now classic example is
BERT, which is a deep self-attention network that is trained to reconstruct texts in
which 15% of tokens are masked. Remarkably, BERT and related models are able
to learn many things about language and the world, such as:
• Subjects and verbs must agree in English [66].
• Doughnuts are something you can eat.
• Dante was born in Florence [141].
• Tim Cook is the CEO of Apple.
Each of these facts has its own epistemological status: subject-verb agreement
is a constant in standard English, but there are dialects with different syntactic
constraints [71]; it is possible (though sad) to imagine a world in which doughnuts
are used for something other food; there is one particular “Dante” who was born in
one particular “Florence,” but the statement is incorrect if referring to Dante Smith
or Florence, New Jersey; Tim Cook is the CEO of Apple in 2020, but this was
not true in the past and will cease to be true at some point in the future. Because
neural language models are unaware of these distinctions, they are overconfident,
insufficiently adaptable to context, and unable to reason about counterfactuals or
historical situations—all potentially critical impediments to sensitive applications.
Future work in natural language processing should therefore develop hybrid
representations that enable such metacognitive capabilities. The integration of natu-
ral language processing with knowledge graphs seems like a promising approach
towards this goal, but there are several challenges towards realizing this vision.
From a technical perspective, the arbitrary and dynamic nature of graph structures
make them difficult to integrate with state-of-the-art neural methods for natural
language processing, which emphasize high throughput computations on static
representations. From a modeling perspective, new approaches are needed to incor-
porate network-level generalizations such as “verbs tend to be derived from more
abstract nouns,” which have so far only been integrated through hand-engineered
features [142]. Most significantly, this research agenda requires benchmarks that
accurately measure progress towards systems that not only know what facts are
asserted in a collection of texts, but understand how those facts relate to each other
and to the external world.
Núria Bel (Universitat Pompeu Fabra): Natural language processing depends
on what has been called language resources. Language resources are data of the
language that is addressed and data selected for the domain the system is going
12.2 What Is Next? Feedback from the Community 253
to address. NLP engines are generic, but they require information to process the
particular words and sentences of particular languages. This linguistic information
has been provided in the form of dictionaries and rule-based components. These
resources are structured knowledge representations. They are meant to feed engines
with representations that already capture abstractions to generalize over any com-
prehensive sample of a language.
The most striking fact in the recent success of deep neural approaches is that they
mostly work with language resources that are just raw text. They seem not to require
of an intermediate representation that generalizes over the data. However, to achieve
the awesome accuracy that leaderboards of most NLP benchmarks consistently
show, they depend on very large amounts of raw texts. For instance, BERT has been
trained with three billion words, GPT-2 has been trained on 40GB of Internet text
(eight million web pages), and the latest neural conversational model to the moment
of writing this paragraph has been trained with 341GB of text (40B words).
This dependency of very large amounts of data can make very difficult to
apply these deep learning engines to particular scenarios. Current deep systems are
not beating statistical or knowledge-based systems when only small datasets are
available, which is the case for many specific domains and languages. Therefore, the
hybridization of deep learning and knowledge-based approaches is a necessary step
to exploit the enormous success of deep learning for a large variety of applications
for which it would not be possible to gather enough data otherwise. Knowledge-
based techniques provide the generalization capability required to alleviate data
dependency in deep learning methods.
Oscar Corcho (Universidad Politécnica de Madrid): The world of natural lan-
guage processing has been rapidly changing in recent years, offering an enormous
set of possibilities to industry, governments, and academia to deal with the large
corpora of texts that they have all collectively accumulated or that are available for
free on the Web and at specialized document repositories. An important part of the
most recent advances is related to the usage of neural-based approaches, which have
demonstrated their capacity to perform better than more traditional approaches, and
most of the work of research groups and companies that are trying to make research
and apply NLP for a range of problems is moving towards that direction.
This is the case of our own work to fight corruption together with associations
of journalists and governmental agencies, where we are focusing on named entity
recognition over all sorts of documents, to understand better the roles that entitles
play in documents such as contracts, so as to apply automatic anonymization
techniques for documents, as well as on our work focused on organizing large
multi-lingual corpora of texts (e.g., patent databases, scientific literature, project
proposals), detecting similarity among documents based on topics that are shared
among them. In all these cases, a smart combination of traditional and neural-based
approaches, together with the use of existing knowledge graphs, is key to obtain the
best results.
Paul Groth (Universiteit van Amsterdam): The combination of neural-based
deep learning techniques and massive knowledge graphs have become a central
component of modern systems that need to use and understand text. Knowledge
254 12 Looking into the Future of Natural Language Processing
thus limiting the applicability and generalization ability of deep learning approaches
which typically are extremely data-hungry.
If we want machines to generalize faster, be adaptable to new domains, avoid
doing mistakes that seem obvious to a human expert as well as explain and justify
their decisions at a rational level that a human can understand, the incorporation
of domain knowledge is fundamental. Along these lines, the book by Jose Manuel
Gomez-Perez and colleagues is addressing a very timely and important challenge,
the one of how to design hybrid systems that benefit from the robustness and pattern
learning ability of learning approaches combined with the ability to represent and
model domain knowledge in the form of knowledge graphs. The ability to combine
these methods will ultimately decide about the applicability of artificial intelligence
systems in real-world tasks in which machines are required to become competent
partners supporting decision-making rather than pre-rational machines solving tasks
they do not understand and the solutions to which they cannot really explain because
they lack a corresponding conceptualization.
Richard Benjamins (Telefonica Data and AI Ambassador): This book is an
excellent example of the type of research we need more of: the combination of data-
driven AI with knowledge-driven AI to advance the state of the art. Current success
of AI is mostly due to progress in data-driven AI (deep learning), but there is a limit
to what this technology can achieve and new breakthroughs are required to make
the next step towards artificial general intelligence. Such breakthroughs could come
from a hybrid approach where both technologies mutually bootstrap each other.
Rudi Studer (Karlsruhe Institute of Technology): The tremendous boost of
AI in recent years is to a large extent based on the success of machine learning,
especially deep learning, in various application scenarios, like, e.g. classification
tasks or machine translation, where a huge amount of training data is available.
As a result, many people equate AI with (sub-symbolic) machine learning, thus
ignoring the field of symbolic knowledge modeling, nowadays most prominently
represented by the notion of knowledge graphs. Vice versa, people from the
knowledge modeling area are often not familiar with the recent developments in
the field of (sub-symbolic) machine learning.
Therefore, the book at hand is an urgently needed book to provide a bridge
between these still partially separate communities and to show the benefits of
integrating symbolic and sub-symbolic methods, here illustrated in the area of NLP.
Furthermore, being a more practically oriented text book, these integration aspects
are made accessible for practitioners and thus for a broad audience. I also appreciate
that approaches that still need a lot of progress, i.e. multi-linguality and multi-
modality, are being discussed as well.
I am absolutely convinced that hybrid approaches are very promising and are
needed not only in the area of NLP but also in other areas where pure sub-symbolic
machine learning approaches lack of explainability, when you, e.g. think of medical
applications, or where pure KG approaches lack of methods to make them more
complete or more up-to-date by machine learning.
Vinay Chaudhri (Stanford University): An important goal for knowledge
acquisition is to create explicit knowledge representations of a domain that matches
256 12 Looking into the Future of Natural Language Processing
human understanding and enables precise reasoning with it. While in some domains
such as search, recommendation, translation, etc., human understanding and preci-
sion are not hard requirements, there are numerous domains where these require-
ments are essential. Some examples include knowledge of law for income tax
calculations, knowledge of a subject domain for teaching it to a student, knowledge
of a contract so that a computer can automatically execute it, etc. The creation
of an explicit representation of knowledge that is both human understandable and
enables accurate reasoning remains a primarily manual exercise. That is because
the current generation of natural language processing methods achieve scale by
either sacrificing precision or by sacrificing human understanding. As a result, they
work well for those problems where an explicit human understandable knowledge
representation is not required and/or there is no hard precision requirement on the
result of reasoning. There is a need for research that can leverage the automation
and scaling properties of natural language processing, and yet, does not sacrifice the
goal of precision and explicit human understandable knowledge representation.
We need to conduct experiments in multiple domains with the goal of producing
explicit human understandable knowledge representation that support precise rea-
soning and yet leverage automation. There are situations where the output of natural
language processing is so noisy that the cost of correcting its output is greater than
the cost of manual knowledge engineering. In other cases, the desired representation
is not externalized in the natural language text, and no amount of natural language
processing will yield the knowledge representation that one is seeking. We need
to better characterize tasks and situations where automation offered by natural
language processing leads to net reduction in the cost of creating the representation.
We need benchmarks and tasks that can give us quantitative data on tasks such
as: extracting key terms, extracting relations between key terms, and algorithms for
assembling the relations into a global whole. As the end goal is for the representation
to be human understandable, we need to develop new models for getting human
input. Such methods may involve dedicated domain experts, crowd workers, or
automatic quality checking algorithms.
A major barrier to achieving this vision is to address the narrative in the research
community that the knowledge engineering does not scale, and that the natural
language processing scales. Such claims are based on false characterization of the
tasks addressed by natural language processing and fail to consider that some of
the most widely used resources such as the WordNet were manually engineered.
The success of web-scale methods is crucially dependent on the human input in
the form of hyperlinks, click data, or explicit user feedback. On the technological
front, there are three major challenges: advance the semantic parsing technology
to produce better quality output from natural language processing; to develop high
quality knowledge engineered resources that can serve as training data for NLP;
and to develop integrated development environments that incorporate automation,
human validation, and knowledge integration capabilities.
References
1. Agirre, E., Alfonseca, E., Hall, K., Kravalova, J., Pas, M., Soroa, A.: A study on similarity
and relatedness using distributional and WordNet-based approaches. Human Language
Technologies: The 2009 Annual Conference of the North American Chapter of the ACL
(June), 19–27 (2009). https://doi.org/10.3115/1620754.1620758
2. Aker, A., Derczynski, L., Bontcheva, K.: Simple Open Stance Classification for Rumour
Analysis (2017). https://doi.org/10.26615/978-954-452-049-6_005
3. Ammar, W., Groeneveld, D., Bhagavatula, C., Beltagy, I., Crawford, M., Downey, D.,
Dunkelberger, J., Elgohary, A., Feldman, S., Ha, V., Kinney, R., Kohlmeier, S., Lo, K.,
Murray, T., Ooi, H., Peters, M., Power, J., Skjonsberg, S., Wang, L., Wilhelm, C., Yuan,
Z., van Zuylen, M., Etzioni, O.: Construction of the literature graph in semantic scholar. In:
NAACL-HTL (2018)
4. Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Zitnick, C., Parikh, D.: Vqa: Visual
question answering. 2015 IEEE International Conference on Computer Vision (ICCV) pp.
2425–2433 (2015)
5. Arandjelovic, R., Zisserman, A.: Look, listen and learn. 2017 IEEE International Conference
on Computer Vision (ICCV) pp. 609–617 (2017)
6. Asher, N., Lascarides, A.: Strategic conversation. Semantics and Pragmatics 6(2), 1–62
(2013). https://doi.org/10.3765/sp.6.2
7. Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., Ives, Z.G.: Dbpedia: A nucleus
for a web of open data. In: ISWC/ASWC (2007)
8. Augenstein, I., Rocktäschel, T., Vlachos, A., Bontcheva, K.: Stance Detection with Bidirec-
tional Conditional Encoding. In: EMNLP (2016)
9. Babakar, M., Moy, W.: The State of Automated Factchecking -. Tech. rep. (2016). URL
https://fullfact.org/blog/2016/aug/automated-factchecking/
10. Bader, S., Hitzler, P.: Dimensions of neural-symbolic integration — a structured survey. In:
S. Artemov, H. Barringer, A.S.d. Garcez, L.C. Lamb, J. Woods (eds.) We Will Show Them:
Essays in Honour of Dov Gabbay, vol. 1, pp. 167–194. King’s College Publications (2005)
11. Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align
and translate. arXiv e-prints abs/1409.0473 (2014). URL https://arxiv.org/abs/1409.0473.
Presented at the 7th International Conference on Learning Representations, 2015
12. Baker, C.F., Fillmore, C.J., Lowe, J.B.: The Berkeley Framenet project. In: Proceedings of
the 17th International Conference on Computational Linguistics - Volume 1, COLING ’98,
pp. 86–90. Association for Computational Linguistics, Stroudsburg, PA, USA (1998). https://
doi.org/10.3115/980451.980860.
13. Baker, S., Reichart, R., Korhonen, A.: An unsupervised model for instance level subcatego-
rization acquisition pp. 278–289 (2014). https://doi.org/10.3115/v1/D14-1034. URL https://
www.aclweb.org/anthology/D14-1034
14. Balažević, I., Allen, C., Hospedales, T.M.: Tucker: Tensor factorization for knowledge graph
completion. arXiv preprint arXiv:1901.09590 (2019)
15. Baroni, M., Dinu, G., Kruszewski, G.: Don’t count, predict! a systematic comparison of
context-counting vs. context-predicting semantic vectors. In: Proceedings of the 52nd Annual
Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 238–
247. Association for Computational Linguistics, Baltimore, Maryland (2014). https://doi.org/
10.3115/v1/P14-1023. URL https://www.aclweb.org/anthology/P14-1023
16. Baroni, M., Lenci, A.: Distributional memory: A general framework for corpus-based
Semantics. Computational Linguistics 36(4), 673–721 (2010)
17. Barriere, C.: Natural language understanding in a semantic web context, 1st edn. Springer
International Publishing (2016)
18. Belinkov, Y., Durrani, N., Dalvi, F., Sajjad, H., Glass, J.: What do neural machine translation
models learn about morphology? In: Proceedings of the 55th Annual Meeting of the Asso-
ciation for Computational Linguistics (Volume 1: Long Papers), pp. 861–872. Association
for Computational Linguistics, Vancouver, Canada (2017). https://doi.org/10.18653/v1/P17-
1080. URL https://www.aclweb.org/anthology/P17-1080
19. Beltagy, I., Cohan, A., Lo, K.: Scibert: Pretrained contextualized embeddings for scientific
text (2019)
20. Bender, E.M.: 100 things you always wanted to know about semantics & pragmatics but
were afraid to ask. Tech. rep., Melbourne, Australia (2018). URL https://www.aclweb.org/
anthology/P18-5001
21. Bengio, Y., Ducharme, R., Vincent, P., Janvin, C.: A neural probabilistic language model.
J. Mach. Learn. Res. 3, 1137–1155 (2003). URL http://dl.acm.org/citation.cfm?id=944919.
944966
22. Blum, A., Mitchell, T.: Combining labeled and unlabeled data with co-training. In: COLT
(1998)
23. Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword
information. Transactions of the Association for Computational Linguistics 5, 135–146
(2016)
24. Bordes, A., Glorot, X., Weston, J., Bengio, Y.: A semantic matching energy function for
learning with multi-relational data. Machine Learning 94(2), 233–259 (2014)
25. Bordes, A., Usunier, N., Weston, J., Yakhnenko, O.: Translating embeddings for modeling
multi-relational data. Advances in NIPS 26, 2787–2795 (2013). https://doi.org/10.1007/
s13398-014-0173-7.2
26. Bruni, E., Boleda, G., Baroni, M., Tran, N.K.: Distributional semantics in technicolor.
Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics
1(July), 136–145 (2012). https://doi.org/10.1109/ICRA.2016.7487801
27. Burns, G., Shi, X., Wu, Y., Cao, H., Natarajan, P.: Towards evidence extraction: Analysis of
sci. figures from studies of molecular interactions. In: SemSci@ISWC (2018)
28. Camacho-Collados, J., Pilehvar, M.T., Collier, N., Navigli, R.: Semeval-2017 task 2: Multi-
lingual and cross-lingual semantic word similarity. In: SemEval@ACL (2017)
29. Camacho-Collados, J., Pilehvar, M.T., Navigli, R.: NASARI: Integrating explicit knowledge
and corpus statistics for a multilingual representation of concepts and entities. Artificial
Intelligence 240, 36–64 (2016). https://doi.org/10.1016/j.artint.2016.07.005
30. Chang, L., Zhu, M., Gu, T., Bin, C., Qian, J., Zhang, J.: Knowledge graph embedding by
dynamic translation. IEEE Access 5, 20898–20907 (2017)
31. Chen, X., Liu, Z., Sun, M.: A unified model for word sense representation and disambiguation.
In: the 2014 Conference on Empirical Methods in Natural Language Processing, pp. 1025–
1035 (2014)
References 259
32. Cho, K., van Merrienboer, B., Bahdanau, D., Bengio, Y.: On the properties of neural machine
translation: encoder–decoder approaches. In: Proceedings of SSST-8, Eighth Workshop
on Syntax, Semantics and Structure in Statistical Translation, pp. 103–111. Association for
Computational Linguistics, Doha, Qatar (2014). https://doi.org/10.3115/v1/W14-4012. URL
https://www.aclweb.org/anthology/W14-4012
33. Cimiano, P., Unger, C., McCrae, J.: Ontology-Based Interpretation of Natural Language.
Morgan and Claypool (2014). URL https://ieeexplore.ieee.org/document/6813475
34. Clark, C., Divvala, S.: Pdffigures 2.0: Mining figures from research papers. 2016 IEEE/ACM
Joint Conference on Digital Libraries (JCDL) pp. 143–152 (2016)
35. Clark, K., Khandelwal, U., Levy, O., Manning, C.D.: What does BERT look at? An analysis
of BERT’s attention. ArXiv abs/1906.04341 (2019)
36. Collobert, R., Weston, J.: A unified architecture for natural language processing: Deep neural
networks with multitask learning. In: Proceedings of the 25th International Conference on
Machine Learning, ICML ’08, pp. 160–167. ACM, New York, NY, USA (2008). https://doi.
org/10.1145/1390156.1390177. URL http://doi.acm.org/10.1145/1390156.1390177
37. Conneau, A., Lample, G., Ranzato, M., Denoyer, L., Jégou, H.: Word Translation Without
Parallel Data. arXiv preprint (2017). http://dx.doi.org/10.1111/j.1540-4560.2007.00543.x.
URL https://arxiv.org/pdf/1710.04087.pdf http://arxiv.org/abs/1710.04087
38. Delobelle, P., Winters, T., Berendt, B.: Robbert: a Dutch RoBERTa-based language model
(2020)
39. Denaux, R., Gomez-Perez, J.: Vecsigrafo: Corpus-based word-concept embeddings. Semantic
Web pp. 1–28 (2019). https://doi.org/10.3233/SW-190361
40. Denaux, R., Gomez-Perez, J.M.: Towards a Vecsigrafo: Portable semantics in knowledge-
based text analytics. In: International Workshop on Hybrid Statistical Semantic Under-
standing and Emerging Semantics @ISWC17, CEUR Workshop Proceedings. CEUR-WS.org
(2017). URL http://ceur-ws.org/Vol-1923/article-04.pdf
41. Denaux, R., Gomez-Perez, J.M.: Assessing the Lexico-semantic relational knowledge cap-
tured by word and concept embeddings. In: Proceedings of the 10th International Conference
on Knowledge Capture, pp. 29–36. ACM (2019)
42. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: A Large-Scale
Hierarchical Image Database. In: CVPR09 (2009)
43. Dettmers, T., Minervini, P., Stenetorp, P., Riedel, S.: Convolutional 2d knowledge graph
embeddings. In: Thirty-Second AAAI Conference on Artificial Intelligence (2018)
44. Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional
transformers for language understanding. CoRR abs/1810.04805 (2018). URL http://arxiv.
org/abs/1810.04805
45. Dinu, G., Lazaridou, A., Baroni, M.: Improving zero-shot learning by mitigating the hubness
problem. arXiv preprint arXiv:1412.6568 (2014)
46. Domingos, P.: A few useful things to know about machine learning. Commun. ACM 55(10),
78–87 (2012). https://doi.org/10.1145/2347736.2347755
47. Duong, L., Kanayama, H., Ma, T., Bird, S., Cohn, T.: Learning crosslingual word embeddings
without bilingual corpora. In: Proceedings of the 2016 Conference on Empirical Methods in
Natural Language Processing, pp. 1285–1295. Association for Computational Linguistics,
Austin, Texas (2016). https://doi.org/10.18653/v1/D16-1136. URL https://www.aclweb.org/
anthology/D16-1136
48. Ebisu, T., Ichise, R.: Toruse: Knowledge graph embedding on a lie group. In: Thirty-Second
AAAI Conference on Artificial Intelligence (2018)
49. Ebisu, T., Ichise, R.: Graph pattern entity ranking model for knowledge graph completion.
arXiv preprint arXiv:1904.02856 (2019)
50. Eisenschtat, A., Wolf, L.: Linking image and text with 2-way nets. 2017 IEEE Conf. on
Computer Vision and Pattern Recognition (CVPR) (2017)
51. Eisenstein, J.: Introduction to natural language processing, 1st edn. Adaptive Computation
and Machine Learning series, The MIT Press (2019)
260 References
52. Emmert-Streib, F., Dehmer, M., Shi, Y.: Fifty years of graph matching, network alignment
and network comparison. Information Sciences 346, 180–197 (2016)
53. Engilberge, M., Chevallier, L., Pérez, P., Cord, M.: Finding beans in burgers: Deep semantic-
visual embedding with localization. 2018 IEEE/CVF Conference on Computer Vision and
Pattern Recognition (2018)
54. Fader, A., Zettlemoyer, L., Etzioni, O.: Open question answering over curated and extracted
knowledge bases. In: Proceedings of the 20th ACM SIGKDD international conference on
Knowledge discovery and data mining, pp. 1156–1165. ACM (2014)
55. Faghri, F., Fleet, D., Kiros, J., Fidler, S.: Vse++: Improving visual-semantic embeddings with
hard negatives. In: BMVC (2017)
56. Fallis, D.: A Conceptual Analysis of Disinformation. iConference pp. 30–31 (2009). https://
doi.org/10.1111/j.1468-5914.1984.tb00498.x
57. Fan, M., Zhou, Q., Chang, E., Zheng, T.F.: Transition-based knowledge graph embedding
with relational mapping properties. In: Proceedings of the 28th Pacific Asia Conference on
Language, Information and Computing, pp. 328–337 (2014)
58. Feigenbaum, E.A.: The art of artificial intelligence: Themes and case studies of knowledge
engineering. Tech. rep., Stanford, CA, USA (1977)
59. Fellbaum, C.: Wordnet : an electronic lexical database (2000)
60. Finkelstein, L., Gabrilovich, E., Matias, Y., Rivlin, E., Solan, Z., Wolfman, G., Ruppin, E.:
Placing search in context: the concept revisited. ACM Transactions on Information Systems
20(1), 116–131 (2002). https://doi.org/10.1145/503104.503110
61. Gábor, D.: Associative holographic memories. IBM Journal of Research and Development
13(2), 156–159 (1969)
62. Garcia, A., Gomez-Perez, J.: Not just about size - A study on the role of distributed word
representations in the analysis of scientific publications. In: 1st ws. on Deep Learning for
Knowledge Graphs (DL4KGS) co-located with ESWC (2018)
63. Garcia-Silva, A., Berrio, C., Gómez-Pérez, J.M.: An empirical study on pre-trained embed-
dings and language models for bot detection. In: Proceedings of the 4th Workshop
on Representation Learning for NLP (RepL4NLP-2019), pp. 148–155. Association for
Computational Linguistics, Florence, Italy (2019). https://doi.org/10.18653/v1/W19-4317.
URL https://www.aclweb.org/anthology/W19-4317
64. Gerz, D., Vulić, I., Hill, F., Reichart, R., Korhonen, A.: SimVerb-3500: A large-scale
evaluation set of verb similarity. In: Proceedings of the 2016 Conference on Empirical
Methods in Natural Language Processing, pp. 2173–2182. Association for Computational
Linguistics, Austin, Texas (2016). https://doi.org/10.18653/v1/D16-1235. URL https://www.
aclweb.org/anthology/D16-1235
65. Gilani, Z., Kochmar, E., Crowcroft, J.: Classification of twitter accounts into automated agents
and human users. In: Proceedings of the 2017 IEEE/ACM International Conference on
Advances in Social Networks Analysis and Mining 2017, ASONAM ’17, pp. 489–496. ACM,
New York, NY, USA (2017). https://doi.org/10.1145/3110025.3110091. URL http://doi.acm.
org/10.1145/3110025.3110091
66. Goldberg, Y.: Assessing bert’s syntactic abilities. CoRR abs/1901.05287 (2019). URL http://
arxiv.org/abs/1901.05287
67. Goldberg, Y., Hirst, G.: Neural Network Methods in Natural Language Processing. Morgan
and Claypool Publishers (2017)
68. Gómez-Pérez, A., Fernández-López, M., Corcho, Ó.: Ontological engineering: With exam-
ples from the areas of knowledge management, e-commerce and the semantic web. In:
Advanced Information and Knowledge Processing (2004)
69. Gomez-Perez, J.M., Ortega, R.: Look, read and enrich - learning from scientific figures and
their captions. In: Proceedings of the 10th International Conference on Knowledge Capture,
K-CAP ’19, p. 101–108. Association for Computing Machinery, New York, NY, USA (2019).
URL https://doi.org/10.1145/3360901.3364420
References 261
70. Grave, E., Joulin, A., Berthet, Q.: Unsupervised alignment of embeddings with Wasserstein
procrustes. arXiv preprint arXiv:1805.11222 (2018)
71. Green, L.J.: African American English: A linguistic introduction (2002)
72. Gunning, D., Chaudhri, V.K., Clark, P., Barker, K., Chaw, S.Y., Greaves, M., Grosof, B.N.,
Leung, A., McDonald, D.D., Mishra, S., Pacheco, J., Porter, B.W., Spaulding, A., Tecuci,
D., Tien, J.: Project halo update - progress toward digital Aristotle. AI Magazine 31, 33–58
(2010)
73. Hammond, T., Pasin, M., Theodoridis, E.: Data integration and disintegration: Managing
springer nature SciGraph with SHACL and OWL. In: N. Nikitina, D. Song, A. Fokoue,
P. Haase (eds.) International Semantic Web Conference (Posters, Demos and Industry Tracks),
CEUR Workshop Proceedings, vol. 1963. CEUR-WS.org (2017). URL http://dblp.uni-trier.
de/db/conf/semweb/iswc2017p.html#HammondPT17
74. Han, L., L. Kashyap, A., Finin, T., Mayfield, J., Weese, J.: UMBC_EBIQUITY-CORE:
Semantic textual similarity systems. In: Second Joint Conference on Lexical and Compu-
tational Semantics (*SEM), Volume 1: Proceedings of the Main Conference and the Shared
Task: Semantic Textual Similarity, pp. 44–52. Association for Computational Linguistics,
Atlanta, Georgia, USA (2013). URL https://www.aclweb.org/anthology/S13-1005
75. Hancock, J.T., Curry, L.E., Goorha, S., Woodworth, M.: On lying and being lied to: A
linguistic analysis of deception in computer-mediated communication. Discourse Processes
45(1), 1–23 (2007). URL https://doi.org/10.1080/01638530701739181
76. Heinzerling, B., Strube, M.: BPEmb: Tokenization-free pre-trained subword embeddings
in 275 languages. In: Proceedings of the Eleventh International Conference on Lan-
guage Resources and Evaluation (LREC 2018). European Language Resources Association
(ELRA), Miyazaki, Japan (2018). URL https://www.aclweb.org/anthology/L18-1473
77. Hill, F., Reichart, R., Korhonen, A.: Simlex-999: Evaluating semantic models with (genuine)
similarity estimation. Computational Linguistics 41, 665–695 (2015)
78. Hitzler, P., Bianchi, F., Ebrahimi, M., Sarker, M.K.: Neural-symbolic integration and the
semantic web a position paper (2019)
79. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780
(1997). URL http://dx.doi.org/10.1162/neco.1997.9.8.1735
80. Hodosh, M., Young, P., Hockenmaier, J.: Framing image description as a ranking task: Data,
models and evaluation metrics. In: J.AI Res. (2013)
81. Howard, J., Ruder, S.: Fine-tuned language models for text classification. CoRR
abs/1801.06146 (2018). URL http://arxiv.org/abs/1801.06146
82. Iacobacci, I., Pilehvar, M.T., Navigli, R.: SENSEMBED: Learning sense embeddings for
word and relational similarity. In: 53rd Annual Meeting of the ACL, pp. 95–105 (2015).
https://doi.org/10.3115/v1/P15-1010
83. Ioffe, S., Szegedy, C.: Batch normalization: Accelerating deep network training by reducing
internal covariate shift. In: ICML 2015 (2015)
84. Jawahar, G., Sagot, B., Seddah, D.: What does BERT learn about the structure of language?
In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics,
pp. 3651–3657. Association for Computational Linguistics, Florence, Italy (2019). https://doi.
org/10.18653/v1/P19-1356. URL https://www.aclweb.org/anthology/P19-1356
85. Ji, G., Liu, K., He, S., Zhao, J.: Knowledge graph completion with adaptive sparse transfer
matrix. In: Thirtieth AAAI Conference on Artificial Intelligence (2016)
86. Jiang, W., Wang, G., Bhuiyan, M.Z.A., Wu, J.: Understanding graph-based trust evaluation
in online social networks: Methodologies and challenges. ACM Computing Surveys (CSUR)
49(1), 10 (2016)
87. Jozefowicz, R., Vinyals, O., Schuster, M., Shazeer, N., Wu, Y.: Exploring the limits of
language modeling. arXiv preprint arXiv:1602.02410 (2016)
88. Jurafsky, D., Martin, J.H.: Speech and Language Processing (2Nd Edition). Prentice-Hall,
Inc., Upper Saddle River, NJ, USA (2009)
262 References
89. Kalchbrenner, N., Blunsom, P.: Recurrent continuous translation models. In: Proceedings of
the 2013 Conference on Empirical Methods in Natural Language Processing, pp. 1700–1709.
Association for Computational Linguistics, Seattle, Washington, USA (2013). URL https://
www.aclweb.org/anthology/D13-1176
90. Kazemi, S.M., Poole, D.: Simple embedding for link prediction in knowledge graphs. In:
Advances in Neural Information Processing Systems, pp. 4284–4295 (2018)
91. Kejriwal, M.: Domain-specific knowledge graph construction. In: SpringerBriefs in Computer
Science (2019)
92. Kembhavi, A., Salvato, M., Kolve, E., Seo, M., Hajishirzi, H., Farhadi, A.: A diagram is worth
a dozen images. In: ECCV (2016)
93. Kembhavi, A., Seo, M., Schwenk, D., Choi, J., Farhadi, A., Hajishirzi, H.: Are you smarter
than a sixth grader? textbook question answering for multimodal machine comprehension.
2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) pp. 5376–5384
(2017)
94. Keskar, N.S., McCann, B., Varshney, L.R., Xiong, C., Socher, R.: Ctrl: A conditional
transformer language model for controllable generation. ArXiv abs/1909.05858 (2019)
95. Khot, T., Sabharwal, A., Clark, P.: Scitail: A textual entailment dataset from science question
answering. In: Proceedings of the 32nd AAAI Conference on Artificial Intelligence (2018)
96. Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of
the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP),
pp. 1746–1751. Association for Computational Linguistics, Stroudsburg, PA, USA (2014).
https://doi.org/10.3115/v1/D14-1181
97. Kingma, D., Ba, J.: Adam: A method for stochastic optimization. CoRR abs/1412.6980
(2014)
98. Koehn, P.: Europarl : A parallel corpus for statistical machine translation. MT Summit 11,
79–86 (2005). https://doi.org/10.3115/1626355.1626380
99. Lacroix, T., Usunier, N., Obozinski, G.: Canonical tensor decomposition for knowledge base
completion. arXiv preprint arXiv:1806.07297 (2018)
100. LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document
recognition. Proceedings of the IEEE 86(11), 2278–2324 (1998)
101. Lenat, D.B.: Cyc: A large-scale investment in knowledge infrastructure. Commun. ACM 38,
32–38 (1995)
102. Levy, O., Goldberg, Y.: Linguistic regularities in sparse and explicit word representations. In:
Proceedings of the Eighteenth Conference on Computational Natural Language Learning, pp.
171–180. Association for Computational Linguistics, Ann Arbor, Michigan (2014). https://
doi.org/10.3115/v1/W14-1618. URL https://www.aclweb.org/anthology/W14-1618
103. Levy, O., Goldberg, Y., Dagan, I.: Improving distributional similarity with lessons learned
from word embeddings. Transactions of the Association for Computational Linguistics 3(0),
211–225 (2015)
104. Lin, T., Maire, M., Belongie, S., Bourdev, L., Girshick, R., Hays, J., Perona, P., Ramanan, D.,
Dollár, P., Zitnick, C.: Microsoft coco: Common objects in context. In: ECCV (2014)
105. Lin, Y., Liu, Z., Luan, H., Sun, M., Rao, S., Liu, S.: Modeling relation paths for representation
learning of knowledge bases. arXiv preprint arXiv:1506.00379 (2015)
106. Lin, Y., Liu, Z., Sun, M., Liu, Y., Zhu, X.: Learning entity and relation embeddings for
knowledge graph completion. In: Twenty-ninth AAAI conference on artificial intelligence
(2015)
107. Liu, P.J., Saleh, M., Pot, E., Goodrich, B., Sepassi, R., Kaiser, L., Shazeer, N.: Generating
wikipedia by summarizing long sequences. CoRR abs/1801.10198 (2018). URL http://arxiv.
org/abs/1801.10198
108. Liu, V., Curran, J.R.: Web text corpus for natural language processing. In: 11th Conference of
the European Chapter of the Association for Computational Linguistics (2006). URL https://
www.aclweb.org/anthology/E06-1030
References 263
109. Loureiro, D., Jorge, A.: Language modelling makes sense: Propagating representations
through WordNet for full-coverage word sense disambiguation. In: Proceedings of the
57th Annual Meeting of the Association for Computational Linguistics, pp. 5682–5691.
Association for Computational Linguistics, Florence, Italy (2019). https://doi.org/10.18653/
v1/P19-1569. URL https://www.aclweb.org/anthology/P19-1569
110. Luo, Y., Wang, Q., Wang, B., Guo, L.: Context-dependent knowledge graph embedding. In:
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing,
pp. 1656–1661. Association for Computational Linguistics, Lisbon, Portugal (2015). https://
doi.org/10.18653/v1/D15-1191. URL https://www.aclweb.org/anthology/D15-1191
111. Luong, T., Socher, R., Manning, C.: Better word representations with recursive neural
networks for morphology. In: Proceedings of the Seventeenth Conference on Computational
Natural Language Learning, pp. 104–113. Association for Computational Linguistics, Sofia,
Bulgaria (2013). URL https://www.aclweb.org/anthology/W13-3512
112. Mancini, M., Camacho-Collados, J., Iacobacci, I., Navigli, R.: Embedding words and senses
together via joint knowledge-enhanced training. In: Proceedings of the 21st Conference
on Computational Natural Language Learning (CoNLL 2017), pp. 100–111. Association
for Computational Linguistics, Vancouver, Canada (2017). https://doi.org/10.18653/v1/K17-
1012. URL https://www.aclweb.org/anthology/K17-1012
113. Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge
University Press, New York, NY, USA (2008)
114. Manning, C.D., Schütze, H.: Foundations of Statistical Natural Language Processing. MIT
Press, Cambridge, MA, USA (1999)
115. Martin, L., Muller, B., Ortiz Suárez, P.J., Dupont, Y., Romary, L., Villemonte de la Clergerie,
É., Seddah, D., Sagot, B.: CamemBERT: a Tasty French Language Model. arXiv e-prints
arXiv:1911.03894 (2019)
116. Melamud, O., Goldberger, J., Dagan, I.: Context2vec: Learning generic context embedding
with bidirectional LSTM. In: Proceedings of The 20th SIGNLL Conference on Computa-
tional Natural Language Learning, pp. 51–61. Association for Computational Linguistics,
Berlin, Germany (2016). https://doi.org/10.18653/v1/K16-1006. URL https://www.aclweb.
org/anthology/K16-1006
117. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in
vector space. CoRR abs/1301.3781 (2013). URL http://arxiv.org/abs/1301.3781
118. Mikolov, T., Grave, E., Bojanowski, P., Puhrsch, C., Joulin, A.: Advances in pre-training
distributed word representations. In: Proceedings of the 11th Language Resources and
Evaluation Conference. European Language Resource Association, Miyazaki, Japan (2018).
URL https://www.aclweb.org/anthology/L18-1008
119. Mikolov, T., Karafiát, M., Burget, L., Černockỳ, J., Khudanpur, S.: Recurrent neural network
based language model. In: Eleventh annual conference of the international speech communi-
cation association (2010)
120. Mikolov, T., Le, Q.V., Sutskever, I.: Exploiting Similarities among Languages for Machine
Translation. Tech. rep., Google Inc. (2013). https://doi.org/10.1162/153244303322533223
121. Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J.: Distributed representations of
words and phrases and their compositionality. In: Advances in Neural Information Processing
Systems, vol. cs.CL, pp. 3111–3119 (2013). https://doi.org/10.1162/jmlr.2003.3.4-5.951
122. Miller, G.A., Charles, W.G.: Contextual correlates of semantic similarity. Language and
Cognitive Processes 6(1), 1–28 (1991). https://doi.org/10.1080/01690969108406936
123. Navigli, R.: Word Sense Disambiguation: A Survey. ACM Comput. Surv 41(10) (2009).
https://doi.org/10.1145/1459352.1459355
124. Newell, A.: The knowledge level. Artificial Intelligence 18(1), 87–127 (1982). https://doi.
org/10.1016/0004-3702(82)90012-1
125. Nguyen, D.Q.: An overview of embedding models of entities and relationships for knowledge
base completion. arXiv preprint arXiv:1703.08098 (2017)
264 References
126. Nguyen, D.Q., Nguyen, T.D., Nguyen, D.Q., Phung, D.: A novel embedding model
for knowledge base completion based on convolutional neural network. arXiv preprint
arXiv:1712.02121 (2017)
127. Nguyen, D.Q., Sirts, K., Qu, L., Johnson, M.: Neighborhood mixture model for knowledge
base completion. arXiv preprint arXiv:1606.06461 (2016)
128. Nguyen, D.Q., Vu, T., Nguyen, T.D., Nguyen, D.Q., Phung, D.: A capsule network-based
embedding model for knowledge graph completion and search personalization. arXiv preprint
arXiv:1808.04122 (2018)
129. Nickel, M., Murphy, K., Tresp, V., Gabrilovich, E.: A review of relational machine learning
for knowledge graphs. Proceedings of the IEEE 104(1), 11–33 (2016)
130. Nickel, M., Rosasco, L., Poggio, T.: Holographic embeddings of knowledge graphs. In:
Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, AAAI’16, pp. 1955–
1961. AAAI Press (2016). URL http://dl.acm.org/citation.cfm?id=3016100.3016172
131. Nickel, M., Tresp, V., Kriegel, H.P.: A three-way model for collective learning on multi-
relational data. In: ICML, vol. 11, pp. 809–816 (2011)
132. Olah, C., Mordvintsev, A., Schubert, L.: Feature visualization. Distill (2017). https://doi.org/
10.23915/distill.00007. Https://distill.pub/2017/feature-visualization
133. Ott, M., Cardie, C., Hancock, J.T.: Negative deceptive opinion spam. In: Proceedings of
the 2013 conference of the north american chapter of the association for computational
linguistics: human language technologies, pp. 497–501 (2013)
134. Ott, M., Choi, Y., Cardie, C., Hancock, J.T.: Finding Deceptive Opinion Spam by Any Stretch
of the Imagination. In: 49th ACL, pp. 309–319 (2011). https://doi.org/10.1145/2567948.
2577293. URL https://arxiv.org/pdf/1107.4557.pdf
135. Pan, J.Z., Vetere, G., Gomez-Perez, J.M., Wu, H.: Exploiting Linked Data and Knowledge
Graphs in Large Organisations, 1st edn. Springer International Publishing (2017)
136. Pan, S.J., Yang, Q.: A survey on transfer learning. IEEE Trans. on Knowl. and Data Eng.
22(10), 1345–1359 (2010). URL http://dx.doi.org/10.1109/TKDE.2009.191
137. Parikh, A., Täckström, O., Das, D., Uszkoreit, J.: A decomposable attention model for natural
language inference. In: Proceedings of the 2016 Conference on Empirical Methods in
Natural Language Processing, pp. 2249–2255. Association for Computational Linguistics,
Austin, Texas (2016). https://doi.org/10.18653/v1/D16-1244. URL https://www.aclweb.org/
anthology/D16-1244
138. Pennington, J., Socher, R., Manning, C.: Glove: Global vectors for word representation. In:
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing
(EMNLP), pp. 1532–1543. Association for Computational Linguistics, Doha, Qatar (2014).
https://doi.org/10.3115/v1/D14-1162. URL https://www.aclweb.org/anthology/D14-1162
139. Peters, M., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., Zettlemoyer, L.:
Deep contextualized word representations. In: Proceedings of the 2018 Conference of the
North American Chapter of the Association for Computational Linguistics: Human Language
Technologies, Volume 1 (Long Papers), pp. 2227–2237. Association for Computational
Linguistics, New Orleans, Louisiana (2018). https://doi.org/10.18653/v1/N18-1202. URL
https://www.aclweb.org/anthology/N18-1202
140. Peters, M.E., Neumann, M., IV, R.L.L., Schwartz, R., Joshi, V., Singh, S., Smith, N.A.:
Knowledge enhanced contextual word representations (2019)
141. Petroni, F., Rocktäschel, T., Riedel, S., Lewis, P., Bakhtin, A., Wu, Y., Miller, A.: Language
models as knowledge bases? In: Proceedings of the 2019 Conference on Empirical Methods in
Natural Language Processing and the 9th International Joint Conference on Natural Language
Processing (EMNLP-IJCNLP), pp. 2463–2473. Association for Computational Linguistics,
Hong Kong, China (2019). https://doi.org/10.18653/v1/D19-1250. URL https://www.aclweb.
org/anthology/D19-1250
142. Pinter, Y., Eisenstein, J.: Predicting semantic relations using global graph properties. In:
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing,
pp. 1741–1751. Association for Computational Linguistics, Brussels, Belgium (2018). https://
doi.org/10.18653/v1/D18-1201. URL https://www.aclweb.org/anthology/D18-1201
References 265
143. Pires, T., Schlinger, E., Garrette, D.: How multilingual is multilingual bert? In: ACL (2019)
144. Polignano, M., Basile, P., de Gemmis, M., Semeraro, G., Basile, V.: AlBERTo: Italian BERT
language understanding model for NLP challenging tasks based on tweets (2019)
145. Porter, S., Ten Brinke, L.: The truth about lies: What works in detecting high-stakes
deception? Legal and Criminological Psychology 15(1), 57–75 (2010). https://doi.org/10.
1348/135532509X433151
146. Radford, A., Narasimhan, K., Salimans, T., Sutskever, I.: Improving language understand-
ing by generative pre-training. URL https://s3-us-west-2.amazonaws.com/openai-assets/
research-covers/languageunsupervised/languageunderstandingpaper.pdf (2018)
147. Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I.: Language models are
unsupervised multitask learners. OpenAI Blog 1(8) (2019)
148. Radinsky, K., Agichtein, E., Gabrilovich, E., Markovitch, S.: A word at a time. In:
Proceedings of the 20th international conference on World wide web - WWW ’11, p. 337.
ACM Press, New York, New York, USA (2011). https://doi.org/10.1145/1963405.1963455
149. Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., Liu,
P.J.: Exploring the limits of transfer learning with a unified text-to-text transformer (2019)
150. Resnik, P.: Using information content to evaluate semantic similarity in a taxonomy. In:
Proceedings of the 14th International Joint Conference on Artificial Intelligence - Volume 1,
IJCAI’95, pp. 448–453. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA (1995).
URL http://dl.acm.org/citation.cfm?id=1625855.1625914
151. Riedel, S., Yao, L., McCallum, A., Marlin, B.M.: Relation extraction with matrix factorization
and universal schemas. In: HLT-NAACL (2013)
152. Ristoski, P., Paulheim, H.: RDF2Vec: RDF graph embeddings for data mining. In:
International Semantic Web Conference, vol. 9981 LNCS, pp. 498–514 (2016)
153. Rothe, S., Schütze, H.: AutoExtend: Extending word embeddings to embeddings for synsets
and lexemes. In: Proceedings of the 53rd Annual Meeting of the Association for Computa-
tional Linguistics and the 7th International Joint Conference on Natural Language Processing
(Volume 1: Long Papers), pp. 1793–1803. Association for Computational Linguistics,
Beijing, China (2015). https://doi.org/10.3115/v1/P15-1173. URL https://www.aclweb.org/
anthology/P15-1173
154. Rubenstein, H., Goodenough, J.B.: Contextual correlates of synonymy. Commun. ACM
8(10), 627–633 (1965). https://doi.org/10.1145/365628.365657
155. dos Santos, C., Gatti, M.: Deep convolutional neural networks for sentiment analysis of short
texts. In: Proceedings of COLING 2014, the 25th International Conference on Computational
Linguistics: Technical Papers, pp. 69–78 (2014)
156. Sap, M., Bras, R.L., Allaway, E., Bhagavatula, C., Lourie, N., Rashkin, H., Roof, B., Smith,
N.A., Choi, Y.: ATOMIC: an atlas of machine commonsense for if-then reasoning. In:
The Thirty-Third AAAI Conference on Artificial Intelligence, AAAI 2019, The Thirty-
First Innovative Applications of Artificial Intelligence Conference, IAAI 2019, The Ninth
AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2019, Honolulu,
Hawaii, USA, January 27 - February 1, 2019., pp. 3027–3035 (2019). URL https://doi.org/
10.1609/aaai.v33i01.33013027
157. Schlichtkrull, M.S., Kipf, T.N., Bloem, P., van den Berg, R., Titov, I., Welling, M.: Modeling
relational data with graph convolutional networks (2018). URL https://doi.org/10.1007/978-
3-319-93417-4_38
158. Schnabel, T., Labutov, I., Mimno, D., Joachims, T.: Evaluation methods for unsupervised
word embeddings. In: Proceedings of the 2015 Conference on Empirical Methods in
Natural Language Processing, pp. 298–307. Association for Computational Linguistics,
Lisbon, Portugal (2015). https://doi.org/10.18653/v1/D15-1036. URL https://www.aclweb.
org/anthology/D15-1036
159. Schuster, M., Nakajima, K.: Japanese and Korean voice search. In: 2012 IEEE International
Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5149–5152. IEEE
(2012)
266 References
160. Sennrich, R., Haddow, B., Birch, A.: Neural machine translation of rare words with subword
units. arXiv preprint arXiv:1508.07909 (2015)
161. Seo, M., Hajishirzi, H., Farhadi, A., Etzioni, O.: Diagram understanding in geometry
questions. In: AAAI (2014)
162. Seo, M.J., Kembhavi, A., Farhadi, A., Hajishirzi, H.: Bidirectional attention flow for machine
comprehension. In: 5th International Conference on Learning Representations, ICLR 2017,
Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net (2017).
URL https://openreview.net/forum?id=HJ0UKP9ge
163. Shang, C., Tang, Y., Huang, J., Bi, J., He, X., Zhou, B.: End-to-end structure-aware
convolutional networks for knowledge base completion. In: Proceedings of the AAAI
Conference on Artificial Intelligence, vol. 33, pp. 3060–3067 (2019)
164. Shazeer, N., Doherty, R., Evans, C., Waterson, C.: Swivel: Improving embeddings by noticing
what’s missing. CoRR abs/1602.02215 (2016). URL http://arxiv.org/abs/1602.02215
165. Shen, Y., Huang, P.S., Chang, M.W., Gao, J.: Link prediction using embedded knowledge
graphs. arXiv preprint arXiv:1611.04642 (2016)
166. Sheth, A., Perera, S., Wijeratne, S., Thirunarayan, K.: Knowledge will propel machine
understanding of content: extrapolating from current examples. In: Proceedings of the
International Conference on Web Intelligence, WI ’17, pp. 1–9. ACM, New York, NY, USA
(2017). https://doi.org/10.1145/3106426.3109448
167. Shi, B., Weninger, T.: Proje: Embedding projection for knowledge graph completion.
Proceedings of the 32nd AAAI Conference on Artificial Intelligence (2018). URL http://
par.nsf.gov/biblio/10054090
168. Shoham, Y.: Why knowledge representation matters. Commun. ACM 59(1), 47–49 (2015).
https://doi.org/10.1145/2803170
169. Shortliffe, E.H.: Mycin: A knowledge-based computer program applied to infectious diseases.
In: Proceedings of the Annual Symposium on Computer Application in Medical Care, pp. 66–
74. PubMed Central (1977)
170. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image
recognition. CoRR abs/1409.1556 (2014)
171. Socher, R., Chen, D., Manning, C.D., Ng, A.: Reasoning with neural tensor networks for
knowledge base completion. In: Advances in neural information processing systems, pp.
926–934 (2013)
172. Socher, R., Ganjoo, M., Sridhar, H., Bastani, O., Manning, C., Ng, A.: Zero-shot learning
through cross-modal transfer. In: NIPS (2013)
173. Speer, R., Chin, J., Havasi, C.: Conceptnet 5.5: An open multilingual graph of general
knowledge. In: Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence,
AAAI’17, pp. 4444–4451. AAAI Press (2017). URL http://dl.acm.org/citation.cfm?id=
3298023.3298212
174. Sun, Z., Deng, Z.H., Nie, J.Y., Tang, J.: Rotate: Knowledge graph embedding by relational
rotation in complex space. arXiv preprint arXiv:1902.10197 (2019)
175. Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In:
Proceedings of the 28th Conference on Neural Information Processing Systems (2014)
176. Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception
architecture for computer vision. 2016 IEEE Conference on Computer Vision and Pattern
Recognition (CVPR) pp. 2818–2826 (2015)
177. Tchechmedjiev, A., Fafalios, P., Boland, K., Gasquet, M., Zloch, M., Zapilko, B., Dietze, S.,
Todorov, K.: Claimskg: A knowledge graph of fact-checked claims. In: International Semantic
Web Conference, pp. 309–324. Springer (2019)
178. Tenney, I., Das, D., Pavlick, E.: BERT rediscovers the classical NLP pipeline. In: Proceedings
of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 4593–4601.
Association for Computational Linguistics, Florence, Italy (2019). https://doi.org/10.18653/
v1/P19-1452. URL https://www.aclweb.org/anthology/P19-1452
References 267
179. Thoma, S., Rettinger, A., Both, F.: Towards holistic concept representations: Embedding
relational knowledge, visual attributes, and distributional word semantics. In: International
Semantic Web Conference (2017)
180. Toutanova, K., Chen, D., Pantel, P., Poon, H., Choudhury, P., Gamon, M.: Representing text
for joint embedding of text and knowledge bases pp. 1499–1509 (2015). https://doi.org/10.
18653/v1/D15-1174. URL https://www.aclweb.org/anthology/D15-1174
181. Trouillon, T.P., Bouchard, G.M.: Complex embeddings for simple link prediction (2017). US
Patent App. 15/156,849
182. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L.,
Polosukhin, I.: Attention is all you need. CoRR abs/1706.03762 (2017). URL http://arxiv.
org/abs/1706.03762
183. Vrandecic, D., Krötzsch, M.: Wikidata: a free collaborative knowledgebase. Commun. ACM
57, 78–85 (2014)
184. de Vries, W., van Cranenburgh, A., Bisazza, A., Caselli, T., van Noord, G., Nissim, M.: Bertje:
A dutch bert model (2019)
185. Wang, A., Pruksachatkun, Y., Nangia, N., Singh, A., Michael, J., Hill, F., Levy, O., Bowman,
S.R.: Superglue: A stickier benchmark for general-purpose language understanding systems.
arXiv preprint arXiv:1905.00537 (2019)
186. Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., Bowman, S.R.: Glue: A multi-
task benchmark and analysis platform for natural language understanding. arXiv preprint
arXiv:1804.07461 (2018)
187. Wang, L., Li, Y., Huang, J., Lazebnik, S.: Learning two-branch neural networks for image-
text matching tasks. IEEE Transactions on Pattern Analysis and Machine Intelligence 41,
394–407 (2018)
188. Wang, Q., Mao, Z., Wang, B., Guo, L.: Knowledge graph embedding: A survey of approaches
and applications. IEEE Transactions on Knowledge and Data Engineering 29(12), 2724–2743
(2017)
189. Wang, S., Manning, C.D.: Baselines and bigrams: Simple, good sentiment and topic classi-
fication. In: Proceedings of the 50th Annual Meeting of the Association for Computational
Linguistics: Short Papers - Volume 2, ACL ’12, pp. 90–94. Association for Computational
Linguistics, Stroudsburg, PA, USA (2012). URL http://dl.acm.org/citation.cfm?id=2390665.
2390688
190. Wang, Z., Zhang, J., Feng, J., Chen, Z.: Knowledge graph embedding by translating on
hyperplanes. In: Twenty-Eighth AAAI conference on artificial intelligence (2014)
191. Weston, J., Bordes, A., Yakhnenko, O., Usunier, N.: Connecting language and knowledge
bases with embedding models for relation extraction. CoRR abs/1307.7973 (2013). URL
http://dblp.uni-trier.de/db/journals/corr/corr1307.html#WestonBYU13
192. Weston, J., Chopra, S., Bordes, A.: Memory networks. CoRR abs/1410.3916 (2014)
193. Yang, B., Yih, W.t., He, X., Gao, J., Deng, L.: Embedding entities and relations for learning
and inference in knowledge bases. arXiv preprint arXiv:1412.6575 (2014)
194. Yang, D., Powers, D.M.W.: Verb similarity on the taxonomy of WordNet. 3rd International
WordNet Conference pp. 121–128 (2006)
195. Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R., Le, Q.V.: XLNet: Generalized
Autoregressive Pretraining for Language Understanding (2019). URL http://arxiv.org/abs/
1906.08237
196. Yao, L., Mao, C., Luo, Y.: Kg-bert: Bert for knowledge graph completion (2019)
197. Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of CNN and RNN for natural
language processing (2017)
198. Young, P., Lai, A., Hodosh, M., Hockenmaier, J.: From image descriptions to visual
denotations: New similarity metrics for semantic inference over event descriptions. Trans.
of the Assoc. for Computational Linguistics 2, 67–78 (2014)
199. Zhang, W., Paudel, B., Zhang, W., Bernstein, A., Chen, H.: Interaction embeddings for
prediction and explanation in knowledge graphs. In: Proceedings of the Twelfth ACM
International Conference on Web Search and Data Mining, pp. 96–104. ACM (2019)
268 References
200. Zhang, X., Zhao, J., LeCun, Y.: Character-level convolutional networks for text classification.
In: Proceedings of the 28th International Conference on Neural Information Processing
Systems - Volume 1, NIPS’15, pp. 649–657. MIT Press, Cambridge, MA, USA (2015). URL
http://dl.acm.org/citation.cfm?id=2969239.2969312
201. Ziegler, C.N., Lausen, G.: Spreading activation models for trust propagation. In: Proceedings
- 2004 IEEE International Conference on e-Technology, e-Commerce and e-Service, EEE
2004, pp. 83–97 (2004). https://doi.org/10.1109/EEE.2004.1287293
202. Ziemski, M., Junczys-Dowmunt, M., Pouliquen, B.: The united nations parallel corpus v1.0.
In: N.C.C. Chair), K. Choukri, T. Declerck, S. Goggi, M. Grobelnik, B. Maegaard, J. Mariani,
H. Mazo, A. Moreno, J. Odijk, S. Piperidis (eds.) Proceedings of the Tenth International
Conference on Language Resources and Evaluation (LREC 2016). European Language
Resources Association (ELRA), Paris, France (2016)