Lodin+Project+Papers
Lodin+Project+Papers
NLP Project
Summary of 7 Papers
The relationship between NLP and information management based on content has not been
quite as symbiotic compared to NLP and MT. The content-based manipulation operations we
refer to include indexing and retrieval, categorization, classification, filtering, and so on. If we
broadly define information retrieval to be the retrieval of textual information based on its
content then we see that NLP tools and techniques do not have very much impact on our
current generation of information retrieval systems. Operational IR systems are predominantly
based on statistical measures of overlap between documents and queries, counting the
numbers of words or index terms in common between the two as part of some similarity
measure. The kind of NLP that been developed for applications like MT until recently has had
little influence on information retrieval. The present book goes some way towards highlighting
the fact that NLP can and does have a greater role in information retrieval than many believe
and many of the chapters report successful uses of NLP tools and techniques for IR applications.
Early Experiments:
Over the last 10 years, this author and his research group have tried a number of different ways
of applying NLP tools and techniques to information retrieval tasks, with varying degrees of
success and failure. During the mid-1980s we developed and experimented with techniques for
parsing users' natural language queries and from the resulting parse trees we identified word
pair and word triple dependencies between query terms, which were then used as part of a
term weighting retrieval.
2
Med7: a transferable clinical natural language processing model for electronic
health records
Materials and Methods
1. Data:
The annotated data set was sourced from MIMIC-III (Medical Information Mart for Intensive
Care-III) electronic health records data base [9] as part of the Track 2 of The 2018 National
NLP Clinical Challenges (n2c2) Shared Task on drug related concepts extraction, including
adverse drug events (ADE) and reasons for prescription [31]. The data set comprised a
collection of discharge letters from the Intensive Care Unit (ICU) and contained very rich and
detailed information about medications used for treatment. The data set was randomly split
and provided by the organizers into training and test sets with 303 and 202 documents
respectively.
2. Methods:
Text pre-processing: In order to compare the performance of the developed medication extraction
model using MIMIC-III (n2c2 2018) and UK-CRIS data, basic text cleaning and pre-processing steps
were taken to standardize texts. UK-CRIS notes that were uploaded as scanned documents and
transformed into electronic texts via optical character recognition (OCR) process, were cleaned from
such artefacts as email addresses, non-ASCII characters, website URLs, HTML or XML tags.
Additionally, standard escape sequences (’\t’, ’\n’ and ’\r’) were also removed and the offsets of
gold-annotated entities were adjusted accordingly.
Self-supervised learning: The main obstacle to developing an accurate information extraction model
is the dearth of a sufficient amount of high-quality annotated data to train the model. In contrast to
publicly available large manually annotated data sets for computer vision [5, 6] and for various
natural language processing downstream tasks [32, 33, 34] manually annotated texts for clinical
concepts extraction are quite rare [31]. The shortage of annotated clinical data is mainly due to
privacy concerns and potential identification of personal medical information of patients.
Named entity recognition model: The task of locating concepts of interest in unstructured text and
their subsequent classification into predefined categories, for example: drug names, dosages or
frequency of administration is a sub-task of information extraction and called named-entity
recognition (NER). There are various implementations of NER systems, ranging from rule-based
string-matching approaches [10] to complex Transformer models [2] or their hybrid combinations. In
this work the namedentity recognition model for extraction of medication information was
implemented in Python 3.7 using spaCy open source library for NLP tasks [38]. Although there exists
a good number of NLP libraries, such as: NLTK [39], NLP4J [40], Stanford CoreNLP [41], Apache
OpenNLP and a very recent open source collection of Transformer-based models from Hugging Face
Inc. [42], the spaCy library is optimized for speed on CPUs, has an intuitive API and easily integrates
with the active learning-based annotation tool Prodigy [43]. The architecture of SpaCy’s NER model
is based on convolutional neural networks with tokens represented as hashed Bloom embeddings
[44] of prefix, suffix and lemmatization of individual words augmented with a transition-based
chunking model [45]. We also experimented with various combinations of hyperparameters of the
neural network architecture, dropout rates, batch compounding, learning rate and regularization
schemes. We set aside 30 documents (10%) sampled at random from the training data as a
validation set.
Model training augmentation with bootstrapped noisy labels: Several recent lines of research have
demonstrated a clear benefit in terms of achieving higher accuracy and better generalization of
neural networks trained with corrupted, noisy and synthetically augmented data [46, 47, 48, 49].
Training with data augmentation also alleviates the problem of learning from a limited amount of
manually annotated data. Similar to the idea presented in ’Snorkel’ [50], we designed a number of
labelling functions (LF) by compiling a list of rules and keyword patterns for all seven named-entity
categories. Additionally, we exploited a ’sense2vec’ approach [51] which was fine-tuned on the
entire MIMIC-III corpora to bootstrap keywords and patterns. ’Sense2vec’ is a more complex version
of the ’Word2vec’ method [52] for representation of words as vectors. The major improvement over
’word2vec’ is that ’sense2vec’ also learns from linguistic annotations of words for sense
disambiguation in their embeddings.
Model evaluation: In order to estimate the performance of the proposed named-entity recognition
model, we used the evaluation schema proposed in SemEval’13 and outlined in Appendix A. The
evaluation schema comprised a number of potential errors categories produced by the model and
the model performance metrics, such as precision and recall were computed using the expressions
A.1. Under the current evaluation schema, partial match was considered as an exact match between
the gold-annotated and the predicted labels while no restriction was imposed on the boundaries of
the tokens. The rationale behind this approach was obvious from the ambiguity in goldannotations
examples corresponding to the same concept. For example, both sequences ’for 3 weeks’ and ’3
weeks’ were labelled as ’Duration’. In particular, 492 of 967 (71%) text spans labelled as ’Duration’
started with the word ’for’.
3
A Proposed Conceptual Framework for a Representational Approach to
Information Retrieval
Connections to Natural Language Processing
Lin et al. [2021b] argued that relevance, semantic equivalence, paraphrase, entailment, and a host of
other “sentence similarity” tasks are all closely related, even though the first is considered an IR problem
and the remainder are considered to be problems in NLP. What’s the connection? Cast in terms of the
conceptual framework proposed in this paper, I argue that these problems all share in the formalization
of the logical scoring model, but NLP researchers usually don’t care about the physical retrieval model.
For example, supervised paraphrase detection is typically formalized as a “pointwise” estimation task of
the “paraphrase relation”: P (Paraphrase = 1|s1, s2) ∆= r(s1, s2). (6) That is, the task is to induce some
scoring function based on training data that provides an estimate of the likelihood that two texts
(sentences in most cases) are paraphrases of each other. In the popular transformer-based Sentence-
BERT model [Reimers and Gurevych, 2019], the solution is formulated in a bi-encoder design: r(s1, s2) ∆=
φ(η(s1), η(s2)), (7) which has exactly the same functional form as the logical scoring model in Eq. (1)!
The main difference, I argue, is that paraphrase detection for the most part does not care where the
texts come from. In other words, there isn’t an explicitly defined physical retrieval model. In fact,
comparing Sentence-BERT with DPR, we can see that although the former focuses on sentence similarity
tasks and the latter on passage retrieval, the functional forms of the solutions are identical. Both are
captured by the logical scoring model in Eq. (1); the definitions of the encoders are also quite similar,
both based on BERT, but they extract the final representations in slightly different ways. Of course, since
DPR was designed for a question answering task, the complete solution requires defining a physical
retrieval model, which is not explicitly present in Sentence-BERT. Pursuing these connections further,
note that there are usage scenarios in which a logical scoring model for paraphrase detection might
require a physical retrieval model. Consider a community question answering application [Srba and
Bielikova, 2016], where the task is to retrieve from a knowledge base of (question, answer) pairs the
top-k questions that are the closest paraphrases of a user’s question. Here, there would be few
substantive differences between a solution based on Sentence-BERT and DPR, just slightly different
definitions of the encoders. One immediate objection to this treatment is that relevance differs from
semantic equivalence, paraphrase, entailment, and other sentence similarity tasks in fundamental ways.
For example, the relations captured by sentence similarity tasks are often symmetric (with entailment
being an obvious exception), i.e., r(s1, s2) = r(s2, s1), while relevance clearly is not. Furthermore, queries
are typically much shorter than their relevant documents (and may not be well-formed natural language
sentences), whereas for sentence similarity tasks, the inputs are usually of comparable length and
represent well-formed natural language.
The goal of this discussion is to illustrate that the conceptual framework proposed in this paper
establishes connections between information retrieval and natural language processing, with the hope
that these connections can lead to further synergies in the future. Lin et al. [2021b] (Chapter 5) argued
that until relatively recently, solutions to the text retrieval problem and sentence similarity tasks have
developed in relative isolation in the IR and NLP communities, respectively, despite the wealth of
connections. In fact, both communities have converged on similar solutions in terms of neural
architectures (in the pre-BERT days). The proposed conceptual framework here makes these
connections explicit, hopefully facilitating a two-way dialogue between the communities that will
benefit both.
4
Natural Language Processing and Information Retrieval
The explosive growth in the number of full-text, natural language documents that are available
electronically makes tools that assist users in finding documents of interest indispensable. Information
retrieval systems address this problem by matching query language statements (representing the user's
information need) against document surrogates. Intuitively, natural language processing techniques
should be able to improve the quality of the document surrogates and thus improve retrieval
performance. But to date explicit linguistic processing of document or query text has afforded
essentially no benefit for general-purpose (i.e., not domain specific) retrieval systems as compared to
less expensive statistical techniques.
The question of statistical vs. NLP retrieval systems is miscast, however. It is not a question of either one
or the other, but rather a question of how accurate an approximation to explicit linguistic processing is
required for good retrieval performance. The techniques used by the statistical systems are based on
linguistic theory in that they are effective retrieval measures precisely because they capture important
aspects of the way natural language is used. Stemming is an approximation to morphological processing.
Finding frequently co-occurring word pairs is an approximation to finding collocations and other
compound structures. Similarity measures implicitly resolve word senses by capturing word forms used
in the same contexts. Current information retrieval research demonstrates that more accurate
approximations cannot yet be reliably exploited to improve retrieval.
So why should relatively crude approximations be sufficient? The task in information retrieval is to
produce a ranked list of documents in response to a query. There is no evidence that detailed meaning
structures are necessary to accomplish this task. Indeed, the IR literature suggests that such structures
are not required. For example, IR systems can successfully process documents whose contents have
been garbled in some way such as by being the output of OCR processing [24, 25] or the output of an
automatic speech recognizer [26]. There has even been some success in retrieving French documents
with English queries by simply treating English as misspelled French [27]. Instead, retrieval effectiveness
is strongly dependent on finding all possible (true) matches between documents and queries, and on an
appropriate balance in the weights among different aspects of the query. In this setting, processing that
would create better linguistic approximations must be essentially perfect to avoid causing more harm
than good.
This is not to say that current natural language processing technology is not useful. While information
retrieval has focused on retrieving documents as a practical necessity, users would much prefer systems
that are capable of more intuitive, meaning-based interaction. Current NLP technology may now make
these applications feasible, and research efforts to address appropriate tasks are underway. For
example, one way to support the user in information-intensive tasks is to provide summaries of the
documents rather than entire documents. A recent evaluation of summarization technology found
statistical approaches quite effective when the summaries were simple extracts of document texts [28],
but generating more cohesive abstracts will likely require more developed linguistic processing. Another
way to support the user is to generate actual answers. A first test of systems' ability to find short text
extracts that answer fact-seeking questions will occur in the \Question-Answering" track of TREC-8.
Determining the relationships that hold among words in a text is likely to be important in this task.
5
Natural Language Processing in Information Retrieval
Many Natural Language Processing (NLP) techniques, including stemming, part of-speech tagging,
compound recognition, de-compounding, chunking, word sense disambiguation and others, have been
used in Information Retrieval (IR). The core IR task we are investigating here is document retrieval.
Several other IR tasks use very similar techniques, e.g., document clustering, filtering, new event
detection, and link detection, and they can be combined with NLP in a way similar to document
retrieval. NLP and IR are very different areas of research, and recent major conferences only have a
small number of papers investigating the use of NLP techniques for information retrieval. The three
conferences listed in table 1 had 411 full papers in total. Only 6 of them (1.5%) explicitly dealt with NLP
for retrieval. The percentage is slightly higher for conferences with a main focus on IR (SIGIR, ECIR: 2.0%)
than for conferences with a main focus on NLP (ACL: 1.0%). In most cases, researchers work on using
existing NLP components (stemmers, taggers, . . .), apply them to an IR data set and queries, and then
use standard IR techniques. This out-of-the-box use of NLP components that are not geared towards IR
might be one reason why NLP techniques are only moderately successful when compared to state-of-
the art non-NLP retrieval techniques. The moderate success contradicts the intuition that NLP should
help IR, which is shared by a large number of researchers. This article reviews the research on combining
the two areas and attempts to identify reasons for why NLP has not brought a breakthrough to IR.
Stop words
Almost all IR applications remove stop words (function words, low-content words, very high frequency
words) before processing documents and queries. This usually increases system performance. But there
are many counter-examples that are handled poorly after stop word removal, e.g.:
1. To be or not to be
Adjusting the stop word list to the given task can significantly improve results (Farahat et al. 2003).
Creating stop word lists is not generally considered to be NLP, but NLP techniques can help to create
specific lists and to deal with examples 1 – 4 above.
Stemming
Stemming is the task of mapping words to some base form. The two main methods are (1)
linguistic/dictionary-based stemming, and (2) Porter-style stemming (Porter 1980). (1) has higher
stemming accuracy, but also higher implementation and processing costs and lower coverage. (2) has
lower accuracy, but also lower implementation and processing costs and is usually sufficient for IR.
Stemming maps several terms onto one base form, which is then used as a term in the vector space
model. This means that, on average, it increases similarities between documents or documents and
queries because they have an additional common term after stemming, but not before. This results in an
increase in recall, but sacrifices precision.
Part-of-Speech Tagging
Part-of-speech tagging is the task of assigning a syntactic category to each word in a text, thereby
resolving some ambiguities. E.g., the tagger decides whether the word ships are used as a plural noun or
a 3rd person singular present tense verb. A variety of techniques have been used, e.g., statistical
(Ratnaparkhi 1996, Brants 2000), memory-based (Daelemans et al. 1996), rule-based (Brill 1992) and
many more. The accuracies for small and medium sized tag sets are usually in the middle or high 90s.
(Kraaij and Pohlmann 1996) investigate the “success” of different parts-of speech for retrieval. They
define a “successful term” as a query term that appears in a relevant document. For Dutch, they find
that 58% of the successful terms are nouns (including nominal compounds and proper names), 29% are
verbs, 13% are adjectives. When looking at the query terms present in the highest number of relevant
documents, they find that 84% of these terms are nouns. This shows the higher importance of nouns.
And more……
6
Identifying Fake News on Social Networks Based on Natural Language
Processing: Trends and Challenges
Fake News Definition:
The fake news term originally refers to false and often sensationalist information disseminated under
the guise of relevant news. However, this term’s use has evolved and is now considered synonymous
with the spread of false information on social media.
Fake news can be distinguished by the means employed to distort information. The news content can be
completely fake, entirely manufactured to deceive the consumer, or it can be tricky content that
employs misleading information to address a particular topic. There is also the possibility of imposing
content that simulates genuine sources but, in fact, the sources are false. Other fraudulent
characteristics of fake news content are the use of manipulated content, such as headlines and images
that are not in accordance with the content conveyed, or the contextualization of the fake news with
legitimate elements and content but in a false context.
Fake news also has different motives or intentions, such as intentions to harm or discredit people or
institutions; profit intentions to generate financial gains by increasing the placement and viewing of
online publications; intentions to influence and manipulate public opinion; as well as intentions to
promote discord or, simply, for fun are identified as motivations for the creation and dissemination of
fake news. Several concepts compete and overlap with the concept of fake news. A synthesis of these
multiple concepts, which are not considered fake news, are listed as follows [4,8,13,14]:
1. Satires and parodies have embedded humorous content, using sarcasms and ironies. It is feasible to
have its deceptive character identified;
2. Rumors that do not originate from news events, but are publicly accepted;
4. Spams, commonly described as unwanted messages, mainly e-mail, spams are any advertising
campaign that reaches readers via social media without being wanted;
5. Scams and hoaxes, which are motivated just for fun or to trick targeted individuals;
6. Clickbaits use miniature images, or sensationalist headlines, in the process of convincing users to
access and share dubious content. Clickbait is more like a type of false advertising;
7. Misinformation, that is created involuntarily, without a specific origin or intention to mislead the
reader;
8. Disinformation, which is pieces of information created with the specific intention of confusing the
reader.
7
Natural language processing. Annual Review of Information Science and
Technology, 37. pp. 51-89. ISSN 0066-4200
Information Retrieval
Information retrieval has been a major area of application of NLP, and consequently a number of
research projects, dealing with the various applications on NLP in IR, have taken place throughout the
world resulting in a large volume of publications. Lewis and Sparck Jones (1996) comment that the
generic challenge for NLP in the field of IR is whether the necessary NLP of texts and queries is doable,
and the specific challenges are whether non-statistical and statistical data can be combined and whether
data about individual documents and whole files can be combined. They further comment that there are
major challenges in making the NLP technology operate effectively and efficiently and also in conducting
appropriate evaluation tests to assess whether and how far the approach works in an environment of
interactive searching of large text files. Feldman (1999) suggests that in order to achieve success in IR,
NLP techniques should be applied in conjunction with other technologies, such as visualization,
intelligent agents and speech recognition.
Chandrasekar & Srinivas (1998) propose that coherent text contains significant latent information, such
as syntactic structure and patterns of language use, and this information could be used to improve the
performance of information retrieval systems. They describe a system, called Glean, that uses syntactic
information for effectively filtering irrelevant documents, and thereby improving the precision of
information retrieval systems.
Pirkola (2001) shows that languages vary significantly in their morphological properties. However, for
each language there are two variables that describe the morphological complexity, viz., index of
synthesis (IS) that describes the amount of affixation in an individual language, i.e., the average number
of morphemes per word in the language; and index of fusion (IF) that describes the ease with which two
morphemes can be separated in a language. Pirkola (2001) shows that calculation of the ISs and IFs in a
language is a relatively simple task, and once they have been established, they could be utilized fruitfully
in empirical IR research and system development.