NLP Introduction
NLP Introduction
Introduction
Felipe Bravo-Marquez
1
ttp://web.stanford.edu/class/cs224n/
Natural Language Processing
• The amount of digitized textual data being generated every day is huge (e.g, the
Web, social media, medical records, digitized books).
• So does the need for translating, analyzing, and managing this flood of words
and text.
• Natural language processing (NLP) is the field of designing methods and
algorithms that take as input or produce as output unstructured, natural
language data. [Goldberg, 2017]
• Natural language processing is focused on the design and analysis of
computational algorithms and representations for processing natural human
language [Eisenstein, 2018]
Natural Language Processing
Natural language processing (NLP) develops methods for solving practical problems
involving language [Johnson, 2014].
• Automatic speech recognition.
• Machine translation.
• Information extraction from documents.
Computational linguistics (CL) studies the computational processes underlying
(human) language.
• How do we understand language?
• How do we produce language?
• How do we learn language?
• Most of the meetings and journals that host natural language processing
research bear the name “computational linguistics” (e.g., ACL, NACL).
[Eisenstein, 2018]
• NLP and CL may be thought of as essentially synonymous.
• While there is substantial overlap, there is an important difference in focus.
• CL is essentially linguistics supported by computational methods (similar to
computational biology, computational astronomy).
• In linguistics, language is the object of study.
• NLP focuses on solving well-defined tasks involving human language (e.g.,
translation, query answering, holding conversations).
• Fundamental linguistic insights may be crucial for accomplishing these tasks, but
success is ultimately measured by whether and how well the job gets done
(according to an evaluation metric) [Eisenstein, 2018].
Linguistics levels of description
The field of linguistics includes subfields that concern themselves with different levels
or aspects of the structure of language, as well as subfields dedicated to studying how
linguistic structure interacts with human cognition and society [Bender, 2013].
1. Phonetics: The study of the sounds of human language.
2. Phonology: The study of sound systems in human languages.
3. Morphology: The study of the formation and internal structure of words.
4. Syntax: The study of the formation and internal structure of sentences.
5. Semantics: The study of the meaning of sentences
6. Pragmatics: The study of the way sentences with their semantic meanings are
used for particular communicative goals.
Phonetics
• Syntax studies the ways words combine to form phrases and sentences
[Johnson, 2014]
• Syntactic parsing helps identify who did what to whom, a key step in
understanding a sentence.
Semantics
• While we humans are great users of language, we are also very poor at formally
understanding and describing the rules that govern language.
• Understanding and producing language using computers is highly challenging.
• The best known set of methods for dealing with language data rely on
supervised machine learning.
• Supervised machine learning: attempt to infer usage patterns and regularities
from a set of pre-annotated input and output pairs (a.k.a training dataset).
Training Dataset: CoNLL-2003 NER Data
https://www.clips.uantwerpen.be/conll2003/ner/
Example of NLP Task: Topic Classification
• Classify a document into one of four categories: Sports, Politics, Gossip, and
Economy.
• The words in the documents provide very strong hints.
• Which words provide what hints?
• Writing up rules for this task is rather challenging.
• However, readers can easily categorize a number of documents into its topic
(data annotation).
• A supervised machine learning algorithm come up with the patterns of word
usage that help categorize the documents.
Example 3: Sentiment Analysis
w2 happy
lol happy
w3 good
lol good
w4 grr
Label tweets by
sentiment and train
a classifier Tweet vectors
w1 w2 w3 w4 w5
t1 0 1 0 0 1
t2 0 0 1 0 1
t3 1 0 0 1 0
Classify target
tweets by sentiment
Target tweets
• Knowing about linguistic structure is important for feature design and error
analysis in NLP [Bender, 2013].
• Machine learning approaches to NLP require features which can describe and
generalize across particular instances of language use.
• Goal: guide the machine learning algorithm to find correlations between
language use and its target set of labels.
• Knowledge about linguistic structures can inform the design of features for
machine learning approaches to NLP.
Challenges in NLP
Distant Supervision
• Automatically label unlabeled data (Twitter API) using a heuristic method.
• Emoticon-Annotation Approach (EAA): tweets with positive :) or negative :(
emoticons are labelled according to the polarity indicated by the
emoticon [Read, 2005].
• The emoticon is removed from the content.
• The same approach has been extended using hashtags #anger, and emojis.
• Is not trivial to find distant supervision techniques for all kind of NLP problems.
Crowdsourcing
• Rely on services like Amazon Mechanical Turk or Crowdflower to ask the
crowds to annotate data.
• This can be expensive.
• It is hard to guarantee quality.
Sentiment Classification of Tweets
NLP progress can be divided into three main waves: 1) rationalism, 2) empiricism, and
3) deep learning [Deng and Liu, 2018].
1950 - 1990 Rationalism: approaches endeavored to design hand-crafted rules to incorporate
knowledge and reasoning mechanisms into intelligent NLP systems (e.g, ELIZA
for simulating a Rogerian psychotherapist, MARGIE for structuring real-world
information into concept ontologies).
1991 - 2009 Empiricism: characterized by the exploitation of data corpora and of (shallow)
machine learning and statistical models (e.g., Naive Bayes, HMMs, IBM
translation models).
2010 - Deep Learning: feature engineering (considered as a bottleneck) is replaced
with representation learning and/or deep neural networks (e.g.,
https://www.deepl.com/translator). A very influential paper in this
revolution: [Collobert et al., 2011].
3
Dates are approximated.
A fourth wave??
• Large Language Models (LLMs) like ChatGPT, GPT4, Llama and Bard are deep
neural networks trained on large corpora (hundreds of billions of tokens) and a
large parameter space (billions) to predict the next word from a fixed-size context.
• One of the most striking features of these models is their ability for few-shot,
one-shot, and zero-shot learning, often referred to as “in-context learning”.
• This implies their capacity to acquire new tasks with minimal human-annotated
data, simply by providing the appropriate instruction or prompt.
• Thus, despite being rooted in the deep learning paradigm, they introduce a
disruptive approach to NLP.
Roadmap
In this course we will introduce modern concepts in natural
language processing based on statistical models (second
wave) and neural networks (third wave). The main concepts to
be covered are listed below:
1. Text classification.
2. Linear Models.
3. Naive Bayes.
4. Hidden Markov Models.
5. Neural Networks.
6. Word embeddings.
7. Convolutional Neural Networks (CNNs)
8. Recurrent Neural Networks: Elman, LSTMs, GRUs.
9. Attention.
10. Sequence-to-Sequence Models.
11. Transformer
12. Large Language Models.
Questions?