0% found this document useful (0 votes)
10 views

Applied Text Analysis 2

Natural Language Processing (NLP) is a field of artificial intelligence that allows computers to understand human language. NLP uses techniques from linguistics, computer science, and cognitive science to analyze and understand written or spoken language. It has applications such as automatic summarization, question answering, sentiment analysis, and information extraction from text. NLP involves various levels of linguistic analysis including morphology, part-of-speech tagging, syntax, and semantics. Computational methods for these tasks include rule-based, statistical, and machine learning approaches.
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

Applied Text Analysis 2

Natural Language Processing (NLP) is a field of artificial intelligence that allows computers to understand human language. NLP uses techniques from linguistics, computer science, and cognitive science to analyze and understand written or spoken language. It has applications such as automatic summarization, question answering, sentiment analysis, and information extraction from text. NLP involves various levels of linguistic analysis including morphology, part-of-speech tagging, syntax, and semantics. Computational methods for these tasks include rule-based, statistical, and machine learning approaches.
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 30

Natural-Language Processing. POS tagging.

Semantic
tagging. Wmatrix5. Semantic word cloud

Applied text analysis 2


Natural-Language Processing (NLP)
• NLP is the area of Artificial Intelligence (AI) that allows computers to
understand human language to perform complex tasks on different
linguistic objects (i.e., speech, words, phrases, meaning).
• By using these capabilities, computers can understand and make sense
of unstructured data that enables them to acquire knowledge that’s
implicit in language (Bird et al., 2009; Eisenstein, 2019; Ghosh &
Gunning, 2019).
• For this, NLP combines models from linguistics, computer science, AI,
and cognitive sciences, in order to create intelligent systems capable of
understanding, analyzing, and extracting meaning from written (text) or
spoken human language (Jurafsky et al., 2014).
NLP capabilities in current technologies
• Systems that automatically create answers to questions written or spoken (as in
Apple’s SIRI) in natural language (Atkinson & Andrade, 2013).
• Automatic summarization from one or more documents (Atkinson & Munoz, 2013).
• Automatic dialogues in human–computer interaction (Atkinson, 2007a; Wu et al.,
2019).
• Self-service systems in contact centers (Sankar et al., 2019).
• Spam categorization.
• Sentiment analysis on opinions or reviews coming from products and services.
• Information extraction from online documents in order to populate databases
(Atkinson et al., 2014).
• Grammar checkers and autofill prediction in word processors (like MS Word).
• Many others.
Levels of linguistic processing
NLP levels and tasks
• The full flow of NLP can be seen as a pipeline of levels and associated
tasks

NLP follows a multilevel pipeline. Note that for textual analytical purposes, several of them are usually not
necessary (i.e., phonology, pragmatics).
Morphology
• The level of morphological analysis determines how words are constructed from their
smallest significant units called morphemes. The analysis of morphology is necessary
because a text can use different forms of a word (i.e., infect, infected, etc.), which could
produce too much linguistic variability and, therefore, increase the dimensionality of a
text, obfuscating the real meaning of the individual word (Bohnet et al., 2018).
• Morphological analysis:
• Lemmatization: Reduces the words to their canonical form in the dictionary, better
known as their lemma. For this, it’s required to know the grammatical function (i.e.,
verb, noun, adjective) of the word to solve the inflection.
• Stemming: Reduces words to their stems, which don’t need to have the same root as
those existing in a dictionary. Hence, the stem can be an equal or shorter form of the
word, so stemming becomes a reduction method, which can generally be addressed
with algorithms based on morphological and/or heuristic rules.
Lemmatization and Stemming
• The table below shows the output of NLTK's Snowball Stemmer and
Spacy's lemmatizer for the tokens in the sentence 'Analyzing text is
not that hard'.
Lexicon
• The level of lexical analysis tries to understand the linguistic roles or functions of words, usually known
as their part-of-speech (POS).
• A basic requirement for lexical analysis is that the words of a text must be properly separated. For this,
a set of words must go through a task called tokenization, which breaks them down into individual
useful units or tokens. Usually this is done both to separate those words (i.e., word tokenization) and to
separate sentences within a text (i.e., sentence tokenization).
• Example: Analyzing text is not that hard. = [“Analyzing”, “text”, “is”, “not”, “that”, “hard”, “.”]
• Once the input text is tokenized, we need methods that automatically determine the roles or POS of
each word in context - POS tagging. Examples of POS-type tags include N (noun), V (verb), DET
(determiner), ART (article), P (preposition), etc.
• Example: Analyzing text is not that hard. “Analyzing”: VERB, “text”: NOUN, “is”: VERB, “not”: ADV, “that”: ADV, “hard”: ADJ,
“.”: PUNCT
• Part-of-speech tags used in the Penn Treebank Project
https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html
POS tagging
• CLAWS
• https://ucrel.lancs.ac.uk/claws/
• https://ucrel.lancs.ac.uk/annotation.html
• https://ucrel.lancs.ac.uk/claws1tags.html

NN
singular common noun (boy, pencil ... )
BEZ
(is)
AT
singular article (a, an, every)
JJ
general adjective (turquoise, happy ... )
CC
co-ordinating conjunction (and, or, but,
so, then, yet, only, for)
Computational approaches to POS tagging
• Rule-based methods: These use expert-defined rules to perform the tagging (i.e., “IF the
current tag is DET and… THEN the tag of the next word is N…”).
• Statistical methods: These are supervised approaches that require training texts, from which
the label probabilities for each word are estimated.
• Stochastic methods: These use supervised sequence prediction models based on Bayesian
probabilistic inference approaches. Usual techniques include Hidden Markov Models (HMM)
and a generalization called Conditional Random Fields (CRF). From an input text, and a set of
training texts, the method can generate the most likely sequence of POS tags associated with
the words in that text (Baron, 2019).
• Machine-learning-based methods: These correspond to sequence models that use supervised
learning techniques from a collection of training texts annotated with the correct labels to then
predict the best sequence of labels for an input text. The usual methods are based on recurrent
artificial neural networks and LSTM (Long Short Term Memory) methods, which allow capturing
the context surrounding a word to make predictions of POS roles (Aggarwal, 2018).
Syntax
The syntactic analysis level tries to
determine the structure and roles
connecting words in a sentence (i.e.,
grams) in order to generate a model for
the complete sentence. This
relationship usually takes the form of a
grammatical or syntactic structure of
the sentence, following certain
language rules called a grammar.
Parsing refers to the process of
determining the syntactic structure of a
text. a parser. This takes an input text
and a set of grammar rules (i.e.,
grammar) and determines if there’s a
valid language structure for that text.
Dependency Parsing
• Dependency grammars can be defined as grammars that establish directed relations between the words of
sentences. In many cases, the verb is taken as the stem root of a sentence, so the other words are directly or
indirectly connected to the root verb, having a dependency relationship.
Constituency
Parsing

Constituency phrase structure


grammars model syntactic structures by
making use of abstract nodes associated
to words and other abstract categories
(depending on the type of grammar) and
undirected relations between them.
The parser takes a grammar and generates
a syntax tree or parse tree structure, which
attempts to detect all relationships that
match the grammatical rules for the entire
text of entry.
The successive application of these rules
generates a syntactic tree structure (i.e.,
parse tree) in which the upper levels
represent generating symbols (i.e.,
nonterminal) while the last level (i.e., tree
leaves) represents the symbols or words in
a vocabulary (i.e., terminals) that must
match the entry.
Semantic analysis
• The level of semantic analysis determines the literal meaning of a
word or sentence. For this, semantics tries to identify the interactions
between individual meanings (words) in contexts given in a sentence.
As with the other linguistic levels, a semantic analysis must also solve
ambiguity problems as words or sentences can have multiple possible
interpretations or meanings.
• Semantic analysis from three perspectives: that is, lexical semantics
(of words), sentence semantics, and complete text semantics (i.e.,
discourse)
Semantic tagging
• UCREL Semantic Analysis System (USAS)
• https://ucrel.lancs.ac.uk/usas/
• https://ucrel.lancs.ac.uk/usas/semtags.txt
NLP: basic concepts
Text representation
Wmatrix
https://ucrel.lancs.ac.uk/wmatrix/
A case study
Former President Trump announces 2024 presidential bid
https://www.youtube.com/watch?v=8tSYwJ1_htE
POS tagging
https://ucrel.lancs.ac.uk/claws/format.html
Semantic tagging
https://ucrel.lancs.ac.uk/usas/
Semantic
Frequency lists

POS
Word cloud
Semantic tag clouds
Negative emotions (list)
Positive emotions
G3 (list)
Semantic word cloud Former President Trump announces 2024 presidential bid
https://www.youtube.com/watch?v=8tSYwJ1_htE

• http://wordcloud.cs.arizona.edu/

Task
1) explain the functions / options of the program
2) analyze the text
UAM corpus tool
• Download UAMCorpusTool6
• http://www.corpustool.com/download.html
References
• Atkinson-Abutridy J. Text Analytics. An Introduction to the Science and
Applications of Unstructured Information Analysis. Chapman & Hall.
2022
• What is Text Analysis? https://monkeylearn.com/text-analysis/
• Rayson, P. (2009) Wmatrix: a web-based corpus processing
environment, Computing Department, Lancaster University.
http://ucrel.lancs.ac.uk/wmatrix/
• Text Mining and Analytics https://www.youtube.com/watch?
v=Uqs0GewlMkQ&list=PLLssT5z_DsK8Xwnh_0bjN4KNT81bekvtt

You might also like