Applied Text Analysis 2
Applied Text Analysis 2
Semantic
tagging. Wmatrix5. Semantic word cloud
NLP follows a multilevel pipeline. Note that for textual analytical purposes, several of them are usually not
necessary (i.e., phonology, pragmatics).
Morphology
• The level of morphological analysis determines how words are constructed from their
smallest significant units called morphemes. The analysis of morphology is necessary
because a text can use different forms of a word (i.e., infect, infected, etc.), which could
produce too much linguistic variability and, therefore, increase the dimensionality of a
text, obfuscating the real meaning of the individual word (Bohnet et al., 2018).
• Morphological analysis:
• Lemmatization: Reduces the words to their canonical form in the dictionary, better
known as their lemma. For this, it’s required to know the grammatical function (i.e.,
verb, noun, adjective) of the word to solve the inflection.
• Stemming: Reduces words to their stems, which don’t need to have the same root as
those existing in a dictionary. Hence, the stem can be an equal or shorter form of the
word, so stemming becomes a reduction method, which can generally be addressed
with algorithms based on morphological and/or heuristic rules.
Lemmatization and Stemming
• The table below shows the output of NLTK's Snowball Stemmer and
Spacy's lemmatizer for the tokens in the sentence 'Analyzing text is
not that hard'.
Lexicon
• The level of lexical analysis tries to understand the linguistic roles or functions of words, usually known
as their part-of-speech (POS).
• A basic requirement for lexical analysis is that the words of a text must be properly separated. For this,
a set of words must go through a task called tokenization, which breaks them down into individual
useful units or tokens. Usually this is done both to separate those words (i.e., word tokenization) and to
separate sentences within a text (i.e., sentence tokenization).
• Example: Analyzing text is not that hard. = [“Analyzing”, “text”, “is”, “not”, “that”, “hard”, “.”]
• Once the input text is tokenized, we need methods that automatically determine the roles or POS of
each word in context - POS tagging. Examples of POS-type tags include N (noun), V (verb), DET
(determiner), ART (article), P (preposition), etc.
• Example: Analyzing text is not that hard. “Analyzing”: VERB, “text”: NOUN, “is”: VERB, “not”: ADV, “that”: ADV, “hard”: ADJ,
“.”: PUNCT
• Part-of-speech tags used in the Penn Treebank Project
https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html
POS tagging
• CLAWS
• https://ucrel.lancs.ac.uk/claws/
• https://ucrel.lancs.ac.uk/annotation.html
• https://ucrel.lancs.ac.uk/claws1tags.html
NN
singular common noun (boy, pencil ... )
BEZ
(is)
AT
singular article (a, an, every)
JJ
general adjective (turquoise, happy ... )
CC
co-ordinating conjunction (and, or, but,
so, then, yet, only, for)
Computational approaches to POS tagging
• Rule-based methods: These use expert-defined rules to perform the tagging (i.e., “IF the
current tag is DET and… THEN the tag of the next word is N…”).
• Statistical methods: These are supervised approaches that require training texts, from which
the label probabilities for each word are estimated.
• Stochastic methods: These use supervised sequence prediction models based on Bayesian
probabilistic inference approaches. Usual techniques include Hidden Markov Models (HMM)
and a generalization called Conditional Random Fields (CRF). From an input text, and a set of
training texts, the method can generate the most likely sequence of POS tags associated with
the words in that text (Baron, 2019).
• Machine-learning-based methods: These correspond to sequence models that use supervised
learning techniques from a collection of training texts annotated with the correct labels to then
predict the best sequence of labels for an input text. The usual methods are based on recurrent
artificial neural networks and LSTM (Long Short Term Memory) methods, which allow capturing
the context surrounding a word to make predictions of POS roles (Aggarwal, 2018).
Syntax
The syntactic analysis level tries to
determine the structure and roles
connecting words in a sentence (i.e.,
grams) in order to generate a model for
the complete sentence. This
relationship usually takes the form of a
grammatical or syntactic structure of
the sentence, following certain
language rules called a grammar.
Parsing refers to the process of
determining the syntactic structure of a
text. a parser. This takes an input text
and a set of grammar rules (i.e.,
grammar) and determines if there’s a
valid language structure for that text.
Dependency Parsing
• Dependency grammars can be defined as grammars that establish directed relations between the words of
sentences. In many cases, the verb is taken as the stem root of a sentence, so the other words are directly or
indirectly connected to the root verb, having a dependency relationship.
Constituency
Parsing
POS
Word cloud
Semantic tag clouds
Negative emotions (list)
Positive emotions
G3 (list)
Semantic word cloud Former President Trump announces 2024 presidential bid
https://www.youtube.com/watch?v=8tSYwJ1_htE
• http://wordcloud.cs.arizona.edu/
Task
1) explain the functions / options of the program
2) analyze the text
UAM corpus tool
• Download UAMCorpusTool6
• http://www.corpustool.com/download.html
References
• Atkinson-Abutridy J. Text Analytics. An Introduction to the Science and
Applications of Unstructured Information Analysis. Chapman & Hall.
2022
• What is Text Analysis? https://monkeylearn.com/text-analysis/
• Rayson, P. (2009) Wmatrix: a web-based corpus processing
environment, Computing Department, Lancaster University.
http://ucrel.lancs.ac.uk/wmatrix/
• Text Mining and Analytics https://www.youtube.com/watch?
v=Uqs0GewlMkQ&list=PLLssT5z_DsK8Xwnh_0bjN4KNT81bekvtt