Applied Text Analysis 2

Natural Language Processing (NLP) is a field of artificial intelligence that allows computers to understand human language. NLP uses techniques from linguistics, computer science, and cognitive science to analyze and understand written or spoken language. It has applications such as automatic summarization, question answering, sentiment analysis, and information extraction from text. NLP involves various levels of linguistic analysis including morphology, part-of-speech tagging, syntax, and semantics. Computational methods for these tasks include rule-based, statistical, and machine learning approaches.

Uploaded by

Таня Брода

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views

Applied Text Analysis 2

Uploaded by

Таня Брода

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 30

Natural-Language Processing. POS tagging.

Semantic
tagging. Wmatrix5. Semantic word cloud

Applied text analysis 2

Natural-Language Processing (NLP)
• NLP is the area of Artificial Intelligence (AI) that allows computers to
understand human language to perform complex tasks on different
linguistic objects (i.e., speech, words, phrases, meaning).
• By using these capabilities, computers can understand and make sense
of unstructured data that enables them to acquire knowledge that’s
implicit in language (Bird et al., 2009; Eisenstein, 2019; Ghosh &
Gunning, 2019).
• For this, NLP combines models from linguistics, computer science, AI,
and cognitive sciences, in order to create intelligent systems capable of
understanding, analyzing, and extracting meaning from written (text) or
spoken human language (Jurafsky et al., 2014).
NLP capabilities in current technologies
• Systems that automatically create answers to questions written or spoken (as in
Apple’s SIRI) in natural language (Atkinson & Andrade, 2013).
• Automatic summarization from one or more documents (Atkinson & Munoz, 2013).
• Automatic dialogues in human–computer interaction (Atkinson, 2007a; Wu et al.,
2019).
• Self-service systems in contact centers (Sankar et al., 2019).
• Spam categorization.
• Sentiment analysis on opinions or reviews coming from products and services.
• Information extraction from online documents in order to populate databases
(Atkinson et al., 2014).
• Grammar checkers and autofill prediction in word processors (like MS Word).
• Many others.
Levels of linguistic processing
NLP levels and tasks
• The full flow of NLP can be seen as a pipeline of levels and associated
tasks

NLP follows a multilevel pipeline. Note that for textual analytical purposes, several of them are usually not
necessary (i.e., phonology, pragmatics).
Morphology
• The level of morphological analysis determines how words are constructed from their
smallest significant units called morphemes. The analysis of morphology is necessary
because a text can use different forms of a word (i.e., infect, infected, etc.), which could
produce too much linguistic variability and, therefore, increase the dimensionality of a
text, obfuscating the real meaning of the individual word (Bohnet et al., 2018).
• Morphological analysis:
• Lemmatization: Reduces the words to their canonical form in the dictionary, better
known as their lemma. For this, it’s required to know the grammatical function (i.e.,
verb, noun, adjective) of the word to solve the inflection.
• Stemming: Reduces words to their stems, which don’t need to have the same root as
those existing in a dictionary. Hence, the stem can be an equal or shorter form of the
word, so stemming becomes a reduction method, which can generally be addressed
with algorithms based on morphological and/or heuristic rules.
Lemmatization and Stemming
• The table below shows the output of NLTK's Snowball Stemmer and
Spacy's lemmatizer for the tokens in the sentence 'Analyzing text is
not that hard'.
Lexicon
• The level of lexical analysis tries to understand the linguistic roles or functions of words, usually known
as their part-of-speech (POS).
• A basic requirement for lexical analysis is that the words of a text must be properly separated. For this,
a set of words must go through a task called tokenization, which breaks them down into individual
useful units or tokens. Usually this is done both to separate those words (i.e., word tokenization) and to
separate sentences within a text (i.e., sentence tokenization).
• Example: Analyzing text is not that hard. = [“Analyzing”, “text”, “is”, “not”, “that”, “hard”, “.”]
• Once the input text is tokenized, we need methods that automatically determine the roles or POS of
each word in context - POS tagging. Examples of POS-type tags include N (noun), V (verb), DET
(determiner), ART (article), P (preposition), etc.
• Example: Analyzing text is not that hard. “Analyzing”: VERB, “text”: NOUN, “is”: VERB, “not”: ADV, “that”: ADV, “hard”: ADJ,
“.”: PUNCT
• Part-of-speech tags used in the Penn Treebank Project
https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html
POS tagging
• CLAWS
• https://ucrel.lancs.ac.uk/claws/
• https://ucrel.lancs.ac.uk/annotation.html
• https://ucrel.lancs.ac.uk/claws1tags.html

NN
singular common noun (boy, pencil ... )
BEZ
(is)
AT
singular article (a, an, every)
JJ
general adjective (turquoise, happy ... )
CC
co-ordinating conjunction (and, or, but,
so, then, yet, only, for)
Computational approaches to POS tagging
• Rule-based methods: These use expert-defined rules to perform the tagging (i.e., “IF the
current tag is DET and… THEN the tag of the next word is N…”).
• Statistical methods: These are supervised approaches that require training texts, from which
the label probabilities for each word are estimated.
• Stochastic methods: These use supervised sequence prediction models based on Bayesian
probabilistic inference approaches. Usual techniques include Hidden Markov Models (HMM)
and a generalization called Conditional Random Fields (CRF). From an input text, and a set of
training texts, the method can generate the most likely sequence of POS tags associated with
the words in that text (Baron, 2019).
• Machine-learning-based methods: These correspond to sequence models that use supervised
learning techniques from a collection of training texts annotated with the correct labels to then
predict the best sequence of labels for an input text. The usual methods are based on recurrent
artificial neural networks and LSTM (Long Short Term Memory) methods, which allow capturing
the context surrounding a word to make predictions of POS roles (Aggarwal, 2018).
Syntax
The syntactic analysis level tries to
determine the structure and roles
connecting words in a sentence (i.e.,
grams) in order to generate a model for
the complete sentence. This
relationship usually takes the form of a
grammatical or syntactic structure of
the sentence, following certain
language rules called a grammar.
Parsing refers to the process of
determining the syntactic structure of a
text. a parser. This takes an input text
and a set of grammar rules (i.e.,
grammar) and determines if there’s a
valid language structure for that text.
Dependency Parsing
• Dependency grammars can be defined as grammars that establish directed relations between the words of
sentences. In many cases, the verb is taken as the stem root of a sentence, so the other words are directly or
indirectly connected to the root verb, having a dependency relationship.
Constituency
Parsing

Constituency phrase structure

grammars model syntactic structures by
making use of abstract nodes associated
to words and other abstract categories
(depending on the type of grammar) and
undirected relations between them.
The parser takes a grammar and generates
a syntax tree or parse tree structure, which
attempts to detect all relationships that
match the grammatical rules for the entire
text of entry.
The successive application of these rules
generates a syntactic tree structure (i.e.,
parse tree) in which the upper levels
represent generating symbols (i.e.,
nonterminal) while the last level (i.e., tree
leaves) represents the symbols or words in
a vocabulary (i.e., terminals) that must
match the entry.
Semantic analysis
• The level of semantic analysis determines the literal meaning of a
word or sentence. For this, semantics tries to identify the interactions
between individual meanings (words) in contexts given in a sentence.
As with the other linguistic levels, a semantic analysis must also solve
ambiguity problems as words or sentences can have multiple possible
interpretations or meanings.
• Semantic analysis from three perspectives: that is, lexical semantics
(of words), sentence semantics, and complete text semantics (i.e.,
discourse)
Semantic tagging
• UCREL Semantic Analysis System (USAS)
• https://ucrel.lancs.ac.uk/usas/
• https://ucrel.lancs.ac.uk/usas/semtags.txt
NLP: basic concepts
Text representation
Wmatrix
https://ucrel.lancs.ac.uk/wmatrix/
A case study
Former President Trump announces 2024 presidential bid
https://www.youtube.com/watch?v=8tSYwJ1_htE
POS tagging
https://ucrel.lancs.ac.uk/claws/format.html
Semantic tagging
https://ucrel.lancs.ac.uk/usas/
Semantic
Frequency lists

POS
Word cloud
Semantic tag clouds
Negative emotions (list)
Positive emotions
G3 (list)
Semantic word cloud Former President Trump announces 2024 presidential bid
https://www.youtube.com/watch?v=8tSYwJ1_htE

• http://wordcloud.cs.arizona.edu/

Task
1) explain the functions / options of the program
2) analyze the text
UAM corpus tool
• Download UAMCorpusTool6
• http://www.corpustool.com/download.html
References
• Atkinson-Abutridy J. Text Analytics. An Introduction to the Science and
Applications of Unstructured Information Analysis. Chapman & Hall.
2022
• What is Text Analysis? https://monkeylearn.com/text-analysis/
• Rayson, P. (2009) Wmatrix: a web-based corpus processing
environment, Computing Department, Lancaster University.
http://ucrel.lancs.ac.uk/wmatrix/
• Text Mining and Analytics https://www.youtube.com/watch?
v=Uqs0GewlMkQ&list=PLLssT5z_DsK8Xwnh_0bjN4KNT81bekvtt

Anita Moorjani - Dying To Be Me - My Journey From Cancer, To Near Death, To True Healing (2012, Hay House)
No ratings yet
Anita Moorjani - Dying To Be Me - My Journey From Cancer, To Near Death, To True Healing (2012, Hay House)
28 pages
NLP Practice Problems (2)
No ratings yet
NLP Practice Problems (2)
48 pages
NLP Final
No ratings yet
NLP Final
72 pages
(A) What Is Traditional Model of NLP?: Unit - 1
No ratings yet
(A) What Is Traditional Model of NLP?: Unit - 1
18 pages
NLP - Viva - Que & Ans
No ratings yet
NLP - Viva - Que & Ans
15 pages
NLP CHAPTER-1
No ratings yet
NLP CHAPTER-1
24 pages
nlp
No ratings yet
nlp
35 pages
NLP Notes Unit-1
No ratings yet
NLP Notes Unit-1
20 pages
Natural Language Processing (NLP)
No ratings yet
Natural Language Processing (NLP)
17 pages
NLP UNIT-II PPT
No ratings yet
NLP UNIT-II PPT
45 pages
NLP Unit 5
No ratings yet
NLP Unit 5
10 pages
unit-1
No ratings yet
unit-1
23 pages
Syntax_complete
No ratings yet
Syntax_complete
22 pages
2-Text Operations_new
No ratings yet
2-Text Operations_new
39 pages
NLP UNIT 5 part b
100% (2)
NLP UNIT 5 part b
31 pages
Fundaments of Text Analysis
No ratings yet
Fundaments of Text Analysis
14 pages
NLTK 3
No ratings yet
NLTK 3
5 pages
Module 3
No ratings yet
Module 3
40 pages
NLP Assign Mod-4,5,6 IramShaikh
No ratings yet
NLP Assign Mod-4,5,6 IramShaikh
10 pages
A Framework For Figurative Language Detection Based On Sense Differentiation
No ratings yet
A Framework For Figurative Language Detection Based On Sense Differentiation
6 pages
NLP Text Preprocessing
No ratings yet
NLP Text Preprocessing
19 pages
Reference Material NLP - 2
No ratings yet
Reference Material NLP - 2
40 pages
AI Unit 3 Lecture 2
No ratings yet
AI Unit 3 Lecture 2
8 pages
2 Text Operations
No ratings yet
2 Text Operations
32 pages
SNLP Mid term
No ratings yet
SNLP Mid term
4 pages
Evaluating Part-Of-speech Tagging and Parsing
No ratings yet
Evaluating Part-Of-speech Tagging and Parsing
26 pages
تعلم ML4 (1)
No ratings yet
تعلم ML4 (1)
42 pages
Poeter Stemmer Algorithm
No ratings yet
Poeter Stemmer Algorithm
57 pages
Natural Language Processing
No ratings yet
Natural Language Processing
25 pages
NLB final lab manual (2)
No ratings yet
NLB final lab manual (2)
23 pages
Lec 6
No ratings yet
Lec 6
2 pages
A Mathematical Model For Universal Semantics
No ratings yet
A Mathematical Model For Universal Semantics
9 pages
W11 Natural Language Processing Lecture
No ratings yet
W11 Natural Language Processing Lecture
9 pages
Unit v Expert Systems Notes
No ratings yet
Unit v Expert Systems Notes
15 pages
Ai DP 2
No ratings yet
Ai DP 2
3 pages
Unit 5 NLP
No ratings yet
Unit 5 NLP
24 pages
Developing A Large Semantically Annotated Corpus: Valerio Basile, Johan Bos, Kilian Evang, Noortje Venhuizen
No ratings yet
Developing A Large Semantically Annotated Corpus: Valerio Basile, Johan Bos, Kilian Evang, Noortje Venhuizen
5 pages
NLP FINAL
No ratings yet
NLP FINAL
33 pages
Natural language processing notes
No ratings yet
Natural language processing notes
61 pages
pos tagging and chunking
No ratings yet
pos tagging and chunking
29 pages
NLP assignment notes
No ratings yet
NLP assignment notes
28 pages
Solutions To NLP I Mid Set A
100% (1)
Solutions To NLP I Mid Set A
8 pages
NLP CHAPTER 3
No ratings yet
NLP CHAPTER 3
23 pages
NLP_39-48
No ratings yet
NLP_39-48
11 pages
NLP unit1
No ratings yet
NLP unit1
24 pages
Unit 2 Syntactic Processing
No ratings yet
Unit 2 Syntactic Processing
17 pages
Tasks in NLP
No ratings yet
Tasks in NLP
7 pages
Module-5 (Markov Model and Pos Tagging)
No ratings yet
Module-5 (Markov Model and Pos Tagging)
66 pages
Ngram Experiment 3
No ratings yet
Ngram Experiment 3
3 pages
nlp unit 3 part A pdf
No ratings yet
nlp unit 3 part A pdf
75 pages
nlp unit 2
No ratings yet
nlp unit 2
13 pages
NLP Unit II Notes
71% (7)
NLP Unit II Notes
18 pages
NLP-PT 1
No ratings yet
NLP-PT 1
15 pages
NLP SEM QUESTIONS AND ANSWERS
No ratings yet
NLP SEM QUESTIONS AND ANSWERS
72 pages
Document Author Classification Using Parsed Language Structure
No ratings yet
Document Author Classification Using Parsed Language Structure
21 pages
Document Author Classification Using Parsed Language Structure
No ratings yet
Document Author Classification Using Parsed Language Structure
21 pages
NLP POS NER
No ratings yet
NLP POS NER
11 pages
Important Questions-Answers Text Analytics and Natural Language Processing [KAI073]
No ratings yet
Important Questions-Answers Text Analytics and Natural Language Processing [KAI073]
37 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
17 pages
SNLP
No ratings yet
SNLP
18 pages
Natural Language Processing
From Everand
Natural Language Processing
Ajit Singh
No ratings yet
Explanation Based Learning: Fundamentals and Applications
From Everand
Explanation Based Learning: Fundamentals and Applications
Fouad Sabry
No ratings yet
Word Sense Disambiguation Methods Applied To English and Romanian
No ratings yet
Word Sense Disambiguation Methods Applied To English and Romanian
8 pages
Urdu Dependency Parser A Data-Driven Approach
No ratings yet
Urdu Dependency Parser A Data-Driven Approach
7 pages
Constituency and Dependency in Syntax
No ratings yet
Constituency and Dependency in Syntax
7 pages
Unit 11: International Marketing: Pre-Reading Tasks
No ratings yet
Unit 11: International Marketing: Pre-Reading Tasks
39 pages
(Linguistic Inquiry Monographs) Alec Marantz-On The Nature of Grammatical Relations (Linguistic Inquiry Monographs, 10) - The MIT Press (1984)
No ratings yet
(Linguistic Inquiry Monographs) Alec Marantz-On The Nature of Grammatical Relations (Linguistic Inquiry Monographs, 10) - The MIT Press (1984)
355 pages
The Parallel-TUT: A Multilingual and Multiformat Treebank: Cristina Bosco, Manuela Sanguinetti, Leonardo Lesmo
No ratings yet
The Parallel-TUT: A Multilingual and Multiformat Treebank: Cristina Bosco, Manuela Sanguinetti, Leonardo Lesmo
7 pages
NLP UNIT-II
No ratings yet
NLP UNIT-II
71 pages
Shigeaki Sakurai-Theory and Applications For Advanced Text Mining-InTech (2012)
No ratings yet
Shigeaki Sakurai-Theory and Applications For Advanced Text Mining-InTech (2012)
226 pages
Exo Endo Etc
No ratings yet
Exo Endo Etc
11 pages
Arabic MorphologicPhD Thesis
100% (1)
Arabic MorphologicPhD Thesis
279 pages
Tree Diagram 1
0% (1)
Tree Diagram 1
13 pages
Curs Sintaxa
No ratings yet
Curs Sintaxa
48 pages
Chinese Word Order
No ratings yet
Chinese Word Order
11 pages
Dependency Relations in Phonology
No ratings yet
Dependency Relations in Phonology
12 pages
NLP Bit Bank
No ratings yet
NLP Bit Bank
8 pages
An Introduction To Language Processing With Perl and Prolog
No ratings yet
An Introduction To Language Processing With Perl and Prolog
19 pages
Cau Hoi Syntax
No ratings yet
Cau Hoi Syntax
10 pages
BulteHousenPallotti2024LL Accepted Version
No ratings yet
BulteHousenPallotti2024LL Accepted Version
49 pages
Information Extraction: Sunita Sarawagi
No ratings yet
Information Extraction: Sunita Sarawagi
117 pages
Explainability For Large Language Models: A Survey
No ratings yet
Explainability For Large Language Models: A Survey
31 pages
Brocode OP
No ratings yet
Brocode OP
133 pages
Unit-3 Aim 502
No ratings yet
Unit-3 Aim 502
14 pages
Knowledge Representation Scheme
No ratings yet
Knowledge Representation Scheme
61 pages
Word-Grammar
No ratings yet
Word-Grammar
3 pages
HTB Guidelines Ver2.5
No ratings yet
HTB Guidelines Ver2.5
77 pages
Chomsky and Ai
No ratings yet
Chomsky and Ai
7 pages
Course Code_ ESP 123_Course Title_ Language Development
No ratings yet
Course Code_ ESP 123_Course Title_ Language Development
50 pages

Applied Text Analysis 2

Uploaded by

Applied Text Analysis 2

Uploaded by

Natural-Language Processing. POS tagging.

Applied text analysis 2

Constituency phrase structure

You might also like