0% found this document useful (0 votes)
1 views

Text Analysis

Uploaded by

priteshbari
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1 views

Text Analysis

Uploaded by

priteshbari
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 49

Text Analysis

Dr Nitsa Herzog

Nitsa Herzog Text Analysis 1


What is Text Analysis

• The process of computationally retrieving information from text, such as


books, articles, emails, speeches, and social media posts.
• Text analysis or text analytics refers to the representation, processing, and
modelling of textual data to derive useful insights.
• Text analysis uses natural language processing (NLP) to transform the free
(unstructured) text in documents and databases into normalized,
structured data suitable for analysis or to drive machine learning (ML)
algorithms.
• Text analysis often suffers from the curse of high dimensionality, where
each word is a dimension.

Nitsa Herzog Text Analysis 2


Natural Language Processing

• Natural language processing (NLP) combines computational linguistics,


machine learning, and deep learning models to process human language.
• Computational linguistics is the science of understanding and constructing
human language models with computers and software tools.

Nitsa Herzog Text Analysis 3


• Text to speech: Converting text-to-speech data,
then reproducing the text as natural-sounding
speech

NLP • Chatbots: Helping chatbots understand and


respond to customer inquiries
Business • Urgency detection: Analyzing language to
prioritize tasks
Application • Natural language understanding: Converting

s speech to text and analyzing its intent


• Autocorrect: Detecting and removing text errors
and suggesting alternatives
• Sentiment analysis: Revealing the perceptions
people have of your goods and services and
those of your competitors
• Speech recognition: Powering applications that
understand users’ voices and decipher their
meaning
Nitsa Herzog Text Analysis 4
Lexical analysis.

Syntactic analysis.

Semantic analysis.
NLP Steps
Discourse integration.

Pragmatic analysis.

Nitsa Herzog Text Analysis 5


NLP Steps – Lexical Analysis
• Lexicon describes the understandable vocabulary that makes up a language.
• Lexical analysis deciphers and segments language into units—or lexemes—like
paragraphs, sentences, phrases, and words.
• NLP algorithms categorize words into parts of speech (POS) and split lexemes into
morphemes—meaningful language units that you can’t further divide.
Example
When performing a lexical analysis on the paragraph, the analysis isolates and
divides the first sentence into lexeme phrases, like “the understandable
vocabulary that makes up a language.”
This analysis further divides the phrase into word lexemes, like “vocabulary” and
“language,” categorizing both as noun POS.
Then, the analysis derives free morphemes, like “words,” “vocabulary,” and
“understand-,” and bound morphemes, like “-able.”
Nitsa Herzog Text Analysis 6
NLP Steps – Syntactic Analysis
• Syntax describes how a language’s words and phrases are arranged to
form sentences. Syntactic analysis checks word arrangements for
proper grammar.
Example
The sentence “Dave wrote the paper” passes a syntactic analysis check because it’s grammatically
correct. Conversely, a syntactic analysis categorizes a sentence like “Dave do jumps” as syntactically
incorrect.

Nitsa Herzog Text Analysis 7


NLP Steps – Semantic Analysis
• Semantics describe the meaning of words, phrases, sentences, and
paragraphs.
• Semantic analysis attempts to understand the literal meaning of
individual language selections, not syntactic correctness.
• However, a semantic analysis doesn’t check language data before and
after a selection to clarify its meaning.
Example
“Manhattan calls out to Dave” passes a syntactic analysis because it’s a grammatically
correct sentence.
However, it fails a semantic analysis, because Manhattan is a place (and can’t literally call out
to people), and the sentence’s meaning doesn’t make sense.
Nitsa Herzog Text Analysis 8
NLP Steps – Discourse Integration
• Discourse describes communication between 2 or more individuals.
• Discourse integration analyzes prior words and sentences to
understand the meaning of ambiguous language.
Example
If one sentence reads, “Manhattan speaks to all its people,” and the following sentence reads, “It
calls out to Dave,” discourse integration checks the first sentence for context to understand that “It”
in the latter sentence refers to Manhattan.

Nitsa Herzog Text Analysis 9


NLP Steps – Pragmatic Analysis
• Pragmatism describes the interpretation of language’s intended
meaning.
• Pragmatic analysis attempts to derive language’s intended—not
literal—meaning.
Example
A pragmatic analysis can uncover the intended meaning of “Manhattan speaks to all its
people.”
Methods like neural networks assess the context to understand that the sentence isn’t
literal, and most people won’t interpret it as such.
A pragmatic analysis deduces that this sentence is a metaphor for how people emotionally
connect with places.
Nitsa Herzog Text Analysis 10
Natural Language Natural Language Understanding –

Processing analyses the syntactic structure of


language and derive semantic meaning
Approach Examples
○ Speech Recognition
○ Named Entity Recognition
○ Text Classification

Natural Language Generation –


produces natural written or spoken
language from structured and
unstructured data.
Examples
○ Text Generation (a college essay written by
PaLM or GPT)
○ Speech Generation (found in virtual
assistants)
○ Question Answering

Nitsa Herzog Text Analysis 11


Natural Language Generation: Humor
Berkeley School of Information (2020):
“Our analysis suggests that the state-of-the-art models perform well for
classifying jokes, but quality is uneven when generating. Combined with
what we think is a Clever Hans effect in our classification results, we
think deep learning does not "understand" humour but does pick up on
the patterns of jokes and puns.”

Nitsa Herzog Text Analysis 12


Screengrab from Google I/O 2022 Keynote with Alphabet CEO
Natural Language Sundar Pichai explaining the PaLM AI model and its ability to
understand jokes.
Generation: Google shows how PaLM understands a novel joke
Humor not found on the internet.

Nitsa Herzog Text Analysis 13


What is Text Mining
• Text mining is a component of text analysis.
• Text Mining is the process of
transforming unstructured text into a
structured format to identify
meaningful patterns and new insights.
• Text mining discovers relationships and specific
patterns in large text collections.
• It can be obtained from the web using
web scrapers or web crawlers.

Nitsa Herzog Text Analysis 14


Text Mining and Web Search
Text mining is different from traditional web search.
• In search, the user is typically looking for something already known
and written by someone else.
• In text mining, the users have to sift through all the material currently
irrelevant to their needs in finding the information.

Nitsa Herzog Text Analysis 15


Text Mining and Information Extraction
Text mining is different from information extraction.
• Information Extraction (IE) is about getting facts out of unstructured
information
• There are programs that can, with reasonable accuracy, extract information
from text such as names of people, organisations, locations and so on, and
find relations between them (e.g. John works for the BBC)
• IE is a major component of text mining, but it doesn't tell the whole story
For example,
In a criminal investigation, finding the facts (names of witnesses, alibis for the night of the
murder) is like the IE component. Text mining will be similar to the process of deducting
who could or could not have committed the murder.
Nitsa Herzog Text Analysis 16
Document selection and Involves identifying and retrieving
filtering (IR – information potentially relevant documents
from a large set (e.g., the web) to
retrieval techniques) reduce the search space.
Standard or semantically-
enhanced IR techniques can be
Text Mining Document pre-processing (NLP
used for this.
Involves cleaning and preparing
Stages – natural language processing
techniques)
the documents, e.g. removal of
extraneous information, error
correction, spelling normalisation,
tokenisation, Part-of-speech (POS)
tagging, etc.
Document processing (NLP /
Consists of information extraction
ML / statistical techniques) (Named entity recognition (NER),
relation/event recognition, etc.)
and potentially opinion mining.

Nitsa Herzog Text Analysis 17


Data collection is “free text” – Data is not
well-organized
• Semi-structured (web-pages, server logs, social
networks APIs) or unstructured (texts, news articles,
Challenge emails)

Natural language text contains ambiguities


s in Text on many levels
Mining • Lexical, syntactic, semantic, and pragmatic

Learning techniques for processing text


typically need annotated training examples
• Expensive to acquire at scale

Nitsa Herzog Text Analysis 18


What is Corpora
A corpus (plural:
corpora) is a large
collection of texts used
for various purposes in
Natural Language.

Example Corpora in NLP


Nitsa Herzog Text Analysis 19
.

Text
Preprocess
ing

Nitsa Herzog Text Analysis 20


The first step in text analysis is preprocessing (cleaning) the
corpus:

● Tokenize: parse documents into smaller units, such as


words or phrases
Text ● Remove stop words (e.g., a, the, and, etc.) and
punctuation
Preprocessi ● Use stemming and lemmatisation: standardize words
with similar meaning
ng (2017) ● Normalize: convert to lowercase (carefully: e.g., US vs.
us)

! Preprocessing should be customized based on the type


of corpus.
! Tweets should be preprocessed differently than
academic texts.

Nitsa Herzog Text Analysis 21


The first step in text analysis is preprocessing (cleaning) the
corpus:

● Lower casing
● Removal of Punctuations
● Removal of Stopwords
● Removal of Frequent words
Text ● Removal of Rare words
● Stemming

Preprocessing ● Lemmatization
● Removal of emojis (pictogram, logogram, ideogram, or

(2023) smiley embedded in text and used in electronic messages


and web pages – Wikipedia)
● Removal of emoticons (short for "emotion icon", is a
pictorial representation of a facial expression using
characters—usually punctuation marks, numbers, and
letters—to express a person's feelings, mood, or reaction
without needing to describe it in detail – Wikipedia)
● Conversion of emoticons to words
● Conversion of emojis to words
● Removal of URLs
● Removal of HTML tags
● Chat word conversions
● Spelling correction
Nitsa Herzog Text Analysis 22
POS tagging aims to build a model whose input is a
Part-of- sentence whose output is a tag.

Speech
Example:
(POS) • he saw a fox
Tagging Each tag marks the POS for the corresponding word,
such as:
• PRP VBD DT NN (According to the Penn Treebank
POS tags)
Four words are mapped to
• pronoun (personal), verb (past tense), determiner,
and noun (singular).

Nitsa Herzog Text Analysis 23


Stemming & Lemmatization
A well-known rule-based stemming algorithm is Porter’s stemming
algorithm.
• Goal: standardize words with a similar meaning
• Stemming reduces words to their base, or root, form
Lemmatization makes words grammatically comparable (e.g., am, are,
is be)

Nitsa Herzog Text Analysis 24


Normalization – Case
Folding and Removals
Examples:
• make all words lowercase
• remove any punctuation
• remove unwanted tags

Nitsa Herzog Text Analysis 25


Split up a document into tokens
• Common tokens
• Words: e.g., “hello”, “blue”, “no”, “laptop”, etc.
• Punctuation: e.g., . , “ ‘”!”” ?”, etc.
• Other tokens
Normalizati • A very uncommon word with an unknown token: <NKNOWN>
• End sentences (or sentence-like structures) with a stop token: <STOP>
on - • Replace all numbers with a single token: e.g., 100 → <NUM>
• Replace common words (“a”, “the”, etc.) with <SWRD>
Tokenizatio Example
n “The dog ran in the park joyously!” →
“<SWRD>”, “dog”, “ran”, “<SWRD>”, “<SWRD>”, “park”, “<UNKNOWN>”, “!”,
“<STOP>”

Nitsa Herzog Text Analysis 26


Text Modelling

Nitsa Herzog Text Analysis 27


Text Modelling
• Text modelling is based on topic modelling
• Topic modelling is a type of statistical
modelling that uses unsupervised Machine
Learning to identify clusters or groups of
similar words within a body of text.
• Topic modelling in NLP is a set of
algorithms that can be used to summarise
automatically over a large corpus of texts.
• Text modelling can be represented by a
vector of counts or features for each
distinct word.

Nitsa Herzog Text Analysis 28


Bag-of-Words (BOW) Model
Represents a corpus as an unordered set of
word counts, ignoring stop words.

Example

Nitsa Herzog Text Analysis 29


Bag-of-Words (BOW)
Model
BOW is represented by Term
Frequency (TF).
Term frequency represents the
weight of each term in a
document, and it is proportional to
the number of occurrences of the
term in that document.
The figure shows the 50 most
frequent words and the number of
occurrences from Shakespeare’s
Hamlet.
The frequency of a word is
inversely proportional to its rank in
the frequency table.
Nitsa Herzog Text Analysis 30
Word Clouds

Visualizes words in a document with sizes proportional to how


frequently the words are used

Nitsa Herzog Text Analysis 31


• Bag-of-words takes quite a naïve approach, as order plays
an important role in the semantics of text.
BOW – • With bag-of-words, many texts with different meanings
Final Notes are combined into one form.

For example,
The texts “a dog bites a man” and “a man bites a dog” have very
different meanings, but they share the same representation with
bag-of-words.

• The bag-of-words technique oversimplifies the problem,


but it is still considered a good approach to start with and
is widely used for text analysis.

Nitsa Herzog Text Analysis 32


• Besides extracting the terms, their morphological
features may need to be included.
Morphologi • The morphological features specify additional
cal information about the terms, which may include root
words, affixes, part-of-speech tags, named entities,
Features or intonation (variations of spoken pitch).
• The features from this step contribute to the
downstream analysis in classification or sentiment
analysis.

Nitsa Herzog Text Analysis 33


N-gram Model
N-grams are continuous sequences of words, symbols, or tokens in a document
(corpus).
In technical terms, they can be defined as the neighbouring sequences of items in a
document.
In N-grams, compared to the BOW, the word order is important.

Nitsa Herzog Text Analysis 34


TF-IDF: Term frequency-Inverse document frequency

Term frequency

Inverse document frequency

tfidf(t, d, D) = tf(t, d) * idf(t, D)

Term frequency (TF) is how common a word is, and inverse document frequency (IDF) is how
unique or rare a word is.
Useful for decreasing the weight of common, low-information words.
Nitsa Herzog Text Analysis 35
TF-IDF: Example
Consider a document containing 100 words wherein the word “apple”
appears 5 times. The term frequency (i.e., TF) for apple is then (5 / 100)
= 0.05.
Now, assume we have 10 million documents, and the word “apple”
appears in 1000 of these. Then, the inverse document frequency (i.e.,
IDF) is calculated as log(10,000,000 / 1,000) = 4.
Thus, the TF-IDF weight is the product of these quantities: 0.05 * 4 =
0.20.

Nitsa Herzog Text Analysis 36


TF, DF and IDF:
Example (from
Brown corpus’s
news category)
• TFIDF scores words higher that
appear more often in a
document but occur less often
across all documents in the
corpus.
• Note that TFIDF applies to a term
in a specific document, so the
same term will likely receive
different TFIDF scores in different
documents (because the TF
values may differ).

Nitsa Herzog Text Analysis 37


TF-IDF (Example)
• TFIDF can be used to highlight
the informative words in the
reviews.
• The figure shows a subset of the
reviews in which each word with
a larger font size corresponds to
a higher TFIDF value.
• Each review is considered a
document.

Nitsa Herzog Text Analysis 38


Review
Categorizati
on with LDA
Topic models such as
LDA can categorize the
reviews into topics.

The topics
Nitsa Herzog Text Analysis of 5-start reviews 39
Review
Categorizati
on with LDA

Nitsa Herzog Text Analysis 40


The topics of 1-star reviews
.

Text Analysis and Classification


Nitsa Herzog Text Analysis 41
Document
Classification
• Separates papers according to
the authors and topics
• Performs text analysis of the
paper
• word frequency,
• distribution,
• patterns,
• and meaning.

Nitsa Herzog Text Analysis 42


ML Methods in Text Analysis

Supervised Unsupervised
• Classic ML (the most popular: • Clustering
Naive Bayes(NB), Support • Deep Learning (the most
Vector Machine (SVM) popular: Convolutional
Neural Network (CNN),
Recurrent Neural Network
(RNN)

Nitsa Herzog Text Analysis 43


Document
Clustering
Topic modelling: assign topics
(politics, sports, fashion, etc.)
to documents (e.g., articles or
web pages)

Biomedical articles clustering. Source:


https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0018029
Nitsa Herzog Text Analysis 44
Spam Detection (example)

Nitsa Herzog Text Analysis 45


Spam Detection (cont.)
• An email that contains the words hello and friend, but not money and
password:
8 0.0024
3 0.06831

Nitsa Herzog Text Analysis 46


Spam Detection (cont.)
• An email that contains the words hello, money, and password:
4 28 0.00336
1 03 0.0010692

Nitsa Herzog Text Analysis 47


End of Lecture

Nitsa Herzog Text Analysis 48


References
• https://cs.brown.edu/courses/cs100/lectures/lecture24.pdf
• https://www.cs.virginia.edu/~hw5x/Course/TextMining-2019Spring/_
site/lectures/
• https://walshbr.com/textanalysiscoursebook/assets/introduction-to-t
ext-analysis.pdf
• https://arxiv.org/pdf/2307.10169.pdf
• https://www.twilio.com/blog/nlp-steps
• https://hdsr.mitpress.mit.edu/pub/wi9yky5c/release/3
• https://www.elastic.co/what-is/large-language-models
• https://spotintelligence.com/2022/12/16/sentiment-analysis-tools-in-
python/
• https://realpython.com/python-nltk-sentiment-analysis/
Nitsa Herzog Text Analysis 49

You might also like