Text Analytics
Text Analytics
Advanced Analytical Theory
and methods: Text Analysis
Learning Objectives
• Describe text mining and understand the need for
text mining
• Differentiate between text mining and data mining
• Understand the different application areas for text
mining
• Know the process of carrying out a text mining
project
• Understand the different methods to introduce
structure to text‐based data
1
09‐11‐2016
Motivation
2
09‐11‐2016
Text Mining Concepts
• 85‐90 percent of all corporate data is in some kind
of unstructured form (e.g., text)
• Unstructured corporate data is doubling in size
every 18 months
• Tapping into these information sources is not an
option, but a need to stay competitive
• What is text mining?
• A semi‐automated process of extracting knowledge
from unstructured data sources
• a.k.a. text data mining or knowledge discovery in
textual databases
Introduction
• Also called Text Analytics
• Text Mining forms a core step in text analysis
• Examples of text data
News article
Literature
E‐mail
Web Pages
Server Logs
Social Network API forehorses
Call center transcripts
3
09‐11‐2016
Data sources with formats
Data Mining versus Text Mining
• Both seek for novel and useful patterns
• Both are semi‐automated processes
• Difference is the nature of the data:
• Structured versus unstructured data
• Structured data: in databases
• Unstructured data: Word documents, PDF files, text
excerpts, XML files, and so on
• Text mining – first, impose structure to the data,
then mine the structured data
4
09‐11‐2016
Text Mining Concepts
• Benefits of text mining are obvious especially in
text‐rich data environments
• e.g., law (court orders), academic research (research
articles), finance (quarterly reports), medicine (discharge
summaries), biology (molecular interactions), technology
(patent files), marketing (customer comments), etc.
• Electronic communization records (e.g., Email)
• Spam filtering
• Email prioritization and categorization
• Automatic response generation
Text Mining
• Typically falls into one of two categories
• Analysis of text: I have a bunch of text I am interested in, tell me something
about it
• E.g. sentiment analysis, “buzz” searches
• Retrieval: There is a large corpus of text documents, and I want the one
closest to a specified query
• E.g. web search, library catalogs, legal and medical precedent studies
5
09‐11‐2016
Text Mining: Analysis
• Which words are most present
• Which words are most surprising
• Which words help define the document
• What are the interesting text phrases?
Text Mining: Retrieval
• Find k objects in the corpus of documents which are most similar to
my query.
• Can be viewed as “interactive” data mining ‐ query not specified a
priori.
• Main problems of text retrieval:
• What does “similar” mean?
• How do I know if I have the right documents?
• How can I incorporate user feedback?
6
09‐11‐2016
Text Mining Process
Context diagram for the text mining
Software/hardware limitations
process Privacy issues
Linguistic limitations
Domain expertise
Tools and techniques
Text Mining Process
The three‐step text mining process
7
09‐11‐2016
Text Mining Process
• Step 1: Establish the corpus
• Collect all relevant unstructured data (e.g., textual documents, XML
files, emails, Web pages, short notes, voice recordings…)
• Digitize, standardize the collection (e.g., all in ASCII text files)
• Place the collection in a common place (e.g., in a flat file, or in a directory
as separate files)
Text Mining Process
• Step 2: Create the Term–by–Document Matrix
8
09‐11‐2016
Text Mining Process
• Step 2: Create the Term–by–Document Matrix
(TDM), cont.
• Should all terms be included?
• Stop words, include words
• Synonyms, homonyms
• Stemming
• What is the best representation of the indices (values in
cells)?
• Row counts; binary frequencies; log frequencies;
• Inverse document frequency
Text Mining Process
• Step 2: Create the Term–by–Document Matrix
(TDM), cont.
• TDM is a sparse matrix. How can we reduce the
dimensionality of the TDM?
• Manual ‐ a domain expert goes through it
• Eliminate terms with very few occurrences in very few
documents (?)
• Transform the matrix using singular value decomposition (SVD)
• SVD is similar to principle component analysis
9
09‐11‐2016
Text Mining Process
• Step 3: Extract patterns/knowledge
• Classification (text categorization)
• Clustering (natural groupings of text)
• Improve search recall
• Improve search precision
• Scatter/gather
• Query‐specific clustering
• Association
• Trend Analysis (…)
Text Retrieval: Challenges
• Calculating similarity is not obvious ‐ what is the distance
between two sentences or queries?
• Evaluating retrieval is hard: what is the “right” answer ? (no
ground truth)
• User can query things you have not seen before e.g. misspelled,
foreign, new terms.
• Goal (score function) is different than in
classification/regression: not looking to model all of the data,
just get best results for a given user.
• Words can hide semantic content
• Synonymy: A keyword T does not appear anywhere in the document,
even though the document is closely related to T, e.g., data mining
• Polysemy: The same keyword may mean different things in different
contexts, e.g., mining
10
09‐11‐2016
Basic Measures for Text Retrieval
• Precision: the percentage of retrieved documents that are
in fact relevant to the query (i.e., “correct” responses)
| {Relevant} {Retrieved} |
precision
| {Retrieved} |
• Recall: the percentage of documents that are relevant to
the query and were, in fact, retrieved
|{Relevant} {Retrieved} |
recall
|{Relevant} |
Precision vs. Recall
Truth:Relvant Truth:Not Relevant
Algorithm:Relevant TP FP
• We’ve been here before!
• Precision = TP/(TP+FP)
• Recall = TP/(TP+FN) actual
• Trade off: 1
outcome 0
• If algorithm is ‘picky’: precision high, recall low
• If algorithm is ‘relaxed’: precision low, recall high
predicted
1 a b
• BUT: recall often hard if not impossible to calculate outcome
0 c d
11
09‐11‐2016
Precision Recall Curves
• If we have a labelled training set, we can calculate recall.
• For any given number of returned documents, we can plot a point for
precision vs. recall. (similar to thresholds in ROC curves)
• Different retrieval algorithms might have very different curves ‐ hard
to tell which is “best”
Term / document matrix
• Most common form of representation in text mining is the term ‐
document matrix
• Term: typically a single word, but could be a word phrase like “data mining”
• Document: a generic term meaning a collection of text to be retrieved
• Can be large ‐ terms are often 50k or larger, documents can be in the billions
(www).
• Can be binary, or use counts
12
09‐11‐2016
Term document matrix
Example: 10 documents: 6 terms
• Each document now is just a vector of terms,
sometimes boolean
Term document matrix
• We have lost all semantic content
• Be careful constructing your term list!
• Not all words are created equal!
• Words that are the same should be treated the same!
• Stop Words
• Stemming
13
09‐11‐2016
Stop words
• Many of the most frequently used words in English are worthless in
retrieval and text mining – these words are called stop words.
• the, of, and, to, ….
• Typically about 400 to 500 such words
• For an application, an additional domain specific stop words list may be
constructed
• Why do we need to remove stop words?
• Reduce indexing (or data) file size
• stopwords accounts 20‐30% of total word counts.
• Improve efficiency
• stop words are not useful for searching or text mining
• stop words always have a large number of hits
Stemming
• Techniques used to find out the root/stem of a word:
• E.g.,
• user engineering
• users engineered
• used engineer
• using
• stem: use engineer
Usefulness
• improving effectiveness of retrieval and text mining
• matching similar words
• reducing indexing size
• combing words with same roots may reduce indexing size as
much as 40‐50%.
14
09‐11‐2016
Basic stemming methods
• remove ending
• if a word ends with a consonant other than s,
followed by an s, then delete s.
• if a word ends in es, drop the s.
• if a word ends in ing, delete the ing unless the remaining word consists only
of one letter or of th.
• If a word ends with ed, preceded by a consonant, delete the ed unless this
leaves only a single letter.
• …...
• transform words
• if a word ends with “ies” but not “eies” or “aies” then “ies ‐‐> y.”
Feature Selection
• Performance of text classification algorithms can be optimized
by selecting only a subset of the discriminative terms
• Even after stemming and stopword removal.
• Greedy search
• Start from full set and delete one at a time
• Find the least important variable
• Can use Gini index for this if a classification problem
• Often performance does not degrade even with orders of
magnitude reductions
• Chakrabarti, Chapter 5: Patent data: 9600 patents in communcation,
electricity and electronics.
• Only 140 out of 20,000 terms needed for classification!
15
09‐11‐2016
Distances in TD matrices
• Given a term doc matrix represetnation, now we can define distances
between documents (or terms!)
• Elements of matrix can be 0,1 or term frequencies (sometimes
normalized)
• Can use Euclidean or cosine distance
• Cosine distance is the angle between the two vectors
• Not intuitive, but has been proven to work well
• If docs are the same, dc =1, if nothing in common dc=0
• We can calculate cosine and Euclidean distance for this
matrix
• What would you want the distances to look like?
16
09‐11‐2016
Weighting in TD space
• Not all phrases are of equal importance
• E.g. David less important than Beckham
• If a term occurs frequently in many documents it has less discriminatory power
• One way to correct for this is inverse‐document frequency (IDF).
• Term importance = Term Frequency (TF) x IDF
• Nj= # of docs containing the term
• N = total # of docs
• A term is “important” if it has a high TF and/or a high IDF.
• TF x IDF is a common measure of term importance
D1 24 21 9 0 0 3
D2 32 10 5 0 3 0
D3 12 16 5 0 0 0
D4 6 7 2 0 0 0
D5 43 31 20 0 3 0
D6 2 0 0 18 7 6
D7 0 0 1 32 12 0
D8 3 0 0 22 4 4
D9 1 0 0 34 27 25
D10 6 0 0 17 4 23
Database SQL Index Regression Likelihood linear
D1 2.53 14.6 4.6 0 0 2.1
D2 3.3 6.7 2.6 0 1.0 0
D3 1.3 11.1 2.6 0 0 0
D4 0.7 4.9 1.0 0 0 0
D5 4.5 21.5 10.2 0 1.0 0
17
09‐11‐2016
Queries
• A query is a representation of the user’s information needs
• Normally a list of words.
• Once we have a TD matrix, queries can be represented as a vector
in the same space
• “Database Index” = (1,0,1,0,0,0)
• Query can be a simple question in natural language
• Calculate cosine distance between query and the TF x IDF version
of the TD matrix
• Returns a ranked vector of documents
Document Clustering
• Can also do clustering, or unsupervised learning of docs.
• Automatically group related documents based on their
content.
• Require no training sets or predetermined taxonomies.
• Major steps
• Preprocessing
• Remove stop words, stem, feature extraction, lexical analysis, …
• Hierarchical clustering
• Compute similarities applying clustering algorithms, …
• Slicing
• Fan out controls, flatten the tree to desired number of levels.
• Like all clustering examples, success is relative
18
09‐11‐2016
Document Clustering
• To Cluster:
• Can use LSI
• Another model: Latent Dirichlet Allocation (LDA)
• LDA is a generative probabilistic model of a corpus. Documents are represented as random
mixtures over latent topics, where a topic is characterized by a distribution over words.
• LDA:
• Three concepts: words, topics, and documents
• Documents are a collection of words and have a probability distribution over
topics
• Topics have a probability distribution over words
• Fully Bayesian Model
Text Mining: Helpful Data
WordNet
19
09‐11‐2016
Corpus
Text Mining ‐ Other Topics
• Part of Speech Tagging
• Assign grammatical tags to words (verb, noun, etc)
• Helps in understanding documents : uses Hidden Markov Models
• Named Entity Classification
• Classification task: can we automatically detect proper nouns and tag them
• “Mr. Jones” is a person; “Madison” is a town.
• Helps with dis‐ambiguation: spears
20
09‐11‐2016
Text Mining ‐ Other Topics
• Sentiment Analysis
• Automatically determine tone in text: positive, negative or neutral
• Typically uses collections of good and bad words
• “While the traditional media is slowly starting to take John McCain’s straight talking image
with increasingly large grains of salt, his base isn’t quite ready to give up on their favorite son.
Jonathan Alter’s bizarre defense of McCain after he was caught telling an outright lie,
perfectly captures that reluctance[.]”
• Often fit using Naïve Bayes
• There are sentiment word lists out there:
• See http://neuro.imm.dtu.dk/wiki/Text_sentiment_analysis
Text Mining ‐ Other Topics
• Summarizing text: Word Clouds
• Takes text as input, finds the most
interesting ones, and displays them
graphically
• Blogs do this
• Wordle.net
21
09‐11‐2016
Text Mining Applications
• Marketing applications
• Enables better CRM
• Security applications
• ECHELON, OASIS
• Deception detection (…)
• Medicine and biology
• Literature‐based gene identification (…)
• Academic applications
• Research stream analysis
A bag of words
22
09‐11‐2016
A matrix
Words
11 12 1
21 22 2
Documents
1 2
#Example matrix syntax
A = matrix(c(1, rep(0,6), 2), nrow = 4)
library(slam)
S = simple_triplet_matrix(c(1, 4), c(1, 2), c(1, 2))
library(Matrix)
M = sparseMatrix(i = c(1, 4), j = c(1, 2), x = c(1, 2))
tm package
library(tm) #load the tm package
corpus_1 <- Corpus(VectorSource(txt)) # creates a ‘corpus’ from a vector
it was the best of times, it was the worst of times, it was the age of
wisdom, it was the age of foolishness, it was the epoch of belief, it was
the epoch of incredulity, it was the season of light, it was the season of
darkness, it was the spring of hope, it was the winter of despair, we
had everything before us, we had nothing before us, we were all going
direct to heaven, we were all going direct the other way- in short, the
period was so far like the present period, that some of its noisiest
authorities insisted on its being received, for good or for evil, in the
superlative degree of comparison only.
23
09‐11‐2016
Stopwords
library(tm)
corpus_1 <- Corpus(VectorSource(txt))
it was the best of times, it was the worst of times, it was the age of
wisdom, it was the age of foolishness, it was the epoch of belief, it
was the epoch of incredulity, it was the season of light, it was the
season of darkness, it was the spring of hope, it was the winter of
despair, we had everything before us, we had nothing before us, we
were all going direct to heaven, we were all going direct the other
way- in short, the period was so far like the present period, that
some of its noisiest authorities insisted on its being received, for
good or for evil, in the superlative degree of comparison only.
Stopwords
library(tm)
corpus_1 <- Corpus(VectorSource(txt))
24
09‐11‐2016
Punctuation
library(tm)
corpus_1 <- Corpus(VectorSource(txt))
Stemming
library(tm)
corpus_1 <- Corpus(VectorSource(txt))
25
09‐11‐2016
Cleanup
library(tm)
corpus_1 <- Corpus(VectorSource(txt))
best time worst time age wisdom age foolish epoch belief epoch
incredul season light season dark spring hope winter despair everyth
us noth us go direct heaven go direct way short period far like present
period noisiest author insist receiv good evil superl degre comparison
class(tdm)
[1] "TermDocumentMatrix" "simple_triplet_matrix“
dim (tdm)
[1] 35 1
26
09‐11‐2016
Ngrams
Library(Rweka)
four_gram_tokeniser <- function(x, n) {
RWeka:::NGramTokenizer(x, RWeka:::Weka_control(min = 1, max = 4))
}
dim(tdm_4gram)
[1] 163 1
Text Mining Applications‐ Mining for Lies
• Deception detection
• A difficult problem
• If detection is limited to only text, then the problem is even more difficult
• The study
• analyzed text based testimonies of person of interests at military bases
• used only text‐based features (cues)
27
09‐11‐2016
Text Mining Applications
Mining for Lies
Text Mining Applications
Mining for Lies
Category Example Cues
Quantity Verb count, noun-phrase count, ...
Complexity Avg. no of clauses, sentence length, …
Uncertainty Modifiers, modal verbs, ...
Nonimmediacy Passive voice, objectification, ...
Expressivity Emotiveness
Diversity Lexical diversity, redundancy, ...
Informality Typographical error ratio
Specificity Spatiotemporal, perceptual information …
Affect Positive affect, negative affect, etc.
28
09‐11‐2016
Text Mining Applications
Mining for Lies
• 371 usable statements are generated
• 31 features are used
• Different feature selection methods used
• 10‐fold cross validation is used
• Results (overall % accuracy)
• Logistic regression 67.28
• Decision trees 71.60
• Neural networks 73.46
Text Mining Applications
(gene/protein interaction identification)
596 12043 24224 281020 42722 397276
D007962
D 016923
D 001773
...expression of Bcl-2 is correlated with insufficient white blood cell death and activation of p53.
185 8 51112 9 23017 27 5874 2791 8952 1623 5632 17 8252 8 2523
NN IN NN IN VBZ IN JJ JJ NN NN NN CC NN IN NN
NP PP NP NP PP NP NP PP NP
29
09‐11‐2016
Text Mining Tools
• Commercial Software Tools
• SPSS PASW Text Miner
• SAS Enterprise Miner
• Statistica Data Miner
• ClearForest, …
• Free Software Tools
• RapidMiner
• GATE
• Spy‐EM, …
30