01_Introduction to Text Analytics_part2
01_Introduction to Text Analytics_part2
Pilsung Kang
School of Industrial Management Engineering
Korea University
AGENDA
✓ Alphabetical list of free/public domain datasets with text data for use in Natural
Language Processing (NLP)
▪ https://github.com/niderhoff/nlp-datasets
✓ 25 Open Datasets for Deep Learning Every Data Scientist Must Work With
▪ https://www.analyticsvidhya.com/blog/2018/03/comprehensive-collection-deep-learning-
datasets/
Text Preprocessing Level 0: Text
• Remove unnecessary information from the collected data
100 common words in the Oxford English Corpus Word frequency distribution in Wikipedia
http://en.wikipedia.org/wiki/Most_common_words_in_English http://upload.wikimedia.org/wikipedia/commons/b/b9/Wikipedia-n-zipf.png
Text Preprocessing Level 2: Token
• Stop-words
✓ Words that do not carry any information
▪ Mainly functional role
▪ Usually remove them to help the machine learning algorithms to perform better
http://eprints.pascal-network.org/archive/00000017/01/Tutorial_Marko.pdf
Text Preprocessing Level 2: Token
• Stemming
✓ Different forms of the same word are usually problematic for text data analysis,
because they have different spelling and similar meaning
▪ Learns, learned, learning, …
Word S1 S2
John 1 1
Likes 2 1
To 1 1
Watch 1 1
Movies 1 0
Also 0 1
Football 0 1
Games 0 1
Mary 1 0
too 1 0
Text Transformation
• Word Weighting
✓ Each word is represented as a separate variable with a numeric weight
✓ Term frequency and inverse document frequency (TF-IDF)
▪ tf(w): term frequency (number of word occurrences in a document)
▪ df(w): number of documents containing the word (number of documents containing the
word)
N
TF − IDF ( w) = tf ( w) log
df ( w)
The word is more important if it appears The word is more important if it appears
several times in a target document in less documents
Text Transformation
• Word weighting (cont’): example
Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth
Antony 157 73 0 0 0 0
Brutus 4 157 0 1 0 0
Caesar 232 227 0 2 1 1
Calpurnia 0 10 0 0 0 0
Cleopatra 57 0 0 0 0 0
mercy 2 0 3 5 5 1
worser 2 0 1 1 1 0
Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth
http://nlp.stanford.edu/projects/glove/
Text Transformation
• Pre-trained Word Models
▪ Information gain:
▪ Cross-entropy:
▪ Mutual information:
▪ Weight of evidence:
▪ Odds ratio:
▪ Frequency:
Feature Selection/Extraction
• Feature subset extraction
✓ Feature extraction: construct a set of variables that preserve the information of the
original data by combining them in a linear/non-linear form
✓ Latent Semantic Analysis (LSA) that is based on singular value decomposition is
commonly used mxr rxr rxn
D Dk = U k Σ k Vk T
✓ Step 2) Multiply the transpose of the matrix Uk
U Tk Dk = Σ k Vk T
✓ Apply data mining algorithms to the matrix obtained in the Step 2)
Feature Selection/Extraction
• LSA
https://www.quora.com/How-is-LSA-used-for-a-text-document-clustering
Feature Selection/Extraction
• LSA example
✓ Data: 41,717 abstracts of the research projects that were funded by National Science
Foundation (NSF) between 1990 and 2003
✓ Top 10 positive and negative keywords for each SVD
Feature Selection/Extraction
• LSA example
✓ Visualize the project in the reduced 2-D space
Feature Selection/Extraction
• Topic Modeling as a Feature Extractor
✓ Latent Dirichlet Allocation (LDA)
(a) Per-document topic proportions (𝜃𝑑 ) (b) Per-topic word distributions (𝜙𝑘 )
Sim( D1 , D2 ) =
x x i 1i 2 i
x j
2
j x
k k
2
Learning Task 1: Classification
• Document categorization (classification)
✓ Automatically classify a document into one of the pre-defined categories
Labeled Unlabeled
documents documents
Document
Category
Learning Task 1: Classification
• Spam filtering
Raw Data Features Model
Domain knowledge-
based phrases
(“Free money”, “!!!”)
Meta-data
(sender, mailing list, etc.)
Learning Task 1: Classification
• Sentiment Analysis
http://www.crowdsource.com/solutions/content-moderation/sentiment-analysis/
Learning Task 1: Classification Socher et al. (2013)
• Sentiment Analysis
✓ Sentiment tree bank @Stanford NLP Lab
Learning Task 1: Classification Socher et al. (2013)
• Sentiment Analysis
✓ Sentiment tree bank @Stanford NLP Lab
Learning Task 2: Clustering
• Document Clustering & Visualization
✓ Have a top level view of the topics in the corpora
✓ See relationships between topics
✓ Understand better what’s going on
• 8,850 articles from 11 Journals for the recent 10 years • 50 topics from Latent Dirichlet Allocation (LDA)
• 21,434 terms after preprocessing
Learning Task 2: Clustering
• Document Clustering & Visualization
Keywords association Journal/Topic Clustering
Learning Task 2: Clustering
• Document Clustering & Visualization
✓ Themescape: contents maps from Thomson innovation full text patent data
Learning Task 3: Information Extraction/Retrieval
Yang et al. (2013)
• Information extraction/retrieval
✓ Find useful information from text databases
✓ Examples: Question Answering
https://rajpurkar.github.io/SQuAD-
explorer/explore/1.1/dev/Super_Bowl_50.html?model=r-
net+%20(ensemble)%20(Microsoft%20Research%20Asia)&version=1.1
Learning Task 3: Information Extraction/Retrieval
• Topic Modeling
✓ A suite of algorithms that aim to
discover and annotate large archives of
documents with thematic information
✓ Statistical methods that analyze the
words of the original texts to discover
▪ the themes that run through them
▪ how themes are connected to each
other
▪ how they change over time
https://dhs.stanford.edu/algorithmic-literacy/my-definition-of-topic-modeling/
Learning Task 3: Information Extraction/Retrieval
• Latent Dirichlet Allocation (LDA)