0% found this document useful (0 votes)
17 views

01_Introduction to Text Analytics_part2

Uploaded by

dinhnguyenngoc
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views

01_Introduction to Text Analytics_part2

Uploaded by

dinhnguyenngoc
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 48

Lecture 1: Introduction to Text Analytics

Pilsung Kang
School of Industrial Management Engineering
Korea University
AGENDA

01 Text Analytics: Overview


02 TA Process 1: Collection & Preprocessing
03 TA Process 2: Transformation
04 TA Process 3: Dimensionality Reduction
05 TA Process 4: Learning & Evaluation
Decide What to Mine Witte (2006)

• From a wide range of text sources…


Decide What to Mine Witte (2006)

• From a wide range of text sources…


Decide What to Mine
• Open datasets/repositories
✓ The best 25 datasets for NLP (2018.06.07)
▪ https://gengo.ai/datasets/the-best-25-datasets-for-natural-language-processing/

✓ Alphabetical list of free/public domain datasets with text data for use in Natural
Language Processing (NLP)
▪ https://github.com/niderhoff/nlp-datasets

✓ 50 Free Machine Learning Datasets: Natural Language Processing


▪ https://blog.cambridgespark.com/50-free-machine-learning-datasets-natural-language-
processing-d88fb9c5c8da

✓ 25 Open Datasets for Deep Learning Every Data Scientist Must Work With
▪ https://www.analyticsvidhya.com/blog/2018/03/comprehensive-collection-deep-learning-
datasets/
Text Preprocessing Level 0: Text
• Remove unnecessary information from the collected data

• Remove figures, advertisements, html syntax,


hyperlinks, etc.
Text Preprocessing Level 0: Text
• Do not remove meta-data, which
contains significant information on
the text
✓ Ex) Newspaper article: author, date,
category, language, newspaper, etc.

• Meta-data can be used for further


analysis
✓ Target class of a document
✓ Time series analysis
Text Preprocessing Level 1: Sentence
• Correct sentence boundary is also important
✓ For many downstream analysis tasks
▪ POS-Tagger maximize probabilities of tags within a sentence
▪ Summarization systems rely on correct detection of sentence
Text Preprocessing Level 1: Sentence
• Sentence Splitting (페이지 링크)
Text Preprocessing Level 1: Sentence
• Sentence Splitting with Rule-based Model
Text Preprocessing Level 2: Token
• Extracting meaningful (worth being analyzed) tokens (word, number, space, etc.)
from a text is not an easy task.
✓ Is John’s sick one token or two?
▪ If one → problems in parsing (where is the verb?)
▪ If two → what do we do with John’s house?

✓ What to do with hyphens?


▪ database vs. data-base vs. data base

✓ What to do with “C++”, “A/C”, “:-)”, “…”, “ㅋㅋㅋㅋㅋㅋㅋㅋ”?


✓ Some languages do not use whitespace (e.g., Chinese)

• Consistent tokenization is important for all later processing steps.


Text Preprocessing Level 2: Token
• Power distribution in word frequencies
✓ It is not true that more frequently appeared words (tokens) are more important for
text mining tasks.

100 common words in the Oxford English Corpus Word frequency distribution in Wikipedia

http://en.wikipedia.org/wiki/Most_common_words_in_English http://upload.wikimedia.org/wikipedia/commons/b/b9/Wikipedia-n-zipf.png
Text Preprocessing Level 2: Token
• Stop-words
✓ Words that do not carry any information
▪ Mainly functional role
▪ Usually remove them to help the machine learning algorithms to perform better

✓ Natural language dependent


▪ English: a, about, above, across, after, again, against, all, also, etc.
▪ 한국어: …습니다, …로서(써), …를 등

[Original text] [After removing stop words]

http://eprints.pascal-network.org/archive/00000017/01/Tutorial_Marko.pdf
Text Preprocessing Level 2: Token
• Stemming
✓ Different forms of the same word are usually problematic for text data analysis,
because they have different spelling and similar meaning
▪ Learns, learned, learning, …

✓ Stemming is a process of transforming a word into its stem (normalized form)


▪ In English: Porter2 stemmer (http://snowball.tartarus.org/algorithms/english/stemmer.html)
▪ In Korean, 꼬꼬마 형태소 분석기 (http://kkma.snu.ac.kr/documents/)
Text Preprocessing Level 2: Token
• Lemmatization
✓ Although stemming just finds any base form, which does not even need to be a word
in the language, but lemmatization finds the actual root of a word

Word Stemming Lemmatization

Love Lov Love

Loves Lov Love

Loved Lov Love

Loving Lov Love

Innovation Innovat Innovation

Innovations Innovat Innovation

Innovate Innovat Innovate

Innovates Innovat Innovate

Innovative Innovat Innovative


AGENDA

01 Text Mining Overview


02 TM Process 1: Collection & Preprocessing
03 TM Process 2: Transformation
04 TM Process 3: Dimensionality Reduction
05 TM Process 4: Learning & Evaluation
Text Transformation
• Document representation
✓ Bag-of-words: simplifying representation method for documents where a text is
represented in a vector of an unordered collection of words
S1: Jon likes to watch movies. Mary likes too.
S2: John also likes to watch football game.

Word S1 S2
John 1 1
Likes 2 1
To 1 1
Watch 1 1
Movies 1 0
Also 0 1
Football 0 1
Games 0 1
Mary 1 0
too 1 0
Text Transformation
• Word Weighting
✓ Each word is represented as a separate variable with a numeric weight
✓ Term frequency and inverse document frequency (TF-IDF)
▪ tf(w): term frequency (number of word occurrences in a document)
▪ df(w): number of documents containing the word (number of documents containing the
word)
 N 
TF − IDF ( w) = tf ( w)  log 
 df ( w) 

The word is more important if it appears The word is more important if it appears
several times in a target document in less documents
Text Transformation
• Word weighting (cont’): example
Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth

Antony 157 73 0 0 0 0
Brutus 4 157 0 1 0 0
Caesar 232 227 0 2 1 1
Calpurnia 0 10 0 0 0 0
Cleopatra 57 0 0 0 0 0
mercy 2 0 3 5 5 1
worser 2 0 1 1 1 0

Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth

Antony 5.25 3.18 0 0 0 0


Brutus 1.21 6.1 0 1 0 0
Caesar 8.59 2.54 0 1.51 0.25 0.35
Calpurnia 0 1.54 0 0 0 0
Cleopatra 2.85 0 0 0 0 0
mercy 1.51 0 1.9 0.12 5.25 0.88
worser 1.37 0 0.11 4.15 0.25 1.95
Text Transformation
• One-hot-vector representation
✓ The most simple & intuitive representation

✓ Can make a vector representation, but similarities between words cannot be


preserved.
Text Transformation
• Word vectors: distributed representation
✓ A parameterized function mapping words in some language to a certain dimensional
vectors

• Interesting feature of word embedding


✓ Semantic relationship between words can be preserved
Text Transformation
• Word vectors: distributed representation

http://nlp.stanford.edu/projects/glove/
Text Transformation
• Pre-trained Word Models

• Pre-trained Language Models


AGENDA

01 Text Mining Overview


02 TM Process 1: Collection & Preprocessing
03 TM Process 2: Transformation
04 TM Process 3: Dimensionality Reduction
05 TM Process 4: Learning & Evaluation
Feature Selection/Extraction
• Feature subset selection
✓ Select only the best features for further analysis
▪ The most frequent

▪ The most informative relative to the all class values, …


✓ Scoring methods for individual feature (for supervised learning tasks)

▪ Information gain:

▪ Cross-entropy:

▪ Mutual information:

▪ Weight of evidence:

▪ Odds ratio:

▪ Frequency:
Feature Selection/Extraction
• Feature subset extraction
✓ Feature extraction: construct a set of variables that preserve the information of the
original data by combining them in a linear/non-linear form
✓ Latent Semantic Analysis (LSA) that is based on singular value decomposition is
commonly used mxr rxr rxn

▪ Rectangular matrix A(m x n) can be decomposed as A = UΣV T


▪ All singular vectors of the matrices U and V are orthogonal UT U = V T V = I

▪ Eigenvalues of the matrix ∑ is positive and sorted in a descending order


Feature Selection/Extraction Lee (2010)

• SVD in Text Mining (Latent Semantic Analysis/Indexing)

✓ Step 1) Using SVD, a term-document matrix D is reduced to Dk

D  Dk = U k Σ k Vk T
✓ Step 2) Multiply the transpose of the matrix Uk

U Tk Dk = Σ k Vk T
✓ Apply data mining algorithms to the matrix obtained in the Step 2)
Feature Selection/Extraction
• LSA

https://www.quora.com/How-is-LSA-used-for-a-text-document-clustering
Feature Selection/Extraction
• LSA example
✓ Data: 41,717 abstracts of the research projects that were funded by National Science
Foundation (NSF) between 1990 and 2003
✓ Top 10 positive and negative keywords for each SVD
Feature Selection/Extraction
• LSA example
✓ Visualize the project in the reduced 2-D space
Feature Selection/Extraction
• Topic Modeling as a Feature Extractor
✓ Latent Dirichlet Allocation (LDA)

• 단어(w)는 특정 토픽들 z로부터 생성됨


• 해당 문서가 어떤 토픽 비율(topic proportion, θ)을 가질 것인지는 파라미터가 α인 Dirichlet
distribution에 의해 결정됨
• w만 실제로 관찰할 수 있는 값이고 θ, z, Φ는 숨겨진 값임
• 문서 생성 프로세스는 먼저 다항분포 θ로부터 토픽이 나타날 확률 θ를 추출하고 이를
바탕으로 단어가 나타날 확률인 w를 산출함
Feature Selection/Extraction
• Topic Modeling as a Feature Extractor
✓ Two outputs of topic modeling
▪ Per-document topic proportion
▪ Per-topic word distribution

(a) Per-document topic proportions (𝜃𝑑 ) (b) Per-topic word distributions (𝜙𝑘 )

Topic 1 Topic 2 Topic 3 … Topic K Sum Topic 1 Topic 2 Topic 3 … Topic K


Doc 1 0.20 0.50 0.10 … 0.10 1 word 1 0.01 0.05 0.05 … 0.10
Doc 2 0.50 0.02 0.01 … 0.40 1 word 2 0.02 0.02 0.01 … 0.03
Doc 3 0.05 0.12 0.48 … 0.15 1 word 3 0.05 0.12 0.08 … 0.02
… … … … … … 1 … … … … … …
Doc N 0.14 0.25 0.33 … 0.14 1 word V 0.04 0.01 0.03 … 0.07
Sum 1 1 1 1 1
Feature Selection/Extraction
• Document to vector (Doc2Vec)
✓ A natural extension of word2vec
✓ Use a distributed representation for each document
Feature Selection/Extraction
• Document to vector (Doc2Vec)
✓ A natural extension of word2vec
✓ Use a distributed representation for each document
AGENDA

01 Text Mining Overview


02 TM Process 1: Collection & Preprocessing
03 TM Process 2: Transformation
04 TM Process 3: Dimensionality Reduction
05 TM Process 4: Learning & Evaluation
Similarity Between Documents
• Document similarity
✓ Use the cosine similarity rather than Euclidean distance
▪ Which two documents are more similar?

Doc. Word 1 Word 2 Word 3


Document 1 1 1 1
Document 2 3 3 3
Document 3 0 2 0

Sim( D1 , D2 ) =
x x i 1i 2 i

x  j
2
j x
k k
2
Learning Task 1: Classification
• Document categorization (classification)
✓ Automatically classify a document into one of the pre-defined categories

Machine Learning Algorithms

Labeled Unlabeled
documents documents

Document
Category
Learning Task 1: Classification
• Spam filtering
Raw Data Features Model

Vector space model


(Bag of words)

Domain knowledge-
based phrases
(“Free money”, “!!!”)

Meta-data
(sender, mailing list, etc.)
Learning Task 1: Classification
• Sentiment Analysis

http://www.crowdsource.com/solutions/content-moderation/sentiment-analysis/
Learning Task 1: Classification Socher et al. (2013)

• Sentiment Analysis
✓ Sentiment tree bank @Stanford NLP Lab
Learning Task 1: Classification Socher et al. (2013)

• Sentiment Analysis
✓ Sentiment tree bank @Stanford NLP Lab
Learning Task 2: Clustering
• Document Clustering & Visualization
✓ Have a top level view of the topics in the corpora
✓ See relationships between topics
✓ Understand better what’s going on

Raw Data Features

• 8,850 articles from 11 Journals for the recent 10 years • 50 topics from Latent Dirichlet Allocation (LDA)
• 21,434 terms after preprocessing
Learning Task 2: Clustering
• Document Clustering & Visualization
Keywords association Journal/Topic Clustering
Learning Task 2: Clustering
• Document Clustering & Visualization
✓ Themescape: contents maps from Thomson innovation full text patent data
Learning Task 3: Information Extraction/Retrieval
Yang et al. (2013)

• Information extraction/retrieval
✓ Find useful information from text databases
✓ Examples: Question Answering

https://rajpurkar.github.io/SQuAD-
explorer/explore/1.1/dev/Super_Bowl_50.html?model=r-
net+%20(ensemble)%20(Microsoft%20Research%20Asia)&version=1.1
Learning Task 3: Information Extraction/Retrieval
• Topic Modeling
✓ A suite of algorithms that aim to
discover and annotate large archives of
documents with thematic information
✓ Statistical methods that analyze the
words of the original texts to discover
▪ the themes that run through them
▪ how themes are connected to each
other
▪ how they change over time

https://dhs.stanford.edu/algorithmic-literacy/my-definition-of-topic-modeling/
Learning Task 3: Information Extraction/Retrieval
• Latent Dirichlet Allocation (LDA)

• Words (w) are statistically generated from the topics Z


• Topic proportion for a document (θd) is determined by the Dirichlet distribution with the parameter α
• We can only observe W; θ, z, Φ are latent variables (hidden, cannot be observed)
• Document generation process: (1) Estimate the topic proportions θ, (2) estimate the word probability w from θ

You might also like