0% found this document useful (0 votes)
12 views

Introduction To Text Mining

This document introduces text mining and provides an overview of the topic from an instructor of a text mining course. It discusses what text mining is, examples of text mining applications, challenges in text mining, and techniques that will be covered in the course like document categorization and topic modeling.

Uploaded by

Rudra Gandhi
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

Introduction To Text Mining

This document introduces text mining and provides an overview of the topic from an instructor of a text mining course. It discusses what text mining is, examples of text mining applications, challenges in text mining, and techniques that will be covered in the course like document categorization and topic modeling.

Uploaded by

Rudra Gandhi
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 45

Introduction to Text Mining

Hongning Wang
CS@UVa
Who Am I?
• Hongning Wang
– Assistant professor in CS@UVa since August 2014
– Research areas
• Information retrieval
• Data mining
• Machine learning

CS@UVa CS6501: Text Mining 2


Who Am I?
• Hongning Wang
– Assistant professor in CS@UVa since August 2014

CS@UVa CS6501: Text Mining 3


What Am I Doing at UVa?
• Sentiment analysis with topic modeling

CS@UVa CS6501: Text Mining 4


What Am I Doing at UVa?
• Interactive online recommendation
– Modeling recommendation as a two-party game
Strategy?

Goal: Challenge:
1) Unknown preference;
2) Feedback is acquired on the
fly, and it is not free!

CS@UVa CS6501: Text Mining 5


What Am I Doing at UVa?
• Yahoo frontpage news recommendation

18,882 users, 188,384 articles, and 9,984,879 logged


events segmented into 1,123,583 sessions.

CS@UVa CS6501: Text Mining 6


What Am I Doing at UVa?
• Personalization techniques raise serious public
concerns about privacy infringement
No means for users to opt-out data collection!

CS@UVa CS6501: Text Mining 7


What Am I Doing at UVa?
• Privacy-preserving personalization
Stronger privacy guarantee than k-anonymity

CS@UVa CS6501: Text Mining 8


What are about you?
• Why do you choose this course?
• Anything specific you want me to know?
• What type of text data do you often
encounter in your projects?
• What kind of knowledge do you want to
extract from it?

CS@UVa CS6501: Text Mining 9


What is “Text Mining”?
• “Text mining, also referred to as text data
mining, roughly equivalent to text analytics,
refers to the process of deriving high-quality
information from text.” - wikipedia
• “Another way to view text data mining is as a
process of exploratory data analysis that
leads to heretofore unknown information, or
to answers for questions for which the answer
is not currently known.” - Hearst, 1999

CS@UVa CS6501: Text Mining 10


Two different definitions of mining
• Goal-oriented (effectiveness driven)
– Any process that generates useful results that are non-
obvious is called “mining”.
– Keywords: “useful” + “non-obvious”
– Data isn’t necessarily massive
• Method-oriented (efficiency driven)
– Any process that involves extracting information from
massive data is called “mining”
– Keywords: “massive” + “pattern”
– Patterns aren’t necessarily useful

CS@UVa CS6501: Text Mining 11


Knowledge discovery from text data
• IBM’s Watson wins at Jeopardy! - 2011

CS@UVa CS6501: Text Mining 12


An overview of Watson

CS@UVa CS6501: Text Mining 13


What is inside Watson?
• “Watson had access to 200 million pages of
structured and unstructured content consuming
four terabytes of disk storage including the full
text of Wikipedia” – PC World
• “The sources of information for Watson include
encyclopedias, dictionaries, thesauri, newswire
articles, and literary works. Watson also used
databases, taxonomies, and ontologies.
Specifically, DBPedia, WordNet, and Yago were
used.” – AI Magazine

CS@UVa CS6501: Text Mining 14


What is inside Watson?
• DeepQA system
– “Watson's main innovation was not in the creation
of a new algorithm for this operation but rather its
ability to quickly execute hundreds of proven
language analysis algorithms simultaneously to
find the correct answer.” – New York Times
– The DeepQA Research Team

CS@UVa CS6501: Text Mining 15


Text mining around us
• Sentiment analysis

CS@UVa CS6501: Text Mining 16


Text mining around us
• Sentiment analysis

CS@UVa CS6501: Text Mining 17


Text mining around us
• Document summarization

CS@UVa CS6501: Text Mining 18


Text mining around us
• Document summarization

CS@UVa CS6501: Text Mining 19


Text mining around us
• Movie recommendation

CS@UVa CS6501: Text Mining 20


Text mining around us
• Restaurant/hotel recommendation

CS@UVa CS6501: Text Mining 21


Text mining around us
• News recommendation

CS@UVa CS6501: Text Mining 22


Text mining around us
• Text analytics in financial services

CS@UVa CS6501: Text Mining 23


Text mining around us
• Text analytics in healthcare

CS@UVa CS6501: Text Mining 24


How to perform text mining?
• As computer scientists, we view it as
– Text Mining = Data Mining + Text Data

CS@UVa CS6501: Text Mining 25


Text mining v.s. NLP, IR, DM…
• How does it relate to data mining in general?
• How does it relate to computational
linguistics?
• How does it relate to information retrieval?
Finding Patterns Finding “Nuggets”
Novel Non-Novel

General Database
Non-textual data
data-mining Exploratory queries
data analysis Information
Textual data Text Mining
Computational
Linguistics retrieval
CS@UVa CS6501: Text Mining 26
Text mining in general
Serve for IR Sub-area of Mining
Access applications DM research
Filter Discover knowledge
information

Based on NLP/ML Add


techniques Organization Structure/Annotations
CS@UVa CS6501: Text Mining 27
Challenges in text mining
• Data collection is “free text”
– Data is not well-organized
• Semi-structured or unstructured
– Natural language text contains ambiguities on many levels
• Lexical, syntactic, semantic, and pragmatic
– Learning techniques for processing text typically need
annotated training examples
• Expensive to acquire at scale
• What to mine?

CS@UVa CS6501: Text Mining 28


Text mining problems we will solve
• Lexical semantics and word senses
– Identifying which sense of a word (i.e. meaning) is
used in a sentence, when the word has multiple
meanings

CS@UVa CS6501: Text Mining 29


Text mining problems we will solve
• Document categorization
– Adding structures to the text corpus

CS@UVa CS6501: Text Mining 30


Text mining problems we will solve
• Text clustering
– Identifying structures in the text corpus

CS@UVa CS6501: Text Mining 31


Text mining problems we will solve
• Topic modeling
– Identifying structures in the text corpus

CS@UVa CS6501: Text Mining 32


Text mining problems we will solve
• Social media and social network analysis
– Exploring additional structure in the text corpus

CS@UVa CS6501: Text Mining 33


We will also briefly cover
• Natural language processing pipeline
– Tokenization
• “Studying text mining is fun!” -> “studying” + “text” +
“mining” + “is” + “fun” + “!”
– Part-of-speech tagging
• “Studying text mining is fun!” ->
– Dependency parsing
• “Studying text mining is fun!” ->

CS@UVa CS6501: Text Mining 34


We will also briefly cover
• Machine learning techniques
– Supervised methods
• Naïve Bayes, k Nearest Neighbors, Logistic Regression
– Unsupervised methods
• K-Means, hierarchical clustering, topic models
– Semi-supervised methods
• Expectation Maximization

CS@UVa CS6501: Text Mining 35


Text mining in the era of Big Data
• Huge in size
– Google processes 5.13B queries/day (2013)
– Twitter receives 340M tweets/day (2012)
– Facebook has 2.5 PB of user data + 15 TB/day (4/2009)
– eBay has 6.5 PB of user data + 50 TB/day640K
(5/2009)
ought to be
enough for anybody.
• 80% data is unstructured (IBM, 2010)

CS@UVa CS6501: Text Mining 36


Scalability is crucial
• Large scale text processing techniques
– MapReduce framework

CS@UVa CS6501: Text Mining 37


State-of-the-art solutions
• Apache Spark (spark.apache.org)
– In-memory MapReduce
• Specialized for machine learning algorithms
– Speed
• 100x faster than Hadoop MapReduce in memory, or
10x faster on disk.

CS@UVa CS6501: Text Mining 38


State-of-the-art solutions
• Apache Spark (spark.apache.org)
– In-memory MapReduce
• Specialized for machine learning algorithms
– Generality
• Combine SQL, streaming, and complex analytics

CS@UVa CS6501: Text Mining 39


State-of-the-art solutions
• GraphLab (graphlab.com)
– Graph-based, high performance, distributed
computation framework

CS@UVa CS6501: Text Mining 40


State-of-the-art solutions
• GraphLab (graphlab.com)
– Specialized for sparse data with local
dependencies for iterative algorithms

CS@UVa CS6501: Text Mining 41


Text mining in the era of Big Data

Human-generated data

Knowledge Discovery
Text data Behavior data
Knowledge service system

As knowledge As data producer


consumer
Challenges: Challenges:
1. Implicit feedback Human: big data producer and consumer 1. Unstructured data
2. Diverse and dynamic 2. Rich semantic
CS@UVa CS6501: Text Mining 42
Text books
• Introduction to Information Retrieval.
Christopher D. Manning, Prabhakar Raghavan,
and Hinrich Schuetze, Cambridge University
Press, 2007.
• Speech and Language Processing. Daniel
Jurafsky and James H. Martin, Pearson Education,
2000.
• Mining Text Data. Charu C. Aggarwal and
ChengXiang Zhai, Springer, 2012.

CS@UVa CS6501: Text Mining 43


What to read?
Applications
Algorithms
Web Applications,
Machine Learning Bioinformatics…
Pattern Recognition
Statistics ICML, NIPS, UAI
Optimization Library & Info
Data Mining
Text Mining Science
KDD, ICDM, SDM
NLP
ACL, EMNLP, COLING Information Retrieval
SIGIR, WWW, WSDM, CIKM

• Find more on course website for resource


CS@UVa CS6501: Text Mining 44
Welcome to the class of “Text Mining”!

CS@UVa CS6501: Text Mining 45

You might also like