Decision Support and Business Intelligence Systems (9 Ed., Prentice Hall) Text and Web Mining
Decision Support and Business Intelligence Systems (9 Ed., Prentice Hall) Text and Web Mining
Intelligence Systems
(9th Ed., Prentice Hall)
Chapter 7:
Text and Web Mining
Learning Objectives
Describe text mining and understand the need
for text mining
Differentiate between text mining, Web mining
and data mining
Understand the different application areas for
text mining
Know the process of carrying out a text mining
project
Understand the different methods to introduce
structure to text-based data
-2 Copyright © 2011 Pearson Education, Inc. Publishing as Prentice Hall
Learning Objectives
Describe Web mining, its objectives, and its
benefits
Understand the three different branches of
Web mining
Web content mining
Web structure mining
Web usage mining
Understand the applications of these three
mining paradigms
Problem description
Proposed solution
Results
Dream of AI community
to have algorithms that are capable of automatically
reading and obtaining knowledge from text
-15 Copyright © 2011 Pearson Education, Inc. Publishing as Prentice Hall
Natural Language Processing (NLP)
WordNet
A laboriously hand-coded database of English
words, their definitions, sets of synonyms, and
various semantic relations between synonym sets
A major resource for NLP
Need automation to be completed
Sentiment Analysis
A technique used to detect favorable and
unfavorable opinions toward specific products and
services
See Application Case 7.3 for a CRM application
-16 Copyright © 2011 Pearson Education, Inc. Publishing as Prentice Hall
NLP Task Categories
Information retrieval
Information extraction
Named-entity recognition
Question answering
Automatic summarization
Natural language generation and understanding
Machine translation
Foreign language reading and writing
Speech recognition
Text proofing
Optical character recognition
-17 Copyright © 2011 Pearson Education, Inc. Publishing as Prentice Hall
Text Mining Applications
Marketing applications
Enables better CRM
Security applications
ECHELON, OASIS
Deception detection (…)
Medicine and biology
Literature-based gene identification (…)
Academic applications
Research stream analysis
-18 Copyright © 2011 Pearson Education, Inc. Publishing as Prentice Hall
Text Mining Applications
Application Case 7.4: Mining for Lies
Deception detection
A difficult problem
If detection is limited to only text, then the
problem is even more difficult
The study
analyzed text based testimonies of person
of interests at military bases
used only text-based features (cues)
-19 Copyright © 2011 Pearson Education, Inc. Publishing as Prentice Hall
Text Mining Applications
Application Case 7.4: Mining for Lies
Statements
Transcribed for
Processing
Statements Labeled as
Cues Extracted &
Truthful or Deceptive
Selected
By Law Enforcement
Text Processing
Software Generated
Quantified Cues
D007962
D 016923
Ontology
D 001773
...expression of Bcl-2 is correlated with insufficient white blood cell death and activation of p53.
Word
185 8 51112 9 23017 27 5874 2791 8952 1623 5632 17 8252 8 2523
POS
NN IN NN IN VBZ IN JJ JJ NN NN NN CC NN IN NN
Shallow
Parse
NP PP NP NP PP NP NP PP NP
A0
Domain expertis
Tools and techniqu
-24 Copyright © 2011 Pearson Education, Inc. Publishing as Prentice Hall
Text Mining Process
Feedback Feedback
The inputs to the process The output of the Task 1 is a The output of the Task 2 is a The output of Task 3 is a
includes a variety of relevant collection of documents in flat file called term-document number of problem specific
unstructured (and semi- some digitized format for matrix where the cells are classification, association,
structured) data sources such computer processing populated with the term clustering models and
as text, XML, HTML, etc. frequencies visualizations
Document 2 1
Document 3 3 1
Document 4 1
Document 5 2 1
Document 6 1 1
...
… … … … … … … …
1
1
2
2
3
3
1
1
2
2
3
3
1
1
2
2
3
3
0
5
0
5
0
5
0
5
0
5
0
5
0
5
0
5
0
5
0
5
0
5
0
5
1994 1994 1994
1995 1995 1995
1996 1996 1996
1997 1997 1997
1998 1998 1998
1999 1999 1999
2000 2000 2000
2001 2001 2001
C LU S TER : 7
C LU S TER : 4
C LU S TER : 1
Y EAR
2000 2000 2000
2001 2001 2001
C LU STER : 8
C LU STER : 5
C LU STER : 2
C LU S T ER : 1 C LU S T ER : 2 C LU S T E R : 3
100
90
80
70
No of Articles
60
50
40
30
20
10
0
IS R J M IS M IS Q IS R J M IS M IS Q IS R J M IS M IS Q
C LU S T ER : 4 C LU S T ER : 5 C LU S T E R : 6
100
90
80
70
60
50
40
30
20
10
0
IS R J M IS M IS Q IS R J M IS M IS Q IS R J M IS M IS Q
C LU S T ER : 7 C LU S T ER : 8 C LU S T E R : 9
JO U R N AL
Web Mining
Web
Analytics
Voice of
Customer
Customer Experience
Management
Questions / comments…