Data Mining: Ying Liu, Prof., PH.D
Data Mining: Ying Liu, Prof., PH.D
2017/9/19 2
Useful Information
2017/9/19 3
Textbook and References
Textbook
Data Mining, Concepts and Techniques.
Jiawei Han and Micheline Kamber,
Morgan Kaufmann, 2011.
References
Research papers. To be announced in
class.
2017/9/19 4
Prerequisites
Data Structure
Algorithm
Database
Programming: C/C++ (preferred), Java
2017/9/19 5
A Mini Survey
2017/9/19 6
Grading Scheme
Assignments (40%)
2 homework assignments
Course Project (50%)
One project. (group project, 2-3 students/per group)
Develop an algorithm and hand in a project report
Present in class
To be evaluated in technical innovation,
performance, thoroughness of the work, clarity of
presentation
Attendance (10%)
2017/9/19 7
About the Project
Start early
It takes time to understand and think
Discuss with me
Maybe I can give some suggestions or ideas
Implement concretely
Understand the pros and cons
Think creatively
2017/9/19 9
Why Take This Course ?
Data mining is hot
Solve many interesting problems in real
applications, e.g. business management, market
analysis, science exploration
Turn raw data into knowledge
Promising in research of many disciplines
Data miners job market: many well-paid positions
2017/9/19 10
Syllabus (Tentative)
Introduction
Data warehouse
Data pre-processing
Association Rules
Classification
Clustering
Sequence Mining
Applications
Big Data Mining
Project Discussion & Demo
2017/9/19 11
Objectives of This Course
2017/9/19 12
Policies
No Plagiarism!
2017/9/19 13
What Motivated Data Mining?
The explosive growth of data
Data collection and data availability
Computer hardware & software develop dramatically
The amount of data collected and stored doubles/triples
per year vs. CPU speed increases 15% per year (till
2003)
2017/9/19 14
What Motivated Data Mining
Business World
Tremendous of data being collected and
stored
E-commerce
Transactions
Stocks
Credit card transactions
Strong competitive pressure to extract and
use the knowledge hidden in the data to
provide customized CRM
2017/9/19 15
What Motivated Data Mining
Scientific World
Tremendous of data being collected
and stored
Remote sensing
Bioinformatics (Microarrays)
Scientific simulation
Scientists need strong data analysis
to assist research, such as
classification, segmentation, etc.
2017/9/19 16
What Motivated Data Mining?
2017/9/19 17
What is Data Mining?
2017/9/19 18
What is Data Mining?
Cross Disciplines
Databases
Machine learning: decision tree, Bayesian classifier,
etc.
Statistics: regression, etc.
Neural networks
2017/9/19 19
Why Not Traditional Data Analysis?
Tremendous amount of data
Algorithms must be highly scalable
to handle such as tera-bytes of data
High-dimensionality of data
DNA sequences may have tens of
thousands of dimensions
2017/9/19 20
Why Not Traditional Data Analysis?
High complexity of data
Data streams and sensor data
Time-series data, sequence data
Graphs, social networks
Spatial, temporal, multimedia, text
and Web data
New and sophisticated
applications
2017/9/19 21
Why Not Traditional Data Analysis?
Database Data mining
Storage-oriented Discover knowledge from
Provide simple queries data in databases
Data warehouse
Subject-oriented Advanced data analysis tools
A multidimensional view of data
Operations to access summarized
data
Statistical algorithms Less hypothesis
Based on many hypothesis Find patterns in large
Find patterns in small number of number of samples
samples Abnormal patterns
2017/9/19 22
Characteristics of Data Mining
Massive dataset
Automatically searching for interesting
patterns from historical data
Fast
Scalable
Update easily
Practical
Decision support
2017/9/19 23
Exercises
2017/9/19 24
What Kinds of Patterns?
Association rules
Detect sets of attributes or items that frequently co-
occur in many database records and rules among them
2017/9/19 25
Ex. 1: Market Basket Analysis and
Management
Where does the data come from?
supermarket transactions, membership cards,
discount coupons, customer complaint calls
Cross-marketing analysis
What products were often purchased together?
Purchase recommendation, cross selling
What are the subsequent purchases after buying a
given product?
Target-marketing
What types of customers buy what products
Catalog design
2017/9/19 26
What Kinds of Patterns?
Classification
Build a model of classes on training dataset, and
then, assign a new record to one of several
predefined classes
Income>$40K
Yes NO
Decision Tree
Yes NO NO Yes
2017/9/19 27
Ex.2 Credit Scoring
Where does the data come from?
credit card transactions, credit card payments,
loan payments, demographic data
Predict the probability to bankrupt or charge-
off
Reduce the credit risk to the banks
Increase the profitability of the banks
2017/9/19 28
What Kinds of Patterns?
Clustering
Partition the dataset into groups such that elements
in a group have lower inter-group similarity and
higher intra-group similarity
2017/9/19 29
Ex.3 Scientific Simulation
Cosmological simulation
Simulate the formation of the galaxy
Enormous particles at each evolution stage,
beyond the capability of human being to analyze
2017/9/19 30
What Kinds of Patterns?
Frequent sequence
Given a set of sequences, find the complete set of
frequent subsequences
Buy a PC Buy an ink printer Buy an ink cartridges Buy a new CPU
Time
2017/9/19 31
What Kinds of Patterns?
Outliers/Anomalies
Given a set of n objects, and k, the number of
expected anomalies, find the top k objects that are
considerably dissimilar or inconsistent with the
remaining data
2017/9/19 33
On What Kinds of Data?
Database-oriented data sets and applications
Relational database, data warehouse, transactional
database
Advanced database applications
Data streams
Spatial data
Text database
Multimedia data
Time-series
Bio-medical data
Network traffic data
2017/9/19 34
Relational Databases
Structured data
Table records attributes
Accessed by queries, SQL
Online transactional processing (OLTP)
Insert a student Ying Liu into class Introduction
to Data Mining, fall 2014
Name Time Course score Room
Ying Fall 2014 Introduction to Data Mining 90 002
Liu
Tom Fall 2014 Math 85 001
Merlisa Spring 2014 Compiler 70 001
George Fall 2014 Graphics 92 001
2017/9/19 35
Data Warehouses
A subject-oriented, integrated, cleaned collection of
data in support of managements decision making
process
Data from multiple databases
Consistency checking in data warehouses
Data warehouses can answer OLAP queries
efficiently
Online analytical processing (OLAP)
Find the average class score of Ying Liu in the last 3 years,
grouped by semesters
Many patterns are summarization of data
Roll-up, drill-down
2017/9/19 36
Data Warehouses
2017/9/19 37
Transactional Databases
I={x1, , xn} is the set of items
An itemset is a subset of I
A transaction is a tuple (tid, X)
Transaction ID tid
Itemset X
A transactional database is a set of transactions
Tid Itemset
T100 Milk, bread, beer, diaper
T200 Beer, cook, fish, potato, orange, apple
2017/9/19 38
Spatial Data
Spatial information
Geographic databases (map)
VLSI chip design databases
Satellite/remote sensing image
databases
Medical image database
Spatial patterns
Find characteristics of homes
near a given location
Change in trend of
metropolitan poverty rates
based on distances from major
highways
2017/9/19 39
Time Series
A sequence of values that change over time
Sequences of stock price at every 5 minutes
Daily temperature
Power supply
Electrocardiogram
Typical operations
Similarity search
Trend analysis
Periodic pattern discovery time
2017/9/19 40
Text Databases & Multimedia Databases
2017/9/19 41
Data Streams
Data in the form of continuous arrival in
multiple, rapid, time-varying, possibly
unpredictable and unbounded streams
Dynamically changing patterns, high volume,
infinite, quick response, no re-scan
Many applications
Stock exchange, network monitoring,
telecommunications data management, web
application, sensor networks, etc.
2017/9/19 42
Biomedical Data
Bio-sequences
DNA: very long sequences of nucleotides
Similarity search
Identify sequential patterns that play roles in
various diseases
Association analysis: co-occurring gene
sequences
2017/9/19 43
World-Wide Web
The WWW is huge, widely distributed, global
information service center for
Information services: news, advertisements, consumer
information, financial management, education, government, e-
commerce, etc.
Hyper-link information
Access and usage information
WWW provides rich sources for data mining
Challenges
Too huge for effective data warehousing and data mining
Too complex and heterogeneous: no standards and structure
2017/9/19 44
World-Wide Web
Web Usage: Logs and IP package header streams
Mine Weblog records to discover user accessing patterns of
Web pages
Web Content
Extract knowledge from a Web documents, automatic
categorization
Web Structure
Identifying interesting graph patterns among different Web
pages
2017/9/19 45
Graph
Internet graph
Graph
Citation graph
Graph
Friendship graph
Graph
Protein interaction graph
Graph
2017/9/19 50
Knowledge Discovery (KDD) Process
Data miningcore of Pattern Evaluation
knowledge discovery
process
Data Mining
Selection and
Transformation
Data Warehouse
Data Cleaning
and Integration
2017/9/19 54
Find All and Only Interesting Patterns?
Find all the interesting patterns: Completeness
Can a data mining system find all the interesting patterns? Do
we need to find all of the interesting patterns?
Heuristic vs. exhaustive search
Search for only interesting patterns: An optimization
problem Challenging
Can a data mining system find only the interesting patterns?
Approaches
First generate all the patterns and then filter out the uninteresting
ones
Guide and constrain the discovery process
2017/9/19 55
Research Issues in Data Mining
Mining methodology
Mining different kinds of knowledge from diverse
data types, e.g., Web, graph, bio, stream
Performance: efficiency, effectiveness, and
scalability
Parallel, distributed and incremental mining methods
Pattern evaluation: the interestingness problem
Handling noise and incomplete data
Incorporation of background knowledge
2017/9/19 56
Research Issues in Data Mining
User interaction
Data mining query languages
Expression and visualization of data mining results
Applications and social impacts
Domain-specific data mining
Protection of data security, integrity, and privacy
2017/9/19 57
Important Resources