0% found this document useful (0 votes)

62 views

Data Mining: Ying Liu, Prof., PH.D

This document outlines a syllabus for a data mining course taught by Professor Ying Liu. The course covers topics such as data preprocessing, association rule mining, classification, clustering, sequence mining, and applications of data mining. Students will complete assignments, a course project involving developing an algorithm, and will be evaluated based on their attendance, assignments, and project. The goal is to introduce students to principles and algorithms of data mining and enhance their independent research capabilities.

Uploaded by

Hiểu Nguyễn

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

62 views

Data Mining: Ying Liu, Prof., PH.D

Uploaded by

Hiểu Nguyễn

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 57

Data Mining

Ying Liu, Prof., Ph.D

School of Computer and Control

University of Chinese Academy of Sciences
The Key Lab of Big Data Mining and Knowledge Management
Welcome

Instructor: Ying Liu

Computer Engineering, Ph.D
Northwestern University, USA
Research interests
data mining, high performance computing, etc.
Email: [email protected]

2017/9/19 2
Useful Information

Class: Tuesday 8:30 - 12:00, S104

2017/9/19 3
Textbook and References
Textbook
Data Mining, Concepts and Techniques.
Jiawei Han and Micheline Kamber,
Morgan Kaufmann, 2011.

References
Research papers. To be announced in
class.

2017/9/19 4
Prerequisites

Data Structure
Algorithm
Database
Programming: C/C++ (preferred), Java

2017/9/19 5
A Mini Survey

How many people were major in computer

science?
How many people took machine learning
courses before?
How many people took statistics courses
before?
How many people took database courses
before?

2017/9/19 6
Grading Scheme
Assignments (40%)
2 homework assignments
Course Project (50%)
One project. (group project, 2-3 students/per group)
Develop an algorithm and hand in a project report
Present in class
To be evaluated in technical innovation,
performance, thoroughness of the work, clarity of
presentation
Attendance (10%)
2017/9/19 7
About the Project

Choose a topic from a list of selected topics

Read through some related research papers
and fully understand them
Implement and experimentally evaluate the
major method
Identify pros and cons
Improve the method in effectiveness or
efficiency, implement and experimentally
evaluate your improvement (plus)
Write a technical report
2017/9/19 8
How to Do a Good Project?

Start early
It takes time to understand and think
Discuss with me
Maybe I can give some suggestions or ideas
Implement concretely
Understand the pros and cons
Think creatively

2017/9/19 9
Why Take This Course ?
Data mining is hot
Solve many interesting problems in real
applications, e.g. business management, market
analysis, science exploration
Turn raw data into knowledge
Promising in research of many disciplines
Data miners job market: many well-paid positions

Data Mining is very useful!

2017/9/19 10
Syllabus (Tentative)
Introduction
Data warehouse
Data pre-processing
Association Rules
Classification
Clustering
Sequence Mining
Applications
Big Data Mining
Project Discussion & Demo
2017/9/19 11
Objectives of This Course

Introduce the motivation of data mining

Outline principles, major algorithms
Introduce applications
Introduce advanced topics
Enhance independent research capability

2017/9/19 12
Policies

Students are expected to attend all classes

No late homework will be accepted
All work must be efforts of your own
(individual assignment) or of your approved
team (group assignment)

No Plagiarism!

2017/9/19 13
What Motivated Data Mining?
The explosive growth of data
Data collection and data availability
Computer hardware & software develop dramatically
The amount of data collected and stored doubles/triples
per year vs. CPU speed increases 15% per year (till
2003)

Many types of databases

Object-oriented, spatial, time-series, text,
multimedia, Web

2017/9/19 14
What Motivated Data Mining
Business World
Tremendous of data being collected and
stored
E-commerce
Transactions
Stocks
Credit card transactions
Strong competitive pressure to extract and
use the knowledge hidden in the data to
provide customized CRM

2017/9/19 15
What Motivated Data Mining
Scientific World
Tremendous of data being collected
and stored
Remote sensing
Bioinformatics (Microarrays)
Scientific simulation
Scientists need strong data analysis
to assist research, such as
classification, segmentation, etc.

2017/9/19 16
What Motivated Data Mining?

We are drowning in data, but starving for

knowledge!
Data rich, knowledge poor
Decision makers, domain experts have biases or
errors
Automated analysis of massive data sets

2017/9/19 17
What is Data Mining?

Data mining Discover valid, novel, useful,

and understandable patterns in massive
datasets

2017/9/19 18
What is Data Mining?
Cross Disciplines
Databases
Machine learning: decision tree, Bayesian classifier,
etc.
Statistics: regression, etc.
Neural networks

2017/9/19 19
Why Not Traditional Data Analysis?
Tremendous amount of data
Algorithms must be highly scalable
to handle such as tera-bytes of data

High-dimensionality of data
DNA sequences may have tens of
thousands of dimensions

2017/9/19 20
Why Not Traditional Data Analysis?
High complexity of data
Data streams and sensor data
Time-series data, sequence data
Graphs, social networks
Spatial, temporal, multimedia, text
and Web data
New and sophisticated
applications

2017/9/19 21
Why Not Traditional Data Analysis?
Database Data mining
Storage-oriented Discover knowledge from
Provide simple queries data in databases
Data warehouse
Subject-oriented Advanced data analysis tools
A multidimensional view of data
Operations to access summarized
data
Statistical algorithms Less hypothesis
Based on many hypothesis Find patterns in large
Find patterns in small number of number of samples
samples Abnormal patterns
2017/9/19 22
Characteristics of Data Mining

Massive dataset
Automatically searching for interesting
patterns from historical data
Fast
Scalable
Update easily
Practical
Decision support

2017/9/19 23
Exercises

1. Could you present an application of data

mining in business domain?

2. Could you present an application of data

mining in scientific domain?

2017/9/19 24
What Kinds of Patterns?
Association rules
Detect sets of attributes or items that frequently co-
occur in many database records and rules among them

On Thursdays, during 4-11pm customers

often purchase diapers and beers together!

2017/9/19 25
Ex. 1: Market Basket Analysis and
Management
Where does the data come from?
supermarket transactions, membership cards,
discount coupons, customer complaint calls
Cross-marketing analysis
What products were often purchased together?
Purchase recommendation, cross selling
What are the subsequent purchases after buying a
given product?
Target-marketing
What types of customers buy what products
Catalog design

2017/9/19 26
What Kinds of Patterns?
Classification
Build a model of classes on training dataset, and
then, assign a new record to one of several
predefined classes

Income>$40K

Yes NO
Decision Tree

rule 1if (Income<=$40k) and (Debt=0)

then good
Debt<10% of Income Debt=0%

Yes NO NO Yes

rule 2: if (Income>$40K) and

Good Bad Good (Debt<10% of Income) then good
Credit Credit Credit
Risks Risks Risks

2017/9/19 27
Ex.2 Credit Scoring
Where does the data come from?
credit card transactions, credit card payments,
loan payments, demographic data
Predict the probability to bankrupt or charge-
off
Reduce the credit risk to the banks
Increase the profitability of the banks

2017/9/19 28
What Kinds of Patterns?

Clustering
Partition the dataset into groups such that elements
in a group have lower inter-group similarity and
higher intra-group similarity

2017/9/19 29
Ex.3 Scientific Simulation
Cosmological simulation
Simulate the formation of the galaxy
Enormous particles at each evolution stage,
beyond the capability of human being to analyze

2017/9/19 30
What Kinds of Patterns?
Frequent sequence
Given a set of sequences, find the complete set of
frequent subsequences

Buy a PC Buy an ink printer Buy an ink cartridges Buy a new CPU

Time

Marketing stragegyrecommend a new CPU

for the customer 9 months after his first purchase

2017/9/19 31
What Kinds of Patterns?
Outliers/Anomalies
Given a set of n objects, and k, the number of
expected anomalies, find the top k objects that are
considerably dissimilar or inconsistent with the
remaining data

Anomalies may be valuable!

2017/9/19 32
Exercises
1. Please present an example where data
mining is crucial to the success of the
business. What data mining techniques are
the business used (What kinds of patterns
are mined)?

2017/9/19 33
On What Kinds of Data?
Database-oriented data sets and applications
Relational database, data warehouse, transactional
database
Advanced database applications
Data streams
Spatial data
Text database
Multimedia data
Time-series
Bio-medical data
Network traffic data
2017/9/19 34
Relational Databases
Structured data
Table records attributes
Accessed by queries, SQL
Online transactional processing (OLTP)
Insert a student Ying Liu into class Introduction
to Data Mining, fall 2014
Name Time Course score Room
Ying Fall 2014 Introduction to Data Mining 90 002
Liu
Tom Fall 2014 Math 85 001
Merlisa Spring 2014 Compiler 70 001
George Fall 2014 Graphics 92 001
2017/9/19 35
Data Warehouses
A subject-oriented, integrated, cleaned collection of
data in support of managements decision making
process
Data from multiple databases
Consistency checking in data warehouses
Data warehouses can answer OLAP queries
efficiently
Online analytical processing (OLAP)
Find the average class score of Ying Liu in the last 3 years,
grouped by semesters
Many patterns are summarization of data
Roll-up, drill-down
2017/9/19 36
Data Warehouses

2017/9/19 37
Transactional Databases
I={x1, , xn} is the set of items
An itemset is a subset of I
A transaction is a tuple (tid, X)
Transaction ID tid
Itemset X
A transactional database is a set of transactions

Tid Itemset
T100 Milk, bread, beer, diaper
T200 Beer, cook, fish, potato, orange, apple

2017/9/19 38
Spatial Data
Spatial information
Geographic databases (map)
VLSI chip design databases
Satellite/remote sensing image
databases
Medical image database
Spatial patterns
Find characteristics of homes
near a given location
Change in trend of
metropolitan poverty rates
based on distances from major
highways
2017/9/19 39
Time Series
A sequence of values that change over time
Sequences of stock price at every 5 minutes
Daily temperature
Power supply
Electrocardiogram
Typical operations
Similarity search
Trend analysis
Periodic pattern discovery time

2017/9/19 40
Text Databases & Multimedia Databases

HTML web documents

XML documents
Digital libraries
Annotated multimedia databases
Image, audio and video data
Typical operations
Similarity-based pattern matching
Image classification

2017/9/19 41
Data Streams
Data in the form of continuous arrival in
multiple, rapid, time-varying, possibly
unpredictable and unbounded streams
Dynamically changing patterns, high volume,
infinite, quick response, no re-scan
Many applications
Stock exchange, network monitoring,
telecommunications data management, web
application, sensor networks, etc.

2017/9/19 42
Biomedical Data
Bio-sequences
DNA: very long sequences of nucleotides
Similarity search
Identify sequential patterns that play roles in
various diseases
Association analysis: co-occurring gene
sequences

2017/9/19 43
World-Wide Web
The WWW is huge, widely distributed, global
information service center for
Information services: news, advertisements, consumer
information, financial management, education, government, e-
commerce, etc.
Hyper-link information
Access and usage information
WWW provides rich sources for data mining
Challenges
Too huge for effective data warehousing and data mining
Too complex and heterogeneous: no standards and structure

2017/9/19 44
World-Wide Web
Web Usage: Logs and IP package header streams
Mine Weblog records to discover user accessing patterns of
Web pages
Web Content
Extract knowledge from a Web documents, automatic
categorization
Web Structure
Identifying interesting graph patterns among different Web
pages

2017/9/19 45
Graph
Internet graph
Graph
Citation graph
Graph
Friendship graph
Graph
Protein interaction graph
Graph

2017/9/19 50
Knowledge Discovery (KDD) Process
Data miningcore of Pattern Evaluation
knowledge discovery
process
Data Mining
Selection and
Transformation

Data Warehouse
Data Cleaning
and Integration

Databases Flat files

2017/9/19 52
Key Steps in KDD Process
Learning the application domain
relevant prior knowledge and goals of application
Creating a target data resource
Data cleaning and preprocessing: (may take 60% of effort!)
Data reduction and transformation
Find useful features, dimensionality/variable reduction, invariant
representation
Choosing the mining algorithm(s) to search for patterns of
interest
Pattern evaluation and knowledge presentation
visualization, transformation, removing redundant patterns, etc.
Use
2017/9/19
of discovered knowledge 53
Are All the Discovered Patterns
Interesting?
Data mining may generate thousands of patterns: Not
all of them are interesting
Interestingness measures
A pattern is interesting if it is easily understood by humans, valid on
new or test data with some degree of certainty, potentially useful, novel,
or validates some hypothesis that a user seeks to confirm

Objective vs. subjective interestingness measures

Objective: based on statistics and structures of patterns, e.g., support,
confidence, etc.
Subjective: based on users belief in the data, e.g., unexpectedness,
novelty, actionability, etc.

2017/9/19 54
Find All and Only Interesting Patterns?
Find all the interesting patterns: Completeness
Can a data mining system find all the interesting patterns? Do
we need to find all of the interesting patterns?
Heuristic vs. exhaustive search
Search for only interesting patterns: An optimization
problem Challenging
Can a data mining system find only the interesting patterns?
Approaches
First generate all the patterns and then filter out the uninteresting
ones
Guide and constrain the discovery process

2017/9/19 55
Research Issues in Data Mining
Mining methodology
Mining different kinds of knowledge from diverse
data types, e.g., Web, graph, bio, stream
Performance: efficiency, effectiveness, and
scalability
Parallel, distributed and incremental mining methods
Pattern evaluation: the interestingness problem
Handling noise and incomplete data
Incorporation of background knowledge

2017/9/19 56
Research Issues in Data Mining
User interaction
Data mining query languages
Expression and visualization of data mining results
Applications and social impacts
Domain-specific data mining
Protection of data security, integrity, and privacy

2017/9/19 57
Important Resources

Data mining conferences

ACM SIGKDD, IEEE ICDM, SIAM DM, PKDD,
PAKDD
Database conferences
ACM SIGMOD, VLDB, ACM PODS, IEEE ICDE,
EDBT, ICDT
Important journals
ACM Data Mining and Knowledge Discovery
IEEE Transactions on Knowledge and Data
Engineering
Knowledge and Information Systems
2017/9/19 58

Masters of The Universe Legacies by Matthew C Kayser 2014
100% (1)
Masters of The Universe Legacies by Matthew C Kayser 2014
242 pages
Data Mining Concepts
No ratings yet
Data Mining Concepts
35 pages
Data Mining
No ratings yet
Data Mining
26 pages
Introduction To Data Mining & Business Intelligence
No ratings yet
Introduction To Data Mining & Business Intelligence
25 pages
Chap 1
No ratings yet
Chap 1
45 pages
Module 2 Data Mining
No ratings yet
Module 2 Data Mining
49 pages
Introduction To Data Mining
75% (4)
Introduction To Data Mining
45 pages
Data Mining: Concepts and Techniques: - Chapter 1
No ratings yet
Data Mining: Concepts and Techniques: - Chapter 1
37 pages
Lecture_01_11jan
No ratings yet
Lecture_01_11jan
29 pages
01 Intro
No ratings yet
01 Intro
23 pages
DM Introduction
No ratings yet
DM Introduction
32 pages
Major Issues in Data Mining
75% (4)
Major Issues in Data Mining
45 pages
1 Intro
No ratings yet
1 Intro
33 pages
Mehrdad Jalali: Jalali@mshdiau - Ac.ir Jalali - Mshdiau.ac - Ir
No ratings yet
Mehrdad Jalali: Jalali@mshdiau - Ac.ir Jalali - Mshdiau.ac - Ir
27 pages
01 Intro 1
No ratings yet
01 Intro 1
50 pages
Chapter 1. Introduction
No ratings yet
Chapter 1. Introduction
323 pages
Data Mining: Department of Computer Science & Engineering Jamia Hamdard, New Delhi
No ratings yet
Data Mining: Department of Computer Science & Engineering Jamia Hamdard, New Delhi
43 pages
Data Mining: Department of Computer Science & Engineering Jamia Hamdard, New Delhi
No ratings yet
Data Mining: Department of Computer Science & Engineering Jamia Hamdard, New Delhi
43 pages
0 Introduction
No ratings yet
0 Introduction
43 pages
DataMining S
No ratings yet
DataMining S
103 pages
Introduction-to-Data-Mining
No ratings yet
Introduction-to-Data-Mining
32 pages
01 Intro
No ratings yet
01 Intro
29 pages
Data Mining: Concepts and Techniques
No ratings yet
Data Mining: Concepts and Techniques
25 pages
Data Miningppt378
No ratings yet
Data Miningppt378
31 pages
Data Mining: An Overview From A Database Perspective
No ratings yet
Data Mining: An Overview From A Database Perspective
30 pages
Lecture 2
No ratings yet
Lecture 2
66 pages
Data Mining: Concepts and Techniques
No ratings yet
Data Mining: Concepts and Techniques
27 pages
Data Mining: Concepts and Techniques
100% (2)
Data Mining: Concepts and Techniques
27 pages
Introduction To Data Mining
No ratings yet
Introduction To Data Mining
44 pages
DM-Unit 1 PPT
No ratings yet
DM-Unit 1 PPT
110 pages
Intro of Data Mining
No ratings yet
Intro of Data Mining
27 pages
Introduction
No ratings yet
Introduction
27 pages
Data Mining:: Concepts and Techniques
No ratings yet
Data Mining:: Concepts and Techniques
28 pages
Data Mining Notes
100% (1)
Data Mining Notes
45 pages
01 Introduction
No ratings yet
01 Introduction
36 pages
Introduction
No ratings yet
Introduction
46 pages
Data Mining
No ratings yet
Data Mining
27 pages
01Intro
No ratings yet
01Intro
28 pages
Introduction To Data Mining-Week1
No ratings yet
Introduction To Data Mining-Week1
43 pages
01 Intro
No ratings yet
01 Intro
22 pages
Chapter1 Introduction (Autosaved)
No ratings yet
Chapter1 Introduction (Autosaved)
23 pages
1712060004 (1)
No ratings yet
1712060004 (1)
25 pages
Lecture 1
No ratings yet
Lecture 1
17 pages
dm 1
No ratings yet
dm 1
47 pages
DWDM-LS1-Fall-24-25
No ratings yet
DWDM-LS1-Fall-24-25
42 pages
Data Mining - GDi Techno Solutions
No ratings yet
Data Mining - GDi Techno Solutions
145 pages
Data Mining: V Mounika Revathi Dept of Cse Sitam
No ratings yet
Data Mining: V Mounika Revathi Dept of Cse Sitam
13 pages
Chapter - 1
No ratings yet
Chapter - 1
22 pages
1_Lect 1 & 2 Data Mining
No ratings yet
1_Lect 1 & 2 Data Mining
20 pages
Why Data Mining?: March 3, 2015
No ratings yet
Why Data Mining?: March 3, 2015
41 pages
Week1-1
No ratings yet
Week1-1
18 pages
data mining 1
No ratings yet
data mining 1
39 pages
01 Intro
No ratings yet
01 Intro
40 pages
Datamining 1
No ratings yet
Datamining 1
30 pages
Data Mining Concepts and Techniques - Han, Kamber & Pei
No ratings yet
Data Mining Concepts and Techniques - Han, Kamber & Pei
953 pages
Intro_1
No ratings yet
Intro_1
43 pages
Data Mining From Scratch
No ratings yet
Data Mining From Scratch
17 pages
Day-2 BE-VIII DMDW (Into. Contd..)
No ratings yet
Day-2 BE-VIII DMDW (Into. Contd..)
23 pages
Big Data: Statistics, Data Mining, Analytics, And Pattern Learning
From Everand
Big Data: Statistics, Data Mining, Analytics, And Pattern Learning
Rob Botwright
No ratings yet
The Power of Big Data: Transforming Industries and Shaping the Future
From Everand
The Power of Big Data: Transforming Industries and Shaping the Future
Tom Henricksen
No ratings yet
Data Science
From Everand
Data Science
Chloe Martin
No ratings yet
(Ebook) How to Ace the Rest of Calculus: The Streetwise Guide by Colin Adams, Joel Hass & Abigail Thompson & Colin Adams & Joel Hass & Abigail Thompson ISBN 9781627798860, 1627798862 2024 Scribd Download
100% (9)
(Ebook) How to Ace the Rest of Calculus: The Streetwise Guide by Colin Adams, Joel Hass & Abigail Thompson & Colin Adams & Joel Hass & Abigail Thompson ISBN 9781627798860, 1627798862 2024 Scribd Download
81 pages
Soil Investigation Report
No ratings yet
Soil Investigation Report
17 pages
Aut Procedure
100% (4)
Aut Procedure
30 pages
Qw-484A Suggested Format A For Welder Performance Qualifications (WPQ) (See QW-301, Section IX, ASME Boiler and Pressure Vessel Code)
No ratings yet
Qw-484A Suggested Format A For Welder Performance Qualifications (WPQ) (See QW-301, Section IX, ASME Boiler and Pressure Vessel Code)
1 page
Chapter 12 PDF
No ratings yet
Chapter 12 PDF
50 pages
Project Pai
No ratings yet
Project Pai
7 pages
Syllabus in Els 102
No ratings yet
Syllabus in Els 102
6 pages
Ring Plus Aqua Starter Gear - TERI
No ratings yet
Ring Plus Aqua Starter Gear - TERI
47 pages
UNIT 04 - Articles
No ratings yet
UNIT 04 - Articles
9 pages
Databases Description 1
No ratings yet
Databases Description 1
8 pages
Machine Learning QB
No ratings yet
Machine Learning QB
3 pages
DETAILED LESSON PLAN IN CALCULUS 2 (XYLA BLESHY LESIRA AGABON) - Script
No ratings yet
DETAILED LESSON PLAN IN CALCULUS 2 (XYLA BLESHY LESIRA AGABON) - Script
9 pages
Grade 8 Big Summative 2 (Gulshan Anar)
No ratings yet
Grade 8 Big Summative 2 (Gulshan Anar)
2 pages
Inflammatory Dermatology
No ratings yet
Inflammatory Dermatology
64 pages
Concerto For Billy The Kid - Analisi
No ratings yet
Concerto For Billy The Kid - Analisi
3 pages
ICT JCB Chapter Test 2
No ratings yet
ICT JCB Chapter Test 2
9 pages
Unit 1 Engineering Design Presentation
No ratings yet
Unit 1 Engineering Design Presentation
15 pages
NEET Medical Books PMT Study Material AI PDF
100% (1)
NEET Medical Books PMT Study Material AI PDF
31 pages
Tdbfp-A Turbine Logic
100% (3)
Tdbfp-A Turbine Logic
4 pages
21 IMSMS-16 CONTEXT OF THE ORGANIZATION
No ratings yet
21 IMSMS-16 CONTEXT OF THE ORGANIZATION
3 pages
Arthrokinematics
No ratings yet
Arthrokinematics
6 pages
Eating Raw For A Day E-Book
100% (1)
Eating Raw For A Day E-Book
23 pages
Poland - Poznan Univ Engineering
No ratings yet
Poland - Poznan Univ Engineering
2 pages
UNIT 3 Costing Methods
No ratings yet
UNIT 3 Costing Methods
11 pages
P6 Maths SA2 2018 Catholic High Exam Papers
No ratings yet
P6 Maths SA2 2018 Catholic High Exam Papers
50 pages
REN Documentation
No ratings yet
REN Documentation
79 pages
Bosch Injector Caracteristics PDF
100% (1)
Bosch Injector Caracteristics PDF
14 pages
Chemical Bonds
No ratings yet
Chemical Bonds
8 pages
PDS Stopaq Wrappingband CL V2 en
No ratings yet
PDS Stopaq Wrappingband CL V2 en
2 pages

Data Mining: Ying Liu, Prof., PH.D

Uploaded by

Data Mining: Ying Liu, Prof., PH.D

Uploaded by

Data Mining

Ying Liu, Prof., Ph.D

School of Computer and Control

Instructor: Ying Liu

Class: Tuesday 8:30 - 12:00, S104

How many people were major in computer

Choose a topic from a list of selected topics

Data Mining is very useful!

Introduce the motivation of data mining

Students are expected to attend all classes

Many types of databases

We are drowning in data, but starving for

Data mining Discover valid, novel, useful,

1. Could you present an application of data

2. Could you present an application of data

On Thursdays, during 4-11pm customers

rule 1if (Income<=$40k) and (Debt=0)

rule 2: if (Income>$40K) and

Marketing stragegyrecommend a new CPU

Anomalies may be valuable!

HTML web documents

Databases Flat files

Objective vs. subjective interestingness measures

Data mining conferences

You might also like