0% found this document useful (0 votes)
62 views

Data Mining: Ying Liu, Prof., PH.D

This document outlines a syllabus for a data mining course taught by Professor Ying Liu. The course covers topics such as data preprocessing, association rule mining, classification, clustering, sequence mining, and applications of data mining. Students will complete assignments, a course project involving developing an algorithm, and will be evaluated based on their attendance, assignments, and project. The goal is to introduce students to principles and algorithms of data mining and enhance their independent research capabilities.

Uploaded by

Hiểu Nguyễn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
62 views

Data Mining: Ying Liu, Prof., PH.D

This document outlines a syllabus for a data mining course taught by Professor Ying Liu. The course covers topics such as data preprocessing, association rule mining, classification, clustering, sequence mining, and applications of data mining. Students will complete assignments, a course project involving developing an algorithm, and will be evaluated based on their attendance, assignments, and project. The goal is to introduce students to principles and algorithms of data mining and enhance their independent research capabilities.

Uploaded by

Hiểu Nguyễn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 57

Data Mining

Ying Liu, Prof., Ph.D

School of Computer and Control


University of Chinese Academy of Sciences
The Key Lab of Big Data Mining and Knowledge Management
Welcome

Instructor: Ying Liu


Computer Engineering, Ph.D
Northwestern University, USA
Research interests
data mining, high performance computing, etc.
Email: [email protected]

2017/9/19 2
Useful Information

Class: Tuesday 8:30 - 12:00, S104

2017/9/19 3
Textbook and References
Textbook
Data Mining, Concepts and Techniques.
Jiawei Han and Micheline Kamber,
Morgan Kaufmann, 2011.

References
Research papers. To be announced in
class.

2017/9/19 4
Prerequisites

Data Structure
Algorithm
Database
Programming: C/C++ (preferred), Java

2017/9/19 5
A Mini Survey

How many people were major in computer


science?
How many people took machine learning
courses before?
How many people took statistics courses
before?
How many people took database courses
before?

2017/9/19 6
Grading Scheme
Assignments (40%)
2 homework assignments
Course Project (50%)
One project. (group project, 2-3 students/per group)
Develop an algorithm and hand in a project report
Present in class
To be evaluated in technical innovation,
performance, thoroughness of the work, clarity of
presentation
Attendance (10%)
2017/9/19 7
About the Project

Choose a topic from a list of selected topics


Read through some related research papers
and fully understand them
Implement and experimentally evaluate the
major method
Identify pros and cons
Improve the method in effectiveness or
efficiency, implement and experimentally
evaluate your improvement (plus)
Write a technical report
2017/9/19 8
How to Do a Good Project?

Start early
It takes time to understand and think
Discuss with me
Maybe I can give some suggestions or ideas
Implement concretely
Understand the pros and cons
Think creatively

2017/9/19 9
Why Take This Course ?
Data mining is hot
Solve many interesting problems in real
applications, e.g. business management, market
analysis, science exploration
Turn raw data into knowledge
Promising in research of many disciplines
Data miners job market: many well-paid positions

Data Mining is very useful!

2017/9/19 10
Syllabus (Tentative)
Introduction
Data warehouse
Data pre-processing
Association Rules
Classification
Clustering
Sequence Mining
Applications
Big Data Mining
Project Discussion & Demo
2017/9/19 11
Objectives of This Course

Introduce the motivation of data mining


Outline principles, major algorithms
Introduce applications
Introduce advanced topics
Enhance independent research capability

2017/9/19 12
Policies

Students are expected to attend all classes


No late homework will be accepted
All work must be efforts of your own
(individual assignment) or of your approved
team (group assignment)

No Plagiarism!

2017/9/19 13
What Motivated Data Mining?
The explosive growth of data
Data collection and data availability
Computer hardware & software develop dramatically
The amount of data collected and stored doubles/triples
per year vs. CPU speed increases 15% per year (till
2003)

Many types of databases


Object-oriented, spatial, time-series, text,
multimedia, Web

2017/9/19 14
What Motivated Data Mining
Business World
Tremendous of data being collected and
stored
E-commerce
Transactions
Stocks
Credit card transactions
Strong competitive pressure to extract and
use the knowledge hidden in the data to
provide customized CRM

2017/9/19 15
What Motivated Data Mining
Scientific World
Tremendous of data being collected
and stored
Remote sensing
Bioinformatics (Microarrays)
Scientific simulation
Scientists need strong data analysis
to assist research, such as
classification, segmentation, etc.

2017/9/19 16
What Motivated Data Mining?

We are drowning in data, but starving for


knowledge!
Data rich, knowledge poor
Decision makers, domain experts have biases or
errors
Automated analysis of massive data sets

2017/9/19 17
What is Data Mining?

Data mining Discover valid, novel, useful,


and understandable patterns in massive
datasets

2017/9/19 18
What is Data Mining?
Cross Disciplines
Databases
Machine learning: decision tree, Bayesian classifier,
etc.
Statistics: regression, etc.
Neural networks

2017/9/19 19
Why Not Traditional Data Analysis?
Tremendous amount of data
Algorithms must be highly scalable
to handle such as tera-bytes of data

High-dimensionality of data
DNA sequences may have tens of
thousands of dimensions

2017/9/19 20
Why Not Traditional Data Analysis?
High complexity of data
Data streams and sensor data
Time-series data, sequence data
Graphs, social networks
Spatial, temporal, multimedia, text
and Web data
New and sophisticated
applications

2017/9/19 21
Why Not Traditional Data Analysis?
Database Data mining
Storage-oriented Discover knowledge from
Provide simple queries data in databases
Data warehouse
Subject-oriented Advanced data analysis tools
A multidimensional view of data
Operations to access summarized
data
Statistical algorithms Less hypothesis
Based on many hypothesis Find patterns in large
Find patterns in small number of number of samples
samples Abnormal patterns
2017/9/19 22
Characteristics of Data Mining

Massive dataset
Automatically searching for interesting
patterns from historical data
Fast
Scalable
Update easily
Practical
Decision support

2017/9/19 23
Exercises

1. Could you present an application of data


mining in business domain?

2. Could you present an application of data


mining in scientific domain?

2017/9/19 24
What Kinds of Patterns?
Association rules
Detect sets of attributes or items that frequently co-
occur in many database records and rules among them

On Thursdays, during 4-11pm customers


often purchase diapers and beers together!

2017/9/19 25
Ex. 1: Market Basket Analysis and
Management
Where does the data come from?
supermarket transactions, membership cards,
discount coupons, customer complaint calls
Cross-marketing analysis
What products were often purchased together?
Purchase recommendation, cross selling
What are the subsequent purchases after buying a
given product?
Target-marketing
What types of customers buy what products
Catalog design

2017/9/19 26
What Kinds of Patterns?
Classification
Build a model of classes on training dataset, and
then, assign a new record to one of several
predefined classes

Income>$40K

Yes NO
Decision Tree

rule 1if (Income<=$40k) and (Debt=0)


then good
Debt<10% of Income Debt=0%

Yes NO NO Yes

rule 2: if (Income>$40K) and


Good Bad Good (Debt<10% of Income) then good
Credit Credit Credit
Risks Risks Risks

2017/9/19 27
Ex.2 Credit Scoring
Where does the data come from?
credit card transactions, credit card payments,
loan payments, demographic data
Predict the probability to bankrupt or charge-
off
Reduce the credit risk to the banks
Increase the profitability of the banks

2017/9/19 28
What Kinds of Patterns?

Clustering
Partition the dataset into groups such that elements
in a group have lower inter-group similarity and
higher intra-group similarity

2017/9/19 29
Ex.3 Scientific Simulation
Cosmological simulation
Simulate the formation of the galaxy
Enormous particles at each evolution stage,
beyond the capability of human being to analyze

2017/9/19 30
What Kinds of Patterns?
Frequent sequence
Given a set of sequences, find the complete set of
frequent subsequences

Buy a PC Buy an ink printer Buy an ink cartridges Buy a new CPU

Time

Marketing stragegyrecommend a new CPU


for the customer 9 months after his first purchase

2017/9/19 31
What Kinds of Patterns?
Outliers/Anomalies
Given a set of n objects, and k, the number of
expected anomalies, find the top k objects that are
considerably dissimilar or inconsistent with the
remaining data

Anomalies may be valuable!


2017/9/19 32
Exercises
1. Please present an example where data
mining is crucial to the success of the
business. What data mining techniques are
the business used (What kinds of patterns
are mined)?

2017/9/19 33
On What Kinds of Data?
Database-oriented data sets and applications
Relational database, data warehouse, transactional
database
Advanced database applications
Data streams
Spatial data
Text database
Multimedia data
Time-series
Bio-medical data
Network traffic data
2017/9/19 34
Relational Databases
Structured data
Table records attributes
Accessed by queries, SQL
Online transactional processing (OLTP)
Insert a student Ying Liu into class Introduction
to Data Mining, fall 2014
Name Time Course score Room
Ying Fall 2014 Introduction to Data Mining 90 002
Liu
Tom Fall 2014 Math 85 001
Merlisa Spring 2014 Compiler 70 001
George Fall 2014 Graphics 92 001
2017/9/19 35
Data Warehouses
A subject-oriented, integrated, cleaned collection of
data in support of managements decision making
process
Data from multiple databases
Consistency checking in data warehouses
Data warehouses can answer OLAP queries
efficiently
Online analytical processing (OLAP)
Find the average class score of Ying Liu in the last 3 years,
grouped by semesters
Many patterns are summarization of data
Roll-up, drill-down
2017/9/19 36
Data Warehouses

2017/9/19 37
Transactional Databases
I={x1, , xn} is the set of items
An itemset is a subset of I
A transaction is a tuple (tid, X)
Transaction ID tid
Itemset X
A transactional database is a set of transactions

Tid Itemset
T100 Milk, bread, beer, diaper
T200 Beer, cook, fish, potato, orange, apple

2017/9/19 38
Spatial Data
Spatial information
Geographic databases (map)
VLSI chip design databases
Satellite/remote sensing image
databases
Medical image database
Spatial patterns
Find characteristics of homes
near a given location
Change in trend of
metropolitan poverty rates
based on distances from major
highways
2017/9/19 39
Time Series
A sequence of values that change over time
Sequences of stock price at every 5 minutes
Daily temperature
Power supply
Electrocardiogram
Typical operations
Similarity search
Trend analysis
Periodic pattern discovery time

2017/9/19 40
Text Databases & Multimedia Databases

HTML web documents


XML documents
Digital libraries
Annotated multimedia databases
Image, audio and video data
Typical operations
Similarity-based pattern matching
Image classification

2017/9/19 41
Data Streams
Data in the form of continuous arrival in
multiple, rapid, time-varying, possibly
unpredictable and unbounded streams
Dynamically changing patterns, high volume,
infinite, quick response, no re-scan
Many applications
Stock exchange, network monitoring,
telecommunications data management, web
application, sensor networks, etc.

2017/9/19 42
Biomedical Data
Bio-sequences
DNA: very long sequences of nucleotides
Similarity search
Identify sequential patterns that play roles in
various diseases
Association analysis: co-occurring gene
sequences

2017/9/19 43
World-Wide Web
The WWW is huge, widely distributed, global
information service center for
Information services: news, advertisements, consumer
information, financial management, education, government, e-
commerce, etc.
Hyper-link information
Access and usage information
WWW provides rich sources for data mining
Challenges
Too huge for effective data warehousing and data mining
Too complex and heterogeneous: no standards and structure

2017/9/19 44
World-Wide Web
Web Usage: Logs and IP package header streams
Mine Weblog records to discover user accessing patterns of
Web pages
Web Content
Extract knowledge from a Web documents, automatic
categorization
Web Structure
Identifying interesting graph patterns among different Web
pages

2017/9/19 45
Graph
Internet graph
Graph
Citation graph
Graph
Friendship graph
Graph
Protein interaction graph
Graph

2017/9/19 50
Knowledge Discovery (KDD) Process
Data miningcore of Pattern Evaluation
knowledge discovery
process
Data Mining
Selection and
Transformation

Data Warehouse
Data Cleaning
and Integration

Databases Flat files


2017/9/19 52
Key Steps in KDD Process
Learning the application domain
relevant prior knowledge and goals of application
Creating a target data resource
Data cleaning and preprocessing: (may take 60% of effort!)
Data reduction and transformation
Find useful features, dimensionality/variable reduction, invariant
representation
Choosing the mining algorithm(s) to search for patterns of
interest
Pattern evaluation and knowledge presentation
visualization, transformation, removing redundant patterns, etc.
Use
2017/9/19
of discovered knowledge 53
Are All the Discovered Patterns
Interesting?
Data mining may generate thousands of patterns: Not
all of them are interesting
Interestingness measures
A pattern is interesting if it is easily understood by humans, valid on
new or test data with some degree of certainty, potentially useful, novel,
or validates some hypothesis that a user seeks to confirm

Objective vs. subjective interestingness measures


Objective: based on statistics and structures of patterns, e.g., support,
confidence, etc.
Subjective: based on users belief in the data, e.g., unexpectedness,
novelty, actionability, etc.

2017/9/19 54
Find All and Only Interesting Patterns?
Find all the interesting patterns: Completeness
Can a data mining system find all the interesting patterns? Do
we need to find all of the interesting patterns?
Heuristic vs. exhaustive search
Search for only interesting patterns: An optimization
problem Challenging
Can a data mining system find only the interesting patterns?
Approaches
First generate all the patterns and then filter out the uninteresting
ones
Guide and constrain the discovery process

2017/9/19 55
Research Issues in Data Mining
Mining methodology
Mining different kinds of knowledge from diverse
data types, e.g., Web, graph, bio, stream
Performance: efficiency, effectiveness, and
scalability
Parallel, distributed and incremental mining methods
Pattern evaluation: the interestingness problem
Handling noise and incomplete data
Incorporation of background knowledge

2017/9/19 56
Research Issues in Data Mining
User interaction
Data mining query languages
Expression and visualization of data mining results
Applications and social impacts
Domain-specific data mining
Protection of data security, integrity, and privacy

2017/9/19 57
Important Resources

Data mining conferences


ACM SIGKDD, IEEE ICDM, SIAM DM, PKDD,
PAKDD
Database conferences
ACM SIGMOD, VLDB, ACM PODS, IEEE ICDE,
EDBT, ICDT
Important journals
ACM Data Mining and Knowledge Discovery
IEEE Transactions on Knowledge and Data
Engineering
Knowledge and Information Systems
2017/9/19 58

You might also like