01 Intro
01 Intro
TO DATA MINING
Chapter 1. Introduction
Yu Su, CSE@The Ohio State University
Slides adapted from UIUC CS412 by Prof. Jiawei Han and OSU CSE5243 by
Prof. Huan Sun
CSE 5243. Course Page & Schedule
¨ Class Homepage:
https://ysu1989.github.io/courses/sp20/cse5243/
¨ Class Schedule:
9:35-10:55 AM, Wed/Fri, Caldwell Lab 171
¨ Office hours:
¤ Instructor: Yu Su @ DL783, Fri 11:00am-12:15pm (right after class)
First week: No office hours
2
CSE 5243. Textbook
¨ Recommended but not required
¨ (Primary) Jiawei Han, Micheline Kamber and Jian Pei, Data
Mining: Concepts and Techniques (3rd ed), 2011
¤ More resources:
https://wiki.illinois.edu/wiki/display/cs412/2.+Course+Syllabus+and
+Schedule
¨ (Primary) Pang-Ning Tan, Michael Steinbach, and Vipin Kumar,
Introduction to Data Mining, 2006
¨ (Supplementary) Mohammed J. Zaki and Wagner Meira, Jr.,
Data Mining Analysis and Concepts, 2014
¨ (Supplementary) Jure Leskovec, Anand Rajaraman, Jeff Ullman,
Mining of Massive Datasets
¤ More resources: http://www.mmds.org/
3
CSE 5243. Course Work and Grading
4
Videos
5
Chapter 1. Introduction
¨ What is Data Mining?
¨ Why Data Mining?
¨ A Multi-Dimensional View of Data Mining
¨ What Kinds of Data Can Be Mined?
¨ What Kinds of Patterns Can Be Mined?
¨ What Kinds of Technologies Are Used?
¨ What Kinds of Applications Are Targeted?
¨ Major Issues in Data Mining
¨ A Brief History of Data Mining and Data Mining Society
¨ Summary
6
What is Data Mining?
¨ Data mining (knowledge discovery from data, KDD)
¤ Extraction of interesting (non-trivial, implicit, previously
unknown, and potentially useful) patterns or knowledge
from huge amount of data
¨ Alternative names
¤ Knowledge discovery (mining) in databases (KDD),
knowledge extraction, data/pattern analysis, data
archeology, data dredging, information harvesting,
business intelligence, etc.
7
What is Data Mining?
¨ Data mining (knowledge discovery from data, KDD)
¤ Extraction of interesting (non-trivial, implicit, previously
unknown, and potentially useful) patterns or knowledge
from huge amount of data
¨ Alternative names
¤ Knowledge discovery (mining) in databases (KDD),
knowledge extraction, data/pattern analysis, data
archeology, data dredging, information harvesting,
business intelligence, etc.
One of the best conferences to publish your research work:
SIGKDD (check resources)
8
Knowledge Discovery (KDD) Process
¨ (Narrow view) Data mining plays
Pattern
an essential role in the
Evaluation
knowledge discovery process
Data
¨ (Broad view) Data mining is the
Mining
knowledge discovery process
Task-relevant Data
Data Cleaning
Data Integration
Databases
9
Example: A Web Mining Framework
¨ Web mining usually involves
¤ Data crawling and cleaning
¤ Data integration from multiple sources
¤ (Optional) Warehousing the data
¤ (Optional) Data cube construction
¤ Data selection for data mining
¤ Data mining
¤ Presentation of the mining results
¤ Patterns and knowledge to be used or stored into
knowledge base
10
KDD Process: A View from ML and Statistics
Data
Input Data Data Post-
Pre- Processing
Processing Mining
15
Why Data Mining?
¨ The Explosive Growth of Data: from terabytes to petabytes
¤ Data collection and data availability
n Automated data collection tools, database systems, Web,
computerized society
¤ Major sources of data
n Business: Web, e-commerce, transactions, stocks, …
n Science: Remote sensing, bioinformatics, scientific simulation, …
n Society and everyone: news, digital cameras, YouTube
16
“How much data is generated each day?” – World Economic Forum
17
Why Data Mining?
¨ The Explosive Growth of Data: from terabytes to petabytes
¤ Data collection and data availability
n Automated data collection tools, database systems, Web,
computerized society
¤ Major sources of data
n Business: Web, e-commerce, transactions, stocks, …
n Science: Remote sensing, bioinformatics, scientific simulation, …
n Society and everyone: news, digital cameras, YouTube
¨ We are drowning in data, but starving for knowledge!
¨ “Necessity is the mother of invention”—Data mining—Automated
analysis of massive data sets
18
Chapter 1. Introduction
¨ Why Data Mining?
¨ What Is Data Mining?
¨ A Multi-Dimensional View of Data Mining
¨ What Kinds of Data Can Be Mined?
¨ What Kinds of Patterns Can Be Mined?
¨ What Kinds of Technologies Are Used?
¨ What Kinds of Applications Are Targeted?
¨ Major Issues in Data Mining
¨ A Brief History of Data Mining and Data Mining Society
¨ Summary
19
Multi-Dimensional View of Data Mining
¨ Data to be mined
¤ Database data (extended-relational, object-oriented, heterogeneous), data
warehouse, transactional data, stream, spatiotemporal, time-series, sequence, text
and web, multi-media, graphs & social and information networks
20
Multi-Dimensional View of Data Mining
¨ Data to be mined
¤ Database data (extended-relational, object-oriented, heterogeneous), data
warehouse, transactional data, stream, spatiotemporal, time-series, sequence, text
and web, multi-media, graphs & social and information networks
¨ Knowledge to be mined (or: Data mining functions)
¤ Characterization, discrimination, association, classification, clustering,
trend/deviation, outlier analysis, …
¤ Descriptive vs. predictive data mining
¤ Multiple/integrated functions and mining at multiple levels
21
Multi-Dimensional View of Data Mining
¨ Data to be mined
¤ Database data (extended-relational, object-oriented, heterogeneous), data
warehouse, transactional data, stream, spatiotemporal, time-series, sequence, text
and web, multi-media, graphs & social and information networks
¨ Knowledge to be mined (or: Data mining functions)
¤ Characterization, discrimination, association, classification, clustering,
trend/deviation, outlier analysis, …
¤ Descriptive vs. predictive data mining
¤ Multiple/integrated functions and mining at multiple levels
¨ Techniques utilized
¤ Data warehousing (OLAP), machine learning, statistics, pattern recognition,
visualization, high-performance computing, etc.
22
Multi-Dimensional View of Data Mining
¨ Data to be mined
¤ Database data (extended-relational, object-oriented, heterogeneous), data
warehouse, transactional data, stream, spatiotemporal, time-series, sequence, text
and web, multi-media, graphs & social and information networks
¨ Knowledge to be mined (or: Data mining functions)
¤ Characterization, discrimination, association, classification, clustering,
trend/deviation, outlier analysis, …
¤ Descriptive vs. predictive data mining
¤ Multiple/integrated functions and mining at multiple levels
¨ Techniques utilized
¤ Data warehousing (OLAP), machine learning, statistics, pattern recognition,
visualization, high-performance computing, etc.
¨ Applications adapted
¤ Retail, telecommunication, banking, fraud analysis, bio-data mining, stock market
analysis, text mining, Web mining, etc.
23
Chapter 1. Introduction
¨ Why Data Mining?
¨ What Is Data Mining?
¨ A Multi-Dimensional View of Data Mining
¨ What Kinds of Data Can Be Mined?
¨ What Kinds of Patterns Can Be Mined?
¨ What Kinds of Technologies Are Used?
¨ What Kinds of Applications Are Targeted?
¨ Major Issues in Data Mining
¨ A Brief History of Data Mining and Data Mining Society
¨ Summary
24
Data Mining: On What Kinds of Data?
¨ Database-oriented data sets and applications
¤ Relational database, data warehouse, transactional database
¤ Object-relational databases, Heterogeneous databases and legacy
databases
Question 2: What project have you done so far that you think is most relevant to
Data Mining?
• Not necessarily research project; can be your course project or any hackathon
event you participated in.
26
Chapter 1. Introduction
¨ Why Data Mining?
¨ What Is Data Mining?
¨ A Multi-Dimensional View of Data Mining
¨ What Kinds of Data Can Be Mined?
¨ What Kinds of Patterns Can Be Mined?
¨ What Kinds of Technologies Are Used?
¨ What Kinds of Applications Are Targeted?
¨ Major Issues in Data Mining
¨ A Brief History of Data Mining and Data Mining Society
¨ Summary
27
Data Mining Functions: Pattern Discovery
¨ Frequent patterns
¤ What items do you frequently purchase together on Amazon?
28
Data Mining Functions: Pattern Discovery
¨ Frequent patterns
¤ What items do you frequently purchase together on Amazon?
¨ Association and Correlation Analysis
29
Data Mining Functions: Pattern Discovery
¨ Frequent patterns
¤ What items do you frequently purchase together on Amazon?
¨ Association and Correlation Analysis
31
Data Mining Functions: Classification
¨ Classification and label prediction
¤ Construct models (functions) based on some training examples
¤ Describe and distinguish classes or concepts for future prediction
n Ex. 1. Classify countries based on (climate)
n Ex. 2. Classify cars based on (gas mileage)
¤ Predict some unknown class labels
¨ Typical methods
¤ Decision trees, naïve Bayesian classification, support vector
machines, neural networks, rule-based classification, pattern-
based classification, logistic regression, …
32
Data Mining Functions: Classification
¨ Classification and label prediction
¤ Construct models (functions) based on some training examples
¤ Describe and distinguish classes or concepts for future prediction
n Ex. 1. Classify countries based on (climate)
n Ex. 2. Classify cars based on (gas mileage)
¤ Predict some unknown class labels
¨ Typical methods
¤ Decision trees, naïve Bayesian classification, support vector
machines, neural networks, rule-based classification, pattern-
based classification, logistic regression, …
¨ Typical applications:
¤ Credit card fraud detection, direct marketing, classifying stars,
diseases, web pages, …
33
Data Mining Functions: Cluster Analysis
¨ Unsupervised learning (i.e., Class
label is unknown)
¨ Group data to form new
categories (i.e., clusters), e.g.,
cluster houses to find distribution
patterns
34
Data Mining Functions: Cluster Analysis
¨ Unsupervised learning (i.e., Class
label is unknown)
¨ Group data to form new
categories (i.e., clusters), e.g.,
cluster houses to find distribution
patterns
¨ Principle: Maximizing intra-class
similarity & minimizing interclass
similarity
¨ Many methods and applications
35
Data Mining Functions: Outlier Analysis
¨ Outlier analysis
¤ Outlier: A data object that does not comply with the
general behavior of the data
¤ Noise or exception?―One person’s garbage could
be another person’s treasure
36
Data Mining Functions: Outlier Analysis
¨ Outlier analysis
¤ Outlier: A data object that does not comply with the
general behavior of the data
¤ Noise or exception?―One person’s garbage could
be another person’s treasure
¤ Methods: by product of clustering or regression
analysis, …
¤ Useful in fraud detection, rare events analysis
37
Data Mining Functions: Time and Ordering:
Sequential Pattern, Trend and Evolution Analysis
¨ Sequence, trend and evolution analysis
¤ Trend, time-series, and deviation analysis
39
Data Mining Functions: Structure and
Network Analysis
¨ Graph mining
¤ Finding frequent subgraphs (e.g., chemical compounds), trees (XML),
substructures (web fragments)
¨ Information network analysis
¤ Social networks: actors (objects, nodes) and relationships (edges)
n e.g., author networks in CS, terrorist networks
¤ Multiple heterogeneous networks
n A person could be multiple information networks: friends, family,
classmates, …
¤ Knowledge graphs: knowledge backbone of AI systems
40
Data Mining Functions: Structure and
Network Analysis
¨ Graph mining
¤ Finding frequent subgraphs (e.g., chemical compounds), trees (XML),
substructures (web fragments)
¨ Information network analysis
¤ Social networks: actors (objects, nodes) and relationships (edges)
n e.g., author networks in CS, terrorist networks
¤ Multiple heterogeneous networks
n A person could be multiple information networks: friends, family,
classmates, …
¤ Knowledge graphs: knowledge backbone of AI systems
¨ Web mining
¤ Web is a big information network: from PageRank to Google
¤ Analysis of Web information networks
n Web community discovery, opinion mining, usage mining, …
41
Evaluation of Knowledge
¨ Are all mined knowledge interesting?
¤ One can mine tremendous amounts of “patterns”
¤ Some may fit only certain dimension space (time, location,
…)
¤ Some may not be representative, may be transient, …
42
Evaluation of Knowledge
¨ Are all mined knowledge interesting?
¤ One can mine tremendous amount of “patterns”
¤ Some may fit only certain dimension space (time, location, …)
¤ Some may not be representative, may be transient, …
¨ Evaluation of mined knowledge → directly mine only interesting
knowledge?
¤ Descriptive vs. predictive
¤ Coverage
¤ Typicality vs. novelty
¤ Accuracy
¤ Timeliness
¤ …
43
Chapter 1. Introduction
¨ Why Data Mining?
¨ What Is Data Mining?
¨ A Multi-Dimensional View of Data Mining
¨ What Kinds of Data Can Be Mined?
¨ What Kinds of Patterns Can Be Mined?
¨ What Kinds of Technologies Are Used?
¨ What Kinds of Applications Are Targeted?
¨ Major Issues in Data Mining
¨ A Brief History of Data Mining and Data Mining Society
¨ Summary
44
Data Mining: Confluence of Multiple Disciplines
Pattern
Machine Statistics
Recognition
Learning
Algorithm High-Performance
Database Computing
Technology
45
Why Confluence of Multiple Disciplines?
¨ Tremendous amount of data
¤ Algorithms must be scalable to handle big data
¨ High-dimensionality of data
¤ Micro-array may have tens of thousands of dimensions
48
Chapter 1. Introduction
¨ Why Data Mining?
¨ What Is Data Mining?
¨ A Multi-Dimensional View of Data Mining
¨ What Kinds of Data Can Be Mined?
¨ What Kinds of Patterns Can Be Mined?
¨ What Kinds of Technologies Are Used?
¨ What Kinds of Applications Are Targeted?
¨ Major Issues in Data Mining
¨ A Brief History of Data Mining and Data Mining Society
¨ Summary
49
Major Issues in Data Mining (1)
¨ Mining Methodology
¤ Mining various and new kinds of knowledge
¤ Mining knowledge in multi-dimensional space
¤ Data mining: An interdisciplinary effort
¤ Boosting the power of discovery in a networked environment
¤ Handling noise, uncertainty, and incompleteness of data
¤ Pattern evaluation and pattern- or constraint-guided mining
50
Major Issues in Data Mining (1)
¨ Mining Methodology
¤ Mining various and new kinds of knowledge
¤ Mining knowledge in multi-dimensional space
¤ Data mining: An interdisciplinary effort
¤ Boosting the power of discovery in a networked environment
¤ Handling noise, uncertainty, and incompleteness of data
¤ Pattern evaluation and pattern- or constraint-guided mining
¨ User Interaction & Human-Machine Collaboration
¤ Interactive mining
¤ Incorporation of background knowledge
¤ Presentation and visualization of data mining results
51
Major Issues in Data Mining (2)
¨ Efficiency and Scalability
¤ Efficiency and scalability of data mining algorithms
¤ Parallel, distributed, stream, and incremental mining methods
¨ Diversity of data types
¤ Handling complex types of data
¤ Mining dynamic, networked, and global data repositories
¨ Data mining and society
¤ Social impacts of data mining
¤ Privacy-preserving data mining
52
Chapter 1. Introduction
¨ Why Data Mining?
¨ What Is Data Mining?
¨ A Multi-Dimensional View of Data Mining
¨ What Kinds of Data Can Be Mined?
¨ What Kinds of Patterns Can Be Mined?
¨ What Kinds of Technologies Are Used?
¨ What Kinds of Applications Are Targeted?
¨ Major Issues in Data Mining
¨ A Brief History of Data Mining and Data Mining Society
¨ Summary
53
A Brief History of Data Mining Society
¨ 1989 IJCAI Workshop on Knowledge Discovery in Databases
¤ Knowledge Discovery in Databases (G. Piatetsky-Shapiro and W.
Frawley, 1991)
¨ 1991-1994 Workshops on Knowledge Discovery in Databases
¤ Advances in Knowledge Discovery and Data Mining (U. Fayyad, G.
Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, 1996)
¨ 1995-1998 International Conferences on Knowledge Discovery in
Databases and Data Mining (KDD’95-98)
¤ Journal of Data Mining and Knowledge Discovery (1997)
¨ ACM SIGKDD conferences since 1998 and SIGKDD Explorations
¨ More conferences on data mining
¤ PAKDD (1997), PKDD (1997), SIAM-Data Mining (2001), (IEEE) ICDM
(2001), WSDM (2008), etc.
¨ ACM Transactions on KDD (2007)
54
Conferences and Journals on Data Mining
¨ KDD Conferences n Other related conferences
¤ ACM SIGKDD Int. Conf. on n DB conferences: ACM SIGMOD,
¨ Web and IR
¤ Conferences: SIGIR, WWW, CIKM, etc.
¤ Journals: WWW: Internet and Web Information Systems
¨ Statistics
¤ Conferences: Joint Stat. Meeting, etc.
¤ Journals: Annals of statistics, etc.
¨ Visualization
¤ Conference proceedings: CHI, ACM-SIGGraph, etc.
¤ Journals: IEEE Trans. visualization and computer graphics, etc.
57
Future of Data Science
https://www.youtube.com/watc
h?v=hxXIJnjC_HI (Future of
Data Science @ Stanford)
DataFest
Hackathon
Conduct research in labs
60
¨ Major issues in data mining
Recommended Reference Books
¨ Charu C. Aggarwal, Data Mining: The Textbook, Springer, 2015
¨ E. Alpaydin. Introduction to Machine Learning, 2nd ed., MIT Press, 2011
¨ R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification, 2ed., Wiley-
Interscience, 2000
¨ U. Fayyad, G. Grinstein, and A. Wierse, Information Visualization in Data Mining
and Knowledge Discovery, Morgan Kaufmann, 2001
¨ J. Han, M. Kamber, and J. Pei, Data Mining: Concepts and Techniques. Morgan
Kaufmann, 3rd ed. , 2011
¨ T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning:
Data Mining, Inference, and Prediction, 2nd ed., Springer, 2009
¨ T. M. Mitchell, Machine Learning, McGraw Hill, 1997
¨ P.-N. Tan, M. Steinbach and V. Kumar, Introduction to Data Mining, Wiley, 2005
(2nd ed. 2016)
¨ I. H. Witten and E. Frank, Data Mining: Practical Machine Learning Tools and
Techniques with Java Implementations, Morgan Kaufmann, 2nd ed. 2005
¨ Mohammed J. Zaki and Wagner Meira Jr., Data Mining and Analysis:
Fundamental Concepts and Algorithms 2014
61
Future of Data Science
https://www.youtube.com/watc
h?v=hxXIJnjC_HI
DataFest
Hackathon
Conduct research in labs
Figure from: https://www.datasciencecentral.com/profiles/blogs/difference-
62 of-data-science-machine-learning-and-data-mining