0% found this document useful (0 votes)
20 views

01Intro (2)

Uploaded by

22051925
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views

01Intro (2)

Uploaded by

22051925
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 45

Data Mining:

Concepts and
Techniques
(3rd ed.)

— Chapter 1 —

Jiawei Han, Micheline Kamber, and Jian Pei


University of Illinois at Urbana-Champaign &
Simon Fraser University
©2011 Han, Kamber & Pei. All rights reserved.
1
Chapter 1. Introduction
 Why Data Mining?
 What Is Data Mining?
 A Multi-Dimensional View of Data Mining
 What Kind of Data Can Be Mined?
 What Kinds of Patterns Can Be Mined?
 What Technology Are Used?
 What Kind of Applications Are Targeted?
 Major Issues in Data Mining
 A Brief History of Data Mining and Data Mining Society
 Summary
Why Data Mining?
 The Explosive Growth of Data: from terabytes to petabytes
 Data collection and data availability

Automated data collection tools, database systems,
Web, computerized society
 Major sources of abundant data

Business: Web, e-commerce, transactions, stocks, …

Science: Remote sensing, bioinformatics, scientific
simulation, …

Society and everyone: news, digital cameras, YouTube
 We are drowning in data, but starving for knowledge!
 “Necessity is the mother of invention”—Data mining—
Automated analysis of massive data sets
Evolution of Database
Technology
 1960s:
 Data collection, database creation, IMS and network DBMS
 1970s:
 Relational data model, relational DBMS implementation
 1980s:
 RDBMS, advanced data models (extended-relational, OO,
deductive, etc.)
 Application-oriented DBMS (spatial, scientific, engineering, etc.)
 1990s:
 Data mining, data warehousing, multimedia databases, and Web
databases
 2000s
 Stream data management and mining
 Data mining and its applications
 Web technology (XML, data integration) and global information
systems
Chapter 1. Introduction
 Why Data Mining?
 What Is Data Mining?
 A Multi-Dimensional View of Data Mining
 What Kind of Data Can Be Mined?
 What Kinds of Patterns Can Be Mined?
 What Technology Are Used?
 What Kind of Applications Are Targeted?
 Major Issues in Data Mining
 A Brief History of Data Mining and Data Mining Society
 Summary
What Is Data Mining?

 Data mining (knowledge discovery from data)



Extraction of interesting patterns or knowledge from
huge amount of data
 Alternative names
 Knowledge discovery (mining) in databases (KDD),
knowledge extraction, data/pattern analysis, data
archeology, data dredging, information harvesting,
business intelligence, etc.
Knowledge Discovery (KDD) Process
 This is a view from typical
database systems and data
Pattern Evaluation
warehousing communities
 Data mining plays an
essential role in the
knowledge discovery process Data Mining

Task-relevant Data

Data Warehouse Selection

Data Cleaning

Data Integration

Databases
Knowledge Discovery (KDD) Process
 The knowledge discovery process is an iterative sequence of
the following steps:

Data cleaning (to remove noise and inconsistent data)

Data integration (where multiple data sources may be
combined)

Data selection (where data relevant to the analysis task
are retrieved from the database)

Data transformation (where data are transformed and
consolidated into forms appropriate for mining by
performing summary or aggregation operations)

Data mining (an essential process where intelligent
methods are applied to extract data patterns)

Pattern evaluation (to identify the truly interesting
patterns representing knowledge based on
interestingness measures)

Knowledge presentation (where visualization and
knowledge representation techniques are used to present
mined knowledge to users)
Data Mining

Data mining is the process of discovering interesting
patterns and knowledge from large amount of data
Example: A Web Mining
Framework

 Web mining usually involves


 Data cleaning
 Data integration from multiple sources
 Warehousing the data
 Data cube construction
 Data selection for data mining
 Data mining
 Presentation of the mining results
 Patterns and knowledge to be used or stored
into knowledge-base
Data Mining in Business Intelligence

Increasing potential
to support
business decisions End User
Decisio
n
Making
Data Presentation Business
Analyst
Visualization Techniques
Data Mining Data
Information Discovery Analyst

Data Exploration
Statistical Summary, Querying, and Reporting

Data Preprocessing/Integration, Data Warehouses


DBA
Data Sources
Paper, Files, Web documents, Scientific experiments, Database Systems
KDD Process: A Typical View from ML
and Statistics

Input Data Data Pre- Data Post-


Processing Mining Processin
g

Data integration Pattern discovery Pattern evaluation


Normalization Association & Pattern selection
correlation
Feature selection Classification Pattern
Dimension reduction interpretation
Clustering
Outlier analysis Pattern visualization
…………

 This is a view from typical machine learning and statistics


communities
Example: Medical Data
Mining

 Health care & medical data mining – often


adopted such a view in statistics and
machine learning
 Preprocessing of the data (including feature
extraction and dimension reduction)
 Classification or/and clustering processes
 Post-processing for presentation
Chapter 1. Introduction
 Why Data Mining?
 What Is Data Mining?
 A Multi-Dimensional View of Data Mining
 What Kind of Data Can Be Mined?
 What Kinds of Patterns Can Be Mined?
 What Technology Are Used?
 What Kind of Applications Are Targeted?
 Major Issues in Data Mining
 A Brief History of Data Mining and Data Mining Society
 Summary
Data Mining Models and
Tasks
Multi-Dimensional View of Data
Mining
 Data to be mined
 Database data (extended-relational, object-oriented,

heterogeneous, legacy), data warehouse, transactional


data, stream, spatiotemporal, time-series, sequence, text
and web, multi-media, graphs & social and information
networks
 Knowledge to be mined (or: Data mining functions)
 Characterization, discrimination, association, classification,

clustering, trend/deviation, outlier analysis, etc.


 Descriptive vs. predictive data mining

 Techniques utilized
 Data-intensive, data warehouse (OLAP), machine learning,

statistics, pattern recognition, visualization, high-


performance, etc.
 Applications adapted
 Retail, telecommunication, banking, fraud analysis, bio-data

mining, stock market analysis, text mining, Web mining, etc.


Multi-Dimensional View of Data
Mining
Data characterization
 Data characterization is a summarization of the general

characteristics or features of a target class of data.


 The output of data characterization can be presented in

various forms. Examples include pie charts, bar charts,


curves, multidimensional data cubes, and multidimensional
tables, including crosstabs.
 A crosstab is a table that summarizes the relationship

between two categorical variables.


Data discrimination

 Data discrimination is a comparison of the general features

of the target class data objects against the general features


of objects from one or multiple contrasting classes.
 The target and contrasting classes can be specified by a

user, and the corresponding data objects can be retrieved


through database queries.
Multi-Dimensional View of Data
Mining
Predictive Analytics
 Predictive Analytics will help an organization to know what

might happen next, it predicts future based on present data


available.
Descriptive Analytics

 Descriptive Analytics will help an organization to know what

has happened in the past, it would give you the past


analytics using the data that are stored.
Chapter 1. Introduction
 Why Data Mining?
 What Is Data Mining?
 A Multi-Dimensional View of Data Mining
 What Kind of Data Can Be Mined?
 What Kinds of Patterns Can Be Mined?
 What Technology Are Used?
 What Kind of Applications Are Targeted?
 Major Issues in Data Mining
 A Brief History of Data Mining and Data Mining Society
 Summary
Data Mining: On What Kinds of
Data?
 Database-oriented data sets and applications
 Relational database, data warehouse, transactional database
 Advanced data sets and advanced applications
 Data streams and sensor data
 Time-series data, temporal data, sequence data (incl. bio-
sequences)
 Structure data, graphs, social networks and multi-linked data
 Object-relational databases
 Heterogeneous databases and legacy databases
 Spatial data and spatiotemporal data
 Multimedia database
 Text databases
 The World-Wide Web
Data Warehouse
 A data warehouse is a repository of information collected from
multiple sources, stored under a unifie schema, and usually
residing at a single site.
 Data warehouses are constructed via a process of data
cleaning, data integration, data transformation, data loading,
and periodic data refreshing.
 A data warehouse is usually modeled by a multidimensional
data structure, called a data cube.
 Each dimension of the data cube corresponds to an attribute
or a set of attributes in the schema, and each cell stores the
value
 A data cube provides a multidimensional view of data and
allows the precomputation and fast access of summarized
Typical framework of a Data
Warehouse
Example of Data Cube
Chapter 1. Introduction
 Why Data Mining?
 What Is Data Mining?
 A Multi-Dimensional View of Data Mining
 What Kind of Data Can Be Mined?
 What Kinds of Patterns Can Be Mined?
 What Technology Are Used?
 What Kind of Applications Are Targeted?
 Major Issues in Data Mining
 A Brief History of Data Mining and Data Mining Society
 Summary
Data Mining Function: (1)
Generalization
 Information integration and data warehouse
construction
 Data cleaning, transformation, integration, and
multidimensional data model
 Data cube technology
 Scalable methods for computing (i.e.,
materializing) multidimensional aggregates
 OLAP (online analytical processing)
 Multidimensional concept description:
Characterization and discrimination
 Generalize, summarize, and contrast data
characteristics, e.g., dry vs. wet region
Data Mining Function: (2)
Association and Correlation Analysis
 Frequent patterns (or frequent itemsets)
 Frequent patterns are patterns that occur
frequently in data.
 A frequent itemset typically refers to a set of items
that often appear together in a transactional data
set
 What items are frequently purchased together in
your Walmart?
 Example: milk and bread, which are frequently
bought together in grocery stores by many
customers.
Data Mining Function: (2)
Association and Correlation Analysis
 Association, correlation
 A typical association rule

buys(X, “computer”)buys(X, “software”)
[support = 1%, confidence = 50%] indicates
where X is a customer. A confidence, or certainty, of 50%
means that if a customer buys a computer, there is a 50%
chance that she will buy software as well. A 1% support
means that 1% of all the transactions under analysis show
that computer and software are purchased
together.“computer software [1%, 50%]”

Diaper  Beer [0.5%, 75%] (support, confidence)
 Are strongly associated items also strongly
correlated?
 How to mine such patterns and rules efficiently in large
Data Mining Function: (3)
Classification
 Classification and label prediction
 Construct models (functions) based on some training
examples
 Describe and distinguish classes or concepts for future
prediction

E.g., classify countries based on (climate), or classify
cars based on (gas mileage)
 Predict some unknown class labels
 Typical methods
 Decision trees, naïve Bayesian classification, support
vector machines, neural networks, rule-based
classification, pattern-based classification, logistic
regression, …
 Typical applications:
 Credit card fraud detection, direct marketing, classifying
Data Mining Function: (4) Cluster
Analysis
 Unsupervised learning (i.e., Class label is unknown)
 Group data to form new categories (i.e., clusters),
e.g., cluster houses to find distribution patterns
 Principle: Maximizing intra-class similarity &
minimizing interclass similarity
 Many methods and applications
Data Mining Function: (5) Outlier
Analysis
 Outlier analysis
 Outlier: A data object that does not comply with the
general behavior of the data
 Noise or exception? ― One person’s garbage could be
another person’s treasure
 Methods: by product of clustering or regression analysis, …
 Useful in fraud detection, rare events analysis
Time and Ordering: Sequential
Pattern, Trend and Evolution Analysis
 Sequence, trend and evolution analysis
 Trend, time-series, and deviation analysis: e.g.,

regression and value prediction


 Sequential pattern mining


e.g., first buy digital camera, then buy large
SD memory cards
 Periodicity analysis

 Motifs and biological sequence analysis


Approximate and consecutive motifs
 Similarity-based analysis

 Mining data streams


 Ordered, time-varying, potentially infinite, data

streams
Structure and Network Analysis
 Graph mining
 Finding frequent subgraphs (e.g., chemical compounds),

trees (XML), substructures (web fragments)


 Information network analysis
 Social networks: actors (objects, nodes) and relationships

(edges)

e.g., author networks in CS, terrorist networks
 Multiple heterogeneous networks


A person could be multiple information networks:
friends, family, classmates, …
 Links carry a lot of semantic information: Link mining

 Web mining
 Web is a big information network: from PageRank to

Google
 Analysis of Web information networks


Web community discovery, opinion mining, usage
Evaluation of Knowledge
 Are all mined knowledge interesting?
 One can mine tremendous amount of “patterns” and
knowledge
 Some may fit only certain dimension space (time, location,
…)
 Some may not be representative, may be transient, …
 Evaluation of mined knowledge → directly mine only
interesting knowledge?
 Descriptive vs. predictive
 Coverage
 Typicality vs. novelty
 Accuracy
 Timeliness
Chapter 1. Introduction
 Why Data Mining?
 What Is Data Mining?
 A Multi-Dimensional View of Data Mining
 What Kind of Data Can Be Mined?
 What Kinds of Patterns Can Be Mined?
 What Technology Are Used?
 What Kind of Applications Are Targeted?
 Major Issues in Data Mining
 A Brief History of Data Mining and Data Mining Society
 Summary
Data Mining: Confluence of Multiple
Disciplines

Machine Pattern Statistics


Learning Recognition

Applications Data Mining Visualization

Algorithm Database High-Performance


Technology Computing
Why Confluence of Multiple
Disciplines?
 Tremendous amount of data
 Algorithms must be highly scalable to handle such as tera-
bytes of data
 High-dimensionality of data
 Micro-array may have tens of thousands of dimensions
 High complexity of data
 Data streams and sensor data
 Time-series data, temporal data, sequence data
 Structure data, graphs, social networks and multi-linked
data
 Heterogeneous databases and legacy databases
 Spatial, spatiotemporal, multimedia, text and Web data
 Software programs, scientific simulations
 New and sophisticated applications
Chapter 1. Introduction
 Why Data Mining?
 What Is Data Mining?
 A Multi-Dimensional View of Data Mining
 What Kind of Data Can Be Mined?
 What Kinds of Patterns Can Be Mined?
 What Technology Are Used?
 What Kind of Applications Are Targeted?
 Major Issues in Data Mining
 A Brief History of Data Mining and Data Mining Society
 Summary
Applications of Data Mining
 Web page analysis: from web page classification, clustering
to PageRank & HITS algorithms
 Collaborative analysis & recommender systems
 Basket data analysis to targeted marketing
 Biological and medical data analysis: classification, cluster
analysis (microarray data analysis), biological sequence
analysis, biological network analysis
 Data mining and software engineering (e.g., IEEE Computer,
Aug. 2009 issue)
 From major dedicated data mining systems/tools (e.g., SAS,
MS SQL-Server Analysis Manager, Oracle Data Mining Tools)
to invisible data mining
Chapter 1. Introduction
 Why Data Mining?
 What Is Data Mining?
 A Multi-Dimensional View of Data Mining
 What Kind of Data Can Be Mined?
 What Kinds of Patterns Can Be Mined?
 What Technology Are Used?
 What Kind of Applications Are Targeted?
 Major Issues in Data Mining
 A Brief History of Data Mining and Data Mining Society
 Summary
Major Issues in Data Mining
(1)
 Mining Methodology
 Mining various and new kinds of knowledge
 Mining knowledge in multi-dimensional space
 Data mining: An interdisciplinary effort
 Boosting the power of discovery in a networked
environment
 Handling noise, uncertainty, and incompleteness of data
 Pattern evaluation and pattern- or constraint-guided
mining
 User Interaction
 Interactive mining
 Incorporation of background knowledge
 Presentation and visualization of data mining results
Major Issues in Data Mining
(2)

 Efficiency and Scalability


 Efficiency and scalability of data mining algorithms
 Parallel, distributed, stream, and incremental mining
methods
 Diversity of data types
 Handling complex types of data
 Mining dynamic, networked, and global data repositories
 Data mining and society
 Social impacts of data mining
 Privacy-preserving data mining
 Invisible data mining
Chapter 1. Introduction
 Why Data Mining?
 What Is Data Mining?
 A Multi-Dimensional View of Data Mining
 What Kind of Data Can Be Mined?
 What Kinds of Patterns Can Be Mined?
 What Technology Are Used?
 What Kind of Applications Are Targeted?
 Major Issues in Data Mining
 A Brief History of Data Mining and Data Mining Society
 Summary
A Brief History of Data Mining
Society
 1989 IJCAI Workshop on Knowledge Discovery in Databases
 Knowledge Discovery in Databases (G. Piatetsky-Shapiro and W.
Frawley, 1991)
 1991-1994 Workshops on Knowledge Discovery in Databases
 Advances in Knowledge Discovery and Data Mining (U. Fayyad,
G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, 1996)
 1995-1998 International Conferences on Knowledge Discovery in
Databases and Data Mining (KDD’95-98)
 Journal of Data Mining and Knowledge Discovery (1997)
 ACM SIGKDD conferences since 1998 and SIGKDD Explorations
 More conferences on data mining
 PAKDD (1997), PKDD (1997), SIAM-Data Mining (2001), (IEEE)
ICDM (2001), etc.
 ACM Transactions on KDD starting in 2007
Chapter 1. Introduction
 Why Data Mining?
 What Is Data Mining?
 A Multi-Dimensional View of Data Mining
 What Kind of Data Can Be Mined?
 What Kinds of Patterns Can Be Mined?
 What Technology Are Used?
 What Kind of Applications Are Targeted?
 Major Issues in Data Mining
 A Brief History of Data Mining and Data Mining Society
 Summary
Summary
 Data mining: Discovering interesting patterns and knowledge
from massive amount of data
 A natural evolution of database technology, in great demand,
with wide applications
 A KDD process includes data cleaning, data integration, data
selection, transformation, data mining, pattern evaluation,
and knowledge presentation
 Mining can be performed in a variety of data
 Data mining functionalities: characterization, discrimination,
association, classification, clustering, outlier and trend
analysis, etc.
 Data mining technologies and applications
 Major issues in data mining

You might also like