1intro - Data Mining
1intro - Data Mining
1960s:
Data collection, database creation, Premitive File
Processing
1970s – early 1980s:
Relational database Systems, Data Modeling tools:ERD,
Indexing & accessing Methods, SQL Query, Query
processing & optimization, Transactions, concurrency
control & recovery, OLTP
Mid-1980s - present:
Extended Relational, Object-Relational Data Models and
application-oriented DBMS (multimedia, spatial,
scientific, engineering, etc.)
Evolution of Database Technology (Conti..)
Alternative names:
Knowledge discovery(mining) in databases (KDD),
knowledge extraction, data/pattern analysis, data
archeology, data dredging, information harvesting,
business intelligence, etc.
What is not data mining?
(Deductive) query processing.
Expert systems or small ML/statistical programs
What is not Data Mining? What is Data Mining?
– Look up phone – Certain names are more prevalent in
number in phone certain US locations (O’Brien, O’Rurke,
directory O’Reilly… in Boston area)
Pattern Evaluation
Data Selection
Warehouse
Data Cleaning
Data Integration
Databases
Steps of a KDD Process
Learning the application domain:
relevant prior knowledge and goals of application
Data Cleaning: To remove noise and inconsistent data
Data Integration: Where multiple data souces may be combined
Data Selection: Where data relevant to the analysis task are
retrieved from the database
Data Transformation: where data are transformed into forms
appropriate for mining by performing summary/aggragation.
Data Mining: an essential process where intelligent methods are
applied in order to extract data patterns
Choosing functions of data mining & mining algorithms
summarization, classification, regression, association,
clustering.
Pattern evaluation: Identify truly interesting patterns representing
knowledge
Knowledge representation: Visualization and knowledge
representation techniques are used to present the mined knowledge
to the user
Use of discovered knowledge
Data Mining and Business Intelligence
Increasing potential
to support
business decisions End User
Making
Decisions
Data Exploration
Statistical Analysis, Querying and Reporting
Pattern evaluation
Data World
Databases Warehouse Wide Web
Major components in the Architecture
of a Typical Data Mining System
• Database, Data warehose, WWW, or other
information repository:
– This is one or a set of databases, data warehouses,
spreadsheets. Data cleaning and data integration
techniques may be performed on the data.
• Database or data warehouse server:
– Responsible for fetching the relevant data, based on
the user’s data mining request.
• Knowledge base:
– This is the domain knowledge that is used to guide
the search or evaluate the interestingness of resulting
patterns. Such knowledge can include concept
hierarchies, used to organize attributes or attribute
values into different levels of abstraction.
Major components in the Architecture
of a Typical Data Mining System
• Data Mining Engine:
– Consists of set of functional modules for tasks such as
characterization, association and correlation analysis,
classification, prediction, cluster analysis, outlier and
evoluation analysis
• Pattern Evaluation Module:
– This component employs interestingness measures and
interacts with the data mining modules so as to focus the
search toward interesting patterns.
• User Interface:
– Communicates between users and data mining system.
– Allows user to interact with the system by specifying a data
mining query or task, to provide information to help focus
the search, and to visualize the patterns in different forms .
Data Mining: On What Kind of
Data?
Relationaldatabases
Data warehouses
Transactional databases
Advanced DB and information repositories
Object-oriented and object-relational databases
Time-series data and temporal data
WWW
Relational Databases
1. Concept/Class Description:
Characterization and Discrimination
2. Mining Frequent Patterns, Associations
and correlations
3. Classification and Prediction
4. Cluster Analysis
5. Outlier Analysis
6. Evolution Analysis
1 Concept/Class Description
Concept/Class description:
Data
can be associated with classes or
concepts.
Descriptions of a class or concept in
summarized, concise & precise terms are
called class/concept description.
These descriptions can be derived via
Data Characterization
Data Descrimination
Data Characterization
Cluster analysis
Clustering analyzes data objects without
consulting a known class label.
Class label is unknown: Group data to form
new classes, e.g., cluster houses to find
distribution patterns
Clustering based on the principle: maximizing
the intra-class similarity and minimizing the
interclass similarity
5 Outlier Analysis
Outlier analysis
Outlier: a data object that does not comply with the
general behavior of the data
It can be considered as noise or exception but is
quite useful in fraud detection, rare events analysis
For example,
Uncover fraud usage of credit cards by detecting
purchases of extremely large amounts for a given
account number in comparison to regular charges
incurred by the same account.
6 Evolution Analysis
Database
Statistics
Technology
Machine
Learning
Data Mining Visualization
Information Other
Science Disciplines
Why Confluence of Multiple
Disciplines?
• Tremendous amount of data Algorithms
must be highly scalable to handle such as
tera-bytes of data
• High-dimensionality of data Micro-array
may have tens of thousands of dimensions
• High complexity of data Data streams and
sensor data Time-series data, temporal
data, sequence data Structure data,
graphs, social networks and multi-linked
data Heterogeneous databases and
legacy
Data Mining: Classification Schemes
General functionality
• Descriptive data mining - which describes data in a
concise and summative manner and presents interesting general
properties of the data.
Databases to be mined
Relational, transactional, object-oriented, object-
relational, active, spatial, time-series, text, multi-media,
heterogeneous, legacy, WWW, etc.
Knowledge to be mined
Characterization, discrimination, association,
classification, clustering, trend, deviation and outlier
analysis, etc.
Multiple/integrated functions and mining at multiple
levels
Techniques utilized
Database-oriented, data warehouse (OLAP), machine
learning, statistics, visualization, neural network, etc.
Applications adapted
Retail, telecommunication, banking, fraud analysis, DNA mining,
stock market analysis, Web mining, Weblog analysis, etc.
Data Mining Task Primitives
No coupling :
DM system will not utilize any function of DB or DW
system. It may fetch data from a particular source,
process data using data mining algorithm and store
the mining results in another file.
Loose coupling:
DM system will use some facilities of a DB or DW
system, fetching data from a data repository managed
by these systems, performing data mining, and then
storing results either in a file or at a designated place
in a database or data warehouse.
An Integration of Data Mining Systems with
DBMS or Data Warehouse System
Semi-tight coupling:
Besides linking a DM system to a DB/DW system, efficient
implementations of a few essential data mining primitives
can be provided in DB/DW system. Some frequently used
intermediate mining results can be precomputed and stored
in the DB/DW system.
Tight coupling:
DM system is smoothly integrated into the DB/DW system.
DM subsystem is treated as one functional component on
an information system.
Data Mining queries & functions are optimized based on
mining query analysis, data structures, indexing schemes
and query processing methods of DB or DW system.
Major Issues in Data Mining