0% found this document useful (0 votes)
22 views

1intro - Data Mining

The document discusses data mining concepts including what data mining is, why it is useful, the data mining process, types of data it can be applied to, and typical system architectures. Data mining aims to extract useful patterns from large amounts of data and can help with applications like customer analysis and fraud detection.

Uploaded by

Ansh Surti
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views

1intro - Data Mining

The document discusses data mining concepts including what data mining is, why it is useful, the data mining process, types of data it can be applied to, and typical system architectures. Data mining aims to extract useful patterns from large amounts of data and can help with applications like customer analysis and fraud detection.

Uploaded by

Ansh Surti
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 61

Data Mining:

Concepts and Techniques


— Chapter 3 —
Chapter 3. Introduction

 Motivation: Why data mining?


 What is data mining?
 Data Mining: On what kind of data?
 Data mining functionality
 Are all the patterns interesting?
 Classification of data mining systems
 Major issues in data mining
Motivation: Why Data Mining

 Data explosion problem


 Automated data collection tools and mature database
technology lead to tremendous amounts of data stored
in databases, data warehouses and other information
repositories

 We are drowning in data, but starving for knowledge!


 Lots of data is being collected and warehoused
 Web data, e-commerce
 purchases at department/grocery stores
 Bank/Credit Card transactions
Motivation: Why Data Mining

 Solution: Data warehousing and data mining


 Data warehousing and on-line analytical processing
 Extraction of interesting knowledge (rules, regularities,
patterns, constraints) from data in large databases
Evolution of Database Technology

 1960s:
 Data collection, database creation, Premitive File
Processing
 1970s – early 1980s:
 Relational database Systems, Data Modeling tools:ERD,
Indexing & accessing Methods, SQL Query, Query
processing & optimization, Transactions, concurrency
control & recovery, OLTP
 Mid-1980s - present:
 Extended Relational, Object-Relational Data Models and
application-oriented DBMS (multimedia, spatial,
scientific, engineering, etc.)
Evolution of Database Technology (Conti..)

 Late 1980s – present:


 Data Warehouse & OLAP, Data Mining & knowledge
discovery(Classification, Clustering, association),
Advanced data mining applications (text mining,
intrusion detection etc.)
 1990s - present:
 Web-based databases (XML-based Database system)
& its integration with information retrieval
What Is Data Mining?

 Data mining (knowledge discovery in


databases):
 Extraction of interesting (non-trivial, implicit,
previously unknown and potentially useful)
information or patterns from data in large
databases

 Process of semi-automatically analyzing


large databases to find patterns that are:
 valid: hold on new data with some certainity
 novel: non-obvious to the system
 useful: should be possible to act on the item
 understandable: humans should be able to
interpret the pattern
What Is Data Mining?

 Alternative names:
 Knowledge discovery(mining) in databases (KDD),
knowledge extraction, data/pattern analysis, data
archeology, data dredging, information harvesting,
business intelligence, etc.
 What is not data mining?
 (Deductive) query processing.
 Expert systems or small ML/statistical programs
 What is not Data Mining?  What is Data Mining?
– Look up phone – Certain names are more prevalent in
number in phone certain US locations (O’Brien, O’Rurke,
directory O’Reilly… in Boston area)

– Query a Web search – Group together similar documents


engine for information returned by search engine according to
about “Amazon” their context (e.g. Amazon rainforest,
Amazon.com,)
Why Data Mining? — Potential
Applications

 Database analysis and decision support


 Market analysis and management
 target marketing, customer relation management,
market basket analysis, cross selling, market
segmentation
 Risk analysis and management
 Forecasting, customer retention, improved underwriting,
quality control, competitive analysis
 Fraud detection and management
 Other Applications
 Text mining (news group, email, documents) and Web
analysis.
 Intelligent query answering
Data Mining: A KDD Process

Pattern Evaluation

 Data mining: the core


of knowledge
Data Mining
discovery process.
Task-relevant Data

Data Selection
Warehouse

Data Cleaning

Data Integration

Databases
Steps of a KDD Process
 Learning the application domain:
 relevant prior knowledge and goals of application
 Data Cleaning: To remove noise and inconsistent data
 Data Integration: Where multiple data souces may be combined
 Data Selection: Where data relevant to the analysis task are
retrieved from the database
 Data Transformation: where data are transformed into forms
appropriate for mining by performing summary/aggragation.
 Data Mining: an essential process where intelligent methods are
applied in order to extract data patterns
 Choosing functions of data mining & mining algorithms
 summarization, classification, regression, association,
clustering.
 Pattern evaluation: Identify truly interesting patterns representing
knowledge
 Knowledge representation: Visualization and knowledge
representation techniques are used to present the mined knowledge
to the user
 Use of discovered knowledge
Data Mining and Business Intelligence
Increasing potential
to support
business decisions End User
Making
Decisions

Data Presentation Business


Analyst
Visualization Techniques

Data Mining Data


Analyst
Information Discovery

Data Exploration
Statistical Analysis, Querying and Reporting

Data Warehouses / Data Marts


OLAP, MDA
DBA
Data Sources
Paper, Files, Information Providers, Database Systems, OLTP
Architecture of a Typical Data
Mining System

Graphical user interface

Pattern evaluation

Data mining engine


Knowledge-base
Database or data
warehouse server

Data cleaning & data integration

Data World
Databases Warehouse Wide Web
Major components in the Architecture
of a Typical Data Mining System
• Database, Data warehose, WWW, or other
information repository:
– This is one or a set of databases, data warehouses,
spreadsheets. Data cleaning and data integration
techniques may be performed on the data.
• Database or data warehouse server:
– Responsible for fetching the relevant data, based on
the user’s data mining request.
• Knowledge base:
– This is the domain knowledge that is used to guide
the search or evaluate the interestingness of resulting
patterns. Such knowledge can include concept
hierarchies, used to organize attributes or attribute
values into different levels of abstraction.
Major components in the Architecture
of a Typical Data Mining System
• Data Mining Engine:
– Consists of set of functional modules for tasks such as
characterization, association and correlation analysis,
classification, prediction, cluster analysis, outlier and
evoluation analysis
• Pattern Evaluation Module:
– This component employs interestingness measures and
interacts with the data mining modules so as to focus the
search toward interesting patterns.
• User Interface:
– Communicates between users and data mining system.
– Allows user to interact with the system by specifying a data
mining query or task, to provide information to help focus
the search, and to visualize the patterns in different forms .
Data Mining: On What Kind of
Data?

 Relationaldatabases
 Data warehouses
 Transactional databases
 Advanced DB and information repositories
 Object-oriented and object-relational databases
 Time-series data and temporal data

 Spatial & Spatiotemporal databases

 Text databases and multimedia databases

 Heterogeneous and legacy databases

 WWW
Relational Databases

• A relational database is a collection of tables,


each of which is assigned a unique name.
• Each table consists of a set of attributes (columns
or fields) and usually stores a large set of tuples
(records or rows).
• Each tuple in a relational table represents an
object identified by a unique key and described by
a set of attribute values.
• A semantic data model, such as an entity-
relationship (ER) data model, is often constructed
for relational databases.
• An ER data model represents the database as a
set of entities and their relationships.
Data Warehouse
• A data warehouse is a repository of information
collected from multiple sources, stored under a
unified schema, and that usually resides at a
single site.
• Data warehouses are constructed via a process
of data cleaning, data integration, data
transformation, data loading, and periodic data
refreshing.
Transactional databases
• A transactional database consists of a file where each record
represents a transaction.
• A transaction typically includes a unique transaction identity
number (trans ID) and a list of the items making up the
transaction
trans ID list of item IDs
T100 I1, I3, I8, I16
T200 I2, I8
• Suppose you would like to dig deeper into the data by asking,
“Which items sold well together?”
• This kind of market basket data analysis would enable you to
bundle groups of items together as a strategy for maximizing
sales.
• For example, given the knowledge that printers are commonly
purchased together with computers, you could offer an
expensive model of printers at a discount to customers buying
selected computers, in the hopes of selling more of the
expensive printers.
Object Relational Databases

• Object-relational databases are


constructed based on an object-relational
data model.
• This model extends the relational model
by providing a rich data type for handling
complex objects and object orientation.
• the object-relational data model inherits
the essential concepts of object-oriented
databases, where, in general terms, each
entity is considered as an object.
Object Relational Databases

• Data and code relating to an object are encapsulated into


a single unit. Each object has associated with it the
following:
• A set of variables that describe the objects. These
correspond to attributes in the entity-relationship and
relational models.
• A set of messages that the object can use to communicate
with other objects, or with the rest of the database system.
• A set of methods, where each method holds the code to
implement a message. Upon receiving a message, the
method returns a value in response.
• For instance, the method for the message
get_photo(employee) will retrieve and return a photo of
the given employee object.
Object Relational Databases
• Objects that share a common set of properties can be
grouped into an object class.
• Each object is an instance of its class. Object classes
can be organized into class/subclass hierarchies so that
each class represents properties that are common to
objects in that class.
• For data mining in object-relational systems, techniques
need to be developed for handling complex object
structures, complex data types, class and subclass
hierarchies, property inheritance, and methods and
procedures.
Temporal Databases, Sequence
Databases, and Time-Series Databases

• A temporal database typically stores relational data that


include time-related attributes.
• These attributes may involve several timestamps, each
having different semantics.
• A sequence database stores sequences of ordered
events, with or without a concrete notion of time.
Examples include customer shopping sequences, Web
click streams, and biological sequences.
• A time-series database stores sequences of values or
events obtained over repeated measurements of time
(e.g., hourly, daily, weekly).
• Examples include data collected from the stock
exchange, inventory control, and the observation of
natural phenomena (like temperature and wind).
Spatial Databases and
Spatiotemporal Databases
• Spatial databases contain spatial-related information.
Examples include geographic (map) databases, very
large-scale integration (VLSI) or computed-aided design
databases, and medical and satellite image databases.
• Spatial data may be represented in raster format,
consisting of n-dimensional bit maps or pixel maps.
• For example, a 2-D satellite image may be represented
as raster data, where each pixel registers the rainfall in a
given area.
• Maps can be represented in vector format, where roads,
bridges, buildings, and lakes are represented as unions
or overlays of basic geometric constructs, such as
points, lines, polygons, and the partitions and networks
formed by these components.
Spatial Databases and
Spatiotemporal Databases
• Geographic databases have numerous
applications, ranging from forestry and ecology
planning to providing public service information
regarding the location of telephone and electric
cables, pipes, and sewage systems.
• Geographic databases are commonly used in
vehicle navigation and dispatching systems.
• An example of such a system for taxis would
store a city map with information regarding one-
way streets, suggested routes for moving from
region A to region B during rush hour, and the
location of restaurants and hospitals, as well as
the current location of each driver.
Spatial Databases and
Spatiotemporal Databases

• A spatial database that stores spatial


objects that change with time is called a
spatiotemporal database.
Text Databases and Multimedia
Databases
• Text databases are databases that contain word
descriptions for objects.
• These word descriptions are usually not simple keywords
but rather long sentences or paragraphs, such as product
specifications, error or bug reports, warning messages,
summary reports, notes, or other documents. Text
databases may be highly unstructured (such as some Web
pages on the WorldWideWeb).
• Some text databases may be somewhat structured, that is,
semistructured (such as e-mail messages and many
HTML/XML Web pages),whereas others are relatively well
structured (such as library catalogue databases).
• By mining text data, one may uncover general and concise
descriptions of the text documents, keyword or content
associations, as well as the clustering behavior of text
objects.
Text Databases and Multimedia
Databases
• Multimedia databases store image, audio, and video
data. They are used in applications such as picture
content-based retrieval, voice-mail systems, video-on-
demand systems, the World Wide Web, and speech-
based user interfaces that recognize spoken commands.
• Multimedia databases must support large objects,
because data objects such as video can require
gigabytes of storage. Specialized storage and search
techniques are also required.
• Because video and audio data require real-time retrieval
at a steady and predetermined rate in order to avoid
picture or sound gaps and system buffer overflows, such
data are referred to as continuous-media data.
Heterogeneous Databases and
Legacy Databases
• A heterogeneous database consists of a
set of interconnected, autonomous
component databases. The components
communicate in order to exchange
information and answer queries.
• Objects in one component database may
differ greatly from objects in other
component databases, making it difficult to
assimilate their semantics into the overall
heterogeneous database.
Heterogeneous Databases and
Legacy Databases
• A legacy database is a group of heterogeneous
databases that combines different kinds of data
systems, such as relational or object-oriented
databases, hierarchical databases, network
databases, spreadsheets, multimedia
databases, or file systems.
• The heterogeneous databases in a legacy
database may be connected by intra or inter-
computer networks.
• Information exchange across such databases is
difficult because it would require precise
transformation rules from one representation to
another, considering diverse semantics.
Data Streams
• Many applications involve the generation and analysis of a
new kind of data, called stream data, where data flow in and
out of an observation platform (or window) dynamically.
• Such data streams have the following unique features: huge
or possibly infinite volume, dynamically changing, flowing in
and out in a fixed order, allowing only one or a small number
of scans, and demanding fast (often real-time) response time.
• Mining data streams involves the efficient discovery of
general patterns and dynamic changes within stream data.
• For example, we may like to detect intrusions of a computer
network based on the anomaly of message flow, which may
be discovered by clustering data streams, dynamic
construction of stream models, or comparing the current
frequent patterns with that at a certain previous time.
The World Wide Web
• The World Wide Web and its associated distributed
information services, such as Yahoo!, Google, America
Online, and AltaVista, provide rich, worldwide, on-line
information services, where data objects are linked together to
facilitate interactive access.
• Users seeking information of interest traverse from one object
via links to another.
• Such systems provide ample opportunities and challenges for
data mining.
• For example, understanding user access patterns will not only
help improve system design (by providing efficient access
between highly correlated objects), but also leads to better
marketing decisions (e.g., by placing advertisements in
frequently visited documents, or by providing better
customer/user classification and behavior analysis).
• Capturing user access patterns in such distributed information
environments is called Web usage mining (or Weblog mining)
Data Mining Functionalities

1. Concept/Class Description:
Characterization and Discrimination
2. Mining Frequent Patterns, Associations
and correlations
3. Classification and Prediction
4. Cluster Analysis
5. Outlier Analysis
6. Evolution Analysis
1 Concept/Class Description

 Concept/Class description:
 Data
can be associated with classes or
concepts.
 Descriptions of a class or concept in
summarized, concise & precise terms are
called class/concept description.
 These descriptions can be derived via
 Data Characterization
 Data Descrimination
Data Characterization

• It is a summarization of the general characteristics or


features of a target class of data.
• Methods for data characterization
– Data warehousing & OLAP
– Attribute-oriented Induction technique
• The output of data characterization can be presented as
pie chart, bar chart, curves, multidimensional cubes &
multidimensional tables or in rule form referred to as
characteristic rule.
• Example: Produce a description summarizing the characteristics
of customers who spend more than $1000 a year at AllElectronics
store.
– The result could be general profile of customers, such as 30-40
years old, excellent credit ratings, etc.
Data Descrimination

• It is a comparison of the general features of


target class data objects with the general
features of objects from one or a set of
contrasting classes.
• Example 1: Compare general features of software
products whose sales increased by 10% in the last year
with those whose sales decreased by at least 30%
during the same period.
• Example 2: Compare two groups of AllElectronics
customers, such as those who shop for computer
products more than two times a month versus those who
rarely shop for such products(i.e. less than three time a
year).
2 Mining Frequent Patterns,
Associations & Correlations
 Frequent patterns are patterns that occur frequently in
data.
 e.g. frequent itemsets, subsequences or substructures.
 Mining frequent patterns lead to discovery of interesting
associations and correlations within data.
 Association (correlation and causality)
 Multi-dimensional vs. single-dimensional association
 age(X, “20..29”) ^ income(X, “20..29K”)  buys(X, “PC”)
[support = 2%, confidence = 60%]
 buys(X, “computer”)  buys(X, “software”) [1%, 75%]
3 Classification & Prediction

 Classification and Prediction


 Analyze Class-labeled data objects
 Finding models (functions) that describe and
distinguish classes or concepts for future prediction
 e.g., classify countries based on climate, or classify
cars based on gas mileage
 Presentation: decision-tree, classification rule,
neural network
 Prediction: Predict some unknown or missing
numerical values
4 Cluster Analysis

 Cluster analysis
 Clustering analyzes data objects without
consulting a known class label.
 Class label is unknown: Group data to form
new classes, e.g., cluster houses to find
distribution patterns
 Clustering based on the principle: maximizing
the intra-class similarity and minimizing the
interclass similarity
5 Outlier Analysis

 Outlier analysis
 Outlier: a data object that does not comply with the
general behavior of the data
 It can be considered as noise or exception but is
quite useful in fraud detection, rare events analysis
 For example,
 Uncover fraud usage of credit cards by detecting
purchases of extremely large amounts for a given
account number in comparison to regular charges
incurred by the same account.
6 Evolution Analysis

 Evolution analysis describes and models


regularities or trends for objects whose behavior
changes over time.
 Time-Series Analysis
 Trend and deviation: regression analysis
 Sequential pattern mining, periodicity analysis
 Similarity-based data analysis
 Example
 Find regularities in stock market data of the last several
years for the stocks of particular companies.
Are All the “Discovered” Patterns
Interesting?
 A data mining system/query may generate thousands of
patterns, not all of them are interesting.
 Suggested approach: Human-centered, query-based, focused
mining
 Interestingness measures: A pattern is interesting if it
is easily understood by humans, valid on new or test
data with some degree of certainty, potentially useful,
novel, or validates some hypothesis that a user seeks to
confirm
 Objective vs. subjective interestingness measures:
 Objective: based on statistics and structures of patterns, e.g.,
support, confidence, etc.
 Subjective: based on user’s belief in the data, e.g.,
unexpectedness, novelty, actionability, etc.
• Support(s) –
The number of transactions that include
items in the {X} and {Y} parts of the rule as
a percentage of the total number of
transaction.
• It is a measure of how frequently the
collection of items occur together as a
percentage of all transactions.

It is interpreted as fraction of transactions
that contain both X and Y.
• It is usefulness of a pattern support
• A=>B=(no of tuple containg both A and B)/
• Total no of tuple
• Confidence(c) –
It is the ratio of the no of transactions that
includes all items in {B} as well as the no
of transactions that includes all items in
{A} to the no of transactions that includes
all items in {A}.
• Conf(X=>Y) = Supp(XY) Supp(X) –
It measures how often each item in Y
appears in transactions that contains items
in X also.
• Assesses the validity pr trust of pattern
• Confidence is a certainty measure
• A=>B =(no of total containg A and
B)/(Number of tuples containg A)
Can We Find All and Only Interesting
Patterns?

 Find all the interesting patterns: Completeness


 Can a data mining system find all the interesting
patterns?
 Association vs. classification vs. clustering
 Search for only interesting patterns: Optimization
 Can a data mining system find only the interesting
patterns?
 Approaches
 First generate all the patterns and then filter out the
uninteresting ones.
 Generate only the interesting patterns—mining query
optimization
Data Mining: Confluence of
Multiple Disciplines

Database
Statistics
Technology

Machine
Learning
Data Mining Visualization

Information Other
Science Disciplines
Why Confluence of Multiple
Disciplines?
• Tremendous amount of data Algorithms
must be highly scalable to handle such as
tera-bytes of data
• High-dimensionality of data Micro-array
may have tens of thousands of dimensions
• High complexity of data Data streams and
sensor data Time-series data, temporal
data, sequence data Structure data,
graphs, social networks and multi-linked
data Heterogeneous databases and
legacy
Data Mining: Classification Schemes

 General functionality
• Descriptive data mining - which describes data in a
concise and summative manner and presents interesting general
properties of the data.

• Predictive data mining - which analyzes data in order to


construct one or a set of models and attempts to predict the
behavior of new data sets.

 Different views, different classifications


 Kinds of databases to be mined
 Kinds of knowledge to be discovered
 Kinds of techniques utilized
 Kinds of applications adapted
A Multi-Dimensional View of Data
Mining Classification

 Databases to be mined
 Relational, transactional, object-oriented, object-
relational, active, spatial, time-series, text, multi-media,
heterogeneous, legacy, WWW, etc.
 Knowledge to be mined
 Characterization, discrimination, association,
classification, clustering, trend, deviation and outlier
analysis, etc.
 Multiple/integrated functions and mining at multiple
levels
 Techniques utilized
 Database-oriented, data warehouse (OLAP), machine
learning, statistics, visualization, neural network, etc.
 Applications adapted
 Retail, telecommunication, banking, fraud analysis, DNA mining,
stock market analysis, Web mining, Weblog analysis, etc.
Data Mining Task Primitives

 A data mining task can be specified in the form of


data mining query, which is input to the data mining
system.
 A data mining query is defined in terms of data
mining task primitives.
 The set of task-relevant data to be mined
 The kind of knowledge to be mined
 The background knowledge to be used in the discovery
process
 The interesting measures and thresholds for pattern
evaluation
 The expected representation for visualizing the discovered
patterns.
An Integration of Data Mining Systems
with DBMS or Data Warehouse System

 No coupling :
 DM system will not utilize any function of DB or DW
system. It may fetch data from a particular source,
process data using data mining algorithm and store
the mining results in another file.

 Loose coupling:
 DM system will use some facilities of a DB or DW
system, fetching data from a data repository managed
by these systems, performing data mining, and then
storing results either in a file or at a designated place
in a database or data warehouse.
An Integration of Data Mining Systems with
DBMS or Data Warehouse System

 Semi-tight coupling:
 Besides linking a DM system to a DB/DW system, efficient
implementations of a few essential data mining primitives
can be provided in DB/DW system. Some frequently used
intermediate mining results can be precomputed and stored
in the DB/DW system.
 Tight coupling:
 DM system is smoothly integrated into the DB/DW system.
 DM subsystem is treated as one functional component on
an information system.
 Data Mining queries & functions are optimized based on
mining query analysis, data structures, indexing schemes
and query processing methods of DB or DW system.
Major Issues in Data Mining

 Mining methodology and user interaction


 Mining different kinds of knowledge in databases
 Interactive mining of knowledge at multiple levels of abstr
action
 Incorporation of background knowledge
 Data mining query languages and ad-hoc data mining
 Expression and visualization of data mining results
 Handling noise and incomplete data
 Pattern evaluation: the interestingness problem
 Performance and scalability
 Efficiency and scalability of data mining algorithms
 Parallel, distributed and incremental mining methods
Major Issues in Data Mining

 Issues relating to the diversity of data types


 Handling relational and complex types of data
 Mining information from heterogeneous databases and
global information systems (WWW)
 Issues related to applications and social impacts
 Application of discovered knowledge
 Domain-specific data mining tools
 Intelligent query answering
 Process control and decision making
 Integration of the discovered knowledge with existing
knowledge: A knowledge fusion problem
 Protection of data security, integrity, and privacy
Applications of Data Mining

• Web page analysis: from web page


classification, clustering to PageRank &
HITS algorithms
• Collaborative analysis & recommender
systems
• Basket data analysis to targeted marketing
• Biological and medical data analysis:
classification, cluster analysis (microarray
data analysis), biological sequence
analysis, biological network analysis
• Data mining and software engineering
(e.g., IEEE Computer, Aug. 2009 issue)
• From major dedicated data mining
systems/tools (e.g., SAS, MS SQL-Server
Analysis Manager, Oracle Data Mining
Tools) to invisible data mining
Summary

 Data mining: discovering interesting patterns from large


amounts of data
 A natural evolution of database technology, in great
demand, with wide applications
 A KDD process includes data cleaning, data integration,
data selection, transformation, data mining, pattern
evaluation, and knowledge presentation
 Mining can be performed in a variety of information
repositories
 Data mining functionalities: characterization,
discrimination, association, classification, clustering,
outlier and trend analysis, etc.
 Classification of data mining systems
 Major issues in data mining

You might also like