Data Warehousing and Data Mining: DR Seema Agarwal
Data Warehousing and Data Mining: DR Seema Agarwal
Data Warehousing
and
Data Mining
Dr Seema Agarwal
Data Mining – UNIT II
• Introduction
• Data
• Types of Data
• Data Mining Functionalities
• Interestingness of Patterns – Classification of Data
• Mining Systems – Data Mining Task Primitives
• Integration of a Data Mining System with a Data
Warehouse
• Issues –Data Preprocessing.
2
Why Data Mining?
• The Explosive Growth of Data: from terabytes to pentabytes ,
Exabytes etc
– Data collection and data availability
• Automated data collection tools, database systems, Web,
computerized society
– Major sources of abundant data
• Business: Web, e-commerce, transactions, stocks, …
• Science: Remote sensing, bioinformatics, scientific
simulation, …
• Society and everyone: news, digital cameras,
3
We are drowning in data, but starving for knowledge!
Evolution of Database Technology
• 1960s:
– Data collection, database creation, IMS and network DBMS
• 1970s:
– Relational data model, relational DBMS implementation
• 1980s:
– RDBMS, advanced data models (extended-relational, OO etc.)
– Application-oriented DBMS (spatial, scientific, engineering, etc.)
• 1990s:
– Data mining, data warehousing, multimedia databases, and Web databases
• 2000s
– Stream data management and mining
– Data mining and its applications
– Web technology (XML, data integration) and global information systems
4
What Is Data Mining?
7
Market Analysis and Management
• Cross-market analysis
– Associations/co-relations between product sales, &
prediction based on such association
• Customer profiling
– What types of customers buy what products
• Customer requirement analysis
– Identifying the best products for different
customers
– Predict what factors will attract new customers
8
Fraud Detection & Mining Unusual Patterns
9
Data
• Collection of data objects and their attributes
An object is an entity
eg. In sales database Customer, store items , sales
In univ DB – students , faculty , courses
• An attribute is a property or characteristic of an
object . Its is data field
– Examples: eye color of a person, height etc.
– Attribute is also known as variable, field,
characteristic, or feature
• A collection of attributes describe an object
– Object is also known as record, point, case, sample,
entity, or instance
Types of Attributes
There are different types of attributes
• Nominal attributes are symbols or names of things.
Each value represent some kind of category, code,
state. So Nominal attributes are referred as categorical.
• Examples: eye color, pin codes
Eye color – black, brown , blue ,grey
Age ?
IQ ?
Qualitative and Quantitative Data
Data
Qualitative Quantitative
Task-relevant Data
Data Cleaning
Data Integration
Databases 22
Steps of a KDD Process
• Learning the application domain
– Relevant prior knowledge and goals of application
• Creating a target data set: data selection
• Data cleaning and preprocessing: (may take 60% of effort!)
– to remove noise or irrelevant data),
• Data reduction and transformation
– Find useful features, dimensionality/variable reduction.
– by performing summary or aggregation operations
• Choosing functions of data mining
– Summarization, classification, regression, association, clustering.
• Choosing the mining algorithm(s)
• Data mining: search for patterns of interest
• Pattern evaluation and knowledge presentation
– Visualization, transformation, removing redundant patterns, etc.
– knowledge representation techniques are used to present the mined
knowledge to the user
Pattern evaluation
Knowledge-base
Database or data
warehouse server
Data cleaning & data integration Filtering
Data
Databases Warehouse
24
Architecture of a typical data mining
system.
The architecture of a typical data mining system may have the
following major components
1. Database, data warehouse, or other information repository. This
is one or a set of databases, data warehouses, spread sheets, or
other kinds of information repositories. Data cleaning and data
integration techniques may be performed on the data.
2. Database or data warehouse server. The database or data
warehouse server is responsible for fetching the relevant data,
based on the user's data mining request.
3. Knowledge base. This is the domain knowledge that is used to
guide the search, or evaluate the interestingness of resulting
patterns. Such knowledge can include concept hierarchies, used
to organize attributes or attribute values into different levels of
abstraction.
4. Datamining engine. This is essential to the data mining system and
ideally consists of a set of functional modules for tasks such as
characterization, association analysis, classification, evolution and
deviation analysis.
31
Concept/class description
• Characterization and Discrimination
• Data can be associated with classes or concepts. For
example, in the AllElectronics store,
• classes of items for sale include computers and
printers,
• and concepts of customers include bigSpenders and
budgetSpenders.
• It can be useful to describe individual classes and
concepts in summarized, concise, and yet precise
terms.
Such descriptions of a class or a concept are called
class/concept descriptions.
• These descriptions can be derived via
(1) data characterization, by summarizing the
data of the class under study (often called
the target class) in general terms,
(2) data discrimination, by comparison of the
target class with one or a set of comparative
classes (often called the contrasting classes),
(3) both data characterization and
discrimination.
Data characterization
• Data characterization is a summarization of the general
characteristics or features of a target class of data.
• The data corresponding to the user-specified class are
typically collected by a database query.
• Then run through a summarization module to extract the
essence of the data at different levels of abstractions
• The data cube- based OLAP roll-up operation can be used
to perform user-controlled data summarization along a
specified dimension
• The output of data characterization can be presented in
various forms. Examples include pie charts, bar charts,
curves, multidimensional data cubes, and
multidimensional tables.
• For example, one may want to characterize the
OurVideoStore customers who regularly rent
more than 30 movies a year. With concept
hierarchies on the attributes describing the target
class, the attribute-oriented induction method
can be used, for example, to carry out data
summarization.
• A data cube containing summarization of data,
simple OLAP operations fit the purpose of data
characterization.
Data discrimination
• Data discrimination produces what are
called discriminant rules and is basically the
comparison of the general features of objects
between two classes referred to as the target
class and the contrasting class.
• For example, one may want to compare the
general characteristics of the customers who
rented more than 30 movies in the last year with
those whose rental account is lower than 5.
• The techniques used for data discrimination are
very similar to the techniques used for data
characterization with the exception that data
discrimination results include comparative
measures.
Association analysis
• Association analysis is the discovery of what are
commonly called association rules.
• It studies the frequency of items occurring
together in transactional databases, and based on
a threshold called support, identifies the frequent
item sets.
• Another threshold, confidence, which is the
conditional probability than an item appears in a
transaction when another item appears, is used
to pinpoint association rules.
• Association analysis is commonly used for market
basket analysis.
• An association between more than one attribute,
or predicate (i.e., age, income, and buys).
• Suppose, as a marketing manager of
AllElectronics, you would like to determine which
items are frequently purchased together within
the same transactions. An example of such a rule
is
• Contains
• (T; \computer") ) contains(T; \software") [support
= 1%; confidence = 50%]
• meaning that if a transaction T contains
\computer", there is a 50% chance that it
contains \software" as well, and 1% of all of the
transactions contain both.
OurVideoStore manager to know what games are often
rented together or if there is a relationship between
renting a certain type of game and buying popcorn or
pop.
• RentType(X, "game") AND Age(X, "13-19") ->
Buys(X, "pop") [s=2% ,c=55%]
would indicate that 2% of the transactions
considered are of customers aged between 13
and 19 who are renting a game and buying a
pop, and that there is a confidence of 55%
that teenage customers who rent a game also
buy pop.
Use basket analysis to plan
• Marketing or advertising strategies
• Design store layout
– Keep items bought together in close proximity
– Place hw & sw at opposite end so that
customers walk a long way and notice other
items like security systems
– Plan sale items eg sale on printers.
• For association analysis
– minimum threshold values of support and
minimum threshold values of confidence is set by
domain experts.
– Data Mining is done
– Find out result
Classification and Prediction
• Classification is the processing of finding a set of models (or
functions) which describe and distinguish data classes or
concepts, for the purposes of being able to use the model to
predict the class of objects whose class label is unknown.
• Classification approaches normally use a training set where all
objects are already associated with known class labels. The
classification algorithm learns from the training set and builds a
model. The model is used to classify new objects.
• The derived model is based on the analysis of a set of training
data (i.e., data objects whose class label is known).
• The derived model may be represented in various forms, such as
classification (IF-THEN) rules, decision trees, mathematical
formulae, or neural networks.
Classification vs. Prediction
• Classification:
– predicts categorical class labels
– classifies data (constructs a model) based on the training set
and the values (class labels) in a classifying attribute and uses it
in classifying new data
• Prediction:
– models continuous-valued functions, i.e., predicts unknown or
missing values
• Typical Applications
– credit approval
– target marketing
– medical diagnosis
– treatment effectiveness analysis
Classification—A Two-Step Process
• Model construction: describing a set of predetermined
classes
– Each tuple is assumed to belong to a predefined class, as determined
by the class label attribute (supervised learning)
– The set of tuples used for model construction: training set
– The model is represented as classification rules, decision trees, or
mathematical formulae
• Model usage: for classifying previously unseen objects
– Estimate accuracy of the model using a test set
• The known label of test sample is compared with the classified
result from the model
• Accuracy rate is the percentage of test set samples that are
correctly classified by the model
• Test set is independent of training set, otherwise over-fitting will
occur
Classification Process: Model
Construction
Classification
Algorithms
Training
Data
Classifier
Testing
Data Unseen Data
(Jeff, Professor, 4)
53
• Objective vs. subjective interestingness measures
– Objective: based on statistics and structures of patterns,
e.g., support, confidence, etc.
– Subjective: based on user’s belief in the data, e.g.,
unexpectedness, novelty.
Data Mining: Classification Schemes
55
Data Mining: Classification
• Data to be mined
– Relational, data warehouse, transactional, stream, object-
oriented/relational, active, spatial, time-series, text, multi-media,
heterogeneous, WWW
• Knowledge to be mined
– Characterization, discrimination, association, classification,
clustering, trend/deviation etc.
– based on the granularity or levels of abstraction of the knowledge
mined, including generalized knowledge (at a high level of
abstraction), primitive-level knowledge (at a raw data level)
– knowledge at multiple levels (considering several levels of
abstraction).
– An advanced data mining system should facilitate the discovery
of knowledge at multiple levels of abstraction.
56
Data Mining: Classification
• Techniques utilized
– according to the degree of user interaction involved
(e.g., autonomous systems, interactive exploratory
systems, query-driven systems),
– methods of data analysis employed -Database-
oriented, data warehouse (OLAP), machine learning,
statistics, visualization, etc.
• Applications adapted
– Retail, telecommunication, banking, fraud analysis,
bio-data mining, stock market analysis, Web mining,
etc.
57
Major Issues in Data Mining
• Mining methodology
– Mining different kinds of knowledge from diverse data
types, e.g., bio, stream, Web
– Performance: efficiency, effectiveness, and scalability
– Pattern evaluation: the interestingness problem
– Incorporation of background knowledge
– Handling noise and incomplete data
• Performance issues
– Efficiency and scalability of data mining algorithms
– Parallel, distributed and incremental mining methods
• Diverse database types
58
Major Issues in Data Mining
• User interaction
– Data mining query languages and ad-hoc mining
– Expression and visualization of data mining results
– Interactive mining of knowledge at multiple levels
of abstraction
• Applications and social impacts
– Domain-specific data mining & invisible data
mining
– Protection of data security, integrity, and privacy
59
Data Preprocessing
• Data in the Real World Is Dirty: Lots of potentially incorrect data, e.g.,
instrument faulty, human or computer error, transmission error
– incomplete: lacking attribute values, lacking certain attributes of
interest, or containing only aggregate data
• e.g., Occupation=“ ” (missing data)
– noisy: containing noise, errors, or outliers
• e.g., Salary=“−10” (an error)
– inconsistent: containing discrepancies in codes or names, e.g.,
• Age=“42”, Birthday=“03/07/2010”
• Was rating “1, 2, 3”, now rating “A, B, C”
• discrepancy between duplicate records
– Intentional (e.g., disguised missing data)
• Jan. 1 as everyone’s birthday?
60
Major Tasks in Data Preprocessing
• Data cleaning
– Fill in missing values, smooth noisy data, identify or remove outliers,
and resolve inconsistencies
• Data integration
– Integration of multiple databases, data cubes, or files
• Data transformation
– Normalization and aggregation
• Data reduction
– Obtains reduced representation in volume but produces the same or
similar analytical results
• Data discretization
– Part of data reduction but with particular importance, especially for
numerical data
1. Incomplete (Missing) Data
• Data is not always available
– E.g., many tuples have no recorded value for several
attributes, such as customer income in sales data
• Missing data may be due to
– equipment malfunction
– inconsistent with other recorded data and thus deleted
– data not entered due to misunderstanding
– certain data may not be considered important at the time
of entry
– not register history or changes of the data
• Missing data may need to be inferred
62
How to Handle Missing Data?
• Ignore the tuple: usually done when class label is missing (when doing
classification)—not effective when the % of missing values per attribute varies
considerably
• Fill in the missing value manually: tedious + infeasible?
• Fill in it automatically with
– a global constant : e.g., “unknown”
– the attribute mean :For example, suppose that the average income of All
Electronics customers is $28,000. Use this value to replace the missing value for
income.
– the attribute mean for all samples belonging to the same class: For example, if
classifying customers according to credit risk, replace the missing value with the
average income value for customers in the same credit risk category as that of the
given tuple.
• the most probable value: inference-based such as Bayesian formula or decision tree
• using the other customer attributes in your data set, you may construct a decision
tree to predict the missing values for income.
63
2. Noisy Data
• Noise: random error or variance in a measured variable
• Incorrect attribute values may be due to
– faulty data collection instruments
– data entry problems
– data transmission problems
– technology limitation
– inconsistency in naming convention
• Other data problems which require data cleaning
– duplicate records
– incomplete data
– inconsistent data
64
How to Handle Noisy Data?
• Binning
– first sort data and partition into (equal-frequency) bins
– then one can smooth by bin means, smooth by bin median,
smooth by bin boundaries, etc.
• Regression
– smooth by fitting the data into regression functions
• Clustering
– detect and remove outliers
• Combined computer and human inspection
– detect suspicious values and check by human (e.g., deal
with possible outliers)
65
Simple Discretization Methods: Binning
* Sorted data for price (in Rs): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
* Partition into (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
Clustering
• Outliers may be detected by clustering, where
similar values are organized into groups or
“clusters”.
• Intuitively, values which fall outside of the set of
clusters may be considered outliers.
Combined computer and human
inspection
• Outliers may be identified through a combination of
computer and human inspection.
• for example, an information-theoretic measure was
used to help identify outlier patterns in a handwritten
character database for classification.
• The measure's value reflected the “surprise" content
of the predicted character label with respect to the
known label.
• Outlier patterns may be informative or “garbage".
Patterns whose surprise content is above a threshold
are output to a list.
• A human can then sort through the patterns in the list
to identify the actual garbage ones
Regression
• Data can be smoothed by fitting the data to a
function, such as with regression.
• Linear regression involves finding the “best" line
to fit two variables, so that one variable can be
used to predict the other.
– Data modeled to fit a straight line
– Often uses the least-square method to fit the line
• Multiple linear regression is an extension of linear
regression, where more than two variables are
involved and the data are fit to a
multidimensional surface.
y
Regression Analysis
Y1