Data Mining 5 Units Notes
Data Mining 5 Units Notes
Objective : On successful completion of the course the students should have: Understood data mining techniques-
Concepts and design of data warehousing.
UNIT I
Introduction – What is Data mining – Data Warehouses – Data Mining Functionalities – Basic Data mining tasks –
Data Mining Issues – Social Implications of Data Mining– Applications and Trends in Data Mining.
UNIT II
Data Preprocessing : Why preprocess the Data ? –Data Cleaning - Data Integration and Transformation – Data
Reduction – Data cube Aggregation – Attribute Subset Selection Classification: Introduction – statistical based
algorithms – Bayesian Classification. Distance based algorithms – decision tree based algorithms – ID3.
UNIT III
Clustering: Introduction - Hierarchical algorithms – Partitional algorithms – Minimum spanning tree – K-Means
Clustering - Nearest Neighbour algorithm. Association Rules: What is an association rule? – Methods to discover
an association rule–APRIORI algorithm – Partitioning algorithm .
UNIT IV
Data Warehousing: An introduction – characteristics of a data warehouse – Data marts – other aspects of data mart
.Online analytical processing: OLTP & OLAP systems.
UNIT V
Developing a data warehouse : Why and how to build a data warehouse – Data warehouse architectural strategies
and organizational issues – Design consideration – Data content – meta data – distribution of data – tools for data
warehousing – Performance considerations
TEXT BOOKS
1. Jiawei Han and Miceline Kamber , “Data Mining Concepts and Techniques “ , Morgan Kaulmann Publishers,
2006. (Unit I – Chapter 1 -1.2, 1.4 , Chapter 11- 11.1) (Unit II – Chapter 2 - 2.1,2.3, 2.4, 2.5.1,2.5.2) 2. Margaret H
Dunham , “Data mining Introductory & Advanced Topics”, Pearson Education , 2003.(Unit I – Chapter 1 -1.1 , 1.3,
1.5) , (UNIT II – Chapter 4 – 4.1, 4.2, 4.3, 4.4) (UNIT III – Chapter 5 – 5.1,5.4, 5.5.1, 5.5.3,5.5.4, Chapter 6 –
6.1,6.3. 3. C.S.R.Prabhu, “Data Warehousing concepts, techniques, products & applications”, PHI, Second Edition.
) (UNIT IV & V ) REFERENCES: 1. Pieter Adriaans, Dolf Zantinge, “Data Mining” Pearson Education, 1998.
*****
2
Swami Dayananda College of Arts & Science, Manjakkudi
DATA MINING & WARESHOUSING
UNIT I
INTRODUCTION : WHAT IS DATA MINING?
Definition
⚫ Data Mining refers to extracting or mining knowledge from large amount of data .
⚫ In simple words ,data mining is defined a process used to extract usable data from
a larger set of any raw data.
On defining data mining we can know the related terms of data mining , they are
Database
-Database is an organized collection of data, generally stored and accessed
electronically from a computer system .
DBMS
-Database Management system is a software that interacts with the end users,
applications, and the database itself to capture and analyze the data.
Data warehouse
- a large store of data accumulated from a wide range of sources within a company
and used to guide management decisions.
OLTP
-Online Transaction processing is a class of software programs capable of
supporting transactions oriented applications on the internet. (eg) log file, online
banking .
KDD
⚫ Many people treat data mining as a synonym for another popular used term
Knowledge Discovery from Data or KDD.
1.Data Cleaning
2. Data Integration
3.Data Selection
4.Data Transformation
5.Data mining
6.Pattern Evaluation
7.Knowledge Presentation
3
Swami Dayananda College of Arts & Science, Manjakkudi
DATA MINING & WARESHOUSING
⚫ Data Cleaning
-To remove noise and inconsistent data.
⚫ Data Integration
-where multiple sources may be combined
⚫ Data Integration
-where data relevant to the analysis task are retrieved from the database
⚫ Data Transformation
-where data are transformed or consolidated into forms appropriate for mining
by performing summary or aggregation operations.
⚫ Data Mining
-an essential process where intelligent methods are applied in order to exact
data pattern
⚫ Pattern Evaluation
⚫ Knowledge presentation
The architecture optical data mining may have the following major components
⚫ Knowledge base
5
Swami Dayananda College of Arts & Science, Manjakkudi
DATA MINING & WARESHOUSING
-This module communicates between users and the data mining system,
allowing the user to interact with the system by specifying a data
mining query or task, providing information to help focus the search
,and performing exploratory data mining based on the intermediate
data mining results.
6
Swami Dayananda College of Arts & Science, Manjakkudi
DATA MINING & WARESHOUSING
DATA WAREHOUSES
A Data warehouse is a repository of information collected from multiple sources,
stored under a unified schema , and that usually resides at a single site.
Data warehouse are constructed via a process of data cleaning ,data integration,
data transformation, data loading, and periodic data refreshing.
Framework for construction and use of a data warehouse
Data warehouse provide information from a historical perspective and are typically
summarized. For example ,rather than storing detailed information of each transaction in a
super market just it stores the summarization of transaction based on item sales.
Data warehouse is usually periodically updated ,so it doesn’t contain current
information .
A data warehouse is usually modeled by a multidimensional database structure ,
where each dimensions corresponds to an attribute or a set of attributes in the schema.
The actual physical structure of a data warehouse may be relational data store or a
multidimensional data cube.
predictions.
Concept/Class Description: Characterization and Discrimination
A decision tree is a flow-chart-like tree structure, where each node denotes a test on
an attribute value, each branch represents an outcome of the test, and tree leaves represent
classes or class distributions. Decision trees can easily be converted to classification rules
9
DATA MINING & WARESHOUSING
Cluster Analysis
Classification and prediction analyze class-labeled data objects, where
as clustering analyzes data objects without consulting a known class label.
In general the class labels are not present in the training data simply because they
are not known to begin with clustering can be used to create such labels .
The labels are clustered or grouped based on the principle of maximizing the intra
class similarity and minimizing the inter class similarity.
10
DATA MINING & WARESHOUSING
Outlier Analysis
A database may contain data objects that do not comply with the general behavior or
model of the data. These data objects are outliers. Most data mining methods discard
outliers as noise or exceptions.
However, in some applications such as fraud detection, the rare events can be more
interesting than the more regularly occurring ones. The analysis of outlier data is referred
to as outlier mining.
Evolution Analysis
Data evolution analysis describes and models regularities or trends for objects whose
behavior changes over time. Although this may include characterization, discrimination,
association and correlation analysis, classification, prediction, or clustering of time related
data, distinct features of such an analysis include time-series data analysis,Sequence or
periodicity pattern matching, and similarity-based dataanalysis.
Predictive data mining tasks come up with a model from the available data set that is
helpful in predicting unknown or future values of another data set of interest. A medical
practitioner trying to diagnose a disease based on the medical test results of a patient can be
considered as a predictive data mining task.
Descriptive data mining tasks usually finds data describing patterns and comes up
with new, significant information from the available data set. A retailer trying to identify
products that are purchased together can be considered as a descriptive data mining task.
a) Classification
Classification derives a model to determine the class of an object based on its
attributes. A collection of records will be available, each record with a set of attributes.
One of the attributes will be class attribute and the goal of classification task is assigning a
class attribute to new set of records as accurately as possible.
Example : An airport security screening station is used to determine if passengers are
potential terrorists or criminals. To do this the face of each passenger is scanned and its basic
pattern (distance between eyes ,size and shape of mouth , shape of head , etc.) is identified.
This is pattern is compared to entries in a database o see if it matches any patterns that are
associated with known offenders.
12
DATA MINING & WARESHOUSING
b) Regression
The regression task is similar to classification. The main difference is that the
predictable attribute is a continuous number. Regression techniques have been widely
studied for centuries in the field of statistics. Linear regression and logistic regression are
the most popular regression methods. Other regression techniques include regression trees
and neural networks. Regression tasks can solve many business problems.
Example : predict wind velocities based on past temperature, air pressure, and
humidity.
Time series is a sequence of events where the next event is determined by one or
more of the preceding events. Time series reflects the process being measured and there
are certain components that affect the behavior of a process. Time series analysis includes
methods to analyze time-series data in order to extract useful patterns, trends, rules and
statistics.
Example : Stock market prediction is an important application of time- series
analysis. A person is trying to determine whether to purchase stock from companies
X,Y,Z. For period of one month he charts the daily stock price for each company .Based
on this he take his decisions.
d) Prediction
Prediction task predicts the possible values of missing or future data. Prediction
involves developing a model based on the available data and this model is used in
predicting future values of a new data set of interest.
DATA MINING & WARESHOUSING
e) Association Rules
Association discovers the association or connection among a set of items.
Association identifies the relationships between objects. Association analysis is used for
commodity management, advertising, catalog design, direct marketing etc.
Example: A retailer can identify the products that normally customers purchase
together or even find the customers who respond to the promotion of same kind of
products. If a retailer finds that bread and jam are bought together mostly, he can put bread
on sale to promote the sale of jam.
f) Sequence Discovery
Sequence analysis or Sequence discovery is used to determine sequential patterns in
data. These patterns are based on a time sequence of actions. These patterns are similar to
associations in that data are found to be related, but the relationship is based on time.
In market basket analysis the items are purchased at same time , but in the sequence
discovery the items are purchased over time in some order.
Example :Most people who purchase CD players may be found to purchase CDs
within one week and speaker ,and then home theater
g) Clustering
Clustering is used to identify data objects that are similar to one another. The
similarity can be decided based on a number of factors like purchase behavior,
responsiveness to certain actions, geographical locations and so on.
Example : An insurance company can cluster its customers based on age, residence,
income etc. This group information will be helpful to understand the customers better and
hence provide better customized services.
14
DATA MINING & WARESHOUSING
h) Summarization
Summarization is the generalization of data. A set of relevant data is summarized
which result in a smaller set that gives aggregated information of the data.
Example: The shopping done by a customer can be summarized into total products,
total spending, offers used, etc. Such high level summarized information can be useful for
sales or customer relationship team for detailed customer and purchase behavior analysis.
Data can be summarized in different abstraction levels and from different angles.
1. Human interaction: In data mining ,interfaces may be needed with both domain
and technical experts .Technical experts are used to formulate the queries and assist in
interpreting the results.
2. Over fitting: When a model is generated that is associated with a given database
state ,it is desirable that the model also fit future database states. Over fitting occurs when
the model does not fit future states.
3. Outliers : There are often many data entries that do not fit nicely into the derived
model. If a model is developed that includes these outliers, then the model may not behave
well for data that are not outliers.
5. Visualization of results: To easily view and understand the output of data mining
algorithms, visualization of the results is helpful.
6. Large datasets : The massive datasets associated with data mining create
problems when applying algorithms designed for small datasets. Many modeling
applications grow exponentially on the dataset size and thus are too inefficient forlarger
datasets.
DATA MINING & WARESHOUSING
9. Missing data : During the KDD phase, missing data may be placed with
estimates, this may leads to invalid results.
10. Irrelevant data : some attributes in the database might not be of interest to the
data mining task being developed.
12. Changing data : Databases cannot be assumed to be static, but most data mining
algorithms require static database. this requires that the algorithm be completely rerun
anytime the database changes.
13. Integration : The KDD process is not currently integrated into normal data
processing activities. This may be treated as special ,unusual ,or onetime need.This makes
them inefficient, ineffective, and not general to be used on an ongoing basis.
14. Application : Determining the intended use for the information obtained from
the data mining function is a challenge. Indeed ,how business executives can effectively
use the output is sometimes considered the more difficult part , not the running of the
algorithms themselves.
Data mining is used in a vast array of areas ,and numerous commercial data mining
systems are available. Still data mining is a relatively young discipline with wide and
diverse applications there is a nontrivial gap between general principles of data mining
and application specific , effective data mining tools.
17
DATA MINING & WARESHOUSING
The financial data in banking and financial industry is generally reliable and of high quality
which facilitates systematic data analysis and data mining. Some of the typical cases are as
follows −
Design and construction of data warehouses for multidimensional data analysis and
data mining.
Loan payment prediction and customer credit policy analysis.
Classification and clustering of customers for targeted marketing.
Detection of money laundering and other financial crimes.
Retail Industry
Data Mining has its great application in Retail Industry because it collects large
amount of data from on sales, customer purchasing history, goods transportation,
consumption and services. It is natural that the quantity of data collected will continue to
expand rapidly because of the increasing ease, availability and popularity of the web.
Data mining in retail industry helps in identifying customer buying patterns and trends
that lead to improved quality of customer service and good customer retention and
satisfaction. Here is the list of examples of data mining in the retail industry −
Design and Construction of data warehouses based on the benefits of data mining.
Multidimensional analysis of sales, customers, products, time and region.
Analysis of effectiveness of sales campaigns.
Customer Retention.
Product recommendation and cross-referencing of items.
Telecommunication Industry
In recent times, we have seen a tremendous growth in the field of biology such as genomics,
proteomics, functional Genomics and biomedical research. Biological data mining is a very
important part of Bioinformatics. Following are the aspects in which data mining contributes for
biological data analysis −
Semantic integration of heterogeneous, distributed genomic and proteomic
databases.
Alignment, indexing, similarity search and comparative analysis multiple
nucleotide sequences.
Discovery of structural patterns and analysis of genetic networks and protein
pathways.
Association and path analysis.
Visualization tools in genetic data analysis.
Huge amount of data have been collected from scientific domains such as geosciences,
astronomy, etc. A large amount of data sets is being generated because of the fast numerical
simulations in various fields such as climate and ecosystem modeling, chemical engineering,
fluid dynamics, etc. Following are the applications of data mining in the field of Scientific
Applications −
Intrusion Detection
Intrusion refers to any kind of action that threatens integrity, confidentiality, or the
availability of network resources. In this w1o9rld of connectivity, security has become the
major issue. With increased usage of internet and availability of the tools and tricks for intruding
and attacking network prompted intrusion detection to become a critical component of network
administration.
Here is the list of areas in which data mining technology may be applied for intrusion
detection −
Development of data mining algorithm for intrusion detection.
Association and correlation analysis, aggregation to help select and build
discriminating attributes.
Analysis of Stream data.
Distributed data mining.
Visualization and query tools
Data mining concepts are still evolving and here are the latest trends that we get tosee in
this field –
Application Exploration.
Scalable and interactive data mining methods.
Integration of data mining with database systems, data warehouse systems and web
database systems.
Standardization of data mining query language.
Visual data mining.
New methods for mining complex types of data.
Biological data mining.
Data mining and software engineering.
Web mining.
Distributed data mining.
Real time data mining.
Multi database data mining.
Privacy protection and information security in data mining.
UNIT-II
DATA PREPROCESSING
1. Preprocessing
Real-world databases are highly susceptible to noisy, missing, and
inconsistent data due to their typically huge size (often several gigabytes or more)
and their likely origin from multiple, heterogeneous sources. Low-quality data will
lead to low-quality mining results, so we prefer a preprocessing concepts.
Data Preprocessing Techniques
* Data cleaning can be applied to remove noise and correct inconsistencies in the data.
* Data integration merges data from multiple sources into coherent data store, such as a
data warehouse.
* Data reduction can reduce the data size by aggregating, eliminating redundant
features, or clustering, for instance. These techniques are not mutually exclusive; they
may work together.
* Data transformations, such as normalization, may be applied.
Need for preprocessing
Incomplete, noisy and inconsistent data are common place properties of large real world
databases and data warehouses.
Incomplete data can occur for a number of reasons:
Attributes of interest may not always be available
Relevant data may not be recorded due to misunderstanding, or because of equipment
malfunctions.
Data that were inconsistent with other recorded data may have been deleted.
Missing data, particularly for tuples with missing values for some attributes, may
need to be inferred.
The data collection instruments used may be faulty.
There may have been human or computer errors occurring at data entry.
Errors in data transmission can also occur.
There may be technology limitations, such as limited buffer size for coordinating
synchronized data transfer and consumption.
Data cleaning routines work to ―clean‖ the data by filling in missing values,
smoothing noisy data, identifying or removing outliers, and resolving inconsistencies.
Data integration is the process of integrating multiple databases cubes or files. Yet some
attributes representing a given may have different names in different databases, causing
inconsistencies and redundancies.
Data transformation is a kind of operations, such as normalization and aggregation, are
additional data preprocessing procedures that would contribute toward the success of
the mining process.
Data reduction obtains a reduced representation of data set that is much smaller in
volume, yet produces the same(or almost the same) analytical results.
2. DATA CLEANING
Real-world data tend to be incomplete, noisy, and inconsistent. Data
cleaning (or data cleansing) routines attempt to fill in missing values, smooth out
noise while identifying outliers and correct inconsistencies in the data.
Missing Values
Many tuples have no recorded value for several attributes, such as customer
income.so we can fill the missing values for this attributes.
The following methods are useful for performing missing values over several attributes:
1. Ignore the tuple: This is usually done when the class label missing (assuming the
mining task involves classification). This method is not very effective, unless the tuple
contains several attributes with missing values. It is especially poor when the
percentage of the missing values per attribute varies considerably.
2. Fill in the missing values manually: This approach is time –consuming and may not
be feasible given a large data set with many missing values.
3. Use a global constant to fill in the missing value: Replace all missing attribute value
by the same constant, such as a label like ―unknown‖ or -∞.
4. Use the attribute mean to fill in the missing value: For example, suppose that the
average income of customers is $56,000. Use this value to replace the missing value for
income.
5. Use the most probable value to fill in the missing value: This may be determined
with regression, inference-based tools using a Bayesian formalism or decision tree
induction. For example, using the other customer attributes in the sets decision tree is
constructed to predict the missing value for income.
Noisy Data
Noise is a random error or variance in a measured variable. Noise is removed
using data smoothing techniques.
Binning: Binning methods smooth a sorted data value by consulting its
―neighborhood,‖ that is the value around it. The sorted values are distributed into a
number of ―buckets‖ or ―bins―. Because binning methods consult the neighborhood
of values, they perform local smoothing. Sorted data for price (in dollars):
3,7,14,19,23,24,31,33,38.
Example 1: Partition into (equal-
frequency) bins: Bin 1:
3,7,14
Bin 2: 19,23,24
Bin 3: 31,33,38
In the above method the data for price are first sorted and then partitioned into
equal- frequency bins of size 3.
Smoothing by bin means:
Bin 1: 8,8,8
Bin 2: 22,22,22
Bin 3: 34,34,34
In smoothing by bin means method, each value in a bin is replaced by the mean value
ofthe bin. For example, the mean of the values 3,7&14 in bin 1 is 8[(3+7+14)/3].
Smoothing by bin boundaries:
Bin 1: 3,3,14
Bin 2: 19,24,24
Bin 3: 31,31,38
In smoothing by bin boundaries, the maximum & minimum values in give bin or identify
as the bin boundaries. Each bin value is then replaced by the closest boundary value.
In general, the large the width, the greater the effect of the smoothing. Alternatively,
bins may be equal-width, where the interval range of values in each bin is
constant Example 2: Remove the noise in the following data using smoothing
techniques:
8, 4,9,21,25,24,29,26,28,15
Sorted data for price (in dollars):4,8,9,15,21,21,24,25,26,28,29,34
Partition into equal-frequency (equi-depth) bins:
Bin 1: 4, 8,9,15
Bin 2: 21,21,24,25
Bin 3: 26,28,29,34
Smoothing by bin means:
Bin 1: 9,9,9,9
Bin 2: 23,23,23,23
Bin 3: 29,29,29,29
Smoothing by bin boundaries:
Bin 1: 4, 4,4,15
Bin 2: 21,21,25,25
Bin3: 26,26,26,34
Regression: Data can be smoothed by fitting the data to function, such as with
regression. Linear regression involves finding the ―best‖ line to fit two attributes
(or variables), so that one attribute can be used to predict the other. Multiple linear
regressions is an extension of linear regression, where more than two attributes are
involved and the data are fit to a multidimensional surface.
Clustering: Outliers may be detected by clustering, where similar values are
organized into groups, or ―clusters.‖ Intuitively, values that fall outside of the
3. Data Integration
Data mining often requires data integration - the merging of data from
stores into a coherent data store, as in data warehousing. These sources may
include multiple data bases, data cubes, or flat files.
Issues in Data Integration
a) Schema integration & object matching.
b) Redundancy.
c) Detection & Resolution of data value conflict
a) Schema Integration & Object Matching
Schema integration & object matching can be tricky because same entity
can be represented in different forms in different tables. This is referred to as the
entity identification problem. Metadata can be used to help avoid errors in schema
integration. The meta data may also be used to help transform the data.
b) Redundancy:
Redundancy is another important issue an attribute (such as annual revenue,
for instance) may be redundant if it can be ―derived‖ from another attribute are set
of attributes. Inconsistencies in attribute of dimension naming can also cause
redundancies in the resulting data set. Some redundancies can be detected by
correlation analysis and covariance analysis.
For Nominal data, we use the 2 (Chi-Square) test.
For Numeric attributes we can use the correlation coefficient and covariance.
2Correlation analysis for numerical data:
For nominal data, a correlation relationship between two attributes, A and
B, can be discovered by a 2 (Chi-Square) test. Suppose A has c distinct values,
namely a1, a2, a3,
……., ac. B has r distinct values, namely b1, b2, b3, …., br. The data tuples are described by
table.
The 2 value is computed as
𝟐
2 = 𝒄 𝒓 𝒐𝒊𝒋−𝒆𝒊𝒋
𝒊=𝟏 𝒋=𝟏 𝒆𝒊𝒋
Where oij is the observed frequency of the joint event (Ai,Bj) and eij is the
expected frequency of (Ai,Bj), which can computed as
𝒄𝒐𝒖𝒏𝒕 𝑨=𝒂𝒊 𝑿𝒄𝒐𝒖𝒏𝒕(𝑩=𝒃𝒋)
eij = 𝒏
For
Example,
Male Female Total
Fiction 250 200 450
Non_Fiction 50 1000 1050
Total 300 1200 1500
Cov(A,B) = 𝒊=𝟏
𝒏
4. Data Reduction:
Obtain a reduced representation of the data set that is much smaller in volume
but yet produces the same (or almost the same) analytical results.
Why data reduction? — A database/data warehouse may store terabytes of data.
Complex data analysis may take a very long time to run on the complete data set.
27
DATA MINING & WARESHOUSING
Data
reduction
strategies
4.1.Data
cube
aggregati
on
4.2.Attrib
ute Subset
Selection
4.3.Numerosity reduction —
e.g., fit data into models
4.4.Dimensionality
reduction - Data
Compression
28
Swami Dayananda College of Arts & Science, Manjakkudi
DATA MINING & WARESHOUSING
29
Swami Dayananda College of Arts & Science, Manjakkudi
DATA MINING & WARESHOUSING
Numerosity Reduction:
Numerosity reduction is used to reduce the data volume by choosing
alternative, smaller forms ofthe data representation
Techniques for Numerosity reduction:
Parametric - In this model only the data parameters need to be
stored, instead of theactual data. (e.g.,) Log-linear models,
Regression
Parametric model
1. Regression
Linear regression
In linear regression, the data are model to fit a straight line. For
30
Swami Dayananda College of Arts & Science, Manjakkudi
DATA MINING & WARESHOUSING
Nonparametric Model
1. Histograms
A histogram for an attribute A partitions the data distribution
of A into disjoint subsets, or buckets. If each bucket represents only
a single attribute-value/frequency pair, the buckets are called
singleton buckets.
Ex: The following data are bast of prices of commonly sold items at
All Electronics. The numbers have been sorted:
1,1,5,5,5,5,5,8,8,10,10,10,10,12,14,14,14,15,15,15,15,15,18,18,18,1
8,18,18,18,18,20,20,20,2
0,20,20,21,21,21,21,21,25,25,25,25,25,28,28,30,30,30
uniform
2. Clustering
Clustering technique consider data tuples as objects. They
partition the objects into groups or clusters, so that objects within a
cluster are similar to one another and dissimilar to objects in other
clusters. Similarity is defined in terms of how close the objects are
in space, based on a distance function. The quality of a cluster may
be represented by its diameter, the maximum distance between any
two objects in the cluster. Centroid distance is an alternative
measure of cluster quality and is defined as the average distance of
each cluster object from the cluster centroid.
3. Sampling:
Sampling can be used as a data reduction technique because
it allows a large data set to be represented by a much smaller
random sample (or subset) of the data. Suppose that a large data set
D, contains N tuples, then the possible samples are Simple Random
sample without Replacement (SRS WOR) of size n: This is created
by drawing „n‟ of the „N‟ tuples from D (n<N), where the
probability of drawing any tuple in D is 1/N, i.e., all tuples are
equally likely to be sampled.
32
Swami Dayananda College of Arts & Science, Manjakkudi
DATA MINING & WARESHOUSING
Dimensionality Reduction:
In dimensionality reduction, data encoding or transformations are
applied so as to obtained reduced or ―compressed‖ representation of the oriental
data.
Dimension Reduction Types
Lossless - If the original data can be reconstructed from the
compressed data without any loss of information
Lossy - If the original data can be reconstructed from the compressed
data with loss of information, then the data reduction is called lossy.
Effective methods in lossy dimensional reduction
a) Wavelet transforms
b) Principal components analysis.
a) Wavelet transforms:
The discrete wavelet transform (DWT) is a linear signal
processing technique that, when applied to a data vector, transforms
it to a numerically different vector, of wavelet coefficients. The
two vectors are of the same length. When applying this technique
to data reduction, we consider each tuple as an n-dimensional data
vector, that is, X=(x1,x2, ................................................................... ,xn), depicting n
measurements made on the tuple from n database attributes.
For example, all wavelet coefficients larger than some user-
specified threshold can be retained. All other coefficients are set to
0. The resulting data representation is therefore very sparse, so that
can take advantage of data sparsity are computationally very fast if
performed in wavelet space.
33
Swami Dayananda College of Arts & Science, Manjakkudi
DATA MINING & WARESHOUSING
34
Swami Dayananda College of Arts & Science, Manjakkudi
DATA MINING & WARESHOUSING
In the above figure, Y1 and Y2, for the given set of data originally
mapped to the axes X1 and X2. This information helps identify
groups or patterns within the data. The sorted axes are such that the
first axis shows the most variance among the data, the second axis
shows the next highest variance, and so on.
The size of the data can be reduced by eliminating the weaker components.
Advantage of PCA
PCA is computationally inexpensive
Multidimensional data of more than two dimensions can be
handled by reducing theproblem to two dimensions.
Principal components may be used as inputs to multiple regression and cluster
analysis.
35
Swami Dayananda College of Arts & Science, Manjakkudi
DATA MINING & WARESHOUSING
the data. For example, the daily sales data may be aggregated so as to
compute monthly and annual total amounts. This step is typically used in
constructing a data cube for data analysis at multiple abstraction levels.
4. Normalization, where the attribute data are scaled so as to fall within a
smaller range, suchas 1.0 to 1.0, or 0.0 to 1.0.
5. Discretization, where the raw values of a numeric attribute (e.g., age)
are replaced byinterval labels (e.g., 0–10, 11–20, etc.) or conceptual labels
(e.g., youth, adult, senior).The labels, in turn, can be recursively organized
into higher-level concepts, resulting in a concept hierarchy for the numeric
attribute. Figure 3.12 shows a concept hierarchy for the attribute price.
More than one concept hierarchy can be defined for the same attribute to
accommodatethe needs of various users.
6. Concept hierarchy generation for nominal data, where attributes such
as street can be generalized to higher-level concepts, like city or country.
Many hierarchies for nominal attributes are implicit within the database
schema and can be automatically defined at the schema definition level.
36
Swami Dayananda College of Arts & Science, Manjakkudi
DATA MINING & WARESHOUSING
b) Z-Score Normalization
The values for an attribute, A, are normalized based on the mean (i.e.,
37
Swami Dayananda College of Arts & Science, Manjakkudi
DATA MINING & WARESHOUSING
Data Discretization
a) Discretization by binning:
Binning is a top-down splitting technique based on a specified
number of bins. For example, attribute values can be
discretized by applying equal-width or equal-
frequencybinning, and then replacing each bin value by
the bin mean or median, as in smoothing by bin means or
smoothing by bin medians, respectively. These techniques
can be applied recursively to the resulting partitions to
generate concept hierarchies.
40
Swami Dayananda College of Arts & Science, Manjakkudi
DATA MINING & WARESHOUSING
UNIT - III
CLUSTERING
Cluster is a group of objects that belongs to the same class. In other words, similar
objects are grouped in one cluster and dissimilar objects are grouped in another
cluster.
What is Clustering?
Points to Remember
41
Swami Dayananda College of Arts & Science, Manjakkudi
DATA MINING & WARESHOUSING
In the field of biology, it can be used to derive plant and animal taxonomies,
categorize genes with similar functionalities and gain insight into structures
inherent to populations.
Clustering also helps in identification of areas of similar land use in an earth
observation database. It also helps in the identification of groups of houses
in a city according to house type, value, and geographic location.
Clustering also helps in classifying documents on the web for information
discovery.
Clustering is also used in outlier detection applications such as detection of
credit card fraud.
As a data mining function, cluster analysis serves as a tool to gain insight
into the distribution of data to observe characteristics of each cluster.
The following points throw light on why clustering is required in data mining −
42
Swami Dayananda College of Arts & Science, Manjakkudi
DATA MINING & WARESHOUSING
Clustering Methods
Partitioning Method
Hierarchical Method
Density-based Method
Grid-Based Method
Model-Based Method
Constraint-based Method
Partitioning Method
Suppose we are given a database of ‘n’ objects and the partitioning method
constructs ‘k’ partition of data. Each partition will represent a cluster and k ≤ n. It
means that it will classify the data into k groups, which satisfy the following
requirements −
43
Swami Dayananda College of Arts & Science, Manjakkudi
DATA MINING & WARESHOUSING
Points to remember
For a given number of partitions (say k), the partitioning method will create
an initial partitioning.
Then it uses the iterative relocation technique to improve the partitioning by
moving objects from one group to other.
Hierarchical Methods
Agglomerative Approach
Divisive Approach
Agglomerative Approach
This approach is also known as the bottom-up approach. In this, we start with
each object forming a separate group. It keeps on merging the objects or groups
that are close to one another. It keep on doing so until all of the groups are merged
into oneor until the termination condition holds.
Divisive Approach
This approach is also known as the top-down approach. In this, we start with
all of the objects in the same cluster. In the continuous iteration, a cluster is split up
into smaller clusters. It is down until each object in one cluster or the termination
condition holds. This method is rigid, i.e., once a merging or splitting is done, it
can never be undone.
44
Swami Dayananda College of Arts & Science, Manjakkudi
DATA MINING & WARESHOUSING
Here are the two approaches that are used to improve the quality of hierarchical
clustering −
Density-based Method
This method is based on the notion of density. The basic idea is to continue
growing the given cluster as long as the density in the neighborhood exceeds some
threshold, i.e., for each data point within a given cluster, the radius of a given
cluster has to contain at least a minimum number of points.
Grid-based Method
In this, the objects together form a grid. The object space is quantized into
finite number of cells that form a grid structure.
Advantages
45
Swami Dayananda College of Arts & Science, Manjakkudi
DATA MINING & WARESHOUSING
Model-based methods
In this method, a model is hypothesized for each cluster to find the best fit of
data for a given model. This method locates the clusters by clustering the density
function. It reflects spatial distribution of the data points.
Constraint-based Method
46
Swami Dayananda College of Arts & Science, Manjakkudi
DATA MINING & WARESHOUSING
The financial data in banking and financial industry is generally reliable and
of high quality which facilitates systematic data analysis and data mining. Some
ofthe typical cases are as follows −
Retail Industry
Data Mining has its great application in Retail Industry because it collects
large amount of data from on sales, customer purchasing history, goods
transportation, consumption and services. It is natural that the quantity of data
collected will continue to expand rapidly because of the increasing ease,
availability and popularity of the web.
Data mining in retail industry helps in identifying customer buying patterns and
trends that lead to improved quality of customer service and good customer
retention and satisfaction. Here is the list of examples of data mining in the retail
industry −
47
Swami Dayananda College of Arts & Science, Manjakkudi
DATA MINING & WARESHOUSING
Customer Retention.
Product recommendation and cross-referencing of items.
Telecommunication Industry
48
Swami Dayananda College of Arts & Science, Manjakkudi
DATA MINING & WARESHOUSING
data mining is a very important part of Bioinformatics. Following are the aspects in
which data mining contributes for biological data analysis −
49
Swami Dayananda College of Arts & Science, Manjakkudi
DATA MINING & WARESHOUSING
Intrusion Detection
There are many data mining system products and domain specific data
mining applications. The new data mining systems and applications are being
added to the previous systems. Also, efforts are being made to standardize data
mining languages.
Data Types − The data mining system may handle formatted text, record-
based data, and relational data. The data could also be in ASCII text,
50
Swami Dayananda College of Arts & Science, Manjakkudi
DATA MINING & WARESHOUSING
relational database data or data warehouse data. Therefore, we should check what
exact format the data mining system can handle.
System Issues − We must consider the compatibility of a data mining
system with different operating systems. One data mining system may run
on only one operating system or on several. There are also data mining
systems that provide web-based user interfaces and allow XML data as
input.
Data Sources − Data sources refer to the data formats in which data mining
system will operate. Some data mining system may work only on ASCII text
files while others on multiple relational sources. Data mining system should
also support ODBC connections or OLE DB for ODBC connections.
Data Mining functions and methodologies − There are some data mining
systems that provide only one data mining function such as classification
while some provides multiple data mining functions such as concept
description, discovery-driven OLAP analysis, association mining, linkage
analysis, statistical analysis, classification, prediction, clustering, outlier
analysis, similarity search, etc.
Coupling data mining with databases or data warehouse systems − Data
mining systems need to be coupled with a database or a data warehouse
system. The coupled components are integrated into a uniform information
processing environment. Here are the types of coupling listed below −
o No coupling
o Loose Coupling
o Semi tight Coupling
o Tight Coupling
51
Swami Dayananda College of Arts & Science, Manjakkudi
DATA MINING & WARESHOUSING
Data mining concepts are still evolving and here are the latest trends that we get to
see in this field −
Application Exploration.
Scalable and interactive data mining methods.
Integration of data mining with database systems, data warehouse systems
and web database systems.
Standardization of data mining query language.
Visual data mining.
52
Swami Dayananda College of Arts & Science, Manjakkudi
DATA MINING & WARESHOUSING
53
Swami Dayananda College of Arts & Science, Manjakkudi
DATA MINING & WARESHOUSING
UNIT - IV
DATA WAREHOUSING
INTRODUCTION
A data warehouses is kept separate from operational databases due to the following
reasons:
An operational database is constructed for well-known tasks and workloads
such as searching particular records, indexing, etc. In contract, data
warehouse queries are often complex and they present a general form of
data.
Operational databases support concurrent processing of multiple
transactions. Concurrency control and recovery mechanisms are required for
operational databases to ensure robustness and consistency of the database.
An operational database query allows to read and modify operations, while
an OLAP query needs only read only access of stored data.
An operational database maintains current data. On the other hand, a data
warehouse maintains historical data.
Note: A data warehouse does not require transaction processing, recovery, and
concurrency controls, because it is physically stored and separate from the
operational database.
55
Swami Dayananda College of Arts & Science, Manjakkudi
DATA MINING & WARESHOUSING
Financial services
Banking services
Consumer goods
Retail sectors
Controlled manufacturing
Query-driven Approach
Update-driven Approach
Query-Driven Approach
Disadvantages
Update-Driven Approach
Advantages
The following are the functions of data warehouse tools and utilities:
Note: Data cleaning and data transformation are important steps in improving the
quality of data and data mining results.
57
Swami Dayananda College of Arts & Science, Manjakkudi
DATA MINING & WARESHOUSING
58
Swami Dayananda College of Arts & Science, Manjakkudi
DATA MINING & WARESHOUSING
Star Schema
59
Swami Dayananda College of Arts & Science, Manjakkudi
DATA MINING & WARESHOUSING
There is a fact table at the center. It contains the keys to each of four
dimensions.
The fact table also contains the attributes, namely dollars sold and units sold.
Note: Each dimension has only one dimension table and each table holds a set of
attributes. For example, the location dimension table contains the attribute set
{location_key, street, city, province_or_state,country}. This constraint may cause
data redundancy. For example, "Vancouver" and "Victoria" both the cities are in
the Canadian province of British Columbia. The entries for such cities may cause
data redundancy along the attributes province_or_state and country.
Snowflake Schema
60
Swami Dayananda College of Arts & Science, Manjakkudi
DATA MINING & WARESHOUSING
Now the item dimension table contains the attributes item_key, item_name,
type, brand, and supplier-key.
The supplier key is linked to the supplier dimension table. The supplier
dimension table contains the attributes supplier_key and supplier_type.
61
Swami Dayananda College of Arts & Science, Manjakkudi
DATA MINING & WARESHOUSING
62
Swami Dayananda College of Arts & Science, Manjakkudi
DATA MINING & WARESHOUSING
Schema Definition
The star schema that we have discussed can be defined using Data Mining
Query Language (DMQL) as follows:
63
Swami Dayananda College of Arts & Science, Manjakkudi
DATA MINING & WARESHOUSING
Bottom Tier - The bottom tier of the architecture is the data warehouse
database server. It is the relational database system. We use the back end
tools and utilities to feed data into the bottom tier. These back end tools and
utilities perform the Extract, Clean, Load, and refresh functions.
Middle Tier - In the middle tier, we have the OLAP Server that can be
implemented in either of the following ways.
o By Relational OLAP (ROLAP), which is an extended relational
database management system. The ROLAP maps the operations on
multidimensional data to standard relational operations.
o By Multidimensional OLAP (MOLAP) model, which directly
implements the multidimensional data and operations.
Top-Tier - This tier is the front-end client layer. This layer holds the query
64
Swami Dayananda College of Arts & Science, Manjakkudi
DATA MINING & WARESHOUSING
tools and reporting tools, analysis tools and data mining tools.
From the perspective of data warehouse architecture, we have the following data
warehouse models:
Virtual Warehouse
Data mart
Enterprise Warehouse
Virtual Warehouse
65
Swami Dayananda College of Arts & Science, Manjakkudi
DATA MINING & WARESHOUSING
Data Mart
In other words, we can claim that data marts contain data specific to a particular
group. For example, the marketing data mart may contain data related to items,
customers, and sales. Data marts are confined to subjects.
Enterprise Warehouse
66
Swami Dayananda College of Arts & Science, Manjakkudi
DATA MINING & WARESHOUSING
Data warehouses contain huge volumes of data. OLAP servers demand that
decision support queries be answered in the order of seconds. Therefore, it is
crucial for data warehouse systems to support highly efficient cube computation
techniques, access methods, and query processing techniques. In this section, we
present an overview of methods for the efficient implementation of data
warehouse systems.
AllElectronics sales that contains the following: city, item, year, and sales in dollars.
You would like to be able to analyze the data, with queries such as the following:
“Compute the sum of sales, grouping by city
and item.” “Compute the sum of sales,
grouping by city.”
67
Swami Dayananda College of Arts & Science, Manjakkudi
DATA MINING & WARESHOUSING
68
Swami Dayananda College of Arts & Science, Manjakkudi
DATA MINING & WARESHOUSING
69
Swami Dayananda College of Arts & Science, Manjakkudi
DATA MINING & WARESHOUSING
70
Swami Dayananda College of Arts & Science, Manjakkudi
DATA MINING & WARESHOUSING
UNIT - V
OLAP- Need - Categorization of OLAP Operations
Relational OLAP
ROLAP servers are placed between relational back-end server and client front-
end tools. To store and manage warehouse data, ROLAP uses relational or
extended-relational DBMS.
Multidimensional OLAP
71
Swami Dayananda College of Arts & Science, Manjakkudi
DATA MINING & WARESHOUSING
Specialized SQL servers provide advanced query language and query processing
support for SQL queries over star and snowflake schemas in a read-only
environment.
OLAP Operations
Since OLAP servers are based on multidimensional view of data, we will discuss
OLAP operations in multidimensional data.
Roll-up
Drill-down
Slice and dice
Pivot (rotate)
Roll-up
72
Swami Dayananda College of Arts & Science, Manjakkudi
DATA MINING & WARESHOUSING
73
Swami Dayananda College of Arts & Science, Manjakkudi
DATA MINING & WARESHOUSING
Drill-down
74
Swami Dayananda College of Arts & Science, Manjakkudi
DATA MINING & WARESHOUSING
Drill-down is performed by stepping down a concept hierarchy for the dimension time.
Initially the concept hierarchy was "day < month < quarter < year."
On drilling down, the time dimension is descended from the level of quarter to the
level of month.
When drill-down is performed, one or more dimensions from the data cube are added.
It navigates the data from less detailed data to highly detailed data.
Slice
The slice operation selects one particular dimension from a given cube and provides a new sub-
cube. Consider the following diagram that shows how slice works.
75
Swami Dayananda College of Arts & Science, Manjakkudi
DATA MINING & WARESHOUSING
Here Slice is performed for the dimension "time" using the criterion time = "Q1".
It will form a new sub-cube by selecting one or more dimensions.
Dice
Dice selects two or more dimensions from a given cube and provides a new sub-cube. Consider
the following diagram that shows the dice operation.
76
Swami Dayananda College of Arts & Science, Manjakkudi
DATA MINING & WARESHOUSING
The dice operation on the cube based on the following selection criteria involves three
dimensions.
The pivot operation is also known as rotation. It rotates the data axes in view in order to
provide an alternative presentation of data. Consider the following diagram that shows the
pivot operation.
77
Swami Dayananda College of Arts & Science, Manjakkudi
DATA MINING & WARESHOUSING
OLAP
Multidimensional OLAP
MOLAP uses array-based multidimensional storage engines for multidimensional
views of data. With multidimensional data stores, the storage utilization may be
low if the data set is sparse. Therefore, many MOLAP server use two levels of
data storage representation to handle dense and sparse data sets.
Hybrid OLAP
Hybrid OLAP is a combination of both ROLAP and MOLAP. It offers higher
scalability of ROLAP and faster computation of MOLAP. HOLAP servers allows
to store the large data volumes of detailed information. The aggregations are
stored separately in MOLAP store.
78
Swami Dayananda College of Arts & Science, Manjakkudi
DATA MINING & WARESHOUSING
Roll-up
Drill-down
Slice and dice
Pivot (rotate)
Roll-up
79
Swami Dayananda College of Arts & Science, Manjakkudi
DATA MINING & WARESHOUSING
Drill-down
80
Swami Dayananda College of Arts & Science, Manjakkudi
DATA MINING & WARESHOUSING
81
Swami Dayananda College of Arts & Science, Manjakkudi
DATA MINING & WARESHOUSING
Slice
The slice operation selects one particular dimension from a given cube and
provides a new sub-cube. Consider the following diagram that shows how slice
works.
Here Slice is performed for the dimension "time" using the criterion time =
"Q1".
82
Swami Dayananda College of Arts & Science, Manjakkudi
DATA MINING & WARESHOUSING
Dice
Dice selects two or more dimensions from a given cube and provides a new sub-
cube. Consider the following diagram that shows the dice operation.
The dice operation on the cube based on the following selection criteria involves
three dimensions.
83
Swami Dayananda College of Arts & Science, Manjakkudi
DATA MINING & WARESHOUSING
Pivot
The pivot operation is also known as rotation. It rotates the data axes in view in
order to provide an alternative presentation of data. Consider the following
diagram that shows the pivot operation.
84
Swami Dayananda College of Arts & Science, Manjakkudi
DATA MINING & WARESHOUSING
OLAP Vs OLTP
S. No Data Warehouse (OLAP) Operational Database (OLTP)
85
Swami Dayananda College of Arts & Science, Manjakkudi