DWM Unit II
DWM Unit II
2
Data Mining: Introduction
Data mining is one of the most useful techniques that help entrepreneurs,
researchers, and individuals to extract valuable information from huge sets of
data. Data mining is also called Knowledge Discovery in Database (KDD).
The knowledge discovery process includes Data cleaning, Data integration,
Data selection, Data transformation, Data mining, Pattern evaluation, and
Knowledge presentation.
5
Below we describe 5 factors we consider critical for
the success of Data mining:
Clear business goals the company aims to achieve using Data mining
Relevancy of the data sources to avoid duplicates and unimportant results
Completeness of the data to ensure all the essential information is covered
Applicability of the Data analysis results to meet the goals specified
Customer engagement and bottom line growth as the indicators of data
mining success
6
Data Mining Applications
Data Mining is primarily used by
organizations with intense consumer
demands- Retail, Communication,
Financial, marketing company,
determine price, consumer
preferences, product positioning, and
impact on sales, customer satisfaction,
and corporate profits. Data mining
enables a retailer to use point-of-sale
records of customer purchases to
develop products and promotions that
help the organization to attract the
customer.
7
KDD Process in Data Mining
9
Data Cleaning: Data cleaning is defined as removal of noisy and irrelevant
data from collection.
Cleaning in case of Missing values.
Cleaning noisy data, where noise is a random or variance error.
Cleaning with Data discrepancy detection and Data transformation tools.
Data Integration: Data integration is defined as heterogeneous data from
multiple sources combined in a common source(DataWarehouse).
Data integration using Data Migration tools.
Data integration using Data Synchronization tools.
Data integration using ETL(Extract-Load-Transformation) process.
10
Data Selection: Data selection is defined as the process where data relevant to
the analysis is decided and retrieved from the data collection.
Data Transformation: Data Transformation is defined as the process of
transforming data into appropriate form required by mining procedure. Data
Transformation is a two step process:
Data Mapping: Assigning elements from source base to destination to capture
transformations.
Code generation: Creation of the actual transformation program.
11
Data Mining: Data mining is defined as clever techniques that are applied to extract
patterns potentially useful.
Transforms task relevant data into patterns.
Decides purpose of model using classification or characterization.
Pattern Evaluation: Pattern Evaluation is defined as identifying strictly increasing
patterns representing knowledge based on given measures.
Find interestingness score of each pattern.
Uses summarization and Visualization to make data understandable by user.
Knowledge representation: Knowledge representation is defined as technique which
utilizes visualization tools to represent data mining results.
Generate reports.
Generate tables.
Generate discriminant rules, classification rules, characterization rules, etc.
12
Type of Data that can be mined
Different kind of data can be mine. Some of the examples are mentioned
below.
Spatial Databases
Flat Files
Relational Databases
Transactional Databases
Multimedia Databases
DataWarehouse
World Wide Web(WWW)
Time Series Databases
13
Spatial Database
Spatial Database is a suitable way to Store geographical information.
Spatial Database stores the data in the form of coordinates, lines, and different
shapes, etc.
Maps, Global positioning, etc are the famous applications of Spatial Database.
Flat files?
Flat files are in the binary form or text form and having a structure that can be
easily extracted by data mining algorithms.
14
Relational Databases
Relational Databases is an organized collection of related data. This organization
is in the form of tables with rows and columns. Different kind of scheme used in
relational databases. A physical and logical schema is famous schema.
In Physical schema, we can define the structure of tables.
In Logical schema, we can define a different kind of relationship among tables.
Standard API of the relational database is Structured Query Language (SQL).
Transactional Databases
Transactional databases is an organized collection of data that is organized by
timestamps etc. For example, organized by any date to represent the transaction
in databases. Transactional Databases must have the capability to roll back any
transaction. It is most commonly used in ATM machines
Object databases, ATM machine, Banking, and Distributed systems are very
famous applications of a transactional database.
15
Multimedia Databases
Multimedia databases are the databases that can store the followings;
Video
images
Audio
text etc
Multimedia Databases can be stored on Object-Oriented Databases.
Ebooks databases, Video websites databases, news websites databases etc are famous applications
of Multimedia Databases.
DataWarehouse
A data warehouse is the collection of data that is collected and integrated from one or more
sources. Later this data can be mined for business decision making.
Three famous types of a data warehouse are mentioned below;
VirtualWarehouse
Data Mart
Enterprise data warehouse
Business decision making and Data mining are very useful applications of the data warehouse.
16
WWW
WWW stands for World wide web. WWW is a collection of documents and
resources and can contain a different kind of data like video, audio, and text, etc.
Each data can be identified by Uniform Resource Locators (URLs) through web
browsers.
Online tools, online video, images, and text searching sites are the famous
applications of WWW.
Time-series Databases
Time-series databases are the databases that can store the stock exchange data
etc. Graphite and eXtremeDB etc are the famous applications of Time-series
Databases.
17
Major issues in Data Mining
Data mining is not an easy task, as the algorithms used can get very complex and data is not always
available at one place. It needs to be integrated from various heterogeneous data sources. These
factors also create some issues. Here in this tutorial, we will discuss the major issues regarding −
• Mining Methodology and User Interaction
• Performance Issues
• Diverse Data Types Issues
The following diagram describes the major issues.
18
Mining Methodology and User Interaction Issues
It refers to the following kinds of issues −
• Mining different kinds of knowledge in databases − Different users may be interested in different kinds
of knowledge. Therefore it is necessary for data mining to cover a broad range of knowledge discovery task.
• Interactive mining of knowledge at multiple levels of abstraction − The data mining process needs to
be interactive because it allows users to focus the search for patterns, providing and refining data mining
requests based on the returned results.
• Incorporation of background knowledge − To guide discovery process and to express the discovered
patterns, the background knowledge can be used. Background knowledge may be used to express the
discovered patterns not only in concise terms but at multiple levels of abstraction.
• Data mining query languages and ad hoc data mining − Data Mining Query language that allows the
user to describe ad hoc mining tasks, should be integrated with a data warehouse query language and
optimized for efficient and flexible data mining.
• Presentation and visualization of data mining results − Once the patterns are discovered it needs to be
expressed in high level languages, and visual representations. These representations should be easily
understandable.
• Handling noisy or incomplete data − The data cleaning methods are required to handle the noise and
incomplete objects while mining the data regularities. If the data cleaning methods are not there then the
accuracy of the discovered patterns will be poor.
• Pattern evaluation − The patterns discovered should be interesting because either they represent common
19
knowledge or lack novelty.
Performance Issues
There can be performance-related issues such as follows −
• Efficiency and scalability of data mining algorithms − In order to effectively extract the
information from huge amount of data in databases, data mining algorithm must be efficient
and scalable.
• Parallel, distributed, and incremental mining algorithms − The factors such as huge size
of databases, wide distribution of data, and complexity of data mining methods motivate the
development of parallel and distributed data mining algorithms. These algorithms divide the
data into partitions which is further processed in a parallel fashion. Then the results from the
partitions is merged. The incremental algorithms, update databases without mining the data
again from scratch.
22
Data Quality: Why do we preprocess the data?
Many characteristics act as a deciding factor for data quality, such as incompleteness and
incoherent information, which are common properties of the big database in the real world. Factors
used for data quality assessment are:
• Accuracy: There are many possible reasons for flawed or inaccurate data here. i.e. Having
incorrect values of properties that could be human or computer errors.
• Completeness: For some reasons, incomplete data can occur, attributes of interest such as
customer information for sales & transaction data may not always be available.
• Consistency: Incorrect data can also result from inconsistencies in naming convention or data
codes, or from input field incoherent format. Duplicate tuples need cleaning of details, too.
• Timeliness: It also affects the quality of the data. At the end of the month, several sales
representatives fail to file their sales record on time. These are also several corrections &
adjustments which flow into after the end of the month. Data stored in the database are
incomplete for a time after each month.
• Believability: It is reflective of how much users trust the data.
• Interpretability: It is a reflection of how easy the users can understand the data.
23
Example of attribute
In this example, RollNo, Name, and Result are attributes of the object named
as a student.
Rollo Name Result
1 Ali Pass
2 Akram Fail
Types Of attributes
Binary
Nominal
Ordinal Attributes
Numeric
Interval-scaled
Ratio-scaled
24
Nominal Attributes
Nominal data is in alphabetical form and not in an integer. Nominal Attributes
are Qualitative Attributes.
Examples of Nominal attributes
In this example, sates and colors are the attribute and New, Pending, Working,
Complete, Finish and Black, Brown, White, and Red are the values.
Attribute Value
Categorical data Lecturer, Assistant Professor, Professor
States New, Pending, Working, Complete, Finish
Colors Black, Brown, White, Red
25
Binary Attributes
Binary data have only two values/states. For example, here HIV detected can
be only Yes or No.
Binary Attributes are Qualitative Attributes.
Examples of Binary Attributes
Attribute Value
HIV detected Yes, No
Result Pass, Fail
The binary attribute is of two types;
Symmetric binary
Asymmetric binary
26
Examples of Symmetric data
Both values are equally important. For example, if we have open admission to our
university, then it does not matter, whether you are a male or a female.
Example:
Attribute Value
Gender Male, Female
Discrete Attributes
Discrete data have a finite value. It can be in numerical form and can also be in a categorical
form. Discrete Attributes are Quantitative Attributes.
Examples of Discrete Data
Attribute Value
Profession Teacher, Bussiness Man, Peon etc
Postal Code 42200, 42300 etc
28
Example of Continuous Attribute
Continuous data technically have an infinite number of steps.
Continuous data is in float type. There can be many numbers in between 1
and 2. These attributes are Quantitative Attributes.
Example of Continuous Attribute
Attribute Value
Height5.4…, 6.5….. etc
Weight 50.09…. etc
29
How to calculate median for an even number of
Basic Statistical Descriptions of Data
Mean, Median, Mode in data mining values?
Example:
What is mean?
9, 8, 5, 6, 3, 4
Mean is the average of numbers.
Arrange values in order
Example:
3, 4, 5, 6, 8, 9
3, 5, 6, 9, 8
Add 2 middle values and calculate their mean.
Mean = all values/Total number of values
Median = 5+6/2
Mean = 3+5+6+9+8/5
Median = 5.5
Mean = 6.2
What is Mode?
What is Median?
The mode is the most occurring value.
Median is the middle value among all values.
How to calculate the median for an odd number of How to calculate mode?
Example:
values?
Example: 3, 6, 6, 8, 9
[quads id=2]
How to find quartiles of even length data set?
Example:
Data = 8, 5, 2, 4, 8, 9, 5,7
Step 1:
First of all, arrange the values in order
After ordering the values:
Data = 2, 4, 5, 5, 7, 8, 8, 9
Step 2:
For dividing this data into four equal parts, we needed three quartiles.
Q1: Lower quartile
Q2: Median of the data set
Q3: Upper quartile
Step 3:
Find the median of the data set and label it as Q2.
Data = 2, 4, ♦ 5, 5, ♦ 7, 8 ♦ 8, 9
Minimum: 2
Q1: 4 + 5 / 2 = 4.5 Lower quartile
Q2: 5+ 7 / 2 = 6 Middle quartile
Q3: 8 + 8 / 2 = 8 Upper quartile
Maximum: 9
Inter Quartile Range= Q3 – Q1
= 8 – 4.5
= 3.5
Outlier is mostly a value higher or lower than 1.5 * IQR
=1.5 * IQR
=1.5 * 3.5
= 5.25
Variance and standard deviation of data in data mining
What is data variance and standard deviation?
Different values in the data set can be spread here and there from the
mean. Variance tells us that how far away are the values from the mean.
Standard deviation is the square root of the variance.
Low standard deviation
Low standard deviation tells us that fewer numbers are far away from the
mean.
High standard deviation
High standard deviation tells us that more numbers are far away from the
mean.
How to calculate variance and standard deviation of a population
data? marks
8
10
15
20
Mean = 13.25
Variance = 21.6
Standard deviation = 4.6
Measuring data Similarity and dissimilarity
Distance or similarity measures are essential in solving many pattern recognition
problems such as classification and clustering. Various distance/similarity measures
are available in the literature to compare two data distributions. As the names
suggest, a similarity measures how close two distributions are. For multivariate data
complex summary methods are developed to answer this question.
Similarity Measure
Numerical measure of how alike two data objects often fall between 0 (no similarity) and 1
(complete similarity)
Dissimilarity Measure
Numerical measure of how different two data objects are range from 0 (objects are alike) to
(objects are different)
Proximity
refers to a similarity or dissimilarity
37
We consider similarity and dissimilarity in many places in data mining.
Similarity measure
is a numerical measure of how alike two data objects are.
higher when objects are more alike.
often falls in the range [0,1]
Similarity might be used to identify
duplicate data that may have differences due to typos.
equivalent instances from different data sets. E.g. names and/or addresses that are the same but have misspellings.
groups of data that are very close (clusters)
Dissimilarity measure
is a numerical measure of how different two data objects are
lower when objects are more alike
minimum dissimilarity is often 0 while the upper limit varies depending on how much variation can be
Dissimilarity might be used to identify
outliers
interesting exceptions, e.g. credit card fraud
boundaries to clusters
Proximity refers to either a similarity or dissimilarity
38
Similarity/Dissimilarity for Simple Attributes
Here, p and q are the attribute values for two data objects.
Distance, such as the Euclidean distance, is a dissimilarity measure and has some well-known properties:
Common Properties of Dissimilarity Measures
1. d(p, q) ≥ 0 for all p and q, and d(p, q) = 0 if and only if p = q,
2. d(p, q) = d(q,p) for all p and q,
3. d(p, r) ≤ d(p, q) + d(q, r) for all p, q, and r, where d(p, q) is the distance (dissimilarity) between points (data
objects), p and q.
A distance that satisfies these properties is called a metric. Following is a list of several common distance
39
measures to compare multivariate data. We will assume that the attributes are all continuous.
40
Common Properties of Similarity Measures
Similarities have some well-known properties:
41
Example: Calculate the answers to the question and then click the icon on the
left to reveal the answer.
Given data:
p=1000000000
q=0000001001
The frequency table is:
q=1 q=0
p=1 0 1
p=0 2 7
Calculate the Simple matching coefficient and the Jaccard coefficient.
• Simple matching coefficient = (0 + 7) / (0 + 1 + 2 + 7) = 0.7.
• Jaccard coefficient = 0 / (0 + 1 + 2) = 0.
42
DATA PREPROCESSING
Data preprocessing is the process of transforming raw data into an understandable format.
It is also an important step in data mining as we cannot work with raw data.
The quality of the data should be checked before applying machine learning or data
mining algorithms.
Inconsistent data Age =”5 years”, Birthday =”06/06/1990″, Current Year =”2017″
Sometimes applications a lot auto value to attribute. e.g some application put
Intentional error
gender value as male by default. gender=”male”
How to Handle incomplete/Missing Data?
Ignore the tuple
Fill in the missing value manually
Fill the values automatically by
Getting the attribute mean
Getting the constant value if any constant value is there.
Getting the most probable value by Bayesian formula or decision tree
How to Handle Noisy Data?
Binning
Regression
Clustering
Combined computer and human inspection.
What is Binning?
Binning is a technique in which first of all we sort the data and then partition the
data into equal frequency bins.
Bin 1 2, 3, 6, 8
Bin 2 14,16,18,24
Bin 3 26,28,30,32
Types of binning:
There are many types of binning. Some of them are as follows;
Smooth by getting the bin means
Smooth by getting the bin median
Smooth by getting the bin boundaries, etc.
In this example, we can see that if we know the mobile number, then we can know the
mobile network or sim provider. So, we reduce a dimension of mobile network. When we
reduce the dimensions, then you can reduce those dimensions of attributes of data by
combining the dimensions in such a way that it will not lose significant characteristics of the
original dataset that is going to be ready for data mining.
Curse of Dimensionality
“The Curse is an offensive word or phrase used to express anger or annoyance”.
The curse of dimensionality is a condition that occurs when we want to classify, organize, and
analyze the high dimensional data.
When the number of dimensions increases, the distance between two independent points
increases, and similarity decreases.
This problem results in more errors in our final results after data mining.. When we are working
on the data, especially big data, then a very large number of data points are there, so a lot of
dimensions are possibly there.
In this case, it’s practically impossible to get the wanted results and even if we suppose that it’s
possible then it will give inefficient results.
If we reduce the dimension, then it can be easy and more convenient to collect the data.
Data is not collected only for data mining.
Data accumulates at a good speed.
Data preprocessing is an important task to do for better and effective data mining.
Dimensionality reduction is an effective approach to collect less data but efficient data.
Dimensionality Reduction is very helpful in the projection of high-dimensional data
onto 2D or 3D Visualization.
Dimensionality Reduction is helpful in inefficient storage and retrieval of the data and
promotes the concept of Data compression.
Dimensionality Reduction encourages the positive effect on query accuracy by Noise
removal.
Dimensionality Reduction reduces computation time. It fastens the time required for
performing the same computations.
Dimensionality Reduction is helpful to remove redundant features.
Application of Dimensionality Reduction
Text mining
Image retrieval
Microarray data analysis
Protein classification
Face and image recognition
Intrusion detection
Customer relationship management
Handwritten digit recognition
Data compression: In this technique. large volumes of data is compressed i.e.
the number of bits used to store data is reduced. This can be done by using
lossy or lossless compression. In loss compression, the quality of data is
compromised for more compression. In lossless compression, the quality of
data is not compromised for higher compression level.
Numerosity reduction : This technique reduces the volume of data by
choosing smaller forms for data representation. Numerosity reduction can be
done using histograms, clustering or sampling of data. Numerosity reduction
is necessary as processing the entire data set is expensive and time
consuming.
Data Transformation:
The change made in the format or the structure of the data is called data
transformation. This step can be simple or complex based on the
requirements. There are some methods in data transformation.
Smoothing: With the help of algorithms, we can remove noise from the
dataset and helps in knowing the important features of the dataset. By
smoothing we can find even a simple change that helps in prediction.
Aggregation: In this method, the data is stored and presented in the form
of a summary. The data set which is from multiple sources is integrated
into with data analysis description. This is an important step since the
accuracy of the data depends on the quantity and quality of the data. When
the quality and the quantity of the data are good the results are more
relevant.
Discretization: The continuous data here is split into intervals. Discretization
reduces the data size. For example, rather than specifying the class time, we
can set an interval like (3 pm-5 pm, 6 pm-8 pm).
Normalization: It is the method of scaling the data so that it can be
represented in a smaller range. Example ranging from -1.0 to 1.0.
Normalizing your data is an essential part of machine learning. You might
have an amazing dataset with many great features, but if you forget to
normalize, one of those features might completely dominate the others. It’s
like you’re throwing away almost all of your information! Normalizing solves
this problem. In this article, you learned the following techniques to
normalize:
Min-max normalization
Z-score normalization
Decimal Scaling Normalization
Standard Deviation normalization
Min Max Normalization in data mining
Min Max is a data normalization technique like Z score, decimal scaling, and normalization
with standard deviation. It helps to normalize the data. It will scale the data between 0 and 1. This
normalization helps us to understand the data easily.
For example, if I say you to tell me the difference between 200 and 1000 then it’s a little bit
confusing as compared to when I ask you to tell me the difference between 0.2 and 1.
marks
Min Max normalization formula 8
10
15
20
Min:
The minimum value of the given attribute. Here Min is 8
Max:
The maximum value of the given attribute. Here Max is 20
V: V is the respective value of the attribute. For example here V1=8, V2=10, V3=15, and V4=20
newMax:
1
newMin:
0
marks marks after Min-Max normalization
8 0
10 0.16
15 0.58
20 1
Z-Score Normalization – (Data Mining)
Z-Score helps in the normalization of data. If we normalize the data into a
simpler form with the help of z score normalization, then it’s very easy to
understand by our brains.
Z- Score Formula
How to calculate Z-Score of the following data?
marks
8
10
15
20
Mean = 13.25
Standard deviation = 4.6
8 -1.14
10 -0.7
15 0.3
20 1.4
Decimal Scaling Normalization
Decimal scaling is a data normalization technique like Z score, Min-Max, and normalization
with standard deviation. Decimal scaling is a data normalization technique. In this technique, we
move the decimal point of values of the attribute. This movement of decimal points totally depends
on the maximum value among all values in the attribute.
Decimal Scaling Formula
A value v of attribute A is can be normalized by the following formula
Normalized value of attribute = ( vi / 10j )
Example of Decimal scaling
CGPA Formula CGPA Normalized after Decimal scaling
2 2/10 0.2
3 3/10 0.3
We will check the maximum value among our attribute CGPA. Here maximum value is 3 so we can
convert it to a decimal by dividing by 10. Why 10?
we will count total numbers in our maximum value and then put 1 and after 1 we can put zeros equal
to the length of the maximum value.
Here 3 is the maximum value and the total numbers in this value are only 1. so we will put one zero
after one.
Standard Deviation normalization
Different values in the data set can be spread here and there from the mean. Variance tells us how
much far away are the values from the mean.
Standard deviation is the square root of the variance.
High standard deviation tells us that more numbers are far away from the mean.
Low standard deviation tells us that fewer numbers are far away from the mean.
Data discretization in data mining
Data discretization converts a large number of data values into smaller once, so that data evaluation and data
management becomes very easy.
Data discretization example
we have an attribute of age with the following values. Table: Before discretization
Another example is the Website visitor’s data. As seen in the figure below, data is discretized into
the countries.
What are some famous techniques of data discretization?
Histogram analysis: Histogram is a plot used to present the underlying frequency
distribution of a set of continuous data. The histogram helps the inspection of the data for the
distribution of the data. For example normal distribution representation, outliers, and
skewness representation, etc.
Binning: Binning is a data smoothing technique and its helps to group a huge number of
continuous values into a smaller number of bins. For example, if we have data about a group
of students, and we want to arrange their marks into a smaller number of marks intervals by
making the bins of grades. One bin for grade A, one for grade B, one for C, one for D, and
one for F Grade.
Correlation analysis: Cluster analysis is commonly known as clustering. Clustering is the task
of grouping similar objects in one group, commonly called clusters. All different objects are
placed in different clusters.
Clustering analysis
Decision tree analysis
Equal width partitioning
Equal depth partitioning
Data discretization and concept hierarchy generation
A concept hierarchy represents a sequence of mappings with a set of more general
concepts to specialized concepts. Similarly mapping from low-level concepts to
higher-level concepts. In other words, we can say top-down mapping and bottom-
up mapping.
Let’s see an example of a concept hierarchy for the dimension location.
Each city can be mapped with the country to which the given city belongs. For
example, Mianwali can be mapped to Pakistan and Pakistan can be mapped to
Asia.
Top-down mapping
Top-down mapping starts from the top with general concepts and moves to the
bottom to the specialized concepts.
Bottom-up mapping
Bottom-up mapping starts from the Bottom with specialized concepts and moves
to the top to the generalized concepts.
Data discretization and binarization in data mining
what is the difference between discretization and binarization in data science?
Data Discretization in data mining is the process that is used to transform the continuous
attributes.
Data Binarization in data mining is used to transform both the discrete and continuous attributes
into binary attributes.
Implementation - Data preprocessing steps
76