0% found this document useful (0 votes)
37 views

DWM Unit II

Uploaded by

Suryanarayanan G
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views

DWM Unit II

Uploaded by

Suryanarayanan G
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 76

Data Warehousing and Data Mining

III Year / V Semester


Dr. Anandkumar R, Assistant Professor, Department of IT,SMVEC
UNIT II
 UNIT II – DATA MINING
 Data Mining: Introduction- Kinds of Data and Patterns–Major
issues in data mining- Data Objects and attribute types –Statistical
description of data - Measuring data similarity and dissimilarity.
 Data preprocessing: Overview-Data cleaning- Data integration –
Data reduction-Data transformation and discretization

2
Data Mining: Introduction
 Data mining is one of the most useful techniques that help entrepreneurs,
researchers, and individuals to extract valuable information from huge sets of
data. Data mining is also called Knowledge Discovery in Database (KDD).
 The knowledge discovery process includes Data cleaning, Data integration,
Data selection, Data transformation, Data mining, Pattern evaluation, and
Knowledge presentation.

 What is Data Mining?


 The process of extracting information to identify patterns, trends, and useful
data that would allow the business to take the data-driven decision from huge
sets of data is called Data Mining.
3
Advantages of Data Mining
 The Data Mining technique enables organizations to obtain knowledge-based
data.
 Data mining enables organizations to make lucrative modifications in operation
and production.
 Compared with other statistical data applications, data mining is a cost-efficient.
 Data Mining helps the decision-making process of an organization.
 It Facilitates the automated discovery of hidden patterns as well as the
prediction of trends and behaviors.
 It can be induced in the new system as well as the existing platforms.
 It is a quick process that makes it easy for new users to analyze enormous
amounts of data in a short time.
4
Disadvantages of Data Mining
 There is a probability that the organizations may sell useful data of customers
to other organizations for money. As per the report, American Express has
sold credit card purchases of their customers to other organizations.
 Many data mining analytics software is difficult to operate and needs advance
training to work on.
 Different data mining instruments operate in distinct ways due to the different
algorithms used in their design. Therefore, the selection of the right data
mining tools is a very challenging task.
 The data mining techniques are not precise, so that it may lead to severe
consequences in certain conditions.

5
Below we describe 5 factors we consider critical for
the success of Data mining:
 Clear business goals the company aims to achieve using Data mining
 Relevancy of the data sources to avoid duplicates and unimportant results
 Completeness of the data to ensure all the essential information is covered
 Applicability of the Data analysis results to meet the goals specified
 Customer engagement and bottom line growth as the indicators of data
mining success

6
Data Mining Applications
 Data Mining is primarily used by
organizations with intense consumer
demands- Retail, Communication,
Financial, marketing company,
determine price, consumer
preferences, product positioning, and
impact on sales, customer satisfaction,
and corporate profits. Data mining
enables a retailer to use point-of-sale
records of customer purchases to
develop products and promotions that
help the organization to attract the
customer.
7
KDD Process in Data Mining

 Why we need Data Mining?


Volume of information is increasing everyday that we can handle from business
transactions, scientific data, sensor data, Pictures, videos, etc. So, we need a
system that will be capable of extracting essence of information available and
that can automatically generate report,
views or summary of data for better decision-making.
 Why Data Mining is used in Business?
Data mining is used in business to make better managerial decisions by:
Automatic summarization of data
 Extracting essence of information stored.
 Discovering patterns in raw data.
 Data Mining also known as Knowledge Discovery in Databases, refers to the
nontrivial extraction of implicit, previously unknown and potentially useful
information from data stored in databases.
8
Steps Involved in KDD Process:

9
 Data Cleaning: Data cleaning is defined as removal of noisy and irrelevant
data from collection.
 Cleaning in case of Missing values.
 Cleaning noisy data, where noise is a random or variance error.
 Cleaning with Data discrepancy detection and Data transformation tools.
 Data Integration: Data integration is defined as heterogeneous data from
multiple sources combined in a common source(DataWarehouse).
 Data integration using Data Migration tools.
 Data integration using Data Synchronization tools.
 Data integration using ETL(Extract-Load-Transformation) process.

10
 Data Selection: Data selection is defined as the process where data relevant to
the analysis is decided and retrieved from the data collection.
 Data Transformation: Data Transformation is defined as the process of
transforming data into appropriate form required by mining procedure. Data
Transformation is a two step process:
 Data Mapping: Assigning elements from source base to destination to capture
transformations.
 Code generation: Creation of the actual transformation program.

11
 Data Mining: Data mining is defined as clever techniques that are applied to extract
patterns potentially useful.
 Transforms task relevant data into patterns.
 Decides purpose of model using classification or characterization.
 Pattern Evaluation: Pattern Evaluation is defined as identifying strictly increasing
patterns representing knowledge based on given measures.
 Find interestingness score of each pattern.
 Uses summarization and Visualization to make data understandable by user.
 Knowledge representation: Knowledge representation is defined as technique which
utilizes visualization tools to represent data mining results.
 Generate reports.
 Generate tables.
 Generate discriminant rules, classification rules, characterization rules, etc.
12
Type of Data that can be mined
 Different kind of data can be mine. Some of the examples are mentioned
below.
 Spatial Databases
 Flat Files
 Relational Databases
 Transactional Databases
 Multimedia Databases
 DataWarehouse
 World Wide Web(WWW)
 Time Series Databases

13
Spatial Database
 Spatial Database is a suitable way to Store geographical information.
 Spatial Database stores the data in the form of coordinates, lines, and different
shapes, etc.
 Maps, Global positioning, etc are the famous applications of Spatial Database.

Flat files?
 Flat files are in the binary form or text form and having a structure that can be
easily extracted by data mining algorithms.

14
Relational Databases
 Relational Databases is an organized collection of related data. This organization
is in the form of tables with rows and columns. Different kind of scheme used in
relational databases. A physical and logical schema is famous schema.
 In Physical schema, we can define the structure of tables.
 In Logical schema, we can define a different kind of relationship among tables.
 Standard API of the relational database is Structured Query Language (SQL).
Transactional Databases
 Transactional databases is an organized collection of data that is organized by
timestamps etc. For example, organized by any date to represent the transaction
in databases. Transactional Databases must have the capability to roll back any
transaction. It is most commonly used in ATM machines
 Object databases, ATM machine, Banking, and Distributed systems are very
famous applications of a transactional database.
15
Multimedia Databases
 Multimedia databases are the databases that can store the followings;
 Video
 images
 Audio
 text etc
 Multimedia Databases can be stored on Object-Oriented Databases.
 Ebooks databases, Video websites databases, news websites databases etc are famous applications
of Multimedia Databases.
DataWarehouse
 A data warehouse is the collection of data that is collected and integrated from one or more
sources. Later this data can be mined for business decision making.
 Three famous types of a data warehouse are mentioned below;
 VirtualWarehouse
 Data Mart
 Enterprise data warehouse
 Business decision making and Data mining are very useful applications of the data warehouse.
16
WWW
 WWW stands for World wide web. WWW is a collection of documents and
resources and can contain a different kind of data like video, audio, and text, etc.
Each data can be identified by Uniform Resource Locators (URLs) through web
browsers.
 Online tools, online video, images, and text searching sites are the famous
applications of WWW.

Time-series Databases
 Time-series databases are the databases that can store the stock exchange data
etc. Graphite and eXtremeDB etc are the famous applications of Time-series
Databases.
17
Major issues in Data Mining
 Data mining is not an easy task, as the algorithms used can get very complex and data is not always
available at one place. It needs to be integrated from various heterogeneous data sources. These
factors also create some issues. Here in this tutorial, we will discuss the major issues regarding −
• Mining Methodology and User Interaction
• Performance Issues
• Diverse Data Types Issues
 The following diagram describes the major issues.

18
Mining Methodology and User Interaction Issues
 It refers to the following kinds of issues −
• Mining different kinds of knowledge in databases − Different users may be interested in different kinds
of knowledge. Therefore it is necessary for data mining to cover a broad range of knowledge discovery task.
• Interactive mining of knowledge at multiple levels of abstraction − The data mining process needs to
be interactive because it allows users to focus the search for patterns, providing and refining data mining
requests based on the returned results.
• Incorporation of background knowledge − To guide discovery process and to express the discovered
patterns, the background knowledge can be used. Background knowledge may be used to express the
discovered patterns not only in concise terms but at multiple levels of abstraction.
• Data mining query languages and ad hoc data mining − Data Mining Query language that allows the
user to describe ad hoc mining tasks, should be integrated with a data warehouse query language and
optimized for efficient and flexible data mining.
• Presentation and visualization of data mining results − Once the patterns are discovered it needs to be
expressed in high level languages, and visual representations. These representations should be easily
understandable.
• Handling noisy or incomplete data − The data cleaning methods are required to handle the noise and
incomplete objects while mining the data regularities. If the data cleaning methods are not there then the
accuracy of the discovered patterns will be poor.
• Pattern evaluation − The patterns discovered should be interesting because either they represent common
19
knowledge or lack novelty.
Performance Issues
 There can be performance-related issues such as follows −
• Efficiency and scalability of data mining algorithms − In order to effectively extract the
information from huge amount of data in databases, data mining algorithm must be efficient
and scalable.
• Parallel, distributed, and incremental mining algorithms − The factors such as huge size
of databases, wide distribution of data, and complexity of data mining methods motivate the
development of parallel and distributed data mining algorithms. These algorithms divide the
data into partitions which is further processed in a parallel fashion. Then the results from the
partitions is merged. The incremental algorithms, update databases without mining the data
again from scratch.

Diverse Data Types Issues


• Handling of relational and complex types of data − The database may contain complex
data objects, multimedia data objects, spatial data, temporal data etc. It is not possible for one
system to mine all these kind of data.
• Mining information from heterogeneous databases and global information systems −
The data is available at different data sources on LAN or WAN. These data source may be
structured,
20 semi structured or unstructured. Therefore mining the knowledge from them adds
Data Objects and Attribute Types
 Data: It is how the data objects and their attributes are stored.
 An attribute is an object’s property or characteristics. For example. A person’s hair
colour, air humidity etc.
• An attribute set defines an object. The object is also referred to as a record of the
instances or entity.
Different types of attributes or data types:
1. Nominal Attribute:
Nominal Attributes only provide enough attributes to differentiate between one object and another. Such as
Student Roll No., Sex of the Person.
2. Ordinal Attribute:
The ordinal attribute value provides sufficient information to order the objects. Such as Rankings, Grades, Height
3. Binary Attribute:
These are 0 and 1. Where 0 is the absence of any features and 1 is the inclusion of any characteristics.
4. Numeric attribute:It is quantitative, such that quantity can be measured and represented in integer or real values
,are of two types
Interval Scaled attribute:
It is measured on a scale of equal size units,these attributes allows us to compare such as temperature in C or F
and thus values of attributes have order.
21
5. Ratio Scaled attribute:
Types of attributes

22
Data Quality: Why do we preprocess the data?
Many characteristics act as a deciding factor for data quality, such as incompleteness and
incoherent information, which are common properties of the big database in the real world. Factors
used for data quality assessment are:
• Accuracy: There are many possible reasons for flawed or inaccurate data here. i.e. Having
incorrect values of properties that could be human or computer errors.
• Completeness: For some reasons, incomplete data can occur, attributes of interest such as
customer information for sales & transaction data may not always be available.
• Consistency: Incorrect data can also result from inconsistencies in naming convention or data
codes, or from input field incoherent format. Duplicate tuples need cleaning of details, too.
• Timeliness: It also affects the quality of the data. At the end of the month, several sales
representatives fail to file their sales record on time. These are also several corrections &
adjustments which flow into after the end of the month. Data stored in the database are
incomplete for a time after each month.
• Believability: It is reflective of how much users trust the data.
• Interpretability: It is a reflection of how easy the users can understand the data.

23
Example of attribute
 In this example, RollNo, Name, and Result are attributes of the object named
as a student.
 Rollo Name Result
 1 Ali Pass
 2 Akram Fail
 Types Of attributes
 Binary
 Nominal
 Ordinal Attributes
 Numeric
 Interval-scaled
 Ratio-scaled

24
 Nominal Attributes
 Nominal data is in alphabetical form and not in an integer. Nominal Attributes
are Qualitative Attributes.
 Examples of Nominal attributes
 In this example, sates and colors are the attribute and New, Pending, Working,
Complete, Finish and Black, Brown, White, and Red are the values.
 Attribute Value
 Categorical data Lecturer, Assistant Professor, Professor
 States New, Pending, Working, Complete, Finish
 Colors Black, Brown, White, Red

25
 Binary Attributes
 Binary data have only two values/states. For example, here HIV detected can
be only Yes or No.
Binary Attributes are Qualitative Attributes.
 Examples of Binary Attributes
 Attribute Value
 HIV detected Yes, No
 Result Pass, Fail
 The binary attribute is of two types;
 Symmetric binary
 Asymmetric binary
26
 Examples of Symmetric data
 Both values are equally important. For example, if we have open admission to our
university, then it does not matter, whether you are a male or a female.
 Example:
 Attribute Value
 Gender Male, Female

 Examples of Asymmetric data


 Both values are not equally important. For example, HIV detected is more important than
HIV not detected. If a patient is with HIV and we ignore him, then it can lead to death but if
a person is not HIV detected and we ignore it, then there is no special issue or risk.
 Example
 Attribute Value
 HIV detected Yes, No
 Result Pass, Fail
27
 Ordinal Attributes
 All Values have a meaningful order. For example, Grade-A means highest marks, B means
marks are less than A, C means marks are less than grades A and B, and so on. Ordinal
Attributes are Quantitative Attributes.
 Examples of Ordinal Attributes
 Attribute Value
 Grade A, B, C, D, F
 BPS- Basic pay scale 16, 17, 18

 Discrete Attributes
 Discrete data have a finite value. It can be in numerical form and can also be in a categorical
form. Discrete Attributes are Quantitative Attributes.
 Examples of Discrete Data
 Attribute Value
 Profession Teacher, Bussiness Man, Peon etc
 Postal Code 42200, 42300 etc
28
 Example of Continuous Attribute
 Continuous data technically have an infinite number of steps.
 Continuous data is in float type. There can be many numbers in between 1
and 2. These attributes are Quantitative Attributes.
 Example of Continuous Attribute
 Attribute Value
 Height5.4…, 6.5….. etc
 Weight 50.09…. etc

29
 How to calculate median for an even number of
Basic Statistical Descriptions of Data
Mean, Median, Mode in data mining values?
 Example:
What is mean?
 9, 8, 5, 6, 3, 4
 Mean is the average of numbers.
 Arrange values in order
 Example:
 3, 4, 5, 6, 8, 9
 3, 5, 6, 9, 8
 Add 2 middle values and calculate their mean.
 Mean = all values/Total number of values
 Median = 5+6/2
 Mean = 3+5+6+9+8/5
 Median = 5.5
 Mean = 6.2
 What is Mode?
What is Median?
 The mode is the most occurring value.
 Median is the middle value among all values.
 How to calculate the median for an odd number of How to calculate mode?
 Example:
values?
 Example:  3, 6, 6, 8, 9

 9, 8, 5, 6, 3  Mode = 6 (because 6 is occurring 2 times and all other

 Arrange values in order values occur only one time).


 3, 5, 6, 8, 9
 Median = 6
 What is quartile?
 Quartile means four equal groups.
 How to find quartiles of odd length data set?
 Example:
 Data = 8, 5, 2, 4, 8, 9, 5
 Step 1:
 First of all, arrange the values in order.
 After ordering the values:
 Data = 2, 4, 5, 5, 8, 8, 9
 [quads id=1]
 Step 2:
 For dividing this data into four equal parts, we needed three quartiles.
 Q1: Lower quartile
 Q2: Median of the data set
 Q3: Upper quartile
 Step 3:
 Find the median of the data set and label it as Q2.
 Data = 2, 4, 5, 5, 8, 8, 9
 Q1: 4 – Lower quartile
 Q2: 5 – Middle quartile
 Q3: 8 – Upper quartile
 Inter Quartile Range= Q3 – Q1
=8–4
=4
What is Outlier?
 The outlier is the set of data far away from the common and famous pattern.

How to find outliers?


 Outlier is mostly a value higher or lower than 1.5 * IQR
 =1.5 * IQR
 =1.5 * 5
 = 7.5
 Population size:
 Population size is the total number of values in data.

 [quads id=2]
 How to find quartiles of even length data set?
 Example:
 Data = 8, 5, 2, 4, 8, 9, 5,7
 Step 1:
 First of all, arrange the values in order
 After ordering the values:
 Data = 2, 4, 5, 5, 7, 8, 8, 9
 Step 2:
 For dividing this data into four equal parts, we needed three quartiles.
 Q1: Lower quartile
 Q2: Median of the data set
 Q3: Upper quartile
 Step 3:
 Find the median of the data set and label it as Q2.
 Data = 2, 4, ♦ 5, 5, ♦ 7, 8 ♦ 8, 9
 Minimum: 2
 Q1: 4 + 5 / 2 = 4.5 Lower quartile
 Q2: 5+ 7 / 2 = 6 Middle quartile
 Q3: 8 + 8 / 2 = 8 Upper quartile
 Maximum: 9
 Inter Quartile Range= Q3 – Q1
 = 8 – 4.5
 = 3.5
 Outlier is mostly a value higher or lower than 1.5 * IQR
 =1.5 * IQR
 =1.5 * 3.5
 = 5.25
 Variance and standard deviation of data in data mining
 What is data variance and standard deviation?
 Different values in the data set can be spread here and there from the
mean. Variance tells us that how far away are the values from the mean.
 Standard deviation is the square root of the variance.
 Low standard deviation
 Low standard deviation tells us that fewer numbers are far away from the
mean.
 High standard deviation
 High standard deviation tells us that more numbers are far away from the
mean.
 How to calculate variance and standard deviation of a population
data? marks
8
10
15
20

 Mean = 13.25
 Variance = 21.6
 Standard deviation = 4.6
Measuring data Similarity and dissimilarity
 Distance or similarity measures are essential in solving many pattern recognition
problems such as classification and clustering. Various distance/similarity measures
are available in the literature to compare two data distributions. As the names
suggest, a similarity measures how close two distributions are. For multivariate data
complex summary methods are developed to answer this question.
 Similarity Measure
 Numerical measure of how alike two data objects often fall between 0 (no similarity) and 1
(complete similarity)
 Dissimilarity Measure
 Numerical measure of how different two data objects are range from 0 (objects are alike) to
(objects are different)
 Proximity
 refers to a similarity or dissimilarity
37
 We consider similarity and dissimilarity in many places in data mining.
 Similarity measure
 is a numerical measure of how alike two data objects are.
 higher when objects are more alike.
 often falls in the range [0,1]
 Similarity might be used to identify
 duplicate data that may have differences due to typos.
 equivalent instances from different data sets. E.g. names and/or addresses that are the same but have misspellings.
 groups of data that are very close (clusters)
 Dissimilarity measure
 is a numerical measure of how different two data objects are
 lower when objects are more alike
 minimum dissimilarity is often 0 while the upper limit varies depending on how much variation can be
 Dissimilarity might be used to identify
 outliers
 interesting exceptions, e.g. credit card fraud
 boundaries to clusters
 Proximity refers to either a similarity or dissimilarity
38
 Similarity/Dissimilarity for Simple Attributes
 Here, p and q are the attribute values for two data objects.

Distance, such as the Euclidean distance, is a dissimilarity measure and has some well-known properties:
Common Properties of Dissimilarity Measures
1. d(p, q) ≥ 0 for all p and q, and d(p, q) = 0 if and only if p = q,
2. d(p, q) = d(q,p) for all p and q,
3. d(p, r) ≤ d(p, q) + d(q, r) for all p, q, and r, where d(p, q) is the distance (dissimilarity) between points (data
objects), p and q.
 A distance that satisfies these properties is called a metric. Following is a list of several common distance
39
measures to compare multivariate data. We will assume that the attributes are all continuous.
40
Common Properties of Similarity Measures
 Similarities have some well-known properties:

1. s(p, q) = 1 (or maximum similarity) only if p = q,


2. s(p, q) = s(q, p) for all p and q, where s(p, q) is the similarity between data
objects, p and q.
Similarity Between Two Binary Variables
 The above similarity or distance measures are appropriate for continuous
variables. However, for binary variables a different approach is necessary.
 q=1 q=0
 p=1 n1,1 n1,0
 p=0 n0,1 n0,0

41
Example: Calculate the answers to the question and then click the icon on the
left to reveal the answer.
Given data:
 p=1000000000
 q=0000001001
 The frequency table is:
 q=1 q=0
 p=1 0 1
 p=0 2 7
 Calculate the Simple matching coefficient and the Jaccard coefficient.
• Simple matching coefficient = (0 + 7) / (0 + 1 + 2 + 7) = 0.7.
• Jaccard coefficient = 0 / (0 + 1 + 2) = 0.

42
DATA PREPROCESSING
 Data preprocessing is the process of transforming raw data into an understandable format.
 It is also an important step in data mining as we cannot work with raw data.
 The quality of the data should be checked before applying machine learning or data
mining algorithms.

Why is Data preprocessing important?


 Preprocessing of data is mainly to check the data quality. The quality can be checked by
the following
 Accuracy: To check whether the data entered is correct or not.
 Completeness: To check whether the data is available or not recorded.
 Consistency: To check whether the same data is kept in all the places that do or do not
match.
 Timeliness: The data should be updated correctly.
 Believability: The data should be trustable.
 Interpretability: The understandability of the data.
Tasks in data preprocessing
 Data Cleaning: It is also known as scrubbing. This task involves filling of missing values,
smoothing or removing noisy data and outliers along with resolving inconsistencies.
 Data Integration: This task involves integrating data from multiple sources such as
databases (relational and non-relational), data cubes, files, etc. The data sources can be
homogeneous or heterogeneous. The data obtained from the sources can be structured,
unstructured or semi-structured in format.
 Data Transformation: This involves normalization and aggregation of data according to
the needs of the data set.
 Data Reduction: During this step data is reduced. The number of records or the number
of attributes or dimensions can be reduced. Reduction is performed by keeping in mind
that reduced data should produce the same results as original data.
 Data Discretization: It is considered as a part of data reduction. The numerical attributes
are replaced with nominal ones.
What is meant by data cleaning
Data cleaning is a process to clean the dirty data. Data is mostly not clean.
It means that most data can be incorrect due to a large number of reasons
like due to hardware error/failure, network error or human error. So it is
compulsory to clean the data before mining.
What is importance and benefits of data cleaning
1. Data Cleaning removes major errors.
2. Data Cleaning ensures happier customers, more sales, and more accurate
decision.
3. Data Cleaning removes inconsistencies that are most likely occur when
multiple sources of data are store into one data-set.
4. Data Cleaning make the data-set more efficient, more reliable and more
accurate
Sources of Missing Values
 There are many sources of missing data. Let’s see some major sources of missing data.
 User forgot to fill the data in a field.
 It can be a programming error.
 Data can be lost when we transferring the data manually from a legacy database.

Dirty data Examples

Incomplete data salary=” ”

Inconsistent data Age =”5 years”, Birthday =”06/06/1990″, Current Year =”2017″

Noisy data Salary = “-5000”, Name = “123”

Sometimes applications a lot auto value to attribute. e.g some application put
Intentional error
gender value as male by default. gender=”male”
How to Handle incomplete/Missing Data?
 Ignore the tuple
 Fill in the missing value manually
 Fill the values automatically by
 Getting the attribute mean
 Getting the constant value if any constant value is there.
 Getting the most probable value by Bayesian formula or decision tree
How to Handle Noisy Data?
 Binning
 Regression
 Clustering
 Combined computer and human inspection.
What is Binning?
 Binning is a technique in which first of all we sort the data and then partition the
data into equal frequency bins.
Bin 1 2, 3, 6, 8
Bin 2 14,16,18,24
Bin 3 26,28,30,32

Types of binning:
 There are many types of binning. Some of them are as follows;
 Smooth by getting the bin means
 Smooth by getting the bin median
 Smooth by getting the bin boundaries, etc.

Bin 1 4.75, 4.75, 4.75, 4.75


Bin 2 18,18,18,18
Bin 3 29,29,29,29
Data Integration
 In this step, a coherent data source is prepared. This is done by collecting
and integrating data from multiple sources like databases, legacy systems,
flat files, data cubes etc.
 Data is like garbage. You’d better know what you are going to do with it
before you collect it. — Mark Twain
Issues in Data Integration
 Schema Integration: Metadata (i.e. the schema) from different sources
may not be compatible. This leads to entity identification
problem. Example : Consider two data sources R and S. Customer id in R
is represented as cust_id and in S is represented is c_id. They mean the
same thing, represent the same thing but have different names which leads
to integration problems. Detecting and resolving them is very important to
have a coherent data source.
 Data value conflicts: The values or metrics or representations of the same
data maybe different in for the same real world entity in different data
sources. This leads to different representations of the same data, different
scales etc. Example : Weight in data source R is represented in kilograms
and in source S is represented in grams. To resolve this, data
representations should be made consistent and conversions should be
performed accordingly.
 Redundant data: Duplicate attributes or tuples may occur as a result of
integrating data from various sources. This may also lead to
inconsistencies. These redundancies or inconsistencies may be reduced by
careful integration of data from multiple sources. This will help in
improving the mining speed and quality. Also, co-relational analysis can
be performed to detect redundant data.
Data Reduction
 If the data is very large, data reduction is performed. Sometimes, it is also
performed to find the most suitable subset of attributes from a large
number of attributes. This is known as dimensionality reduction. Data
reduction also involves reducing the number of attribute values and/or the
number of tuples. Various data reduction techniques are:
 Data cube aggregation: In this technique the data is reduced by applying
OLAP operations like slice, dice or rollup. It uses the smallest level
necessary to solve the problem.
 Dimensionality reduction: The data attributes or dimensions are reduced.
Not all attributes are required for data mining. The most suitable subset of
attributes are selected by using techniques like forward selection,
backward elimination, decision tree induction or a combination of forward
selection and backward elimination.
 Dimensionality reduction is considered a significant task in data mining applications.
 For example, let’s start with an example. Suppose you have a dataset with a lot of
dimensions (features or columns in your database).

 In this example, we can see that if we know the mobile number, then we can know the
mobile network or sim provider. So, we reduce a dimension of mobile network. When we
reduce the dimensions, then you can reduce those dimensions of attributes of data by
combining the dimensions in such a way that it will not lose significant characteristics of the
original dataset that is going to be ready for data mining.
Curse of Dimensionality
 “The Curse is an offensive word or phrase used to express anger or annoyance”.
 The curse of dimensionality is a condition that occurs when we want to classify, organize, and
analyze the high dimensional data.
 When the number of dimensions increases, the distance between two independent points
increases, and similarity decreases.
 This problem results in more errors in our final results after data mining.. When we are working
on the data, especially big data, then a very large number of data points are there, so a lot of
dimensions are possibly there.
 In this case, it’s practically impossible to get the wanted results and even if we suppose that it’s
possible then it will give inefficient results.
 If we reduce the dimension, then it can be easy and more convenient to collect the data.
 Data is not collected only for data mining.
 Data accumulates at a good speed.
 Data preprocessing is an important task to do for better and effective data mining. „
 Dimensionality reduction is an effective approach to collect less data but efficient data.
 Dimensionality Reduction is very helpful in the projection of high-dimensional data
onto 2D or 3D Visualization.
 Dimensionality Reduction is helpful in inefficient storage and retrieval of the data and
promotes the concept of Data compression.
 Dimensionality Reduction encourages the positive effect on query accuracy by Noise
removal.
 Dimensionality Reduction reduces computation time. It fastens the time required for
performing the same computations.
 Dimensionality Reduction is helpful to remove redundant features.
Application of Dimensionality Reduction
 Text mining
 Image retrieval
 Microarray data analysis
 Protein classification
 Face and image recognition
 Intrusion detection
 Customer relationship management
 Handwritten digit recognition
 Data compression: In this technique. large volumes of data is compressed i.e.
the number of bits used to store data is reduced. This can be done by using
lossy or lossless compression. In loss compression, the quality of data is
compromised for more compression. In lossless compression, the quality of
data is not compromised for higher compression level.
 Numerosity reduction : This technique reduces the volume of data by
choosing smaller forms for data representation. Numerosity reduction can be
done using histograms, clustering or sampling of data. Numerosity reduction
is necessary as processing the entire data set is expensive and time
consuming.
Data Transformation:
 The change made in the format or the structure of the data is called data
transformation. This step can be simple or complex based on the
requirements. There are some methods in data transformation.
 Smoothing: With the help of algorithms, we can remove noise from the
dataset and helps in knowing the important features of the dataset. By
smoothing we can find even a simple change that helps in prediction.
 Aggregation: In this method, the data is stored and presented in the form
of a summary. The data set which is from multiple sources is integrated
into with data analysis description. This is an important step since the
accuracy of the data depends on the quantity and quality of the data. When
the quality and the quantity of the data are good the results are more
relevant.
 Discretization: The continuous data here is split into intervals. Discretization
reduces the data size. For example, rather than specifying the class time, we
can set an interval like (3 pm-5 pm, 6 pm-8 pm).
 Normalization: It is the method of scaling the data so that it can be
represented in a smaller range. Example ranging from -1.0 to 1.0.
 Normalizing your data is an essential part of machine learning. You might
have an amazing dataset with many great features, but if you forget to
normalize, one of those features might completely dominate the others. It’s
like you’re throwing away almost all of your information! Normalizing solves
this problem. In this article, you learned the following techniques to
normalize:
 Min-max normalization
 Z-score normalization
 Decimal Scaling Normalization
 Standard Deviation normalization
Min Max Normalization in data mining
 Min Max is a data normalization technique like Z score, decimal scaling, and normalization
with standard deviation. It helps to normalize the data. It will scale the data between 0 and 1. This
normalization helps us to understand the data easily.
 For example, if I say you to tell me the difference between 200 and 1000 then it’s a little bit
confusing as compared to when I ask you to tell me the difference between 0.2 and 1.
marks
 Min Max normalization formula 8
10
15
20

Min:
 The minimum value of the given attribute. Here Min is 8

Max:
 The maximum value of the given attribute. Here Max is 20

V: V is the respective value of the attribute. For example here V1=8, V2=10, V3=15, and V4=20
newMax:
 1

newMin:
 0
marks marks after Min-Max normalization
8 0
10 0.16
15 0.58
20 1
Z-Score Normalization – (Data Mining)
 Z-Score helps in the normalization of data. If we normalize the data into a
simpler form with the help of z score normalization, then it’s very easy to
understand by our brains.
 Z- Score Formula
How to calculate Z-Score of the following data?
marks
8
10
15
20

 Mean = 13.25
 Standard deviation = 4.6

marks marks after z-score normalization

8 -1.14

10 -0.7

15 0.3

20 1.4
Decimal Scaling Normalization
 Decimal scaling is a data normalization technique like Z score, Min-Max, and normalization
with standard deviation. Decimal scaling is a data normalization technique. In this technique, we
move the decimal point of values of the attribute. This movement of decimal points totally depends
on the maximum value among all values in the attribute.
 Decimal Scaling Formula
 A value v of attribute A is can be normalized by the following formula
 Normalized value of attribute = ( vi / 10j )
 Example of Decimal scaling
CGPA Formula CGPA Normalized after Decimal scaling
2 2/10 0.2
3 3/10 0.3

 We will check the maximum value among our attribute CGPA. Here maximum value is 3 so we can
convert it to a decimal by dividing by 10. Why 10?
 we will count total numbers in our maximum value and then put 1 and after 1 we can put zeros equal
to the length of the maximum value.
 Here 3 is the maximum value and the total numbers in this value are only 1. so we will put one zero
after one.
Standard Deviation normalization
 Different values in the data set can be spread here and there from the mean. Variance tells us how
much far away are the values from the mean.
 Standard deviation is the square root of the variance.
 High standard deviation tells us that more numbers are far away from the mean.
 Low standard deviation tells us that fewer numbers are far away from the mean.
Data discretization in data mining
Data discretization converts a large number of data values into smaller once, so that data evaluation and data
management becomes very easy.
Data discretization example
 we have an attribute of age with the following values. Table: Before discretization

Age 10,11,13,14,17,19,30, 31, 32, 38, 40, 42,70 , 72, 73, 75

Attribute Age Age Age


10,11,13,14,17,19, 30, 31, 32, 38, 40, 42 70 , 72, 73, 75
After Discretization Young Mature Old

 Another example is the Website visitor’s data. As seen in the figure below, data is discretized into
the countries.
What are some famous techniques of data discretization?
 Histogram analysis: Histogram is a plot used to present the underlying frequency
distribution of a set of continuous data. The histogram helps the inspection of the data for the
distribution of the data. For example normal distribution representation, outliers, and
skewness representation, etc.
 Binning: Binning is a data smoothing technique and its helps to group a huge number of
continuous values into a smaller number of bins. For example, if we have data about a group
of students, and we want to arrange their marks into a smaller number of marks intervals by
making the bins of grades. One bin for grade A, one for grade B, one for C, one for D, and
one for F Grade.
 Correlation analysis: Cluster analysis is commonly known as clustering. Clustering is the task
of grouping similar objects in one group, commonly called clusters. All different objects are
placed in different clusters.
 Clustering analysis
 Decision tree analysis
 Equal width partitioning
 Equal depth partitioning
Data discretization and concept hierarchy generation
 A concept hierarchy represents a sequence of mappings with a set of more general
concepts to specialized concepts. Similarly mapping from low-level concepts to
higher-level concepts. In other words, we can say top-down mapping and bottom-
up mapping.
 Let’s see an example of a concept hierarchy for the dimension location.
 Each city can be mapped with the country to which the given city belongs. For
example, Mianwali can be mapped to Pakistan and Pakistan can be mapped to
Asia.
 Top-down mapping
 Top-down mapping starts from the top with general concepts and moves to the
bottom to the specialized concepts.
 Bottom-up mapping
 Bottom-up mapping starts from the Bottom with specialized concepts and moves
to the top to the generalized concepts.
Data discretization and binarization in data mining
 what is the difference between discretization and binarization in data science?
 Data Discretization in data mining is the process that is used to transform the continuous
attributes.
 Data Binarization in data mining is used to transform both the discrete and continuous attributes
into binary attributes.
Implementation - Data preprocessing steps

Step 1: Import libraries and the dataset


 import pandas as pd
 global np
 import numpy as np
 import io
 global df
 from google.colab import files
 uploaded = files.upload()
 df = pd.read_csv(io.BytesIO(uploaded['Datasets.csv']))
 print(df)
Step 2: Extracting independent variable:
 The variables in a study of a cause-and-effect relationship are called
the independent and dependent variables.
 The independent variable is the cause. Its value is independent of other
variables in your study.
 The dependent variable is the effect. Its value depends on changes in the
independent variable.
 global x
 x=df.iloc[:,:-1].values
 X

Step 3: Extracting dependent variable:


 global y
 y=df.iloc[:,3].values
 y
 [:, :] literally means [all rows, all columns].
 Indexing in python starts from 0 when you go from the first element to the
last, but it starts from -1 when you start from the last element.
 So, when you do [:, -1] it means you are taking all the rows and only the last
column. -1 represents the last column.
 When you do [:, :-1], it means you are taking all the rows and all the columns
except the last column.
step 4: Filling the missing value with the mean value of the attribute
 from sklearn.impute import SimpleImputer
 imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
 imputerimputer= imputer.fit(x[:, 1:3])
 x[:, 1:3]= imputer.transform(x[:, 1:3])
 x
Step 5: Encoding the country variable (Transformation)
 The machine learning models use mathematical equations. So categorical data is
not accepted so we convert it into numerical form.
 from sklearn.preprocessing import LabelEncoder
 label_encoder_x= LabelEncoder()
 x[:, 0]= label_encoder_x.fit_transform(x[:, 0])
 x
End of Unit II

76

You might also like