0% found this document useful (0 votes)
8 views

DWM 4

The document discusses data mining and knowledge discovery in databases (KDD). It defines data mining and KDD, describes the KDD process, lists some common data mining techniques, and discusses advantages and disadvantages of using KDD for data mining.

Uploaded by

Shivraj Jadhav
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

DWM 4

The document discusses data mining and knowledge discovery in databases (KDD). It defines data mining and KDD, describes the KDD process, lists some common data mining techniques, and discusses advantages and disadvantages of using KDD for data mining.

Uploaded by

Shivraj Jadhav
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

Samarth

4.Introduction to Data Mining

Data Mining

The process of extracting information to identify patterns, trends, and useful data that
would allow the business to take the data-driven decision from huge sets of data is
called Data Mining.
Data mining is one of the most useful techniques that help entrepreneurs, researchers,
and individuals to extract valuable information from huge sets of data. Data mining is
also called Knowledge Discovery in Database (KDD)
Data Mining is defined as extracting information from huge sets of data. In other words,
we can say that data mining is the procedure of mining knowledge from data. The
information or knowledge extracted so can be used for any of the following applications

Market Analysis
Fraud Detection
Customer Retention
Production Control
Science Exploration

Knowledge Discovery in Database (KDD)

-The process of discovering knowledge in data and application of data mining methods
refers to the term Knowledge Discovery in Databases(KDD).
-It includes a wide variety of application domains, which include Artificial Intelligence,
Pattern Recognition, Machine Learning Statistics and Data Visualization.
Samarth

-The main goal includes extracting knowledge from large databases, the goal is
achieved by using various data mining algorithms to identify useful patterns according to
some predefined measures and thresholds.

KDD Process in Data Mining

The KDD process in data mining is a multi-step process that involves various stages to
extract useful knowledge from large datasets.Various steps involved in the KDD
process in data mining are shown below diagram

The
following are the main steps involved in the KDD process -

Data Selection - The first step in the KDD process is identifying and selecting the
relevant data for analysis. This involves choosing the relevant data sources, such as
databases, data warehouses, and data streams, and determining which data is required
for the analysis.

Data Preprocessing - After selecting the data, the next step is data preprocessing. This
step involves cleaning the data, removing outliers, and removing missing, inconsistent,
or irrelevant data.

Data Transformation - Once the data is preprocessed, the next step is to transform it
into a format that data mining techniques can analyze.

Data Mining - This is the heart of the KDD process and involves applying various data
mining techniques to the transformed data to discover hidden patterns, trends,
relationships, and insights. A few of the most common data mining techniques include
clustering, classification, association rule mining, and anomaly detection.
Samarth

Pattern Evaluation - After the data mining, the next step is to evaluate the discovered
patterns to determine their usefulness and relevance.

Knowledge Representation - This step involves representing the knowledge extracted


from the data in a way humans can easily understand and use.

Deployment - The final step in the KDD process is to deploy the knowledge and insights
gained from the data mining process to practical applications. T

Advantages of KDD in Data Mining

● Helps in Decision Making - KDD can help make informed and data-driven
decisions by discovering hidden patterns, trends, and relationships in data that
might not be immediately apparent.
● Improves Business Performance - KDD can help organizations improve their
business performance by identifying areas for improvement, optimizing
processes, and reducing costs.
● Saves Time and Resources - KDD can help save time and resources by
automating the data analysis process and identifying the most relevant and
significant information or knowledge.
● Increases Efficiency - KDD can help organizations streamline their processes,
optimize their resources, and increase their overall efficiency.
● Fraud Detection - KDD can help detect fraud and identify fraudulent behavior by
analyzing patterns in data and identifying anomalies or unusual behavior.

Disadvantages of KDD in Data Mining

● Requires High-Quality Data - KDD relies on high-quality data to generate


accurate and meaningful insights. If the data is incomplete, inconsistent, or of
poor quality, it can lead to inaccurate, misleading results and flawed conclusions.
● Complexity - KDD is a complex and time-consuming process that requires
specialized skills and knowledge to perform effectively. The complexity can also
make interpreting and communicating the results challenging to non-experts.
● Privacy and Compliance Concerns - KDD can raise ethical concerns related to
privacy, compliance, bias, and discrimination. For example, data mining
techniques can extract sensitive information about individuals without their
consent or reinforce existing biases or stereotypes.
● High Cost - KDD can be expensive, and require specialized software, hardware,
and skilled professionals to perform the analysis. The cost can be prohibitive for
smaller organizations or those with limited resources.

Difference Between KDD and Data Mining

The difference between KDD and data mining is explained in the below table.
Samarth

Factor KDD Process Data Mining

Definition It is a comprehensive process that A subset of KDD that focuses


includes multiple steps for extracting primarily on finding patterns
useful knowledge and insights from and relationships in data
large datasets

Steps It includes steps such as data It includes steps such as data


involved collection, cleaning, integration, preprocessing, modeling, and
selection, transformation, data analysis
mining, interpretation, and
evaluation

Focus Emphasizes the importance of Focuses on the use of


domain expertise in interpreting and computational algorithms to
validating results analyze data

Techniques Data selection, cleaning, Association rules mining,


used transformation, data mining, pattern clustering, regression,
evaluation, interpretation, classification, and
knowledge representation, and data dimensionality reduction.
visualization
Samarth

Outputs Knowledge bases, such as rules or A set of patterns, relationships,


models that help organizations predictions, or insights to
make informed decisions support decision-making or
business understanding

Architecture of data Mining

1.Database, data warehouse or other information repository

These are information repositories. Data cleaning and data integration techniques may
be performed on the data

2. Databases or data warehouse server

It fetches the data as per the user's requirement which is need for data mining task.

3.Knowledge base

This is used to guide the search, and gives interesting and hidden patterns from data.
Samarth

4.Data mining engine

It performs the data mining task such as characterization, association, classification,


cluster analysis etc.

5. Pattern evaluation module

It is integrated with the mining module and it helps in searching only the Interesting
patterns.

6 .Graphical user Interface

This module is used to communicate between user and the data mining system and
allow users to browse data or data warehouse schemas.

What kind of data can be mined / Data Sources

-We can use any kind of data source for data mining.
-In the current era data is stored in multiple forms like in table, lists,number, text,
graphs, pages etc.
-we can mined the data from following different data sources:

1.Database data

2. Data warehouse data

3 Transactional data

4. Data streams

5. Ordered or sequence data

6.Graph or Networked data

7. Spatial data

8. Text data

9. Multimedia data

10. Web data (WWW)

11.Flat Files

1. Database data
Samarth

-It contains simple data that is structured into a table like rows and columns
-It is relational data which is interrelated with others
-Rows contains values and columns contains attributes.

2. Data warehouse data

-It contains huge amount of data from multiple sources


-Data warehouse contains large data in single site
-We can used the data warehouse data for effective decision making.
-We can structured the data from data warehouse in data cubes.
–Dimensions of data cubes represents attributes and cell represents values

3. Transactional Data
-Transaction represents a single unit of operations.
-It contains customer purchase records, flight or train booking,users click on websites.
-Each transaction contains specific transaction id and its related values.

4. Data Streams
-It is sequence of transmitted data from provider.
-It travels through packets from sender to receiver

5. Ordered or sequence data


-It contains sequence of or list of data items
-It contains customer behaviour like the customer who purchased items one by one

6. Graph or Networked data


-It contains interconnected entities
-Web pages are interconnected means from one page we can go to another page by
using
hyperlinks.
-One mobile number is linked with many other numbers by phone calling

7.Spatial data
-It contains geo-spatial data.
-It stores the geographic coordinate numeric values of a physical object
-It contains location, shape, route data.

8.Text data
-It is a raw data
-It created by database
-It contains meta data that is data about data means date, time, day, year, of an
operation

9.Multimedia data
-It contains collection of audio, video, graphics data.
Samarth

-Today many databases contains multimedia data for effective use

10. Web data (WWW)


-It contains web based data
-It contains web pages which displays huge amount of text,multimedia data.
-It stores users metadata like visitors of websites, number of clicks on particular links.

11.Flat Files
-Flat files is defined as data files in text form or binary form with a structure that can be
easily extracted by data mining algorithms.
-Data stored in flat files have no relationship or path among themselves, like if a
relational database is stored on flat file, then there will be no relations between the
tables.
-Flat files are represented by data dictionary. Eg: CSV file.

Major Issues in Data Mining

Data mining is not an easy task, as the algorithms used can get very complex and data
is not always available at one place. It needs to be integrated from various
heterogeneous data sources. These factors also create some issues.The following
diagram describes the major issues.
Samarth

Mining Methodology and User Interaction Issues


It refers to the following kinds of issues −

Mining different kinds of knowledge in databases − Different users may be


interested in different kinds of knowledge. Therefore it is necessary for data
mining to cover a broad range of knowledge discovery task.

Interactive mining of knowledge at multiple levels of abstraction − The data


mining process needs to be interactive because it allows users to focus the
search for patterns, providing and refining data mining requests based on the
returned results.

Incorporation of background knowledge − The main work of background


knowledge is to continue the process of discovery and indicate the patterns or
trends that were seen in the process. Background knowledge can also be used
to express the patterns or trends observed in brief and precise terms. It can also
be represented at different levels of abstraction.

Data mining query languages and ad hoc data mining − Data Mining Query
language is responsible for giving access to the user such that it describes ad
Samarth

hoc mining tasks as well and it needs to be integrated with a data warehouse
query language.

Presentation and visualization of data mining results − Once the patterns are
discovered it needs to be expressed in high level languages, and visual
representations. These representations should be easily understandable.

Handling noisy or incomplete data − The data cleaning methods are required to
handle the noise and incomplete objects while mining the data regularities. If the
data cleaning methods are not there then the accuracy of the discovered patterns
will be poor.

Pattern evaluation − The patterns discovered should be interesting because


either they represent common knowledge or lack novelty.

Performance Issues

There can be performance-related issues such as follows −

Efficiency and scalability of data mining algorithms − In order to effectively extract


the information from huge amount of data in databases, data mining algorithm
must be efficient and scalable.

Parallel, distributed, and incremental mining algorithms − The factors such as


huge size of databases, wide distribution of data, and complexity of data mining
methods motivate the development of parallel and distributed data mining
algorithms. These algorithms divide the data into partitions which are further
processed in a parallel fashion. Then the results from the partitions are merged.
The incremental algorithms update databases without mining the data again from
scratch.

Diverse Data Types Issues


Handling of relational and complex types of data − The database may contain
complex data objects, multimedia data objects, spatial data, temporal data etc. It
is not possible for one system to mine all these kind of data.

Mining information from heterogeneous databases and global information


systems − The data is available at different data sources on LAN or WAN. These
data source may be structured, semi structured or unstructured. Therefore
mining the knowledge from them adds challenges to data mining.

Data Objects and Attributes types

Data Object
Samarth

-Data object is an entity of dataset.e.g. In banking database customer is a data object,in


product manufacturing company product and customer are the object.
Data object is characterized by attribute.
-Data sets are made up of data objects.
-A data object represents an entity of dataset—in a sales database, the objects may be
customers, store items, and sales;
- in a medical database, the objects may be patients;
-in a university database, the objects may be students, professors, and courses.
-Data objects are typically described by attributes.
Data objects can also be referred to as samples, examples, instances, data points, or
objects. If the data objects are stored in a database, they are data tuples.
That is, the rows of a database correspond to the data objects, and the columns
correspond to the attributes.

Data Attribute
-Data attributes refer to the specific characteristics or properties that describe individual
data objects within a dataset.
- That Means It describes the features of data objects.
-Like in bank account it is account_number,custmer_number,branch_id etc.
We need to differentiate between different types of attributes during Data-
preprocessing. So firstly, we need to differentiate between qualitative and quantitative
attributes.
1. Qualitative Attributes such as Nominal, Ordinal, and Binary Attributes.
2. Quantitative Attributes such as Numeric,Discrete and Continuous Attributes.
-Data attribute have following types

1.Nominal Attributes
2.Binary Attributes
a.Symmetric Attribute
b.Asymmetric Attribute
3.Ordinal Attributes
4.Numeric
a.Interval-scaled
b.Ratio-scaled
5.Discrete Attribute
6.Continuous Attribute

1.Nominal Attributes
-It describes relating name to an attribute value.
It is in alphabetical form and not in an integer. Nominal Attributes are Qualitative
Attributes.
Examples of Nominal attributes
Samarth

2.Binary Attributes

It is represented by two values 0 or 1

0 value specifies the value is not present like false

1 value specifies the value is present like true

e.g In train booking database suppose if seat is reserved then booking status records 1
otherwise 0.

Symmetric Binary:In a symmetric attribute, both values or states are considered equally
important or interchangeable.

Asymmetric: An asymmetric attribute indicates that the two values or states are not
equally important or interchangeable.

3.Ordinal Attributes
-All Values have a meaningful order.
-It has multiple values for single attribute
Samarth

4. Numeric Attribute

-We can measure numeric attribute

-It is like quantitative attribute

It has two subtypes 1. Interval Scaled Attribute 2. Ratio Scaled attribute

Interval Scaled attribute measured by using scaled of equal sized. It has values positive,
0, negative. e.g. Temperature database contains temperature values

Ratio Scaled attribute can be a single valued of describing some specific attribute. Like
year of experience of an employee.

5. Discrete Attribute
-It describes finite or infinite set of values.
E.g., zip codes, profession, or the set of words in a collection of documents Sometimes,
represented as integer variables

6. Continuous Attribute

-If an attribute is not in discrete form then it is said to be a continuous attribute


-Continuous data, unlike discrete data, can take on an infinite number of possible values
within a given range. It is characterized by being able to assume any value within a
specified interval, often including fractional or decimal values.

Data Preprocessing
Samarth

-Data preprocessing is the process of transforming raw data into an understandable


format.
- It is also an important step in data mining as we cannot work with raw data. The quality
of the data should be checked before applying machine learning or data mining
algorithms.
-If there is much irrelevant and redundant information is present or noisy and unreliable
data,then knowledge discovery during the training phase is more difficult.
-Data preparation and filtering steps can take a considerable amount of processing
time.data preprocessing includes cleaning,instance
selection,normalization,transformation,feature extraction and selection etc.

Why Preprocess The Data ?

-Inaccurate data gives wrong information

-In data storing or in handling process by human or machine made inaccurate data is
generated -If data is incomplete, inconsistent then it is useless for proper decision
making
-If data is not stored in database time to time then it will make data incomplete
-For developing business analysis or taking decisions for business data should be
complete, accurate, timeliness, and trusted
-So we need to pre process the data before use for analysis

Why is Data Preprocessing Important?


Preprocessing of data is mainly to check the data quality. The quality can be checked
by the following:
● Accuracy: To check whether the data entered is correct or not.
● Completeness: To check whether the data is available or not recorded.
● Consistency: To check whether the same data is kept in all the places that do or
do not match.
● Timeliness: The data should be updated correctly.
● Believability: The data should be trustable.
● Interpretability: The understandability of the data.

Major Tasks in Data pre-processing

There are 5 major tasks in data preprocessing

1. Data cleaning
2. Data integration
3. Data reduction
4. Data transformation
Samarth

5.Data Discretization

1. Data cleaning

Data cleaning operations contains following tasks: filling missing values, smoothing
noisy data, identifying or removing outliers and solve the inconsistencies.
-Skip the tuple which is not contains the data
-We can fill the missing values manually
-Use one constant value to fill the missing value.
-By calculating mean value we can fill the set of values in place of missing values.
-By using outlier analysis technique we can fill boundary value for outside values

2.Data integration
-In data integration phase we can combine data from multiple sources.
-We can merge data from different sources.
-If data is redundant means if copy of data is available on multiple sources then remove
such data.

3. Data reduction
-We can reduce data by using three ways.
1. Dimensionality reduction
2. Numerosity reduction
3. Data compression
In data dimensionality process we can divide the data into a number of pieces.
We can easily remove the identical or redundant data from same pieces by using
dimensionality reduction method.
-By using Numerosity reduction technique we can specify small data set for large set of
data volume.
-Data compression technique use for to store large amount of data into small piece of
memory location.
-By using sampling, aggregation methods we can reduce a large amount of data into
small pieces.

4. Data transformation
-In data transformation we can represent our data into multiple forms
-We can represent our data into different charts then user can easily understand it.
-We can create a cluster for same relation data to reduce readability.
-We can normalize the data into different ranges to normalized the data.

5.Data Discretization
-part of data reduction,replacing numerical attributes with nominal ones.
-this involves dividing continuous data into discrete categories or intervals.
Discretization is often used in data mining and machine learning algorithms that require
categorical data. Discretization can be achieved through techniques such as equal
width binning, equal frequency binning, and clustering.
Samarth

Data Cleaning

Data cleaning or data Scrubbing , also known as data cleansing, is the process of
identifying and correcting or removing inaccurate, incomplete, irrelevant, or inconsistent
data in a dataset. Data cleaning is a critical step in data mining as it ensures that the
data is accurate, complete, and consistent, improving the quality of analysis and
insights obtained from the data.

Why Data is dirty?


1.Absence of data
2.Reused primary key
3.Dummy values
4.Wrong entry of data

● Data Cleaning must deal with many types of possible errors


● Data Staging
1. Parsing
Parsing is a process in which individual data elements are located and identified in the
source systems and then these elements are isolated in the target files. For example,
parsing of name into First name, Middle name and Last name or parsing the address
into street name, city, state and country.
Samarth

2 Correcting
This is the next phase after parsing, in which individual data elements are corrected
using data algorithms and secondary data sources. For example, in the address
attribute replacing a vanity address and adding a zip code.

3. Standardizing
In standardizing process conversion routines are used to transform data into a
consistent format using both standard and custom business rules.
For example, addition of a prename, replacing a nickname and using a preferred street
name.

4.Matching
Matching process involves eliminating duplications by searching and matching records
with parsed, corrected and standardised data using some standard business rules.For
example, identification of similar names and addresses.

5.Consolidating
Consolidation involves merging the records into one representation by analysing and
identifying relationship between matched records.

6. Data cleansing must deal with many types of possible errors


-Data can have many errors like missing data, or incorrect data at one source
-When more than one source is involved there is a possibility of inconsistency and
conflicting data

7. Data staging
-Data staging is an interim step between data extraction and remaining steps.
-Using different processes like native Interfaces, flat files, FTP sessions, data is
accumulated from asynchronous sources.
-After a certain predefined interval, data is loaded into the warehouse after the
transformation process.
-No end user access is available to the staging file.
-For data staging, operational data store may be used.

Missing Values-

-This involves searching for empty fields where values should occur.
-There are several techniques for dealing with missing data, choosing one of them
would the dependent on problems domain and the goal for data mining process
-Following are the different ways for handle missing values in databases:

Handling Missing Values:

1. Ignore the tuple


Samarth

This is usually done when many attributes are missing from the row (not just one). However,
you’ll obviously get poor performance if the percentage of such rows is high.
For example, let’s say we have a database of students enrolment data (age, SAT score, state of
residence, etc.) and a column classifying their success in college to “Low”, “Medium” and “High”.
Let’s say our goal is to build a model predicting a student’s success in college. Data rows who
are missing the success column are not useful in predicting success so they could very well be
ignored and removed before running the algorithm.

2. Fill in the missing value manually


Missing values can be filled manually but it is not recommended when dataset is big

3. Use a global constant to fill in the missing value


global constant value, like “unknown“, “N/A” or minus infinity, that will be used to fill all the
missing values.
For example, let’s look at the students enrollment database again. Assuming the state of
residence attribute data is missing for some students. Filling it up with some state doesn’t really
makes sense as opposed to using something like “N/A”.

4. Use attribute mean or median to fill the missing value

Let say if the average income of a a family is X you can use that value to replace missing
income values in the customer sales database.

5. Use the attribute mean or median for all samples belonging to the same class

Lets say you have a cars pricing DB that, among other things, classifies cars to “Luxury” and
“Low budget” and you’re dealing with missing values in the cost field. Replacing missing cost of
a luxury car with the average cost of all luxury cars is probably more accurate then the value
you’d get if you factor in the low budget

6. Use Data -mining-algorithm to predict the most probable value

The value can be determined using regression, inference based tools using Bayesian
formalism, decision trees, clustering algorithms etc.

Noisy Data

Noisy data is meaningless data. The term has often been used as a synonym for
corrupt data.Noisy data can be caused by hardware failures, programming errors.
Samarth

Handling Noisy Data:

1. Binning Method:
This method works on sorted data in order to smooth it. The whole data is divided into
segments of equal size and then various methods are performed to complete the task. Each
segment is handled separately. One can replace all data in a segment by its mean or boundary
values can be used to complete the task.

2. Regression:
Data smoothing can also be done by regression, a technique that conforms data values to a
function.
Linear regression involves finding the “best” line to fit two attributes (or variables) so that one
attribute can be used to predict the other.
Multiple linear regression is an extension of linear regression, where more than two attributes
are involved and the data are fit to a multidimensional surface

3. Clustering and Outlier analysis:


Clustering groups the similar data in a cluster. Outliers may be detected by clustering, for
example, where similar values are organized into groups, or “clusters.” Intuitively, values that
fall outside of the set of clusters may be considered outliers.

Data Integration
The process of combining multiple sources into a single dataset. The Data integration
process is one of the main components of data management. There are some problems
to be considered during data integration.
Samarth

● Schema integration: Integrates metadata(a set of data that describes other data)
from different sources.
● Entity identification problem: Identifying entities from multiple databases. For
example, the system or the user should know the student id of one database and
studentname of another database belonging to the same entity.
● Detecting and resolving data value concepts: The data taken from different
databases while merging may differ. The attribute values from one database may
differ from another database. For example, the date format may differ, like
“MM/DD/YYYY” or “DD/MM/YYYY”.

Data Reduction-

Data reduction techniques ensure the integrity of data while reducing the data. Data
reduction is a process that reduces the volume of original data and represents it in a
much smaller volume.
Data reduction is a mechanism to reduce the data volume while maintaining the integrity
of the data.
This reduction also helps to reduce storage space.
Here are the following techniques or methods of data reduction in data mining, such as:
1.Data Cube Aggregation
2.Dimensionality Reduction
3.Data Compression
4.Numerosity Reduction

1.Data Cube Aggregation

This technique is used to aggregate data in a simpler form. Data Cube Aggregation is a
multidimensional aggregation that uses aggregation at various levels of a data cube to
represent the original data set, thus achieving data reduction.
For example, suppose you have the data of All Electronics sales per quarter for the year
2018 to the year 2022. If you want to get the annual sale per year, you just have to
aggregate the sales per quarter for each year. In this way, aggregation provides you
with the required data, which is much smaller in size, and thereby we achieve data
reduction even without losing any data.
Samarth

2.Dimensionality Reduction

Dimensionality reduction eliminates the attributes from the data set under consideration,
thereby reducing the volume of original data. It reduces data size as it eliminates
outdated or redundant features. Here are three methods of dimensionality reduction.
1. Wavelet Transform: In the wavelet transform, suppose a data vector A is
transformed into a numerically different data vector A' such that both A and A'
vectors are of the same length. Then how it is useful in reducing data because
the data obtained from the wavelet transform can be truncated. The compressed
data is obtained by retaining the smallest fragment of the strongest wavelet
coefficients. Wavelet transform can be applied to data cubes, sparse data, or
skewed data.
2. Principal Component Analysis: Suppose we have a data set to be analyzed that
has tuples with n attributes. The principal component analysis identifies k
independent tuples with n attributes that can represent the data set.
In this way, the original data can be cast on a much smaller space, and
dimensionality reduction can be achieved. Principal component analysis can be
applied to sparse and skewed data.
3. Attribute Subset Selection: The large data set has many attributes, some of
which are irrelevant to data mining or some are redundant. The core attribute
subset selection reduces the data volume and dimensionality. The attribute
subset selection reduces the volume of data by eliminating redundant and
irrelevant attributes.
The attribute subset selection ensures that we get a good subset of original
attributes even after eliminating the unwanted attributes. The resulting probability
of data distribution is as close as possible to the original data distribution using all
the attributes.

3. Data Compression

Data compression in data mining as the name suggests simply compresses the data.
Samarth

Data compression employs modification, encoding, or converting the structure of data in


a way that consumes less space.
Dimensionality and numerosity reduction method are also used for data compression.

This technique reduces the size of the files using different encoding mechanisms, such
as Huffman Encoding and run-length Encoding. We can divide it into two types based
on their compression techniques.
1.Lossless Compression:-Data that can be restored successfully from its compressed
form is called Lossless compression.
2.Lossy Compression: In contrast, the opposite where it is not possible to restore the
original form from the compressed form is Lossy compression

4.Numerosity Reduction

The numerosity reduction reduces the original data volume and represents it in a much
smaller form. This technique includes two types: parametric and non-parametric
numerosity reduction.
Numerosity reduction is a data reduction technique in the fields of data mining and data
analysis. Its main aim is to decrease the amount of data in a dataset while keeping the
most important facts and patterns. Numerosity reduction’s main goal is to simplify and
manage complicated and huge datasets, which can provide more effective analysis and
require less computing power.
There are two types of this technique: parametric and non-parametric numerosity
reduction.
1. Parametric Reduction
The parametric numerosity reduction technique holds an assumption that the data fits
into the model.
2. Non-parametric Reduction
On the other hand, the non-parametric methods do not hold the assumption of the data
fitting in the model.
The types of Non-Parametric data reduction methodology are:
Histogram
Samarth

Clustering
Sampling

4.Data Transformation and Data Discretization

The change made in the format or the structure of the data is called data transformation.
This step can be simple or complex based on the requirements. There are some
methods for data transformation.
● Smoothing: With the help of algorithms, we can remove noise from the dataset,
which helps in knowing the important features of the dataset. By smoothing, we
can find even a simple change that helps in prediction.
● Aggregation: In this method, the data is stored and presented in the form of a
summary. The data set, which is from multiple sources, is integrated into with
data analysis description. This is an important step since the accuracy of the data
depends on the quantity and quality of the data. When the quality and the
quantity of the data are good, the results are more relevant.
● Normalization: It is the method of scaling the data so that it can be represented in
a smaller range. Example ranging from -1.0 to 1.0.

Data Discretization:

The continuous data here is split into intervals. Discretization reduces the data size. For
example, rather than specifying the class time, we can set an interval like (3 pm-5 pm,
or 6 pm-8 pm).
Now, we can understand this concept with the help of an example
Suppose we have an attribute of Age with the given values
Age 1,5,9,4,7,11,14,17,13,18, 19,31,33,36,42,44,46,70,74,78,77

Table before Discretization


Attribute Age Age Age Age

1,5,4,9, 11,14,17,13,18,1 31,33,36,42,44,4 70,74,77,78


7 9 6

After Discretization Child Young Mature Old

You might also like