DWM 4
DWM 4
Data Mining
The process of extracting information to identify patterns, trends, and useful data that
would allow the business to take the data-driven decision from huge sets of data is
called Data Mining.
Data mining is one of the most useful techniques that help entrepreneurs, researchers,
and individuals to extract valuable information from huge sets of data. Data mining is
also called Knowledge Discovery in Database (KDD)
Data Mining is defined as extracting information from huge sets of data. In other words,
we can say that data mining is the procedure of mining knowledge from data. The
information or knowledge extracted so can be used for any of the following applications
−
Market Analysis
Fraud Detection
Customer Retention
Production Control
Science Exploration
-The process of discovering knowledge in data and application of data mining methods
refers to the term Knowledge Discovery in Databases(KDD).
-It includes a wide variety of application domains, which include Artificial Intelligence,
Pattern Recognition, Machine Learning Statistics and Data Visualization.
Samarth
-The main goal includes extracting knowledge from large databases, the goal is
achieved by using various data mining algorithms to identify useful patterns according to
some predefined measures and thresholds.
The KDD process in data mining is a multi-step process that involves various stages to
extract useful knowledge from large datasets.Various steps involved in the KDD
process in data mining are shown below diagram
The
following are the main steps involved in the KDD process -
Data Selection - The first step in the KDD process is identifying and selecting the
relevant data for analysis. This involves choosing the relevant data sources, such as
databases, data warehouses, and data streams, and determining which data is required
for the analysis.
Data Preprocessing - After selecting the data, the next step is data preprocessing. This
step involves cleaning the data, removing outliers, and removing missing, inconsistent,
or irrelevant data.
Data Transformation - Once the data is preprocessed, the next step is to transform it
into a format that data mining techniques can analyze.
Data Mining - This is the heart of the KDD process and involves applying various data
mining techniques to the transformed data to discover hidden patterns, trends,
relationships, and insights. A few of the most common data mining techniques include
clustering, classification, association rule mining, and anomaly detection.
Samarth
Pattern Evaluation - After the data mining, the next step is to evaluate the discovered
patterns to determine their usefulness and relevance.
Deployment - The final step in the KDD process is to deploy the knowledge and insights
gained from the data mining process to practical applications. T
● Helps in Decision Making - KDD can help make informed and data-driven
decisions by discovering hidden patterns, trends, and relationships in data that
might not be immediately apparent.
● Improves Business Performance - KDD can help organizations improve their
business performance by identifying areas for improvement, optimizing
processes, and reducing costs.
● Saves Time and Resources - KDD can help save time and resources by
automating the data analysis process and identifying the most relevant and
significant information or knowledge.
● Increases Efficiency - KDD can help organizations streamline their processes,
optimize their resources, and increase their overall efficiency.
● Fraud Detection - KDD can help detect fraud and identify fraudulent behavior by
analyzing patterns in data and identifying anomalies or unusual behavior.
The difference between KDD and data mining is explained in the below table.
Samarth
These are information repositories. Data cleaning and data integration techniques may
be performed on the data
It fetches the data as per the user's requirement which is need for data mining task.
3.Knowledge base
This is used to guide the search, and gives interesting and hidden patterns from data.
Samarth
It is integrated with the mining module and it helps in searching only the Interesting
patterns.
This module is used to communicate between user and the data mining system and
allow users to browse data or data warehouse schemas.
-We can use any kind of data source for data mining.
-In the current era data is stored in multiple forms like in table, lists,number, text,
graphs, pages etc.
-we can mined the data from following different data sources:
1.Database data
3 Transactional data
4. Data streams
7. Spatial data
8. Text data
9. Multimedia data
11.Flat Files
1. Database data
Samarth
-It contains simple data that is structured into a table like rows and columns
-It is relational data which is interrelated with others
-Rows contains values and columns contains attributes.
3. Transactional Data
-Transaction represents a single unit of operations.
-It contains customer purchase records, flight or train booking,users click on websites.
-Each transaction contains specific transaction id and its related values.
4. Data Streams
-It is sequence of transmitted data from provider.
-It travels through packets from sender to receiver
7.Spatial data
-It contains geo-spatial data.
-It stores the geographic coordinate numeric values of a physical object
-It contains location, shape, route data.
8.Text data
-It is a raw data
-It created by database
-It contains meta data that is data about data means date, time, day, year, of an
operation
9.Multimedia data
-It contains collection of audio, video, graphics data.
Samarth
11.Flat Files
-Flat files is defined as data files in text form or binary form with a structure that can be
easily extracted by data mining algorithms.
-Data stored in flat files have no relationship or path among themselves, like if a
relational database is stored on flat file, then there will be no relations between the
tables.
-Flat files are represented by data dictionary. Eg: CSV file.
Data mining is not an easy task, as the algorithms used can get very complex and data
is not always available at one place. It needs to be integrated from various
heterogeneous data sources. These factors also create some issues.The following
diagram describes the major issues.
Samarth
Data mining query languages and ad hoc data mining − Data Mining Query
language is responsible for giving access to the user such that it describes ad
Samarth
hoc mining tasks as well and it needs to be integrated with a data warehouse
query language.
Presentation and visualization of data mining results − Once the patterns are
discovered it needs to be expressed in high level languages, and visual
representations. These representations should be easily understandable.
Handling noisy or incomplete data − The data cleaning methods are required to
handle the noise and incomplete objects while mining the data regularities. If the
data cleaning methods are not there then the accuracy of the discovered patterns
will be poor.
Performance Issues
Data Object
Samarth
Data Attribute
-Data attributes refer to the specific characteristics or properties that describe individual
data objects within a dataset.
- That Means It describes the features of data objects.
-Like in bank account it is account_number,custmer_number,branch_id etc.
We need to differentiate between different types of attributes during Data-
preprocessing. So firstly, we need to differentiate between qualitative and quantitative
attributes.
1. Qualitative Attributes such as Nominal, Ordinal, and Binary Attributes.
2. Quantitative Attributes such as Numeric,Discrete and Continuous Attributes.
-Data attribute have following types
1.Nominal Attributes
2.Binary Attributes
a.Symmetric Attribute
b.Asymmetric Attribute
3.Ordinal Attributes
4.Numeric
a.Interval-scaled
b.Ratio-scaled
5.Discrete Attribute
6.Continuous Attribute
1.Nominal Attributes
-It describes relating name to an attribute value.
It is in alphabetical form and not in an integer. Nominal Attributes are Qualitative
Attributes.
Examples of Nominal attributes
Samarth
2.Binary Attributes
e.g In train booking database suppose if seat is reserved then booking status records 1
otherwise 0.
Symmetric Binary:In a symmetric attribute, both values or states are considered equally
important or interchangeable.
Asymmetric: An asymmetric attribute indicates that the two values or states are not
equally important or interchangeable.
3.Ordinal Attributes
-All Values have a meaningful order.
-It has multiple values for single attribute
Samarth
4. Numeric Attribute
Interval Scaled attribute measured by using scaled of equal sized. It has values positive,
0, negative. e.g. Temperature database contains temperature values
Ratio Scaled attribute can be a single valued of describing some specific attribute. Like
year of experience of an employee.
5. Discrete Attribute
-It describes finite or infinite set of values.
E.g., zip codes, profession, or the set of words in a collection of documents Sometimes,
represented as integer variables
6. Continuous Attribute
Data Preprocessing
Samarth
-In data storing or in handling process by human or machine made inaccurate data is
generated -If data is incomplete, inconsistent then it is useless for proper decision
making
-If data is not stored in database time to time then it will make data incomplete
-For developing business analysis or taking decisions for business data should be
complete, accurate, timeliness, and trusted
-So we need to pre process the data before use for analysis
1. Data cleaning
2. Data integration
3. Data reduction
4. Data transformation
Samarth
5.Data Discretization
1. Data cleaning
Data cleaning operations contains following tasks: filling missing values, smoothing
noisy data, identifying or removing outliers and solve the inconsistencies.
-Skip the tuple which is not contains the data
-We can fill the missing values manually
-Use one constant value to fill the missing value.
-By calculating mean value we can fill the set of values in place of missing values.
-By using outlier analysis technique we can fill boundary value for outside values
2.Data integration
-In data integration phase we can combine data from multiple sources.
-We can merge data from different sources.
-If data is redundant means if copy of data is available on multiple sources then remove
such data.
3. Data reduction
-We can reduce data by using three ways.
1. Dimensionality reduction
2. Numerosity reduction
3. Data compression
In data dimensionality process we can divide the data into a number of pieces.
We can easily remove the identical or redundant data from same pieces by using
dimensionality reduction method.
-By using Numerosity reduction technique we can specify small data set for large set of
data volume.
-Data compression technique use for to store large amount of data into small piece of
memory location.
-By using sampling, aggregation methods we can reduce a large amount of data into
small pieces.
4. Data transformation
-In data transformation we can represent our data into multiple forms
-We can represent our data into different charts then user can easily understand it.
-We can create a cluster for same relation data to reduce readability.
-We can normalize the data into different ranges to normalized the data.
5.Data Discretization
-part of data reduction,replacing numerical attributes with nominal ones.
-this involves dividing continuous data into discrete categories or intervals.
Discretization is often used in data mining and machine learning algorithms that require
categorical data. Discretization can be achieved through techniques such as equal
width binning, equal frequency binning, and clustering.
Samarth
Data Cleaning
Data cleaning or data Scrubbing , also known as data cleansing, is the process of
identifying and correcting or removing inaccurate, incomplete, irrelevant, or inconsistent
data in a dataset. Data cleaning is a critical step in data mining as it ensures that the
data is accurate, complete, and consistent, improving the quality of analysis and
insights obtained from the data.
2 Correcting
This is the next phase after parsing, in which individual data elements are corrected
using data algorithms and secondary data sources. For example, in the address
attribute replacing a vanity address and adding a zip code.
3. Standardizing
In standardizing process conversion routines are used to transform data into a
consistent format using both standard and custom business rules.
For example, addition of a prename, replacing a nickname and using a preferred street
name.
4.Matching
Matching process involves eliminating duplications by searching and matching records
with parsed, corrected and standardised data using some standard business rules.For
example, identification of similar names and addresses.
5.Consolidating
Consolidation involves merging the records into one representation by analysing and
identifying relationship between matched records.
7. Data staging
-Data staging is an interim step between data extraction and remaining steps.
-Using different processes like native Interfaces, flat files, FTP sessions, data is
accumulated from asynchronous sources.
-After a certain predefined interval, data is loaded into the warehouse after the
transformation process.
-No end user access is available to the staging file.
-For data staging, operational data store may be used.
Missing Values-
-This involves searching for empty fields where values should occur.
-There are several techniques for dealing with missing data, choosing one of them
would the dependent on problems domain and the goal for data mining process
-Following are the different ways for handle missing values in databases:
This is usually done when many attributes are missing from the row (not just one). However,
you’ll obviously get poor performance if the percentage of such rows is high.
For example, let’s say we have a database of students enrolment data (age, SAT score, state of
residence, etc.) and a column classifying their success in college to “Low”, “Medium” and “High”.
Let’s say our goal is to build a model predicting a student’s success in college. Data rows who
are missing the success column are not useful in predicting success so they could very well be
ignored and removed before running the algorithm.
Let say if the average income of a a family is X you can use that value to replace missing
income values in the customer sales database.
5. Use the attribute mean or median for all samples belonging to the same class
Lets say you have a cars pricing DB that, among other things, classifies cars to “Luxury” and
“Low budget” and you’re dealing with missing values in the cost field. Replacing missing cost of
a luxury car with the average cost of all luxury cars is probably more accurate then the value
you’d get if you factor in the low budget
The value can be determined using regression, inference based tools using Bayesian
formalism, decision trees, clustering algorithms etc.
Noisy Data
Noisy data is meaningless data. The term has often been used as a synonym for
corrupt data.Noisy data can be caused by hardware failures, programming errors.
Samarth
1. Binning Method:
This method works on sorted data in order to smooth it. The whole data is divided into
segments of equal size and then various methods are performed to complete the task. Each
segment is handled separately. One can replace all data in a segment by its mean or boundary
values can be used to complete the task.
2. Regression:
Data smoothing can also be done by regression, a technique that conforms data values to a
function.
Linear regression involves finding the “best” line to fit two attributes (or variables) so that one
attribute can be used to predict the other.
Multiple linear regression is an extension of linear regression, where more than two attributes
are involved and the data are fit to a multidimensional surface
Data Integration
The process of combining multiple sources into a single dataset. The Data integration
process is one of the main components of data management. There are some problems
to be considered during data integration.
Samarth
● Schema integration: Integrates metadata(a set of data that describes other data)
from different sources.
● Entity identification problem: Identifying entities from multiple databases. For
example, the system or the user should know the student id of one database and
studentname of another database belonging to the same entity.
● Detecting and resolving data value concepts: The data taken from different
databases while merging may differ. The attribute values from one database may
differ from another database. For example, the date format may differ, like
“MM/DD/YYYY” or “DD/MM/YYYY”.
Data Reduction-
Data reduction techniques ensure the integrity of data while reducing the data. Data
reduction is a process that reduces the volume of original data and represents it in a
much smaller volume.
Data reduction is a mechanism to reduce the data volume while maintaining the integrity
of the data.
This reduction also helps to reduce storage space.
Here are the following techniques or methods of data reduction in data mining, such as:
1.Data Cube Aggregation
2.Dimensionality Reduction
3.Data Compression
4.Numerosity Reduction
This technique is used to aggregate data in a simpler form. Data Cube Aggregation is a
multidimensional aggregation that uses aggregation at various levels of a data cube to
represent the original data set, thus achieving data reduction.
For example, suppose you have the data of All Electronics sales per quarter for the year
2018 to the year 2022. If you want to get the annual sale per year, you just have to
aggregate the sales per quarter for each year. In this way, aggregation provides you
with the required data, which is much smaller in size, and thereby we achieve data
reduction even without losing any data.
Samarth
2.Dimensionality Reduction
Dimensionality reduction eliminates the attributes from the data set under consideration,
thereby reducing the volume of original data. It reduces data size as it eliminates
outdated or redundant features. Here are three methods of dimensionality reduction.
1. Wavelet Transform: In the wavelet transform, suppose a data vector A is
transformed into a numerically different data vector A' such that both A and A'
vectors are of the same length. Then how it is useful in reducing data because
the data obtained from the wavelet transform can be truncated. The compressed
data is obtained by retaining the smallest fragment of the strongest wavelet
coefficients. Wavelet transform can be applied to data cubes, sparse data, or
skewed data.
2. Principal Component Analysis: Suppose we have a data set to be analyzed that
has tuples with n attributes. The principal component analysis identifies k
independent tuples with n attributes that can represent the data set.
In this way, the original data can be cast on a much smaller space, and
dimensionality reduction can be achieved. Principal component analysis can be
applied to sparse and skewed data.
3. Attribute Subset Selection: The large data set has many attributes, some of
which are irrelevant to data mining or some are redundant. The core attribute
subset selection reduces the data volume and dimensionality. The attribute
subset selection reduces the volume of data by eliminating redundant and
irrelevant attributes.
The attribute subset selection ensures that we get a good subset of original
attributes even after eliminating the unwanted attributes. The resulting probability
of data distribution is as close as possible to the original data distribution using all
the attributes.
3. Data Compression
Data compression in data mining as the name suggests simply compresses the data.
Samarth
This technique reduces the size of the files using different encoding mechanisms, such
as Huffman Encoding and run-length Encoding. We can divide it into two types based
on their compression techniques.
1.Lossless Compression:-Data that can be restored successfully from its compressed
form is called Lossless compression.
2.Lossy Compression: In contrast, the opposite where it is not possible to restore the
original form from the compressed form is Lossy compression
4.Numerosity Reduction
The numerosity reduction reduces the original data volume and represents it in a much
smaller form. This technique includes two types: parametric and non-parametric
numerosity reduction.
Numerosity reduction is a data reduction technique in the fields of data mining and data
analysis. Its main aim is to decrease the amount of data in a dataset while keeping the
most important facts and patterns. Numerosity reduction’s main goal is to simplify and
manage complicated and huge datasets, which can provide more effective analysis and
require less computing power.
There are two types of this technique: parametric and non-parametric numerosity
reduction.
1. Parametric Reduction
The parametric numerosity reduction technique holds an assumption that the data fits
into the model.
2. Non-parametric Reduction
On the other hand, the non-parametric methods do not hold the assumption of the data
fitting in the model.
The types of Non-Parametric data reduction methodology are:
Histogram
Samarth
Clustering
Sampling
The change made in the format or the structure of the data is called data transformation.
This step can be simple or complex based on the requirements. There are some
methods for data transformation.
● Smoothing: With the help of algorithms, we can remove noise from the dataset,
which helps in knowing the important features of the dataset. By smoothing, we
can find even a simple change that helps in prediction.
● Aggregation: In this method, the data is stored and presented in the form of a
summary. The data set, which is from multiple sources, is integrated into with
data analysis description. This is an important step since the accuracy of the data
depends on the quantity and quality of the data. When the quality and the
quantity of the data are good, the results are more relevant.
● Normalization: It is the method of scaling the data so that it can be represented in
a smaller range. Example ranging from -1.0 to 1.0.
●
Data Discretization:
The continuous data here is split into intervals. Discretization reduces the data size. For
example, rather than specifying the class time, we can set an interval like (3 pm-5 pm,
or 6 pm-8 pm).
Now, we can understand this concept with the help of an example
Suppose we have an attribute of Age with the given values
Age 1,5,9,4,7,11,14,17,13,18, 19,31,33,36,42,44,46,70,74,78,77