Classification and Predication in Data Mining

The document discusses two primary forms of data analysis in data mining: classification and prediction. Classification involves identifying the category of new observations based on training data, while prediction focuses on estimating numerical outputs without class labels. It also outlines the data classification lifecycle, issues in preparing data for analysis, and the use of decision trees for both classification and regression tasks.

Uploaded by

Inderpreet Kaur

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views

Classification and Predication in Data Mining

Uploaded by

Inderpreet Kaur

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 6

Classification and Predication in Data Mining

There are two forms of data analysis that can be used to extract models describing important
classes or predict future data trends. These two forms are as follows:
1. Classification
2. Prediction

What is Classification?
Classification is to identify the category or the class label of a new observation. First, a set of data is used as
training data. The set of input data and the corresponding outputs are given to the algorithm. So, the training data
set includes the input data and their associated class labels. Using the training dataset, the algorithm derives a
model or the classifier. The derived model can be a decision tree, mathematical formula, or a neural network. In
classification, when unlabeled data is given to the model, it should find the class to which it belongs. The new
data provided to the model is the test data set.

Classification is the process of classifying a record. One simple example of classification is to check
whether it is raining or not. The answer can either be yes or no. So, there is a particular number of
choices. Sometimes there can be more than two classes to classify. That is called multiclass
classification.

How does Classification Works?

The functioning of classification with the assistance of the bank loan application has been mentioned
above. There are two stages in the data classification system: classifier or model creation and
classification classifier.
1. Developing the Classifier or model creation: This level is the learning stage or the learning process.
The classification algorithms construct the classifier in this stage. A classifier is constructed from a
training set composed of the records of databases and their corresponding class names. Each category
that makes up the training set is referred to as a category or class. We may also refer to these records
as samples, objects, or data points.
2. Applying classifier for classification: The classifier is used for classification at this level. The test data
are used here to estimate the accuracy of the classification algorithm. If the consistency is deemed
sufficient, the classification rules can be expanded to cover new data records. It includes:
a. Sentiment Analysis: Sentiment analysis is highly helpful in social media monitoring. We
can use it to extract social media insights. We can build sentiment analysis models to read
and analyze misspelled words with advanced machine learning algorithms. The accurate
trained models provide consistently accurate outcomes and result in a fraction of the time.
b. Document Classification: We can use document classification to organize the documents
into sections according to the content. Document classification refers to text classification;
we can classify the words in the entire document. And with the help of machine learning
classification algorithms, we can execute it automatically.
c. Image Classification: Image classification is used for the trained categories of an image.
These could be the caption of the image, a statistical value, a theme. You can tag images to
train your model for relevant categories by applying supervised learning algorithms.
d. Machine Learning Classification: It uses the statistically demonstrable algorithm rules to
execute analytical tasks that would take humans hundreds of more hours to perform.
3. Data Classification Process: The data classification process can be categorized into five steps:
a. Create the goals of data classification, strategy, workflows, and architecture of data
classification.
b. Classify confidential details that we store.
c. Using marks by data labelling.
d. To improve protection and obedience, use effects.
e. Data is complex, and a continuous method is a classification.

What is Data Classification Lifecycle?

The data classification life cycle produces an excellent structure for controlling the flow of data to an
enterprise. Businesses need to account for data security and compliance at each level. With the help
of data classification, we can perform it at every stage, from origin to deletion. The data life-cycle has
the following stages, such as:

1. Origin: It produces sensitive data in various formats, with emails, Excel, Word, Google documents,
social media, and websites.
2. Role-based practice: Role-based security restrictions apply to all delicate data by tagging based on in-
house protection policies and agreement rules.
3. Storage: Here, we have the obtained data, including access controls and encryption.
4. Sharing: Data is continually distributed among agents, consumers, and co-workers from various devices
and platforms.
5. Archive: Here, data is eventually archived within an industry's storage systems.
6. Publication: Through the publication of data, it can reach customers. They can then view and download
in the form of dashboards.
What is Prediction?
Another process of data analysis is prediction. It is used to find a numerical output. Same as in
classification, the training dataset contains the inputs and corresponding numerical output values. The
algorithm derives the model or a predictor according to the training dataset. The model should find a
numerical output when the new data is given. Unlike in classification, this method does not have a
class label. The model predicts a continuous-valued function or ordered value.
Regression is generally used for prediction. Predicting the value of a house depending on the facts
such as the number of rooms, the total area, etc., is an example for prediction.

Classification and Prediction Issues

The major issue is preparing the data for Classification and Prediction. Preparing the data involves the
following activities, such as:

1. Data Cleaning: Data cleaning involves removing the noise and treatment of
missing values. The noise is removed by applying smoothing techniques, and the problem of missing
values is solved by replacing a missing value with the most commonly occurring value for that attribute.
2. Relevance Analysis: The database may also have irrelevant attributes. Correlation analysis is used to
know whether any two given attributes are related.
3. Data Transformation and reduction: The data can be transformed by any of the following methods.
o Normalization: The data is transformed using normalization. Normalization involves scaling
all values for a given attribute to make them fall within a small specified range. Normalization
is used when the neural networks or the methods involving measurements are used in the
learning step.
o Generalization: The data can also be transformed by generalizing it to the higher concept.
For this purpose, we can use the concept hierarchies.

Difference between Classification and Prediction

Classification Prediction

Classification is the process of identifying which category a Predication is the process of identifying the missing or
new observation belongs to based on a training data set unavailable numerical data for a new observation.
containing observations whose category membership is
known.

In classification, the accuracy depends on finding the class In prediction, the accuracy depends on how well a given
label correctly. predictor can guess the value of a predicated attribute for
new data.

In classification, the model can be known as the classifier. In prediction, the model can be known as the predictor.

A model or the classifier is constructed to find the A model or a predictor will be constructed that predicts a
categorical labels. continuous-valued function or ordered value.

For example, the grouping of patients based on their For example, We can think of prediction as predicting
medical records can be considered a classification. the correct treatment for a particular disease for a person
Decision tree
Decision is used on wide range of problems. Decision tree cells are formed by splitting each dimension into random
number of evenly spaced partitions. A variable of interest such as, response rate, experience, or average order size,
is measured for each cell. New records are scored by determining which cell they belong to.
Decision trees use two techniques. A top-down approach recursively splits data into smaller and smaller cells with
similar values based on target variable. The degree to which a cell has similar values is known as purity of the cell.
Each cell in decision tree is treated independently and new split for a particular cell is found using algorithm that
tests splits based on all available variables of interest. A decision tree is used for variable selection as well as for
building models or classifiers.
Other technique is a bottom-up approach. In it a decision tree uses the target variable to determine how input to
decision tree should be partitioned. Decision tree then breaks the data into segments using the split rules at each
step, and rules for all segments taken together form the decision tree classifier. Rules are expressed using simple
English. Rules are then expressed using database query language to retrieve or score similar records. Decision tree
based models can be used for data mining tasks like classification, estimation, or prediction.
Decision Tree Induction
Decision Tree is a supervised learning method used in data mining for classification and regression
methods. It is a tree that helps us in decision-making purposes. The decision tree creates classification
or regression models as a tree structure. It separates a data set into smaller subsets, and at the same
time, the decision tree is steadily developed. The final tree is a tree with the decision nodes and leaf
nodes. A decision node has at least two branches. The leaf nodes show a classification or decision.
We can't accomplish more split on leaf nodes-The uppermost decision node in a tree that relates to the
best predictor called the root node. Decision trees can deal with both categorical and numerical data.

Why are decision trees useful?

It enables us to analyze the possible consequences of a decision thoroughly.
It provides us a framework to measure the values of outcomes and the probability of accomplishing
them.
It helps us to make the best decisions based on existing data and best speculations.
Advantages of using decision trees:
A decision tree does not need scaling of information.
Missing values in data also do not influence the process of building a choice tree to any considerable
extent.
A decision tree model is automatic and simple to explain to the technical team as well as stakeholders.
Compared to other algorithms, decision trees need less exertion for data preparation during pre-
processing.
A decision tree does not require a standardization of data.
Decision tree Algorithm:
The decision tree algorithm may appear long, but it is quite simply the basis algorithm techniques is as
follows:
The algorithm is based on three parameters: D, attribute_list, and Attribute _selection_method.
Generally, we refer to D as a data partition.
Initially, D is the entire set of training tuples and their related class levels (input training data).
The parameter attribute_list is a set of attributes defining the tuples.
Attribute_selection_method specifies a heuristic process for choosing the attribute that "best"
discriminates the given tuples according to class.
Attribute_selection_method process applies an attribute selection measure.
Attribute Selection Measures
Number of criteria’s may be used to evaluate or select potential splits. Alternate splitting criteria lead to decision
trees that look quite different from each other, but have similar performance. Different purity measures happen to
select different splits. An attribute selection measure selects splitting criterion that provides best separation for a
given D of class-labeled training tuples into individual classes. Splitting D into smaller partitions according to
conclusions of splitting criterion ideally results in each partition being pure. “Best” splitting criterion is one that
results in new smaller partitions of highest purity. The attribute selection measure provides ranking for each
attribute, and attribute with best score is preferred to be the splitting attribute for given tuples.

Entropy:
The concept is used to find the disorder present in the given dataset. In simple terms, we can say that
Entropy tells the impurity present in the dataset. It determines the best splits for partitioning data.

Following is the mathematical formula for Entropy:

In the above formula,

o Entropy (D) is the Entropy of the dataset "D".

o "n" is the dataset's number of distinct classes or categories.
o "pi" is the proportion of data points in class i to the total data points in "D".
For calculating the Entropy of a dataset, the formula sums up the proportions of data points that
belong to each distinct class or category. The negative sign is used to ensure the Entropy is a positive
value. This formula commonly uses the logarithm base 2 (2log2), making the unit of Entropy bits.

Information Gain:
It is used to find the changes in Entropy after splitting the dataset based on attributes. It is used to tell
how much information an attribute can provide us about the class. We choose the feature that directs
to the most significant reduction in Entropy, as it provides more information.

Information Gain can be calculated with the help of the following formula:

Information Gain (IG) = Entropy (D) - [(Weighted average)*Entropy (each Feature)

IG (D, A) = H(D) - ∑ (|Dv|/|D|)⋅H(Dv)

In the above formula,

o "IG (D)" is the Information Gain of dataset "D" based on attribute A.

o "H (D)" is the Entropy of dataset "D".
o "|Dv|" is the number of instances in subset v.
o "|D|" is the total number of instances in dataset "D".
o "H(Dv)" is the Entropy of each subset v.

Mathematics in The Modern World Test 2
100% (1)
Mathematics in The Modern World Test 2
5 pages
Data Mining UNIT-2 Notes
No ratings yet
Data Mining UNIT-2 Notes
91 pages
DATA MINING MODULE 3
No ratings yet
DATA MINING MODULE 3
27 pages
u4 clasification and prediction
No ratings yet
u4 clasification and prediction
15 pages
UNIT 3 DM
No ratings yet
UNIT 3 DM
34 pages
Classification
No ratings yet
Classification
15 pages
DWM Unit 3 Final Notes
No ratings yet
DWM Unit 3 Final Notes
47 pages
DM Unit 4
No ratings yet
DM Unit 4
22 pages
Classification Unit3
No ratings yet
Classification Unit3
15 pages
4 Classification
No ratings yet
4 Classification
20 pages
DMWH M3
No ratings yet
DMWH M3
21 pages
Classification and Prediction
No ratings yet
Classification and Prediction
41 pages
Unit 3
No ratings yet
Unit 3
15 pages
5 What Is Data-WPS Office
No ratings yet
5 What Is Data-WPS Office
19 pages
UNIT 4
No ratings yet
UNIT 4
39 pages
202396123846584_26076Classification - Data Mining
No ratings yet
202396123846584_26076Classification - Data Mining
4 pages
Unit 4 Data warehousing and Data mining
No ratings yet
Unit 4 Data warehousing and Data mining
15 pages
Unit 4 DWDM
No ratings yet
Unit 4 DWDM
8 pages
DATA MINING JNTUH CSE R18
No ratings yet
DATA MINING JNTUH CSE R18
20 pages
Data Mining and Visualization Question Bank
100% (1)
Data Mining and Visualization Question Bank
11 pages
For More Visit WWW - Ktunotes.in
No ratings yet
For More Visit WWW - Ktunotes.in
21 pages
Data Mining Classification Prediction
No ratings yet
Data Mining Classification Prediction
3 pages
Big Data-Classification Survey
No ratings yet
Big Data-Classification Survey
11 pages
Classification:: Key Components of Classification
No ratings yet
Classification:: Key Components of Classification
21 pages
Unit-4 Data Mining
No ratings yet
Unit-4 Data Mining
19 pages
Unit 4 AI LASK
No ratings yet
Unit 4 AI LASK
7 pages
Asign-3 DWDM
No ratings yet
Asign-3 DWDM
27 pages
Module 3 Notes (1)
No ratings yet
Module 3 Notes (1)
31 pages
wibd
No ratings yet
wibd
10 pages
Unit 3
No ratings yet
Unit 3
16 pages
Assignment No 1
No ratings yet
Assignment No 1
9 pages
Chapter - 4
No ratings yet
Chapter - 4
14 pages
Asynchronous Claisfication Basic Conceps
No ratings yet
Asynchronous Claisfication Basic Conceps
2 pages
siv UNIT-3 Classification DWM PART-A
No ratings yet
siv UNIT-3 Classification DWM PART-A
12 pages
Unit 8 Classification and Prediction: Structure
No ratings yet
Unit 8 Classification and Prediction: Structure
16 pages
Lecture 1 introduction PM (1)
No ratings yet
Lecture 1 introduction PM (1)
21 pages
9 Data Mining - Classification & Prediction
No ratings yet
9 Data Mining - Classification & Prediction
4 pages
Data Mining unit-1 complete
No ratings yet
Data Mining unit-1 complete
45 pages
1.1 Data and Information Mining
No ratings yet
1.1 Data and Information Mining
24 pages
ITP4-Lesson 4-Week 7-8
No ratings yet
ITP4-Lesson 4-Week 7-8
18 pages
Unit 5 Pattern Recognition
No ratings yet
Unit 5 Pattern Recognition
10 pages
Data Warehouse and Mining Notes
No ratings yet
Data Warehouse and Mining Notes
12 pages
C45 Algorithm
No ratings yet
C45 Algorithm
12 pages
MZU-MBA-DATA ANALYTICS - Data Science and Business Analysis - Unit 3
No ratings yet
MZU-MBA-DATA ANALYTICS - Data Science and Business Analysis - Unit 3
39 pages
AI Unit 4
No ratings yet
AI Unit 4
25 pages
Classification & Prediction
No ratings yet
Classification & Prediction
19 pages
04. UNIT-IV(DMWH6EM)
No ratings yet
04. UNIT-IV(DMWH6EM)
33 pages
Data Minning Unit 2-1
No ratings yet
Data Minning Unit 2-1
10 pages
Fundamentals of Data Science Unit 4
100% (1)
Fundamentals of Data Science Unit 4
31 pages
Notes XII AI.docx
No ratings yet
Notes XII AI.docx
11 pages
Unit 3 (DWDM)
No ratings yet
Unit 3 (DWDM)
23 pages
DWDM - Unit - V
No ratings yet
DWDM - Unit - V
93 pages
Clustering vs Classification Explained With Examples - Coding Infinite
No ratings yet
Clustering vs Classification Explained With Examples - Coding Infinite
9 pages
Privacy Preservation Techniques in Data Mining
No ratings yet
Privacy Preservation Techniques in Data Mining
5 pages
DMjoy
No ratings yet
DMjoy
9 pages
LECTURE-2
No ratings yet
LECTURE-2
36 pages
Fam QB Ans
No ratings yet
Fam QB Ans
9 pages
updated dm unit 3
No ratings yet
updated dm unit 3
28 pages
Survey of Classification Techniques in Data Mining: Open Access
No ratings yet
Survey of Classification Techniques in Data Mining: Open Access
10 pages
Q.1. What Is Data Mining?
No ratings yet
Q.1. What Is Data Mining?
15 pages
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
César Pérez López
No ratings yet
Digital Marketing
No ratings yet
Digital Marketing
12 pages
bca-6-sem-information-security-75014-nov-2019
No ratings yet
bca-6-sem-information-security-75014-nov-2019
2 pages
Artificial Intelligence
No ratings yet
Artificial Intelligence
14 pages
AI(6th)Dec2022 (1)
No ratings yet
AI(6th)Dec2022 (1)
2 pages
Lab 18 - Time Series Anomaly Detection
No ratings yet
Lab 18 - Time Series Anomaly Detection
37 pages
EXOPLANETS
No ratings yet
EXOPLANETS
18 pages
Technical University of Crete: Data Structures File Structures
No ratings yet
Technical University of Crete: Data Structures File Structures
20 pages
3.1 Biogeography-Based Optimization Algorithm
No ratings yet
3.1 Biogeography-Based Optimization Algorithm
7 pages
Identity-Based Encryption With Cloud
No ratings yet
Identity-Based Encryption With Cloud
4 pages
A Practical Approach To Kalman Filter and How To Implement It
No ratings yet
A Practical Approach To Kalman Filter and How To Implement It
9 pages
MMSegmentation
No ratings yet
MMSegmentation
2 pages
Controller Comparison Close Loop Control System
No ratings yet
Controller Comparison Close Loop Control System
4 pages
Cluster C - Model Paper - 1 - Final
No ratings yet
Cluster C - Model Paper - 1 - Final
3 pages
Chapter 6 MP
No ratings yet
Chapter 6 MP
30 pages
Application For Software Engineer - Zalando
No ratings yet
Application For Software Engineer - Zalando
1 page
Cooperative Game Theory: (N-Person Games)
No ratings yet
Cooperative Game Theory: (N-Person Games)
16 pages
FS BCOM-2 (Lesson2)
No ratings yet
FS BCOM-2 (Lesson2)
10 pages
225C4A
No ratings yet
225C4A
2 pages
CH 2. Linear Equations in One Variable 2
No ratings yet
CH 2. Linear Equations in One Variable 2
33 pages
Eigenvalue Puzzle Solution
No ratings yet
Eigenvalue Puzzle Solution
5 pages
PPT_WB 18.11.24_E2.5 Equations_L2 Solving linear equations in one unknown (3)
No ratings yet
PPT_WB 18.11.24_E2.5 Equations_L2 Solving linear equations in one unknown (3)
20 pages
Nonlinear Systems by S S Sastry
100% (5)
Nonlinear Systems by S S Sastry
697 pages
2012 Network Anomaly Detection by Cascading K-Means Clustering and C4.5 Decision Tree Algorithm
No ratings yet
2012 Network Anomaly Detection by Cascading K-Means Clustering and C4.5 Decision Tree Algorithm
9 pages
Using Deep Learning Neural Networks and Candlestick Chart Representation To Predict Stock Market
No ratings yet
Using Deep Learning Neural Networks and Candlestick Chart Representation To Predict Stock Market
13 pages
Wiley - Data Structures and Algorithms in C++, 2nd Edition - 978!0!470-46044-3
No ratings yet
Wiley - Data Structures and Algorithms in C++, 2nd Edition - 978!0!470-46044-3
3 pages
Mini Projects
No ratings yet
Mini Projects
25 pages
Datamining Mod3
No ratings yet
Datamining Mod3
21 pages
Module 4: Linear Inequalities Learning Outcomes
No ratings yet
Module 4: Linear Inequalities Learning Outcomes
3 pages
Lecture Notes in Computer Science 2139: Edited by G. Goos, J. Hartmanis, and J. Van Leeuwen
No ratings yet
Lecture Notes in Computer Science 2139: Edited by G. Goos, J. Hartmanis, and J. Van Leeuwen
610 pages
ppt for 7 th sem AIML4 ABC
No ratings yet
ppt for 7 th sem AIML4 ABC
4 pages
2011 Final
No ratings yet
2011 Final
15 pages
MCA-Floyd Warshall Algorithm
No ratings yet
MCA-Floyd Warshall Algorithm
9 pages
Statistics For Management (BBA-303) : Introduction - 1
No ratings yet
Statistics For Management (BBA-303) : Introduction - 1
10 pages