Data Mining UNIT-2 Notes
Data Mining UNIT-2 Notes
1. Classification
2. Prediction
What is Classification?
What is Prediction?
ADVERTISEMENT
ADVERTISEMENT
o Accuracy: The accuracy of the classifier can be
referred to as the ability of the classifier to predict
the class label correctly, and the accuracy of the
predictor can be referred to as how well a given
predictor can estimate the unknown value.
o Speed: The speed of the method depends on the
computational cost of generating and using the
classifier or predictor.
o Robustness: Robustness is the ability to make
correct predictions or classifications. In the context
of data mining, robustness is the ability of the
classifier or predictor to make correct predictions
from incoming unknown data.
o Scalability: Scalability refers to an increase or
decrease in the performance of the classifier or
predictor based on the given data.
o Interpretability: Interpretability is how readily we
can understand the reasoning behind predictions or
classification made by the predictor or classifier.
Classification Prediction
Classification is the process of Predication is the process of
identifying which category a new identifying the missing or
observation belongs to based on unavailable numerical data for
a training data set containing a new observation.
observations whose category
membership is known.
In classification, the accuracy In prediction, the accuracy
depends on finding the class label depends on how well a given
correctly. predictor can guess the value
of a predicated attribute for
new data.
In classification, the model can be In prediction, the model can be
known as the classifier. known as the predictor.
A model or the classifier is A model or a predictor will be
constructed to find the constructed that predicts a
categorical labels. continuous-valued function or
ordered value.
For example, the grouping of For example, We can think of
patients based on their medical prediction as predicting the
records can be considered a correct treatment for a
classification. particular disease for a person.
Rule-Based Classification
Bayesian classification
Classification by Backpropagation
K-NN Classifier
Fuzzy Logic
Decision Tree Induction:
Decision Tree Induction is the learning of decision
rules.
Decision Trees are easier to interrupt are they no need
knowledge.
A rule-based classifier uses a set of IF-THEN rules for
Source: Javatpoint
Sigmoid Function
Code:
def sigmoid(z):
return 1.0 / (1 + np.exp(-z))
Source: Wikipedia
The rule is that the value of the logistic regression must
be between 0 and 1. Due to the limitations of it not being
able to go beyond the value 1, on a graph it forms a
curve in the form of an "S". This is an easy way to
identify the Sigmoid function or the logistic function.
In regards to Logistic Regression, the concept used is
the threshold value. The threshold values help to define
the probability of either 0 or 1. For example, values
above the threshold value tend to 1, and a value below
the threshold value tends to 0.
Example:
ADVERTISEMENT
o It maximizes the distance between means of two
classes.
o It minimizes the variance within the individual class.
In other words, we can say that the new axis will increase
the separation between the data points of the two
classes and plot them onto the new axis.
Why LDA?
o Logistic Regression is one of the most popular
classification algorithms that perform well for binary
classification but falls short in the case of multiple
classification problems with well-separated classes.
At the same time, LDA handles these quite
efficiently.
o LDA can also be used in data pre-processing to
reduce the number of features, just as PCA, which
reduces the computing cost significantly.
o LDA is also used in face detection algorithms. In
Fisherfaces, LDA is used to extract useful data from
different faces. Coupled with eigenfaces, it produces
effective results.
o FaceRecognition
Face recognition is the popular application of
computer vision, where each face is represented as
the combination of a number of pixel values. In this
case, LDA is used to minimize the number of
features to a manageable number before going
through the classification process. It generates a
new template in which each dimension consists of a
linear combination of pixel values. If a linear
combination is generated using Fisher's linear
discriminant, then it is called Fisher's face.
o Medical
In the medical field, LDA has a great application in
classifying the patient disease on the basis of
various parameters of patient health and the
medical treatment which is going on. On such
parameters, it classifies disease as mild, moderate,
or severe. This classification helps the doctors in
either increasing or decreasing the pace of the
treatment.
o Customer Identification
In customer identification, LDA is currently being
applied. It means with the help of LDA; we can
easily identify and select the features that can
specify the group of customers who are likely to
purchase a specific product in a shopping mall. This
can be helpful when we want to identify a group of
customers who mostly purchase a product in a
shopping mall.
o For Predictions
LDA can also be used for making predictions and so
in decision making. For example, "will you buy this
product” will give a predicted result of either one or
two possible classes as a buying or not.
o In Learning
Nowadays, robots are being trained for learning
and talking to simulate human work, and it can also
be considered a classification problem. In this case,
LDA builds similar groups on the basis of different
parameters, including pitches, frequencies, sound,
tunes, etc.
ADVERTISEMENT
Where,
S is the dataset sample.
ith category.
Information Gain:
Information gain measures the reduction in entropy or
variance that results from splitting a dataset based on a
specific property. It is used in decision tree algorithms to
determine the usefulness of a feature by partitioning the
dataset into more homogeneous subsets with respect to the
class labels or target variable. The higher the information
gain, the more valuable the feature is in predicting the target
variable.
The information gain of an attribute A, with respect to a
dataset S, is calculated as follows:
where
A is the specific attribute or class label
best attribute.
Step-5: Recursively make new decision trees using the
Here,
H is the measure of impurities of the left and right subsets
node m.
To select the parameter, we can write as:
Example:
Python3
from sklearn.datasets
import load_iris
iris = load_iris()
X = iris.data[:, 2:] #
petal length and width
y = iris.target
#
DecisionTreeClassifier
tree_clf =
DecisionTreeClassifier(
criterion='entropy',
max_depth=2)
tree_clf.fit(X, y)
export_graphviz(
tree_clf,
out_file="iris_tree
.dot",
feature_names=iris.
feature_names[2:],
class_names=iris.ta
rget_names,
rounded=True,
filled=True
with
open("iris_tree.dot")
as f:
dot_graph = f.read()
Source(dot_graph)
Output:
Decision Tree Classifier
Here,
MSE is the mean squared error.
from sklearn.datasets
import load_diabetes
diabetes =
load_diabetes()
X = diabetes.data
y = diabetes.target
# DecisionTreeRegressor
tree_reg =
DecisionTreeRegressor(c
riterion =
'squared_error',
max_depth=2)
tree_reg.fit(X, y)
export_graphviz(
tree_reg,
out_file="diabetes_
tree.dot",
feature_names=diabe
tes.feature_names,
class_names=diabete
s.target,
rounded=True,
filled=True
with
open("diabetes_tree.dot
") as f:
dot_graph = f.read()
Source(dot_graph)
Output:
Decision Tree Regression
much computation.
Decision trees are able to handle both continuous and
categorical variables.
Decision trees provide a clear indication of which fields are
// Importing required
headers
#include <algorithm>
#include <chrono>
#include <cmath>
#include <cstdlib>
#include <ctime>
#include <fstream>
#include <iostream>
#include <iterator>
#include <queue>
#include <random>
#include <sstream>
#include <vector>
#include <bits/stdc+
+.h>
#include <stdio.h>
int main()
// Generating
random data for
classification
int X[100][5];
int t[100];
srand(10);
for (int j = 0; j
< 5; j++) {
X[i][j] =
rand() % 2;
t[i] = rand() %
2;
} // Splitting data
into train and test
sets
int X_train[70][5];
int X_test[30][5];
int t_train[70];
int t_test[30];
for (int j = 0; j
< 5; j++) {
X_train[i]
[j] = X[i][j];
t_train[i] =
t[i];
for (int j = 0; j
< 5; j++) {
X_test[i]
[j] = X[i + 70][j];
t_test[i] = t[i
+ 70];
// Randomly
predicting binary
values for test set
int
predicted_value[30];
predicted_value
[i] = rand() % 2;
// Printing
predicted binary values
for test set
for (int i = 0; i <
30; i++) {
cout <<
predicted_value[i] << "
";
// Calculating
number of 0s and 1s in
train set
int zeroes = 0;
int ones = 0;
if (t_train[i]
== 0) {
zeroes +=
1;
}
else {
ones += 1;
// Calculating Gini
index
float val = 1
-
((zeroes / 70.0) *
(zeroes / 70.0)
+
(ones / 70.0) * (ones /
70.0));
// Calculating
accuracy of predictions
int match = 0;
int UnMatch = 0;
if
(predicted_value[i] ==
t_test[i]) {
match += 1;
else {
UnMatch +=
1;
float accuracy =
match / 30.0;
return 0;
Output
1 1 0 0 1 0 1 0 1 0 0 1 1 0 0 1 0 1 1 1 0 0 0
1 1 0 0 0 1 0
Gini : 0.5
Accuracy is: 0.366667
Decision Tree Classification Algorithm
o Information Gain
o Gini Index
1. Information Gain:
o Information gain is the measurement of changes in
entropy after the segmentation of a dataset based
on an attribute.
o It calculates how much information a feature
provides us about a class.
o According to the value of information gain, we split
the node and build the decision tree.
o A decision tree algorithm always tries to maximize
the value of information gain, and a node/attribute
having the highest information gain is split first. It
can be calculated using the below formula:
Where,
2. Gini Index:
o Gini index is a measure of impurity or purity used
while creating a decision tree in the
CART(Classification and Regression Tree) algorithm.
o An attribute with the low Gini index should be
preferred as compared to the high Gini index.
o It only creates binary splits, and the CART algorithm
uses the Gini index to create binary splits.
o Gini index can be calculated using the below
formula:
Gini Index= 1- ∑jPj2
ADVERTISEMENT
o Cost Complexity Pruning
o Reduced Error Pruning.
Steps will also remain the same, which are given below:
1. # importing libraries
2. import numpy as nm
3. import matplotlib.pyplot as mtp
4. import pandas as pd
5.
6. #importing datasets
7. data_set= pd.read_csv('user_data.csv')
8.
9. #Extracting Independent and dependent Variable
10. x= data_set.iloc[:, [2,3]].values
11. y= data_set.iloc[:, 4].values
12.
13. # Splitting the dataset into training and test set.
14. from sklearn.model_selection import train_test_split
Output:
ADVERTISEMENT
4. Test accuracy of the result (Creation of
Confusion matrix)
Output:
ADVERTISEMENT
ADVERTISEMENT
8. mtp.xlim(x1.min(), x1.max())
9. mtp.ylim(x2.min(), x2.max())
10. fori, j in enumerate(nm.unique(y_set)):
11. mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],
12. c = ListedColormap(('purple', 'green'))(i), label
= j)
13. mtp.title('Decision Tree Algorithm (Training set)')
14. mtp.xlabel('Age')
15. mtp.ylabel('Estimated Salary')
16. mtp.legend()
17. mtp.show()
Output:
The above output is completely different from the rest
classification models. It has both vertical and horizontal
lines that are splitting the dataset according to the age
and estimated salary variable.
8. mtp.xlim(x1.min(), x1.max())
9. mtp.ylim(x2.min(), x2.max())
10. fori, j in enumerate(nm.unique(y_set)):
11. mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],
12. c = ListedColormap(('purple', 'green'))(i), label
= j)
13. mtp.title('Decision Tree Algorithm(Test set)')
14. mtp.xlabel('Age')
15. mtp.ylabel('Estimated Salary')
16. mtp.legend()
17. mtp.show()
ADVERTISEMENT
ADVERTISEMENT
o There should be some actual values in the feature
variable of the dataset so that the classifier can
predict accurate results rather than a guessed result.
o The predictions from each tree must have very low
correlations.
<="" li="">
o It takes less training time as compared to other
algorithms.
o It predicts output with high accuracy, even for the
large dataset it runs efficiently.
o It can also maintain accuracy when a large
proportion of data is missing.
1. # importing libraries
2. import numpy as nm
3. import matplotlib.pyplot as mtp
4. import pandas as pd
5.
6. #importing datasets
7. data_set= pd.read_csv('user_data.csv')
8.
9. #Extracting Independent and dependent Variable
10. x= data_set.iloc[:, [2,3]].values
11. y= data_set.iloc[:, 4].values
12.
13. # Splitting the dataset into training and test set.
14. from sklearn.model_selection import train_test_split
Output:
RandomForestClassifier(bootstrap=True,
class_weight=None, criterion='entropy',
max_depth=None,
max_features='auto', max_leaf_nodes=None,
min_impurity_decre
ase=0.0, min_impurity_split=None,
min_samples_leaf=1
, min_samples_split=2,
min_weight_fractio
n_leaf=0.0, n_estimators=10,
n_jobs=None,
oob_score=False, random_state=None,
verbose=0,
warm_start=False)
Output:
7. mtp.xlim(x1.min(), x1.max())
8. mtp.ylim(x2.min(), x2.max())
9. for i, j in enumerate(nm.unique(y_set)):
10. mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1
],
11. c = ListedColormap(('purple', 'green'))(i), label
= j)
12. mtp.title('Random Forest Algorithm (Training set)')
13. mtp.xlabel('Age')
14. mtp.ylabel('Estimated Salary')
15. mtp.legend()
16. mtp.show()
Output:
The above image is the visualization result for the
Random Forest classifier working with the training set
result. It is very much similar to the Decision tree
classifier. Each data point corresponds to each user of
the user_data, and the purple and green regions are the
prediction regions. The purple region is classified for the
users who did not purchase the SUV car, and the green
region is for the users who purchased the SUV.
Output: