0% found this document useful (0 votes)
29 views

Data Mining Assignment 3

1. Data classification involves building a model from a training dataset that assigns class labels to new data based on patterns in the training data. 2. It is a two-step process - first a classification model is constructed by analyzing a training dataset that includes data points with known class labels. Then the model is used to predict the class labels of new, unlabeled data. 3. There are two main types of classification algorithms - statistical-based algorithms like regression and Bayesian classification that use statistical techniques to analyze the training data, and machine learning-based algorithms like decision trees that learn classification rules from the data to make predictions.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views

Data Mining Assignment 3

1. Data classification involves building a model from a training dataset that assigns class labels to new data based on patterns in the training data. 2. It is a two-step process - first a classification model is constructed by analyzing a training dataset that includes data points with known class labels. Then the model is used to predict the class labels of new, unlabeled data. 3. There are two main types of classification algorithms - statistical-based algorithms like regression and Bayesian classification that use statistical techniques to analyze the training data, and machine learning-based algorithms like decision trees that learn classification rules from the data to make predictions.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

CLASSIFICATION

3 . 1 C L A S S I F I C A T I O N :

nominal).
class labels
(discrete or
Predicts categorical the training
based on
a model)
the data (constructs
in classifying the attribute
Classifies
labels)
set and the values (class
the new data.
and uses it in classifying
two-step process:
Data Classification is
a

1. Model Construction: Describing a set of predetermined classes.


assumed to belong to a predefined
Each tuple / sample is
label attribute.
class, as determined by the class
The set of tuples used for model construction is
a training set.
The model is represented as classification rules, decision
trees, or mathematical formulae.
2. Model Usage: For classifying future or unknown objects.
Estimate accuracy of the model.
The known label of test sample is compared with the
classified result from the model.
Accuracy rate is the percentage of test set samples that are
correctly classified by the model.
Test set is
independent of a training set.
3 . 2C l a s s i f i c a t i o n

TE the accuracy is acceptable, use the model to classify ata

whose class labels


are unknown
tuples
broadly
classified into two types:
earning is
Supervised learning

The training .data(observations, measurements, etc.) are


the labels indicating the class of the observations.
ccompanied by
classified based on the training set.
New data is
Unsupervised learning (clustering)
set of
The class label of training data is unknown. Given a
with the aim of establishing the
measurements, observations, etc.
existence of classes or clusters in the data.

(a) Learning
Classalkcation ugontwn

Kame rcone Lan.d6es

Yung
Safe
Classicauon nes
aky
S s n uke Sanor Lw Sate

Cuze Pns Sanor Fage


risky
outh THEN koan_dacsin
-

Safe 5ale
Kan decscolow
-

Frcmt -
ngh THEN

mdidit aqed ANO


agTHEN CN dc:son c inkyicme

b) Classification
Ciassihcaton rues

Now Cs
Trauuny data
low)
(John Heny, middle aJ0d.
Loan dec540n?

lacom lRMdsiAiR
Nama
Juas Belle Seaior Low Safo
Low Risky
Syhria Crast lidda aad
Saie
Anae Yee uddla asi High

Riuly

(b)
Classification
Process (a) Learning,
igure.3.1. Data
Classification
3.2. STATISTICAL-BASED ALGORITHMS

statistical-based algorithms which are as


There are two types of
follows
with the evaluation of
Regression Regression 1ssues deal
values. When utilized for
an output value located on input
the database
classification, the input values are values from
be
and the output values define the classes(Regression can
used to clarify classification issues, but it is used for different
applications including forecasting. The elementary form of
regression is simple linear regression that
-
includes only one

predictor and a prediction.

Regression can be used toimplementclassification using two


various methods which are as follows -

Division - The data are divided into regions located on

class.
>Prediction Formulas are created to predict the output
class's value.
Bayesian Classification - Statistical classifiers are used for

the classification. Bayesian classification is based on the


Bayes theorem. Bayesian classifiers view high efficiency and
speed when used to high databases
Bayes Theorem -

Let X be a data tuple. In the Bayesian method, X is treated as


"evidence." Let H be some hypothesis, including that the data tuple
X belongs to a particularized class C. The probability P (HX) is
decided to define the data. This probability P (H[X) isthe probability
that hypothesis H's influence has given the "evidene" or noticed
data tuple X.

is the of H conditioned on X.
P (HX) posterior probability
limited to users
For instance, consider the nature of data tuples is
3.4 Classification
and that X is 30
defined by the attribute age and income, commonly,
the
Assume that H is
years old users with Rs. 20,000income.
will purchase a computer. Thus P (HX
hypothesis that the user
given
reverses the probability that user X will purchase a computer
acknowledged.
that the user's age and income
are

this is the
H. For instance,
probability of
PE) is the prior a computer,
regardless
will purchase
probability that any given
user

The posterior probability P (HIX)


s o m e other
data.
of age, income, or is free
probability P (H), which
located on m o r e data
than the prior
is
of X. X
X
posterior probability of
P (XH) is the old
Likewise, is 30 years X
that a user

conditioned on H. It is the probability

and gains Rs. 20,000.


c a n be
measured from the given
P (H), P (XH), and
P (X) the
method of computing
theorem supports a
information. Bayes It is
from P (H), P (XH), and P(X).
P (H{X),
posterior probability
given by
P H I X ) = P O X H ) P ( E D P ( C X ) P ( H { X ) = P ( X H ) P ( H ) P ( X )

m e a s u r e is also a distance with


In Data Mining, the similarity
That means if the distance
dimensions describing object features.
of
among two data points is small then there is a high degree
versa. The similarity is
similarity among the objects and vice
For
subjective and depends heavily on the context and application.
example, similarity among vegetables can be determined from their
taste, size, colour etc.

3.3. THE DISTANCE-BASED ALGORITHMS IN DATA


MINING
The algorithms are used to the
measure distance between
each text, and _to _calculate the score. Distance measures play an
important role in machine learning.
Data Mining 3.5

the
foundations
for pular and effective
many popular: ective
They provide like KNN Nearest Neio
(K-Nearest Neighb ours) for
for
machine learning
algorithms
K-Means_clustering for s
for unsupervised
"learning_and
supervised
learning.
as
measures must be chosen
must and
and
used
Differcnt distance
is important to know ho
depending on the types of data, it to
a range
of ditIerent popular ance
implement and calculate
measures and the
intuitions for the resulting scores.)

play an important role in machine


Distance measures

used distance measures in machine


learning, the most commonly
learning are as follows
Hamming Distance
Euclidean Distance

Manhattan Distance
Minkowski Distance
Mahalanobis Distance
.Cosine Distance
The most
important is to
calculate each of these distance
measures when
implementing the algorithms from scratch and the
intuition for what is
make use of these being calculated when using
distance
measures. algorithns tna
3.3.1. HAMMING DISTANCE

Hamming
binary vectors, alsodistance
referred tocalculates the
The most
as
binary distance between
strings or bitstrings.
two

likely
performs One-Hot Encodeencountered binary
For
example, A set as categorical columnsstrings is when
of data.
when the
tne user
follows
3.6 Classification

COLUMIN
RED

GREEN
BLUE
Example Set, After Encoding,

Column One hot encode


Red I1,0,0
Green [0,1,0]
Blue (0,0,1]

The distance between red and green could be calculated as


the sum or the average number of bit differences between the two
bit-strings. This is Hamming distance.

1 1 011 1 0 0
11 1 1 0 1 1 0

00 10 10 10 Hamming distance =3

For a One-hot encoded string, it might make more sense to


summarize the sum of the bit difference between the strings, which
will always be a 0 or 1.
Hamming Distance = sum for i to N abs(vl[i]-v2[i])

For bit-strings that may have many 1 bits, it is more common


to calculate the average number of bit differences to give a hamming
distance score between 0(identical) and 1 (all diferent).
Hamming Distance = [(sum for i to N abs(vl[i]-v2[i]))/N]
Data Mining 3.7

3.3.2. Euclidean Distance:

is considered the traditional metric for


Euglidean distance
problems with geometry. It can be simnply explained as the ordinary
in
distance between two points. It is one of the most used algorithms
use this formula
the cluster analysis. One of the algorithms that
the root of squarcd
would be K-mcan. Mathematically it computes
differences between the coordinates between two objects.

(2: P2)

Y2-91

T2-1
(T11)

Figure.3.2. Euclidean Distance

3.3.3. M a n h a t t a n Distance:

the absolute difference among the pair of the


This determines
coordinates.

we points P and Q to determine the


have two
Suppose
the
these points we simply have to calculate
distance between
of the points from X-Axis and Y-Axis.
nerpendicular distance

plane with P at coordinate (x1, y1) and Q at (x2, y2).


In a
Manhattan distance between P and Q= |x1 - x2 + lyl - y2
3.8 Classification

Figure.3.3. Manhattan Distance

Here the total distance of the Red line gives the Manhattan
distance between both the points.

3.3.4. Jaccard Index:

The Jaccard distance measures the


similarity of the two data
set items as the intersection of those items divided by the union of
the data items.

Jaccard coefficient
rersecion Union

AnB AUE

(A, B)= AnB


AUB
Figure.3.4. Jaccard Index
3.3.5. Minkowski distance

It generalized form of the Euclidean and n.


is, the
point is represented
Distance Measure. In an N-dimensional space, a point is rer

as,
(x1, x2, ., xN)
Consider two points P1 and P2:

P1:(X1,X2,., XN)
P2: (Y1, Y2,..., YN)
Then, the Minkowski distance between Pl and P2 is given as:

When p =
2, Minkowski distance is same as the Euclidean
distance.
When p =
1, Minkowski distance is same as the Manhattan
distance.

3.3.6. Cosine Index:

Cosine distance measure for clustering determines the cosine


of the angle between two vectors
given by the following formula.
Here (theta) gives the angle between two vectors and A, B
are n-dimensional vectors.

A(x1,y1)
d

B(x2.y2)

Figure.3.5. Cosine Distance

You might also like