Data Mining Assignment 3
Data Mining Assignment 3
3 . 1 C L A S S I F I C A T I O N :
nominal).
class labels
(discrete or
Predicts categorical the training
based on
a model)
the data (constructs
in classifying the attribute
Classifies
labels)
set and the values (class
the new data.
and uses it in classifying
two-step process:
Data Classification is
a
(a) Learning
Classalkcation ugontwn
Yung
Safe
Classicauon nes
aky
S s n uke Sanor Lw Sate
Safe 5ale
Kan decscolow
-
Frcmt -
ngh THEN
b) Classification
Ciassihcaton rues
Now Cs
Trauuny data
low)
(John Heny, middle aJ0d.
Loan dec540n?
lacom lRMdsiAiR
Nama
Juas Belle Seaior Low Safo
Low Risky
Syhria Crast lidda aad
Saie
Anae Yee uddla asi High
Riuly
(b)
Classification
Process (a) Learning,
igure.3.1. Data
Classification
3.2. STATISTICAL-BASED ALGORITHMS
class.
>Prediction Formulas are created to predict the output
class's value.
Bayesian Classification - Statistical classifiers are used for
is the of H conditioned on X.
P (HX) posterior probability
limited to users
For instance, consider the nature of data tuples is
3.4 Classification
and that X is 30
defined by the attribute age and income, commonly,
the
Assume that H is
years old users with Rs. 20,000income.
will purchase a computer. Thus P (HX
hypothesis that the user
given
reverses the probability that user X will purchase a computer
acknowledged.
that the user's age and income
are
this is the
H. For instance,
probability of
PE) is the prior a computer,
regardless
will purchase
probability that any given
user
the
foundations
for pular and effective
many popular: ective
They provide like KNN Nearest Neio
(K-Nearest Neighb ours) for
for
machine learning
algorithms
K-Means_clustering for s
for unsupervised
"learning_and
supervised
learning.
as
measures must be chosen
must and
and
used
Differcnt distance
is important to know ho
depending on the types of data, it to
a range
of ditIerent popular ance
implement and calculate
measures and the
intuitions for the resulting scores.)
Manhattan Distance
Minkowski Distance
Mahalanobis Distance
.Cosine Distance
The most
important is to
calculate each of these distance
measures when
implementing the algorithms from scratch and the
intuition for what is
make use of these being calculated when using
distance
measures. algorithns tna
3.3.1. HAMMING DISTANCE
Hamming
binary vectors, alsodistance
referred tocalculates the
The most
as
binary distance between
strings or bitstrings.
two
likely
performs One-Hot Encodeencountered binary
For
example, A set as categorical columnsstrings is when
of data.
when the
tne user
follows
3.6 Classification
COLUMIN
RED
GREEN
BLUE
Example Set, After Encoding,
1 1 011 1 0 0
11 1 1 0 1 1 0
00 10 10 10 Hamming distance =3
(2: P2)
Y2-91
T2-1
(T11)
3.3.3. M a n h a t t a n Distance:
Here the total distance of the Red line gives the Manhattan
distance between both the points.
Jaccard coefficient
rersecion Union
AnB AUE
as,
(x1, x2, ., xN)
Consider two points P1 and P2:
P1:(X1,X2,., XN)
P2: (Y1, Y2,..., YN)
Then, the Minkowski distance between Pl and P2 is given as:
When p =
2, Minkowski distance is same as the Euclidean
distance.
When p =
1, Minkowski distance is same as the Manhattan
distance.
A(x1,y1)
d
B(x2.y2)