Classification
Classification
CLASSIFICATION
1
09-10-2024
2
09-10-2024
Classifier
Testing
Data Unseen Data
(Jeff, Professor, 4)
NAME RANK YEARS TENURED
Tom A ssistan t P ro f 2 no Tenured?
M erlisa A sso c iate P ro f 7 no
G eo rg e P ro fesso r 5 yes
J o sep h A ssistan t P ro f 7 yes
6
3
09-10-2024
Data cleaning
◦ Preprocess data in order to reduce noise and handle missing values
Data transformation
◦ Generalize and/or normalize data
Speed
◦ time to construct the model (training time)
Interpretability
◦ understanding and insight provided by the model
Other measures, e.g., goodness of rules, such as decision tree size or compactness of classification rules
4
09-10-2024
age?
10
10
5
09-10-2024
11
11
12
12
6
09-10-2024
13
13
14
14
7
09-10-2024
15
16
16
8
09-10-2024
17
17
18
9
09-10-2024
19
19
20
10
09-10-2024
21
C4.5 (a successor of ID3) uses gain ratio to overcome the problem (normalization to information gain)
v | Dj | | Dj |
SplitInfoA ( D) = − log 2 ( )
j =1 |D| |D|
◦ GainRatio(A) = Gain(A)/SplitInfo(A)
The attribute with the maximum gain ratio is selected as the splitting attribute
4 4 6 6 4 4
𝑆𝑝𝑙𝑖𝑡𝐼𝑛𝑓𝑜𝐴 (𝐷) = − × log 2( ) − × log 2( ) − × log 2( ) =1.557
14 14 14 14 14 14
26
26
11
09-10-2024
𝑔𝑖𝑛𝑖 𝐷 = 1 − 𝑝𝑖 2
𝑖=1
where pj is the probability that a tuple in D belongs to class Ci and is estimated by |Ci, D|/|D|. The sum is
computed over m classes.
If a data set D is split on A into two subsets D1 and D2, the gini index gini(D) is defined as
| D1| | D2 |
gini A ( D) = gini( D1) + gini( D 2)
| D| | D|
The attribute provides the smallest ginisplit(D) (or the largest reduction in impurity) is chosen to split the node (need
to enumerate all the possible splitting points for each attribute)
27
27
Similarly, the Gini index values for splits on the remaining subsets are 0.458 (for the subsets {low, high} and {medium} and 0.450 (for the subsets {medium,
high} and {low}.
Therefore, the best binary split for attribute income is on {low, medium} {high} because it minimizes the Gini index.
28
28
12
09-10-2024
29
29
Performance: A simple Bayesian classifier, naïve Bayesian classifier, has comparable performance with decision
Incremental: Each training example can incrementally increase/decrease the probability that a hypothesis is
Standard: Even when Bayesian methods are computationally intractable, they can provide a standard of optimal
37
13
09-10-2024
38
Bayesian Theorem
Given training data X, posteriori probability of a hypothesis H, P(H|X), follows the Bayes theorem
Predicts X belongs to C2 iff the probability P(Ci|X) is the highest among all the P(Ck|X) for all the k classes
40
14
09-10-2024
P(X | C ) P(C )
P(C | X) = i i
i P(X)
where
P( X ) = P(X | C ) P(C )
i i
Since P(X) is constant for all classes, only needs to be maximized P(C | X) = P(X | C )P(C )
i i i
41
43
15
09-10-2024
44
45
16
09-10-2024
46
X = <Adam, M, 1.95m>
Prior Probabilities: P(short) = 4/15 = 0.267; P(medium) = 8/15 = 0.533; P(tall) = 3/15 = 0.2
50
17
09-10-2024
X = <Adam, M, 1.95m>
Prior Probabilities: P(short) = 4/15 = 0.267; P(medium) = 8/15 = 0.533; P(tall) = 3/15 = 0.2
51
55
18
09-10-2024
What Is Prediction?
(Numerical) prediction is similar to classification
◦ construct a model
◦ use model to predict continuous or ordered value for a given input
Prediction is different from classification
◦ Classification refers to predict categorical class label
◦ Prediction models continuous-valued functions
Major method for prediction: regression
◦ model the relationship between one or more independent or predictor variables and a dependent or response
variable
Regression analysis
◦ Linear and multiple regression
◦ Non-linear regression
◦ Other regression methods: generalized linear model, Poisson regression, log-linear models, regression trees
56
56
Linear Regression
Linear regression: involves a response variable y and a single predictor variable x
y = w0 + w1 x
where w0 (y-intercept) and w1 (slope) are regression coefficients
(x − x )( yi − y )
w = w = y −w x
i
i =1
1 | D|
0 1
( xi − x ) 2
i =1
57
57
19
09-10-2024
61
61
Regression
Assume data fits a predefined function.
Determine best values for regression coefficients c0,c1,…,cn.
Assume an error: y = c0+c1x1+…+cnxn+
To minimize the error, least mean squared error is used for training set:
Taking derivatives w.r.t c0 and c1, and setting it equal to zero, we can obtain least squares estimates for the
coefficients
62
62
20
09-10-2024
63
63
Division
Classifying a person as short and medium based on his/her height
Height: {1.6, 1.9, 1.88, 1.7, 1.85, 1.6, 1.7, 1.8, 1.95, 1.9, 1.8, 1.75}
64
64
21
09-10-2024
Prediction
{ (1.6, 0), ( 1.9, 1), ( 1.88, 1), ( 1.7, 0), (1.85, 1), (1.6, 0), (1.7, 0), (1.8, 1), (1.95, 1), (1.9, 1), (1.8, 1), (1.75, 1) } .
65
65
66
66
22
09-10-2024
70
70
Confusion matrix
Tool for analysing how well your classifier can recognize tuples of different classes
TP & TN – indicates when classifier is getting things right
FP & FN – indicates when classifier is getting things wrong
Predicted class
Yes No Total
Actual Yes True positive (TP) False negative (FN) P
class
Total P’ N’ P+N
71
71
23
09-10-2024
Confusion matrix
Predicted class
Yes No Total
Actual Yes True positive (TP) False negative (FN) P
class
Total P’ N’ P+N
72
72
73
73
24
09-10-2024
74
74
Evaluation Measures
75
75
25
09-10-2024
76
76
Robustness – ability of classifier to make correct predictions give noisy data or data with missing values
Scalability – ability to construct the classifier efficiently given large amount of data
77
77
26
09-10-2024
78
78
27