0% found this document useful (0 votes)
7 views

ML 2024 Part6 Classification Unsupervised

Uploaded by

jfang1
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

ML 2024 Part6 Classification Unsupervised

Uploaded by

jfang1
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 43

Machine Learning for Microeconometrics

Part 6: Classi…cation and Unsupervised

A. Colin Cameron
Univ.of California - Davis
.

April 2024

A. Colin Cameron Univ.of California - Davis .ML


() Part 6: Classi…cation and Unsupervised April 2024 1 / 43
Course Outline
1. Variable selection and cross validation
2. Shrinkage methods
I ridge, lasso, elastic net
3. ML for causal inference using lasso
I OLS with many controls, IV with many instruments
4. Other methods for prediction
I nonparametric regression, principal components, splines
I neural networks
I regression trees, random forests, bagging, boosting
5: More ML for causal inference
I ATE with heterogeneous e¤ects and many controls.
Part 6. Classi…cation and unsupervised learning
I classi…cation (categorical y ) and unsupervised learning (no y ).
A. Colin Cameron Univ.of California - Davis .ML
() Part 6: Classi…cation and Unsupervised April 2024 2 / 43
1. Introduction

1. Introduction

To date considered supervised learning with a continuous measure (or


a count or binary where model probabilities).
Now consider very brie‡y classi…cation and unsupervised learning.
Classi…cation is supervised learning with y categorical
I The loss function is the number of misclassi…cations rather than MSE.
I Traditional methods select the category with the highest predicted
probability.
I Some ML methods instead directly select the category.
Unsupervised learning there is no y , only x
I Principal components.
I k-means clustering.
Good reference is ISL2.

A. Colin Cameron Univ.of California - Davis .ML


() Part 6: Classi…cation and Unsupervised April 2024 3 / 43
1. Introduction

Overview

1 Classi…cation (categorical y )
1 Loss function
2 Logit
3 Local logit regression
4 k-nearest neighbors
5 Discriminant analysis
6 Support vector machines
7 Regression trees and random forests
8 Neural networks
2 Unsupervised learning (no y )
1 Principal components analysis
2 Cluster analysis

A. Colin Cameron Univ.of California - Davis .ML


() Part 6: Classi…cation and Unsupervised April 2024 4 / 43
1. Classi…cation

1. Classi…cation: Overview

Regression methods
I predict probabilities based on log-likelihood rather than MSE
I assign to class with the highest predicted probability (Bayes classi…er)
F in binary case yb = 1 if p
b 0.5 and yb = 0 if p
b < 0.5.
I parametric: logistic regression, multinomial regression
I nonparametric: local logit, nearest-neighbors logit
Discriminant analysis
I additionally assumes a normal distribution for the x’s
I predict probabilities
I use Bayes theorem to get Pr[Y = k jX = x] and Bayes classi…er.
I used in many other social sciences

A. Colin Cameron Univ.of California - Davis .ML


() Part 6: Classi…cation and Unsupervised April 2024 5 / 43
1. Classi…cation

1. Classi…cation: Overview (continued)

Support vector classi…ers and support vector machines


I directly classify (no probabilities)
I machine learning methods developed in the 1990’s
I are more nonlinear so may classify better
I use separating hyperplanes of X and extensions.
Random forests
I in simplest case minimize the classi…cation error rate rather than the
MSE
I in practice better is to use the Gini index or entropy.
Neural networks
I can work very well for complex classi…cation such as images.

A. Colin Cameron Univ.of California - Davis .ML


() Part 6: Classi…cation and Unsupervised April 2024 6 / 43
1. Classi…cation 1.1 Loss Function

1.1 A Di¤erent Loss Function: Error Rate


Instead of MSE we use the error rate
I the number of misclassi…cations
1 n
Error rate =
n ∑ i = 1 1 [ yi 6= ybi ],

F where for K categories yi = 0, ..., K 1 and ybi = 0, ..., K 1.


F and indicator 1[A ] = 1 if event A happens and = 0 otherwise.

The test error rate is for the n0 observations in the test sample
1
∑i =1 1[y0i 6= yb0i ].
n0
Ave(1[y0 6= yb0 ]) =
n0
Cross validation uses number of misclassi…ed observations. e.g.
LOOCV is
1 1
∑i =1 Erri = n ∑i =1 1[yi 6= yb(
n n
CV(n ) = i ) ].
n
A. Colin Cameron Univ.of California - Davis .ML
() Part 6: Classi…cation and Unsupervised April 2024 7 / 43
1. Classi…cation 1.1 Loss Function

Classi…cation Table

A classi…cation table or confusion matrix is a K K table of counts


of (y , yb)
In 2 2 case with binary y = 1 or 0
I sensitivity is % of y = 1 with prediction yb = 1 (true positive)
I speci…city is % of y = 0 with prediction yb = 0 (true negative)
I receiver operator characteristics curve (ROC) curve plots sensitivity
against 1 sensitivity as threshold for yb = 1 changes.
F given tradeo¤s between sensitivity and speci…city may choose the
preferred threshold.

A. Colin Cameron Univ.of California - Davis .ML


() Part 6: Classi…cation and Unsupervised April 2024 8 / 43
1. Classi…cation 1.1 Loss Function

Bayes classi…er
The Bayes classi…er selects the most probable class
I the following gives theoretical justi…cation.
b (x)) = 1[yi 6= ybi ]
L(G , G
b (x)) is 0 on diagonal of K
I L(G , G K table and 1 elsewhere
I b is predicted categories.
where G is actual categories and G
Then minimize the expected prediction error
EPE b (x))]
= EG ,x [L(G , G
h i
= Ex ∑ k = 1 L ( G k , G
K
b (x)) Pr[Gk jx]

Minimize EPE pointwise (for each value of x)


h i
b (x) = arg ming 2G ∑K
G k =1 L ( G k , g ) Pr [ G k j x ]
= arg ming 2G [1 Pr[g jx]] given 0-1 loss
= maxg 2G Pr[g jx]
So select the most probable class.
A. Colin Cameron Univ.of California - Davis .ML
() Part 6: Classi…cation and Unsupervised April 2024 9 / 43
1. Classi…cation 1.2 Logit

1.2 Logit

Directly model p (x) = Pr[y jx].


Logistic (logit) regression for binary case obtains MLE for
p (x)
ln 1 p (x)
= x0 β.

Statisticians implement using a statistical package for the class of


generalized linear models (GLM)
I logit is in the Bernoulli (or binomial) family with logistic link
I logit is often the default.
Logit model is a linear (in x) classi…er
I yb = 1 if p b(x) > 0.5
I b > 0 since p
i.e. if x0 β b ) and Λ(0) =
b(x ) = Λ (x0 β e0 = 0.5.
1 +e 0

A. Colin Cameron Univ.of California - Davis .ML


() Part 6: Classi…cation and Unsupervised April 2024 10 / 43
1. Classi…cation 1.2 Logit

Logit Example
Example considers supplementary health insurance for 65-90 year-olds.

. * Data for 65-90 year olds on supplementary insurance indicator and regressors
. use mus203mepsmedexp.dta, clear

. global xlist income educyr age female white hisp marry ///
> totchr phylim actlim hvgg

. describe suppins $xlist

storage display value


variable name type format label variable label

suppins float %9.0g =1 if has supp priv insurance


income double %12.0g annual household income/1000
educyr double %12.0g Years of education
age double %12.0g Age
female double %12.0g =1 if female
white double %12.0g =1 if white
hisp double %12.0g =1 if Hispanic
marry double %12.0g =1 if married
totchr double %12.0g # of chronic problems
phylim double %12.0g =1 if has functional limitation
actlim double %12.0g =1 if has activity limitation
hvgg float %9.0g =1 if health status is excellent,
good or very good

A. Colin Cameron Univ.of California - Davis .ML


() Part 6: Classi…cation and Unsupervised April 2024 11 / 43
1. Classi…cation 1.2 Logit

Logit Example (continued)

Summary statistics
I ȳ = 0.58 so not near extreme of most y = 0 or most y = 1.

. * Summary statistics
. summarize suppins $xlist

Variable Obs Mean Std. Dev. Min Max

suppins 3,064 .5812663 .4934321 0 1


income 3,064 22.47472 22.53491 -1 312.46
educyr 3,064 11.77546 3.435878 0 17
age 3,064 74.17167 6.372938 65 90
female 3,064 .5796345 .4936982 0 1

white 3,064 .9742167 .1585141 0 1


hisp 3,064 .0848564 .2787134 0 1
marry 3,064 .5558094 .4969567 0 1
totchr 3,064 1.754243 1.307197 0 7
phylim 3,064 .4255875 .4945125 0 1

actlim 3,064 .2836162 .4508263 0 1


hvgg 3,064 .6054178 .4888406 0 1

A. Colin Cameron Univ.of California - Davis .ML


() Part 6: Classi…cation and Unsupervised April 2024 12 / 43
1. Classi…cation 1.2 Logit

Logit Example
Logit model coe¢ cient estimates
(∂ Pr[y = 1]/∂xj = βj Λ(x0 β)f1 Λ(x0 β)g)

. * logit model
. logit suppins $xlist, nolog

Logistic regression Number of obs = 3,064


LR chi2(11) = 345.23
Prob > chi2 = 0.0000
Log likelihood = -1910.5353 Pseudo R2 = 0.0829

suppins Coef. Std. Err. z P>|z| [95% Conf. Interval]

income .0180677 .0025194 7.17 0.000 .0131298 .0230056


educyr .0776402 .0131951 5.88 0.000 .0517782 .1035022
age -.0265837 .006569 -4.05 0.000 -.0394586 -.0137088
female -.0946782 .0842343 -1.12 0.261 -.2597744 .070418
white .7438788 .2441096 3.05 0.002 .2654327 1.222325
hisp -.9319462 .1545418 -6.03 0.000 -1.234843 -.6290498
marry .3739621 .0859813 4.35 0.000 .205442 .5424823
totchr .0981018 .0321459 3.05 0.002 .0350971 .1611065
phylim .2318278 .1021466 2.27 0.023 .0316242 .4320315
actlim -.1836227 .1102917 -1.66 0.096 -.3997904 .0325449
hvgg .17946 .0811102 2.21 0.027 .0204868 .3384331
_cons -.1028233 .577563 -0.18 0.859 -1.234826 1.029179

A. Colin Cameron Univ.of California - Davis .ML


() Part 6: Classi…cation and Unsupervised April 2024 13 / 43
1. Classi…cation 1.2 Logit

Logit Example (continued)


Classi…cation table manually
I error rate = (737 + 347) /3064 = 1084/3064 = 0.354

. * Classification table manually


. predict ph_logit
(option pr assumed; Pr(suppins))

. generate yh_logit = ph_logit >= 0.5

. generate err_logit = (suppins==0 & yh_logit==1) | (suppins==1 & yh_logit==0)

. summarize suppins ph_logit yh_logit err_logit

Variable Obs Mean Std. Dev. Min Max

suppins 3,064 .5812663 .4934321 0 1


ph_logit 3,064 .5812663 .1609388 .0900691 .9954118
yh_logit 3,064 .7085509 .4545041 0 1
err_logit 3,064 .3537859 .4782218 0 1

. tabulate suppins yh_logit

=1 if has
supp priv yh_logit
insurance 0 1 Total

0 546 737 1,283


1 347 1,434 1,781

Total 893 2,171 3,064

A. Colin Cameron Univ.of California - Davis .ML


() Part 6: Classi…cation and Unsupervised April 2024 14 / 43
1. Classi…cation 1.2 Logit

Logit Example (continued)


Classi…cation table using estat classi…cation postestimation command
I problem: reversed ordering in table makes hard to compare to other
models given later.
. * Classification table
. estat classification

Logistic model for suppins

True
Classified D ~D Total

+ 1434 737 2171


- 347 546 893

Total 1781 1283 3064

Classified + if predicted Pr(D) >= .5


True D defined as suppins != 0

Sensitivity Pr( +| D) 80.52%


Specificity Pr( -|~D) 42.56%
Positive predictive value Pr( D| +) 66.05%
Negative predictive value Pr(~D| -) 61.14%

False + rate for true ~D Pr( +|~D) 57.44%


False - rate for true D Pr( -| D) 19.48%
False + rate for classified + Pr(~D| +) 33.95%
False - rate for classified - Pr( D| -) 38.86%

Correctly classified 64.62%

A. Colin Cameron Univ.of California - Davis .ML


() Part 6: Classi…cation and Unsupervised April 2024 15 / 43
1. Classi…cation 1.3 Nonparametric local logit regression

1.3 Nonparametric local logit regression

Extension of local linear to the logit model


I replace squared residual with log density.
At x = x0 maximize w.r.t. α0 and β0 the weighted logit log density

∑ni=1 wh (xi x0 ) fyi ln Λ(α0 + (xi x0 )0 β0 )


+(1 yi ) ln[1 Λ(α0 + (xi x0 )0 β0 )]g

Stata add-on command locreg in ivqte package.

A. Colin Cameron Univ.of California - Davis .ML


() Part 6: Classi…cation and Unsupervised April 2024 16 / 43
1. Classi…cation 1.4 Nonparametric k-nearest neighbors

1.4 Nonparametric k-nearest neighbors

For each observation i consider the K neighboring observations that


have the closest x value and estimate Pr[Y = j ] by the fraction of the
K neighboring observations with y = j.
k-nearest neighbors (K-NN) for many classes
I Pr[Y = j jx = x0 ] = K1 ∑i 2N 0 1[yi = j ]
I where N0 is the K observations on x closest to x0 .
There are many measures of closeness
I default is Euclidean distance between observations i and j
n o1 /2
p
∑a=1 (xai xja )2 where there are p regressors

Obtain predicted probabilities


I then assign to the class with highest predicted probability.

A. Colin Cameron Univ.of California - Davis .ML


() Part 6: Classi…cation and Unsupervised April 2024 17 / 43
1. Classi…cation 1.4 Nonparametric k-nearest neighbors

k-nearest neighbors example


Here use Euclidean distance and set K = 11
I and results here don’t use looclass option
I 584 + 394 = 978 are misclassi…ed (versus logit 737 + 347 = 1084).

. * K-nn classification table with leave-one out cross validation not as good
. estat classtable, nototals nopercents // without LOOCV

Resubstitution classification table

Key

Number

Classified
True suppins 0 1

0 889 394

1 584 1,197

Priors 0.5000 0.5000

A. Colin Cameron Univ.of California - Davis .ML


() Part 6: Classi…cation and Unsupervised April 2024 18 / 43
1. Classi…cation 1.5 Discriminant Analysis

1.5 Linear Discriminant Analysis

Developed for classi…cation problems such as “Is a skull Neanderthal


or Homo Sapiens” given various measures of the skull.
Discriminant analysis speci…es a joint distribution for (Y , X).
Linear discriminant analysis with K categories
I assume XjY = k is N (µk , Σ) with density fk (x) = Pr[X = xjY = k ]
F note that only the mean of X varies with the category k
I and let π k = Pr[Y = k ]
The desired Pr[Y = k jX = x] is obtained using Bayes theorem

Pr[Y = k & X = x] π k f k (x)


Pr[Y = k jX = x] = = .
Pr[X = x] ∑K
j =1 π j f j (x )

Assign observation X = x to class k with largest Pr[Y = k jX = x].

A. Colin Cameron Univ.of California - Davis .ML


() Part 6: Classi…cation and Unsupervised April 2024 19 / 43
1. Classi…cation 1.5 Discriminant Analysis

Linear Discriminant Analysis (continued)

Upon simpli…cation assignment to class k with largest


Pr[Y = k jX = x] is equivalent to choosing model with largest
discriminant function
1 0
δ k (x ) = x0 Σ 1
µk µ Σ 1
µk + ln π k
2 k

I bk = xk , Σ c [xk ] and π
b = Var bk = 1
∑N
use µ N i = 1 1 [ yi = k ] .
Called linear discriminant analysis as δk (x) linear in x.
I logit also gives separation linear in x.

A. Colin Cameron Univ.of California - Davis .ML


() Part 6: Classi…cation and Unsupervised April 2024 20 / 43
1. Classi…cation 1.5 Discriminant Analysis

Linear Discriminant Analysis Example


638 + 513 = 1141 are misclassi…ed (versus logit 737 + 347 = 1084).

. * Linear discriminant analysis


. discrim lda $xlist, group(suppins) notable

. predict yh_lda
(option classification assumed; group classification)

. estat classtable, nototals nopercents

Resubstitution classification table

Key

Number

Classified
True suppins 0 1

0 770 513

1 638 1,143

Priors 0.5000 0.5000

A. Colin Cameron Univ.of California - Davis .ML


() Part 6: Classi…cation and Unsupervised April 2024 21 / 43
1. Classi…cation 1.5 Discriminant Analysis

Quadratic Discriminant Analysis

Quadratic discriminant analysis


I additionally allow di¤erent variances so XjY = k is N (µk , Σk )
Upon simpli…cation, the Bayes classi…er assigns observation X = x to
class k which has largest
1 0 1 1 0 1 1
δk (x) = x Σ k x + x0 Σ k 1 µ k µ Σ µ ln jΣk j + ln π k
2 2 k k k 2

I called quadratic discriminant analysis as δk (x) is quadratic in x


Use rather than LDA only if have a lot of data as requires estimating
many parameters.

A. Colin Cameron Univ.of California - Davis .ML


() Part 6: Classi…cation and Unsupervised April 2024 22 / 43
1. Classi…cation 1.5 Discriminant Analysis

Quadratic Discriminant Analysis Example


815 + 292 = 1107 are misclassi…ed (versus logit 737 + 347 = 1084).

. * Quadratic discriminant analysis


. discrim qda $xlist, group(suppins) notable

. predict yh_qda
(option classification assumed; group classification)

. estat classtable, nototals nopercents

Resubstitution classification table

Key

Number

Classified
True suppins 0 1

0 468 815

1 292 1,489

Priors 0.5000 0.5000

A. Colin Cameron Univ.of California - Davis .ML


() Part 6: Classi…cation and Unsupervised April 2024 23 / 43
1. Classi…cation 1.5 Discriminant Analysis

LDA versus Logit

ESL ch.4.4.5 compares linear discriminant analysis and logit


I Both have log odds ratio linear in X
I LDA is joint model if Y and X versus logit is model of Y conditional
on X .
I In the worst case logit ignoring marginal distribution of X has a loss of
e¢ ciency of about 30% asymptotically in the error rate.
I If X 0 s are nonnormal (e.g. categorical) then LDA still doesn’t do too
bad.

A. Colin Cameron Univ.of California - Davis .ML


() Part 6: Classi…cation and Unsupervised April 2024 24 / 43
1. Classi…cation 1.5 Discriminant Analysis

ISL Figure 4.9: Linear and Quadratic Boundaries


LDA uses a linear boundary to classify and QDA a quadratic

A. Colin Cameron Univ.of California - Davis .ML


() Part 6: Classi…cation and Unsupervised April 2024 25 / 43
1. Classi…cation 1.6 Support Vector Machines

1.6 Support Vector Classi…er


Build on LDA idea of linear boundary to classify when K = 2.
Maximal margin classi…er
I classify using a separating hyperplane (linear combination of X )
I if perfect classi…cation is possible then there are an in…nite number of
such hyperplanes
I so use the separating hyperplane that is furthest from the training
observations
I this distance is called the maximal margin.
Support vector classi…er
I generalize maximal margin classi…er to the nonseparable case
I this adds slack variables to allow some y ’s to be on the wrong side of
the margin
I Maxβ,ε M (the margin - distance from separator to training X ’s)
subject to β0 β 6= 1, yi ( β0 + xi0 β) M (1 εi ), εi 0 and
∑ni=1 εi C.

A. Colin Cameron Univ.of California - Davis .ML


() Part 6: Classi…cation and Unsupervised April 2024 26 / 43
1. Classi…cation 1.6 Support Vector Machines

Support Vector Machines

The support vector classi…er has a linear boundary (in x0 )


I f (x0 ) = β0 + ∑ni=1 αi x00 xi , where x00 xi = ∑pj=1 x0j xij .
The support vector machine has nonlinear boundaries
I f (x0 ) = β0 + ∑ni=1 αi K (x0 , xi ) where K ( ) is a kernel
I polynomial kernel K (x0 , xi ) = (1 + ∑pj=1 x0j xij )d
I radial kernel K (x0 , xi ) = exp( γ ∑pj=1 (x0j xij )2 )
Can extend to K > 2 classes (see ISL ch. 9.4).
I one-versus-one or all-pairs approach
I one-versus-all approach.

A. Colin Cameron Univ.of California - Davis .ML


() Part 6: Classi…cation and Unsupervised April 2024 27 / 43
1. Classi…cation 1.6 Support Vector Machines

ISL Figure 9.9: Support Vector Machine


In this example a linear or quadratic classi…er won’t work whereas
SVM does.

A. Colin Cameron Univ.of California - Davis .ML


() Part 6: Classi…cation and Unsupervised April 2024 28 / 43
1. Classi…cation 1.6 Support Vector Machines

Support Vector Machines Example


Use Stata add-on svmachines (Guenther and Schonlau)
224 + 463 = 687 are misclassi…ed (versus logit 737 + 347 = 1084).

. * Support vector machines - need y to be byte not float and matsize > n
. set matsize 3200

. global xlistshort income educyr age female marry totchr

. generate byte ins = suppins

. svmachines ins income

. svmachines ins $xlist

. predict yh_svm

. tabulate ins yh_svm

yh_svm
ins 0 1 Total

0 820 463 1,283


1 224 1,557 1,781

Total 1,044 2,020 3,064

A. Colin Cameron Univ.of California - Davis .ML


() Part 6: Classi…cation and Unsupervised April 2024 29 / 43
1. Classi…cation 1.6 Support Vector Machines

Comparison of model predictions

The following compares the various category predictions.


SVM does best but we did in-sample predictions here
I especially for SVM we should have training and test samples.

. * Compare various in-sample predictions


. correlate suppins yh_logit yh_knn yh_lda yh_qda yh_svm
(obs=3,064)

suppins yh_logit yh_knn yh_lda yh_qda yh_svm

suppins 1.0000
yh_logit 0.2505 1.0000
yh_knn 0.3604 0.3575 1.0000
yh_lda 0.2395 0.6955 0.3776 1.0000
yh_qda 0.2294 0.6926 0.2762 0.5850 1.0000
yh_svm 0.5344 0.3966 0.6011 0.3941 0.3206 1.0000

A. Colin Cameron Univ.of California - Davis .ML


() Part 6: Classi…cation and Unsupervised April 2024 30 / 43
1. Classi…cation 1.7 Regression Trees and Random Forests

1.7 Regression Trees and Random Forests


Regression trees, bagging, random forests and boosting can be used
for categorical data.
Let pbmk be the proportion of training observations in region m that
are from class k.
From ISL2 section 8.1.2 splits can be determined by
Error rate 1 bmk )
max(p
k
Gini index ∑K bmk (1 p
k =1 p bmk )
K
Entropy ∑k =1 mk
b
p ln b
pmk

Stata user-written rforest supports classi…cation in addition to


regression.
Stata user-written boost applies to Gaussian (normal), logistic and
Poisson regression
I it uses as loss function for cross-validation the
pseudo-R 2 = 1 ln L(full model)/ln L(intercept-only model)
I Matthias Schonlau (2005), The Stata Journal, 5(3), 330-354.
A. Colin Cameron Univ.of California - Davis .ML
() Part 6: Classi…cation and Unsupervised April 2024 31 / 43
1. Classi…cation 1.8 Neural Networks

1.8 Neural Networks

Neural networks work very well for classi…cation such as images.


The development of neural nets was originally for classi…cation such
as images.
As neural nets ability improved they supplanted support vector
machines for classi…cation.

A. Colin Cameron Univ.of California - Davis .ML


() Part 6: Classi…cation and Unsupervised April 2024 32 / 43
2. Unsupervised Learning

2. Unsupervised Learning

Challenging area: no y , only x.


Example is determining several types of individual based on responses
to many psychological questions.
Principal components analysis.
Clustering Methods
I k-means clustering.
I hierarchical clustering.

A. Colin Cameron Univ.of California - Davis .ML


() Part 6: Classi…cation and Unsupervised April 2024 33 / 43
2. Unsupervised Learning 2.1 Principal Components

2.1 Principal Components


Initially discussed in section on dimension reduction.
For p regressors goal is to …nd a few (m ) linear combinations of X
that explain a good fraction of the total variance
p p
∑j =1 Var (Xj ) = ∑j =1 n1 ∑ni=1 xij2 for mean 0 X ’s.
Zm = ∑pj=1 φjm Xj where ∑pj=1 φ2jm = 1 and φjm are called factor
loadings.
A useful statistic is the proportion of variance explained (PVE)
I a scree plot is a plot of PVEm against m
I and a plot of the cumulative PVE by m components against m.
I choose m that explains a “sizable” amount of variance
I ideally …nd interesting patterns with …rst few components.
Easier when used PCA earlier in supervised learning as then observe
Y and can treat m as a tuning parameter.
Stata pca command.

A. Colin Cameron Univ.of California - Davis .ML


() Part 6: Classi…cation and Unsupervised April 2024 34 / 43
2. Unsupervised Learning 2.2 Cluster Analysis

2.2 Cluster Analysis: k-Means Clustering

Goal is to …nd homogeneous subgroups among the X .


K-Means splits into K distinct clusters where within cluster variation
is minimized.
Let W (Ck ) be measure of variation
I MinimizeC 1 ,...,C k ∑K
k =1 W (Ck )
p
I Euclidean distance W (Ck ) = n1 ∑K i ,i 0 2C k ∑j =1 (xij xi 0 j ) 2
k

Global maximum requires K n partitions.


Instead use algorithm 10.1 (ISL p.388) which …nds a local optimum
I run algorithm multiple times with di¤erent seeds
I choose the optimum with smallest ∑K k =1 W (Ck ).

A. Colin Cameron Univ.of California - Davis .ML


() Part 6: Classi…cation and Unsupervised April 2024 35 / 43
2. Unsupervised Learning 2.2 Cluster Analysis

ISL Figure 10.5

Data is (x1 .x2 ) with K = 2, 3 and 4 clusters identi…ed.

A. Colin Cameron Univ.of California - Davis .ML


() Part 6: Classi…cation and Unsupervised April 2024 36 / 43
2. Unsupervised Learning 2.2 Cluster Analysis

k-means Clustering Example


Use same data as earlier principal components analysis example.

. * k-means clustering with defaults and three clusters


. use machlearn_part2_spline.dta, replace

. graph matrix x1 x2 z // matrix plot of the three variables

. cluster kmeans x1 x2 z, k(3) name(myclusters)

. tabstat x1 x2 z, by(myclusters) stat(mean)

Summary statistics: mean


by categories of: myclusters

myclusters x1 x2 z

1 .8750554 .503166 1.34776


2 -.8569585 -1.120344 -.5772717
3 .1691631 .6720648 -.3493614

Total .0301211 .0226274 .0664539

A. Colin Cameron Univ.of California - Davis .ML


() Part 6: Classi…cation and Unsupervised April 2024 37 / 43
2. Unsupervised Learning 2.2 Cluster Analysis

Hierarchical Clustering

Do not specify K .
Instead begin with n clusters (leaves) and combine clusters into
branches up towards trunk
I represented by a dendrogram
I eyeball to decide number of clusters.
Need a dissimilarity measure between clusters
I four types of linkage: complete, average, single and centroid.
For any clustering method
I it is a di¢ cult problem to do unsupervised learning
I results can change a lot with small changes in method
I clustering on subsets of the data can provide a sense of robustness.

A. Colin Cameron Univ.of California - Davis .ML


() Part 6: Classi…cation and Unsupervised April 2024 38 / 43
3. Conclusions

3. Conclusions
Guard against over…tting
I use K -fold cross validation or penalty measures such as AIC.
Biased estimators can be better predictors
I shrinkage towards zero such as Ridge and LASSO.
For ‡exible models popular choices are
I neural nets
I random forests.
Though what method is best varies with the application
I and best are ensemble forecasts that combine di¤erent methods.
Machine learning methods can outperform nonparametric and
semiparametric methods
I so wherever econometricians use nonparametric and semiparametric
regression in higher dimensional models it may be useful to use ML
methods
I though the underlying theory still relies on assumptions such as sparsity.
A. Colin Cameron Univ.of California - Davis .ML
() Part 6: Classi…cation and Unsupervised April 2024 39 / 43
4. Software for Machine Learning

4. Software for Machine Learning


Many ML functions are in Python and R.
Stata 17 covers LASSO, ridge, elastic net, PCA, NP regression, series
regression, splines, LDA, QDA, but add-ons are needed for neural
networks (brain) or random forests (rforest) or support vector
machines (svmachines).
Stata has integration with Python
I Giovanni Cerulli (2020), Machine Learning using Stata/Python,
https://arxiv.org/pdf/2103.03122v1.pdf
F Stata add-on r_ml_stata.ado and r_ml_stata.ado are Stata wrappers
for tree, boosting, random forest, regularized multinomial, neural
network, naive Bayes, nearest neighbor, support vector machine
F https://sites.google.com/view/giovannicerulli/machine-learning-in-
stata
To run R in Stata the user-written Rcall package integrates R within
Stata
I https://github.com/haghish/rcall
A. Colin Cameron Univ.of California - Davis .ML
() Part 6: Classi…cation and Unsupervised April 2024 40 / 43
4. Software for Machine Learning

Some R Commands (possibly superseded)

Basic classi…cation
I logistic: glm function
I discriminant analysis: lda() and qda functions in MASS library
I k nearest neighbors: knn() function in class library
Support vector machines
I support vector classi…er: svm(... kernel="linear") in e1071 library
I support vector machine: svm(... kernel="polynomial") or svm(...
kernel="radial") in e1071 library
I receiver operator characteristic curve: rocplot in ROCR library.
Unsupervised Learning
I principal components analysis: function prcomp()
I k-means clustering: function kmeans()
I hierarchical clustering: function hclust()

A. Colin Cameron Univ.of California - Davis .ML


() Part 6: Classi…cation and Unsupervised April 2024 41 / 43
5. References

5. References

ISLR2: Gareth James, Daniela Witten, Trevor Hastie and Robert Tibsharani
(2021), An Introduction to Statistical Learning: with Applications in R, 2nd Ed.,
Springer.
I Free PDF from https://www.statlearning.com/ and $40 softcover book via
Springer Mycopy.

ISLP: Gareth James, Daniela Witten, Trevor Hastie, Robert Tibsharani and
Jonathan Taylor (2023), An Introduction to Statistical Learning: with Applications
in Python, Springer.
I Free PDF from https://www.statlearning.com/ and $40 softcover book via
Springer Mycopy.

Geron2: Aurelien Geron (2019), Hands-On Machinle Learning with Scikit-Learn,


Keras and Tensor Flow, Second edition, O’Reilly
I excellent book using Python for ML writtten by a computer scientist.

A. Colin Cameron Univ.of California - Davis .ML


() Part 6: Classi…cation and Unsupervised April 2024 42 / 43
5. References

References (continued)

ESL: Trevor Hastie, Robert Tibsharani and Jerome Friedman (2009), The
Elements of Statistical Learning: Data Mining, Inference and Prediction, Springer.
I More advanced treatment.
I Free PDF and $40 softcover book at
https://link.springer.com/book/10.1007/978-0-387-84858-7

Chapter 28.6.7-26.8.8 “Machine Learning for prediction and inference” in A. Colin


Cameron and Pravin K. Trivedi (2023), Microeconometrics using Stata, Second
edition.
I covers classi…cation and unsupervised learning only very brie‡y.

A. Colin Cameron Univ.of California - Davis .ML


() Part 6: Classi…cation and Unsupervised April 2024 43 / 43

You might also like