0% found this document useful (0 votes)
3K views

Generalized Additive Model

This document describes using generalized additive models (GAMs) to analyze wage data. Several GAMs are fit using different link functions and smoothing methods, including natural splines, smoothing splines, and local regression. The best model is selected using an analysis of deviance table. Logistic regression GAMs are also used to model binary wage outcomes. Overall, the document demonstrates how GAMs can flexibly model non-linear relationships in data.

Uploaded by

api-285777244
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3K views

Generalized Additive Model

This document describes using generalized additive models (GAMs) to analyze wage data. Several GAMs are fit using different link functions and smoothing methods, including natural splines, smoothing splines, and local regression. The best model is selected using an analysis of deviance table. Logistic regression GAMs are also used to model binary wage outcomes. Overall, the document demonstrates how GAMs can flexibly model non-linear relationships in data.

Uploaded by

api-285777244
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Generalized Additive Models

YIK LUN, KEI


[email protected]
This paper is a lab from the book called An Introduction to Statistical Learning
with Applications in R. All R codes and comments below are belonged to the
book and authors.

GAM using natural splines


library(ISLR)
library(gam)
## Loading required package: splines
## Loading required package: foreach
## Loaded gam 1.12
library(splines)
attach(Wage)
gam1=lm(wage~ns(year, 4)+ns(age, 5) +education, data=Wage)
summary(gam1)
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##

Call:
lm(formula = wage ~ ns(year, 4) + ns(age, 5) + education, data = Wage)
Residuals:
Min
1Q
-120.513 -19.608

Median
-3.583

3Q
14.112

Max
214.535

Coefficients:

Estimate Std. Error t value Pr(>|t|)


(Intercept)
46.949
4.704
9.980 < 2e-16
ns(year, 4)1
8.625
3.466
2.488 0.01289
ns(year, 4)2
3.762
2.959
1.271 0.20369
ns(year, 4)3
8.127
4.211
1.930 0.05375
ns(year, 4)4
6.806
2.397
2.840 0.00455
ns(age, 5)1
45.170
4.193 10.771 < 2e-16
ns(age, 5)2
38.450
5.076
7.575 4.78e-14
ns(age, 5)3
34.239
4.383
7.813 7.69e-15
ns(age, 5)4
48.678
10.572
4.605 4.31e-06
ns(age, 5)5
6.557
8.367
0.784 0.43328
education2. HS Grad
10.983
2.430
4.520 6.43e-06
education3. Some College
23.473
2.562
9.163 < 2e-16
education4. College Grad
38.314
2.547 15.042 < 2e-16
education5. Advanced Degree
62.554
2.761 22.654 < 2e-16
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
1

***
*
.
**
***
***
***
***
***
***
***
***

##
## Residual standard error: 35.16 on 2986 degrees of freedom
## Multiple R-squared: 0.293, Adjusted R-squared: 0.2899
## F-statistic: 95.2 on 13 and 2986 DF, p-value: < 2.2e-16
par(mfrow =c(1,3))
plot.gam(gam1,se=TRUE,col ="blue")

4. College Grad

10
10

30
2003

2005

2007

2009

30

40

20

partial for education

10
20

ns(age, 5)

0
2

ns(year, 4)

20

30

10

40

1. < HS Grad

20

40

year

60

80

education

age

GAM using smoothing splines with chosen degree of freedom


gam.m3=gam(wage~s(year, 4) + s(age, 5)+education,data=Wage)
summary(gam.m3)
##
##
##
##
##
##
##
##
##

Call: gam(formula = wage ~ s(year, 4) + s(age, 5) + education, data = Wage)


Deviance Residuals:
Min
1Q Median
3Q
Max
-119.43 -19.70
-3.33
14.17 213.48
(Dispersion Parameter for gaussian family taken to be 1235.69)
Null Deviance: 5222086 on 2999 degrees of freedom
2

##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##

Residual Deviance: 3689770 on 2986 degrees of freedom


AIC: 29887.75
Number of Local Scoring Iterations: 2
Anova for Parametric Effects
Df Sum Sq Mean Sq F value
s(year, 4)
1
27162
27162 21.981
s(age, 5)
1 195338 195338 158.081
education
4 1069726 267432 216.423
Residuals 2986 3689770
1236
--Signif. codes: 0 '***' 0.001 '**' 0.01

Pr(>F)
2.877e-06 ***
< 2.2e-16 ***
< 2.2e-16 ***
'*' 0.05 '.' 0.1 ' ' 1

Anova for Nonparametric Effects


Npar Df Npar F Pr(F)
(Intercept)
s(year, 4)
3 1.086 0.3537
s(age, 5)
4 32.380 <2e-16 ***
education
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

par(mfrow =c(1,3))
plot.gam(gam.m3,se=TRUE,col ="red")

4. College Grad

30

10
10

partial for education

10
20

s(age, 5)

0
2

40

20

30

s(year, 4)

20

30

40

10

1. < HS Grad

2003

2005

2007

2009

20

40

year

60
age

80

education

Model 2 is preferred
gam.m1=gam(wage~s(age ,5) +education ,data=Wage)
gam.m2=gam(wage~year+s(age ,5)+education ,data=Wage)
gam.m3=gam(wage~s(year, 4) + s(age, 5)+education,data=Wage)
anova(gam.m1, gam.m2 ,gam.m3,test="F")
##
##
##
##
##
##
##
##
##
##
##

Analysis of Deviance Table


Model 1: wage ~ s(age, 5) + education
Model 2: wage ~ year + s(age, 5) + education
Model 3: wage ~ s(year, 4) + s(age, 5) + education
Resid. Df Resid. Dev Df Deviance
F
Pr(>F)
1
2990
3711731
2
2989
3693842 1 17889.2 14.4771 0.0001447 ***
3
2986
3689770 3
4071.1 1.0982 0.3485661
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Prediction on training set


preds=predict(gam.m2,newdata =Wage)

GAM using local regression


gam.lo=gam(wage~s(year,df=4)+lo(age,span =0.7)+education,data=Wage)
summary(gam.lo)
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##

Call: gam(formula =
education, data
Deviance Residuals:
Min
1Q
-116.997 -19.319

wage ~ s(year, df = 4) + lo(age, span = 0.7) +


= Wage)
Median
-3.753

3Q
14.121

Max
214.445

(Dispersion Parameter for gaussian family taken to be 1243.534)


Null Deviance: 5222086 on 2999 degrees of freedom
Residual Deviance: 3716672 on 2988.797 degrees of freedom
AIC: 29903.95
Number of Local Scoring Iterations: 2
Anova for Parametric Effects
Df Sum Sq Mean Sq F value
Pr(>F)
s(year, df = 4)
1.0
25188
25188 20.255 7.037e-06 ***
4

##
##
##
##
##
##
##
##
##
##
##
##
##
##

lo(age, span = 0.7)


1.0 195537 195537 157.243 < 2.2e-16 ***
education
4.0 1101825 275456 221.511 < 2.2e-16 ***
Residuals
2988.8 3716672
1244
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Anova for Nonparametric Effects
Npar Df Npar F Pr(F)
(Intercept)
s(year, df = 4)
3.0 1.103 0.3464
lo(age, span = 0.7)
1.2 88.835 <2e-16 ***
education
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

par(mfrow=c(1,3))
plot.gam(gam.lo , se=TRUE , col ="green ")

4. College Grad

2003

2005

2007

2009

10
30

30

20

20

10

partial for education

10

lo(age, span = 0.7)

0
2

s(year, df = 4)

20

30

40

1. < HS Grad

20

40

year

60

80

age

GAM with interaction term


gam.lo.i=gam(wage~lo(year,age, span=0.5) + education,data=Wage)
summary(gam.lo.i)

education

##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##

Call: gam(formula = wage ~ lo(year, age, span = 0.5) + education, data = Wage)
Deviance Residuals:
Min
1Q
Median
3Q
Max
-121.293 -19.659
-3.303
13.911 213.067
(Dispersion Parameter for gaussian family taken to be 1234.897)
Null Deviance: 5222086 on 2999 degrees of freedom
Residual Deviance: 3688928 on 2987.235 degrees of freedom
AIC: 29884.6
Number of Local Scoring Iterations: 2
Anova for Parametric Effects

Df Sum Sq Mean Sq F value


Pr(>F)
lo(year, age, span = 0.5)
2.0 217479 108740 88.056 < 2.2e-16 ***
education
4.0 1074786 268696 217.586 < 2.2e-16 ***
Residuals
2987.2 3688928
1235
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Anova for Nonparametric Effects
Npar Df Npar F
Pr(F)
(Intercept)
lo(year, age, span = 0.5)
5.8 23.227 < 2.2e-16 ***
education
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

library(akima)
par(mfrow=c(1,2))
plot(gam.lo.i)

4. College Grad

10
20

partial for education

e
ag

.5)
e, span = 0

lo(year, ag

20

30

1. < HS Grad

year

education

Logistic Regression GAM


gam.lr=gam(I(wage >250)~year+s(age ,df =5)+education,family =binomial ,data=Wage)
summary(gam.lr)
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##

Call: gam(formula = I(wage > 250) ~ year + s(age, df = 5) + education,


family = binomial, data = Wage)
Deviance Residuals:
Min
1Q
Median
3Q
Max
-0.58206 -0.26780 -0.12341 -0.08241 3.31242
(Dispersion Parameter for binomial family taken to be 1)
Null Deviance: 730.5345 on 2999 degrees of freedom
Residual Deviance: 602.4588 on 2989 degrees of freedom
AIC: 624.4586
Number of Local Scoring Iterations: 16
Anova for Parametric Effects
Df Sum Sq Mean Sq F value Pr(>F)
year
1
0.48 0.4845 0.5995 0.43883
7

##
##
##
##
##
##
##
##
##
##
##
##
##
##

s(age, df = 5)
1
3.83 3.8262 4.7345 0.02964 *
education
4
65.81 16.4514 20.3569 < 2e-16 ***
Residuals
2989 2415.55 0.8081
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Anova for Nonparametric Effects
Npar Df Npar Chisq P(Chi)
(Intercept)
year
s(age, df = 5)
4
10.364 0.03472 *
education
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

par(mfrow =c(1,3))
plot(gam.lr,se=T,col ="green")

4. College Grad

200
200

partial for education

0
2
4

s(age, df = 5)

0.0
0.4

400

0.2

partial for year

0.2

0.4

400

1. < HS Grad

2003

2005

2007

2009

20

40

year

age

table(education ,I(wage >250) )


##
## education
##
1. < HS Grad
##
2. HS Grad
##
3. Some College

60

FALSE TRUE
268
0
966
5
643
7
8

80

education

##
##

4. College Grad
5. Advanced Degree

663
381

22
45

Remove < HS Grad since no one in this category has wage > 250
gam.lr.s=gam (I(wage >250)~year+s(age ,df=5)+education,family = binomial ,
data=Wage,subset =( education !="1. < HS Grad"))
summary(gam.lr.s)
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##

Call: gam(formula = I(wage > 250) ~ year + s(age, df = 5) + education,


family = binomial, data = Wage, subset = (education != "1. < HS Grad"))
Deviance Residuals:
Min
1Q Median
3Q
Max
-0.5821 -0.2760 -0.1415 -0.1072 3.3124
(Dispersion Parameter for binomial family taken to be 1)
Null Deviance: 715.5412 on 2731 degrees of freedom
Residual Deviance: 602.4588 on 2722 degrees of freedom
AIC: 622.4586
Number of Local Scoring Iterations: 11
Anova for Parametric Effects
Df Sum Sq Mean Sq F value
Pr(>F)
year
1
0.48 0.4845 0.5459
0.46004
s(age, df = 5)
1
3.83 3.8262 4.3116
0.03795 *
education
3
65.80 21.9339 24.7166 8.933e-16 ***
Residuals
2722 2415.55 0.8874
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Anova for Nonparametric Effects
Npar Df Npar Chisq P(Chi)
(Intercept)
year
s(age, df = 5)
4
10.364 0.03472 *
education
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

par(mfrow=c(1,3))
plot(gam.lr.s,se=T,col =" green ")

5. Advanced Degree

0.4

partial for education

0
2
6

s(age, df = 5)

0.0
0.2

partial for year

0.2

0.4

2. HS Grad

2003

2005

2007

year

2009

20

40

60

80

education

age

Reference:
James, Gareth, et al. An introduction to statistical learning. New
York: springer, 2013.

10

You might also like