An Introduction to Statistical Learning:

5.3 Lab: Cross-Validation and the Bootstrap

Load the ISLR package to get started:

library(ISLR)

5.3.1 The Validation Set Approach

We use the Auto dataset in this exercise which contains 392 observations. We can use head() to see what the Auto dataset looks like.

attach(Auto)
head(Auto)
##   mpg cylinders displacement horsepower weight acceleration year origin
## 1  18         8          307        130   3504         12.0   70      1
## 2  15         8          350        165   3693         11.5   70      1
## 3  18         8          318        150   3436         11.0   70      1
## 4  16         8          304        150   3433         12.0   70      1
## 5  17         8          302        140   3449         10.5   70      1
## 6  15         8          429        198   4341         10.0   70      1
##                        name
## 1 chevrolet chevelle malibu
## 2         buick skylark 320
## 3        plymouth satellite
## 4             amc rebel sst
## 5               ford torino
## 6          ford galaxie 500

For this exercise, we first select a random sample of 196 out of 392 observations. We initialize the random number generator with a seed using set.seed() to ensure that repeated runs produce consistent results.

set.seed(1)
train <- sample(392, 196)

We then estimate the effects of horsepower on mpg by fitting a linear regression model with lm() on the selected subset

lm.fit <- lm(mpg ~ horsepower, data = Auto, subset = train)

Next, we calculate the mean squared error (MSE) for the remaining 196 observations in the validation set. The training subset is excluded from the MSE calculation using -train index.

mse <- mean((mpg - predict(lm.fit, Auto))[-train]^2)
mse
## [1] 26.14142

The error rate for a linear model is 26.1414212. We can also fit higher degree polynomials with the poly() function. First, let's try a quadratic model.

lm.fit2 <- lm(mpg ~ poly(horsepower, 2), data = Auto, subset = train)
mse2 <- mean((mpg - predict(lm.fit2, Auto))[-train]^2)
mse2
## [1] 19.82259

Quadratic regression performs better than a linear model and reduces the error rate to 19.822585. Let's also try a cubic model.

lm.fit3 <- lm(mpg ~ poly(horsepower, 3), data = Auto, subset = train)
mse3 <- mean((mpg - predict(lm.fit3, Auto))[-train]^2)
mse3
## [1] 19.78252

We can fit these models on a different subset of training observations by initializing the random number generator with a different seed.

set.seed(2)
train <- sample(392, 196)
lm.fit <- lm(mpg ~ horsepower, subset = train)

mean((mpg - predict(lm.fit, Auto))[-train]^2)
## [1] 23.29559
lm.fit2 <- lm(mpg ~ poly(horsepower, 2), data = Auto, subset = train)
mean((mpg - predict(lm.fit2, Auto))[-train]^2)
## [1] 18.90124
lm.fit3 <- lm(mpg ~ poly(horsepower, 3), data = Auto, subset = train)
mean((mpg - predict(lm.fit3, Auto))[-train]^2)
## [1] 19.2574

The error rates are slightly different from our initial training sample but the results are consistent with previous findings. A quadratic model performs better than a linear model but there is no significant improvement when we use a cubic model.

5.3.2 Leave-One-Out Cross-Validation

The glm() function offers a generalization of the linear model while allowing for different link functions and error distributions other than gaussian. By default, glm() simply fits a linear model identical to the one estimated with lm().

glm.fit <- glm(mpg ~ horsepower, data = Auto)
coef(glm.fit)
## (Intercept)  horsepower 
##  39.9358610  -0.1578447
lm.fit <- lm(mpg ~ horsepower, data = Auto)
coef(lm.fit)
## (Intercept)  horsepower 
##  39.9358610  -0.1578447

The glm() function can be used with cv.glm() to estimate k-fold cross-validation prediction error.

library(boot)
glm.fit <- glm(mpg ~ horsepower, data = Auto)
cv.err <- cv.glm(Auto, glm.fit)
cv.err$delta
## [1] 24.23151 24.23114

The returned value from cv.glm() contains a delta vector of components -- the raw cross-validation estimate and the adjusted cross-validation estimate respectively.

We can repeat this process in a for() loop to compare the cross-validation error of higher-order polynomials. The following example estimates the polynomial fit of the order 1 through 5 and stores the result in a cv.error vector.

cv.error <- rep(0, 5)
for (i in 1:5) {
    glm.fit <- glm(mpg ~ poly(horsepower, i), data = Auto)
    cv.error[i] <- cv.glm(Auto, glm.fit)$delta[1]
}
cv.error
## [1] 24.23151 19.24821 19.33498 19.42443 19.03321

5.3.3 k-Fold Cross-Validation

In addition to LOOCV, cv.glm() can also be used to run k-fold cross-validation. In the following example, we estimate the cross-validation error of polynomials of the order 1 through 10 using k-fold cross-validation.

set.seed(17)
cv.error.10 <- rep(0, 10)
for (i in 1:10) {
    glm.fit <- glm(mpg ~ poly(horsepower, i), data = Auto)
    cv.error.10[i] <- cv.glm(Auto, glm.fit, K = 10)$delta[1]
}
cv.error.10
##  [1] 24.20520 19.18924 19.30662 19.33799 18.87911 19.02103 18.89609
##  [8] 19.71201 18.95140 19.50196

In both LOOCV and k-fold cross-validation, we get lower test errors with quadratic models than linear models, but cubic and higher-order polynomials don't offer any significant improvement.

5.3.4 The Bootstrap

In order to perform bootstrap analysis, we first create an alpha.fn() for estimating (\alpha).

alpha.fn <- function(data, index) {
    X <- data$X[index]
    Y <- data$Y[index]
    return((var(Y) - cov(X, Y))/(var(X) + var(Y) - 2 * cov(X, Y)))
}

The following example estimates (\alpha) using observations 1 through 100 from the Portfolio dataset.

alpha.fn(Portfolio, 1:100)
## [1] 0.5758321

The subset from our dataset can also be obtained with the sample() function as previously discussed.

set.seed(1)
alpha.fn(Portfolio, sample(100, 100, replace = T))
## [1] 0.5963833

Instead of manually repeating this procedure with different samples from our dataset, we can automate this process with the boot() function as shown below.

boot(Portfolio, alpha.fn, R = 1000)
## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = Portfolio, statistic = alpha.fn, R = 1000)
## 
## 
## Bootstrap Statistics :
##      original        bias    std. error
## t1* 0.5758321 -7.315422e-05  0.08861826

We can apply the same bootstrap approach to the Auto dataset by creating a bootstrap function that fits a linear model to our dataset.

boot.fn <- function(data, index) 
  return(coef(lm(mpg ~ horsepower, data = data, subset = index)))
boot.fn(Auto, 1:392)
## (Intercept)  horsepower 
##  39.9358610  -0.1578447

We can run this manually on different samples from the dataset.

set.seed(1)
boot.fn(Auto, sample(392, 392, replace = T))
## (Intercept)  horsepower 
##  38.7387134  -0.1481952
boot.fn(Auto, sample(392, 392, replace = T))
## (Intercept)  horsepower 
##  40.0383086  -0.1596104

And we can also automate this by fitting the model on 1000 replicates from our dataset.

boot(Auto, boot.fn, 1000)
## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = Auto, statistic = boot.fn, R = 1000)
## 
## 
## Bootstrap Statistics :
##       original      bias    std. error
## t1* 39.9358610  0.02972191 0.860007896
## t2* -0.1578447 -0.00030823 0.007404467

The summary() function be used to compute standard errors for the regression coefficients.

summary(lm(mpg ~ horsepower, data = Auto))$coef
##               Estimate  Std. Error   t value      Pr(>|t|)
## (Intercept) 39.9358610 0.717498656  55.65984 1.220362e-187
## horsepower  -0.1578447 0.006445501 -24.48914  7.031989e-81

Finally, we redefine the bootstrap function to use a quadratic model and compare the standard errors that from bootstrap to the ones obtained from the summary() function.

boot.fn <- function(data, index) 
  coefficients(lm(mpg ~ horsepower + I(horsepower^2), data = data, subset = index))

set.seed(1)
boot(Auto, boot.fn, 1000)
## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = Auto, statistic = boot.fn, R = 1000)
## 
## 
## Bootstrap Statistics :
##         original        bias     std. error
## t1* 56.900099702  6.098115e-03 2.0944855842
## t2* -0.466189630 -1.777108e-04 0.0334123802
## t3*  0.001230536  1.324315e-06 0.0001208339
summary(lm(mpg ~ horsepower + I(horsepower^2), data = Auto))$coef
##                     Estimate   Std. Error   t value      Pr(>|t|)
## (Intercept)     56.900099702 1.8004268063  31.60367 1.740911e-109
## horsepower      -0.466189630 0.0311246171 -14.97816  2.289429e-40
## I(horsepower^2)  0.001230536 0.0001220759  10.08009  2.196340e-21