Stats 401: Homework 2
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit
Stats 401: Homework 2
Introduction
In this homework assignment, I’ll try to lead you through coding up algorithms for bootstrap, cross-validation, and tree-making from scratch. The hope is that by coding these algorithms by hand, you’ll gain a deeper understanding of how these tools work.
1) Bootstrap
Bootstrapping refers to the technique of resampling our sample with replacement. This is equivalent to creating infinite duplicates of our sample to create a “pseudo-population.”We then draw a random sample the same size as our original sample from the pseudo-population.
We often use bootstrap to get an estimate of the standard error (the standard deviation of the sampling distribution) of a statistic when we do not have access to the actual population.
In this example, we will try to estimate the standard error of the correlation of a sample. We will begin with a synthetic population.
set .seed(1) # for reproducibility
# x is 10,000 random normal values, centered at 100, sd 10, rounded to 1 decimal place
x <- round(rnorm ( 10 ˆ4 , 100 , 10), 1)
# y is created to be correlated with x, but with error
y <- x + round(rnorm ( 10 ˆ4 , 0 , 5), 1)
# the correlation between x and y in the population is 0 .8986
cor (x, y)
## [1] 0 .8986356
# plot(x,y) # optional if you want to see the plot
The population we created has a correlation of 0.8986. In real life, we would not know what the population
actually looks like, and this value of ρ would be unknown to us.
# we randomly sample 100 of these values to create our sample
samp <- sample(10 ˆ4 , 100)
x_samp <- x[samp] # we subset x and y accordingly
y_samp <- y[samp]
cor (x_samp, y_samp) # the correlation in our sample is 0 .86296
## [1] 0 .8983157
We have a sample of 100 values. The correlation between x and y in our sample is 0.86296. Without knowing what the population looks like, we may wish to estimate the standard error of this value.
One way we can do this is via bootstrap resampling. To perform bootstrap resampling, we resample our sample with replacement. In our case, there are 100 observations in our sample. So we will sample the
integers 1:100 with replacement to choose which values will be in our bootstrap sample.
We subset our current sample accordingly and calculate the resulting correlation.
we repeat this many times (say, 1000), and record the correlation for each bootstrap sample. This effectively builds up a sampling distribution of different values of the correlation T. When we take the standard deviation of all those values, we have an estimate of the standard error.
# one instance
set .seed(1)
index <- sample(100 , replace = TRUE)
x_boot <- x_samp[index]
y_boot <- y_samp[index]
cor (x_boot, y_boot)
## [1] 0 .8761554
# your turn . . .
# Perform 1000 replicates of bootstrap
# For each replicate, calculate the resulting correlation, and store that value .
# So everyone gets the same results, make a for loop, and
# include the line set .seed(i) before you perform the sampling
R <- rep (NA , 10000)
for(i in 1 :10000){
# write your code here . . .
# . . .
}
hist (R)
## Error in hist .default(R): ’x’ must be numeric
sd (R)
## [1] NA
mean (R)
## [1] NA
# after performing the replicates, produce a histogram of the different observed correlations # Calculate the standard deviation of the correlation estimates
2) Cross-Validation
We can use cross-validation to evaluate the predictive performance of several competing models.
Again, we will manually implement cross-validation from scratch first, and then use the built-in function in
.
We will use the dataset ironslag from the package DAAG (a companion library for the textbook Data Analysis and Graphics in R).
The description of the data is as follows: The iron content of crushed blast-furnace slag can be determined by a chemical test at a laboratory or estimated by a cheaper, quicker magnetic test. These data were collected to investigate the extent to which the results of a chemical test of iron content can be predicted from a magnetic test of iron content, and the nature of the relationship between these quantities. [Hand, D.J., Daly, F., et al. (1993) A Handbook of Small Data Sets]
The ironslag data has 53 observations, each with two values - the measurement using the chemical test and the measurement from the magnetic test.
We can start by fitting a linear regression model Y = β0 + β1 X + ϵ . A quick look at the scatterplot seems to indicate that the data may not be linear.
# install .packages("DAAG") # if necessary
library (DAAG)
x <- seq ( 10 ,40 , . 1) # a sequence used to plot lines
L1 <- lm(magnetic ~ chemical, data = ironslag)
plot(ironslag$chemical, ironslag$magnetic, main = "Linear fit" , pch = 16)
yhat1 <- L1$coef[1] + L1$coef[2] * x
lines(x, yhat1, lwd = 2 , col = "blue")
Linear fit
10 15 20 25 30
ironslag$chemical
In addition to the linear model, fit the following models that predict the magnetic measurement (Y) from the chemical measurement (X).
Quadratic: Y = β0 + β1 X + β2 X2 + ϵ
Exponential: log(Y) = β0 + β1 X + ϵ, equivalent to Y = exp(β0 + β1 X + ϵ)
log-log: log(Y) = β0 + β1 log(X ) + ϵ
# I 've started each of these for you .
# Your job is to create the plots with fitted lines .
L2 <- lm(magnetic ~ chemical + I (chemicalˆ2), data = ironslag)
L3 <- lm(log(magnetic) ~ chemical, data = ironslag)
# when plotting the fitted line for this one, create estimates of log(y -hat) linearly # then exponentiate log(y -hat)
L4 <- lm(log(magnetic) ~ log(chemical), data = ironslag)
# for this one, use plot(log(chemical), log(magnetic))
# the y -axis is now on the log -scale, so you can create and plot log(y -hat) directly # just remember that you ' ll use log(x) rather than x directly
Leave-one-out Cross validation
We will first attempt leave-one-out cross validation. In LOOCV, we remove one data point from our data set. We fit the model to the remaining 52 data points. With this model, we make a prediction for the left-out point. We then compare that prediction to the actual value to calculate the squared error. Once we find the squared error for all 53 points, we can take the mean to get a cross-validation error estimate of that model.
To test out our four models, we will build a loop that will remove one point of data at a time. Thus, we will
make a for(i in 1:53) loop. For each iteration of the loop, we will fit the four models on the remaining 52 data points, and make a prediction for the remaining point.
# create vectors to store the validation errors for each model
# error_model1 <- rep(NA, 53)
# . . .
error_model1 <- rep (NA , 53)
for(i in 1:53){
# write a line to select the ith line in the data
# store this line as the ' test ' case
# store the remaining as the ' training ' data
# fit the four models and calculate the prediction error for each one
# hint: it will be in the form
# model1 <- lm(magnetic ~ chemical, data = training)
# fitted_value <- predict(model1, test_ case)
# error_model1[i] <- test_ case_ actual_value - fitted_value
# . . .
# model2 <-
# . . .
# . . .
}
# once all of the errors have been calculated, find the mean squared error # . . .
# mean(error_model1ˆ2)
mean (error_model1 ˆ2)
## [1] NA
Compare the sizes of the cross validation error to help you decide which model does the best job of predicting the test cases.
Using the same form, perform 6-fold cross validation. I have already created the list of cases to include in each fold.
set .seed(1)
shuffled <- sample(53) # shuffles the values 1 to 53
cases <- list(
test1 = shuffled[ 1:9], # the first 9 go into test set #1
test2 = shuffled[10:18],
test3 = shuffled[19:27],
test4 = shuffled[28:36],
test5 = shuffled[37:45],
test6 = shuffled[46:53]
)
# There are 6 test sets, each have 9, with the exception of test set #6, which has 8 # you can access each vector in the cases list via cases[[1]], cases[[2]], etc .
for(i in 1:6){
# write your code here .
# it will be quite similar to the code you wrote for LOOCV
}
Again, compare the sizes of the cross validation error to help you decide which model does the best job of predicting the test cases.
Cross-validation with R
Now that you have wrote your cross-validation script from scratch, we can use the built-in functions in R. Library(boot) has the function cv .glm() which can be used to estimate cross-validation error.
To make use of cv .glm() on the linear models, we must first use glm() to fit a generalized linear model to our data. If you do not change the attribute “family” in the function glm(), it will fit a linear model
# library(boot)
L1 <- glm(magnetic ~ chemical, data = ironslag) # equivalent to lm(magnetic ~ chemical) L1_loocv <- cv .glm(ironslag, L1)$delta
## Error in cv .glm(ironslag, L1): could not find function "cv .glm"
L1_6foldcv <- cv .glm(ironslag, L1, K = 6)$delta
## Error in cv .glm(ironslag, L1, K = 6): could not find function "cv .glm"
# find the LOOCV and 6-fold CV values for the other three models
Your LOOCV estimates from cv .glm() should match your estimates when you coded your algorithm from scratch. The 6-fold CV estimates will vary because row selection will vary due to random sampling.
3) Decision Trees from Scratch
We will start by writing an algorithm to make a decision tree from scratch.
A decision tree will consider all possible splits, and will make a split that results in the smallest possible sum of squared residuals (SSR). It then recursively applies splits onto the resulting partition. We will write a short loop that will consider all possible splits for a chosen variable, and keep track of the resulting SSR.
We look at the Auto dataset from the package ISLR2. We will try to predict the variable mpg based on the other variables. The variable origin is categorical. The variable name cannot be used as a predictor. We will try to model the variable mpg (gas mileage) based on the other variables. For this first example, we will just consider the variable displacement.
# install .packages("ISLR2") # install the package if necessary
library (ISLR2)
data (Auto)
# create an empty vector to store the resulting SSR after a split
total_ssr <- rep (NA , nrow (Auto))
boundaries <- rep (NA , nrow (Auto) - 1)
for(i in 1 : (nrow (Auto)- 1)) {
# we will go through every possible split, which can occur between any two observations
val1 <- sort (Auto$displacement)[i] # we sort the values of displacement and select the ith value val2 <- sort (Auto$displacement)[i + 1] # we also select the next value
boundary <- mean (c (val1, val2)) # we take the mean to find our boundary value
# split the mpg value data into two groups
# 1) those that correspond to displacement being less than the boundary
# 2) those that correspond to displacement being greater than the boundary
# find the mean of both groups
# calculate the SSR for each group
# add the SSRs from both groups together and store the resulting value in total_ssr # also store the corresponding boundary value to the vector boundaries
}
plot(total_ssr, type = "l")
## Warning in min(x): no non-missing arguments to min; returning Inf ## Warning in max(x): no non-missing arguments to max; returning -Inf ## Error in plot .window( . . .): need finite ’ylim’ values
# find the boundary that minimizes total SSR after the split
# use the boundary value to split the data into Auto1 and Auto2
# repeat the above process on just Auto1 to find where to make another split
plot(total_ssr, type = "l")
## Warning in min(x): no non-missing arguments to min; returning Inf ## Warning in min(x): no non-missing arguments to max; returning -Inf
## Error in plot .window( . . .): need finite ’ylim’ values
(minimizer <- which(total_ssr == min(total_ssr)))
## integer(0)
(val <- sort (Auto$displacement)[minimizer])
## numeric(0)
Compare the results of your partitioning code to the results of calling tree on the Auto data.
library(rpart)
rpart(mpg ~ displacement, data = Auto)
## n= 392
##
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 392 23818 .9900 23 .44592
## 2) displacement>=190 .5 170 2210 .1880 16 .66000
## 4) displacement>=284 .5 98 662 .3563 14 .70612 *
## 5) displacement< 284 .5 72 664 .4728 19 .31944 *
## 3) displacement< 190 .5 222 7785 .9020 28 .64234
## 6) displacement>=112 .5 102 2046 .9750 25 .31863 *
## 7) displacement< 112 .5 120 3654 .3430 31 .46750
## 14) displacement>=93 .5 65 1338 .6510 30 .11692 *
## 15) displacement< 93 .5 55 2057 .0070 33 .06364
## 30) displacement< 80 .5 16 541 .5144 29 .13125 *
## 31) displacement>=80 .5 39 1166 .5690 34 .67692 *
# ˆˆ building a tree, and considering partition only on the variable displacement
4) Decision Trees with R
Create a decision tree on the Auto data using all the numeric predictors with the function rpart(). Once you make the tree, use the cv .tree() or plotcp() to see if pruning the tree will improve prediction performance.
Plot the resulting tree with the text labels.
We did not bother splitting the data between a training and test set. Still, use the resulting tree to make predictions on the data, to get an estimate of the mean squared error.
5) Bagging and Random Forests
Before we try to use the Random Forest functions that are available in, let’s see how bagging works.
With bagging, we perform bootstrap resampling on the data, and fit separate trees to each of our boot- strapped datasets. At the end, we use each tree to make a prediction, and we average the the different predictions.
For our learning example, we will consider building a tree to predict mpg based on the variable displacement. set .seed(1) # set seed for reproducibility
# The Auto dataset has 392 rows .
# we randomly sample 392 rows with replacement . This forms our first bootstrap sample
boot1 <- Auto[sample(nrow (Auto), replace = TRUE), ]
# We fit a tree to predict mpg from on displacement, based on the data in boot1
tree1 <- rpart(mpg ~ displacement, data = boot1)
# our resulting tree .
tree1
## n= 392
##
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 392 24043 .4800 23 .61224
## 2) displacement>=190 .5 175 2728 .9290 16 .88914
## 4) displacement>=284 .5 91 493 .2400 14 .50000 *
## 5) displacement< 284 .5 84 1153 .5470 19 .47738 *
## 3) displacement< 190 .5 217 7025 .4880 29 .03410
## 6) displacement>=106 114 2203 .3050 26 .55000 *
## 7) displacement< 106 103 3340 .1220 31 .78350
## 14) displacement>=93 .5 41 811 .3649 29 .87073 *
## 15) displacement< 93 .5 62 2279 .5550 33 .04839
## 30) displacement< 82 .5 23 710 .3661 29 .28696 *
## 31) displacement>=82 .5 39 1051 .8670 35 .26667 *
# prediction for mpg for a vehicle with displacement 307
(p1 <- predict (tree1, newdata = Auto[1,]))
## 1
## 14 .5
# We repeat the process and create two additional trees
boot2 <- Auto[sample(nrow (Auto), replace = TRUE), ]
# tree2 <- . . .
# (p2 <- . . . )
# boot3 <-
# tree3 <-
# tree3
# (p3 <- . . . )
# Based on this ' bag ' of three trees, what is the predicted mpg of a vehicle
# with displacement 307
To do something similar with the function randomForest(), we provide the formula of the tree we want to create, and how many trees you want to use. In our case, we make three trees. randomForest() pretty much does the rest. Of course, in real life, we would probably never make a tree with only one predictor,
and bagging with only three trees seems rather silly.
library (randomForest)
## randomForest 4 .7-1 .1
## Type rfNews() to see new features/changes/bug fixes .
set .seed(1)
bag_Auto = randomForest (mpg ~ displacement, data = Auto, ntree = 3)
yhat .bag = predict(bag_Auto, newdata = Auto[1,])
yhat .bag
## 1
## 14 .9
Random Forests
In Random Forests we restrict the variables that a tree can use to make a split at each node. This often
forces a tree to make a sub-optimal split. The result is that we get trees that are less correlated with each other.
Using randomForest(), we can impose this restriction by setting the argument mtry to a value smaller than the total number of variables available. In the Auto data, there are 7 predictors, so setting mtry to any value less than 7 will cause us to grow a Random Forest. We often select p to be p/3 or something similar.
# grow a random forest on the Auto data, using mtry = 3 .
# comment on the mean squared residuals compared to the MSE of the single decision tree
2022-10-25