MATH97287 Introduction to Statistical Learning 2021
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit
MATH97287
BSc, MSci and MSc EXAMINATIONS (MATHEMATICS)
2021
Introduction to Statistical Learning
1. Suppose in a regression problem we assume that we have a response vector Y = (Y1 , . . . , Yn ), a n x p full-rank data matrix X , an arbitrary (but fixed) p-dimensional vector of parameters β linked by the following model
Y = Xβ + ∈, (1)
where ∈ = (∈1 , . . . , ∈n ) is a vector of independent and identically distributed errors with mean zero and variance of σ 2 . Let βˆ be the least squares estimator of β .
(a) Define what is meant by an unbiased linear estimator, β˜, of β. (1 mark)
(b) Suppose βˇ = CY is an unbiased linear estimator of β with C = (XT X)-1XT + D , where
D is a non-zero p x n matrix. You are given that E(βˇ) = (Ip + DX)β and so βˇ is unbiased if and only if DX = 0.
State and prove the Gauss-Markov theorem for unbiased linear estimators. (4 marks)
(c) Explain why it might be sensible to consider using biased estimators in some circumstances. (1 mark)
(d) This part of the question is about the ridge regression estimator:
βˆridge = (XT X + λIp )-1XT Y, (2)
which can be obtained as the solution of a penalized least squares problem.
(i) Give an example of a situation where you might use ridge regression. (1 mark) (ii) Briefly explain (about five sentences) the Bayesian formulation of ridge regression. State
the relationship between the ridge parameter, λ , the prior variance, τ 2 , of β and the error variance σ 2 (do not work out the estimators explicitly).
Explain in at most a few sentences the main difference between the Bayesian formulation
of ridge regression and that obtained by solving the ridge penalised least squares problem. (5 marks)
(iii) By looking at second derivatives or otherwise, show that the ridge regression estimator
in (2) is at a minimum of the ridge regression objective function. (3 marks)
(e) The employment agency HardWorkerz decides to see whether the results of its applicant
testing programme can predict future salaries. HardWorkerz gave each of 67 applicants four tests Test1, Test2, Test3 and Test4 and collected information on their salaries three years later. The correlation matrix (to two decimal places) between the four Test variables
was
Test1 Test2 Test3 Test4
0.87
1.00
The salary information was put into the variable Salary and, along with the variables Test1, Test2, Test3 and Test4, were put into a data frame called the.df. They fitted the
following linear model in R using the command
lm(Salary ~ Test1, data=the.df)
and obtained the following table of results
Coefficients:
Estimate Std. Error t value Pr(>|t|)
They then fitted the model
Salary ~ Test1 + Test2 + Test3 + Test4
and obtained the following results on the next page
0 2 3 3 4
0.0 0.2 0.4 0.6 0.8
L1 Norm
Figure 1: Lasso plot from Hardworkerz data set.
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.59968 11.34036 0.317 0.75199
Test1 0.34228 0.12610 2.714 0.00859 **
Test2 0.20430 0.09458 2.160 0.03464 *
Test3 0.21527 0.22359 0.963 0.33940
The graphical results from using the glmnet function with the lasso option, and all other arguments set to defaults, based on the possibility of including any of the four explanatory variables, are shown in Figure 1. Interpret the results of these analyses. (5 marks)
(Total: 20 marks)
|
|
20
Sorted Eigenvalue Number
Figure 2: Eigenvalues from cmdscale() function on eurodist.
2. (a) (i) Explain how the Miles algorithm for least squares monotone linear regression works.
(6 marks)
(ii) Apply the Miles algorithm to the following sequence: 3, 9, 17, 56, 8, 17, 7, 22, 10 and show your working. (2 marks)
(b) The eurodist dataset in R contains the distances between 21 European cities in km. The
first four cities and their distances are shown:
Athens Barcelona Brussels Calais
Barcelona 3313
Brussels 2963 1318
Calais 3175 1326 204
...
Classical multidimensional scaling was carried out using the cmscale() function in R and the
corresponding eigenvalue plot was produced as in Figure 2.
(i) For the general case, show how the eigenvalues, such as those in Figure 2, arise from E, a matrix of Euclidean distances [Hint: you are given that the Euclidean distances arise from some (unknown) n x p centred configuration X , with corresponding inner product matrix B = XXT , the recovered configuration will be centred, and the formula em,e = bm,m + be,e - 2bm,e .] (6 marks)
(ii) The red dotted line in Figure 2 indicates where one of the eigenvalues is precisely zero. In classical scaling, why is (at least) one of the eigenvalues always zero? (1 mark)
(iii) Using the eigenvalue plot in Figure 2 what dimensionality would you recommend that the scaling solution should be presented in? (1 mark)
(C) Jenks’ natural breaks optimization is the one-dimensional version of k-means clustering. Suppose you have four one-dimensional datapoints x1 = 1, x2 = 2 + ∈, x3 = 3 and x4 = 4, where ∈ > 0 and ∈ is considerably smaller than 1. You are told that the true number of clusters k = 2. Assume that one cluster centre is m1 = x1 = 1 and demonstrate that, depending on what points are initially used as the second cluster centre, Jenks’ algorithm can only produce two possible different allocations of points to clusters. List the possible clusters and the associated cluster means for each of these two different allocations. (4 marks)
(Total: 20 marks)
3. (a) Suppose we have a sample of n identically and independently-distributed observations, X1 , . . . , Xn from probability density function f : R - [0, o). Suppose we estimate f(x) using the kernel density estimator
fˆ(x) = K / 、 ,
where K (y) is the triangular kernel function defined by
K (y) = 1 - IyI, for y e (-1, 1),
(3)
(4)
Suppose that f(x) is twice-differentiable with continuous second derivative and that all kernel functions below are symmetric about zero. You are given that the expectation of fˆ(x) is
E{fˆ(x)} = f(x) + C3 h2 foo (x) + o(h3 ), (5)
where C3 = ↓-o(o) v2 K (v) dv .
This question is about deriving an expression for the variance of fˆ(x).
Show that
E ↓fˆ(x) ↓ = E ,K / 、、 ,
and
var ↓fˆ(x) ↓ = var ,K / 、、 .
(ii) Compute the value of C3 for the triangular kernel function.
(iii) Show that, for the triangular kernel function,
E ,K2 / 、、 = f(x) + foo (x) + o(h4 ),
and thence
var ↓fˆ(x) ↓ = f(x) - f2 (x) + o(h/n).
(6)
(7)
(1 mark) (2 marks)
(8) (8 marks)
(9) (3 marks)
By properly balancing the squared bias and variance of the kernel density estimator above the
optimal bandwidth can be shown to be
hopt = ┌ ┐ 1/5 . (10)
[You do not have to show this].
Explain why the optimal bandwidth given in (10) is not of direct practical use. (1 mark)
(c) Let f : R - [0, o) be a probability density function. Suppose that {ψj,k (x)}j,keZ is an orthonormal wavelet basis of L2 (R), with associated father wavelets, {φj,k (x)}j,keZ, and that we can write, for some integer L > 0,
f (x) =↓ cL,k φL,k (x) + ↓ ↓ dj,k ψj,k (x), (11)
keZ jeJL keZ
where JL = {m e Z : m 2 L}, cj,k = ↓R f (x)φj,k (x) dx and dj,k = ↓R f (x)ψj,k (x) dx, where
φj,k (x) = 2j/2φ(2j x - k) and ψj,k (x) = 2j/2ψ(2j x - k), (12)
for j e JL , k e Z.
(i) Suppose that we obtain an independent and identically-distributed sample X1 , . . . , Xn from f (x). Explain why a reasonable estimator of dj,k is dˆj,k = n-1 L ψj,k (Xi ). Let j,k be the equivalent estimator of cj,k using φj,k rather than ψj,k . (3 marks)
(ii) Suppose we want to compute dˆj,k and j,k up to some high-resolution (integer) level J > L.
Explain why it is more computationally efficient to use the discrete wavelet transform than
to calculate all of the dˆj,k and j,k directly using the formulae in part (i)? (2 marks) (Total: 20 marks)
4. (a) State three examples of additive basis expansions that are used in statistical learning.
(1 mark) (b) As part of a statistical learning project a researcher decides to use the circle_data()
function to generate synthetic data. The circle data function depends on two arguments: an
inner limit ir and an outer limit or . A sample from the circle_data() function involves drawing two random variables, X1 , X2 , each from a uniform distribution on [-or , or ]. Then, a Bernoulli response Y is drawn according to a conditional probability mass function based on the realization X1 = x1 , X2 = x2 as follows:
1
|
P(Y = 1Ix) =↓1
(or - IIxII)/(or - ir )
where IIxII = (x1(2) + x2(2))1/2 is the usual two-norm.
(i) Draw a sketch diagram of the conditional probability
if IIxII < ir ,
if IIxII 2 or ,
otherwise,
function (13). (2 marks)
(ii) Linear discriminant analysis is a technique that draws a line in the (x1 , x2 ) space and
allocates y = 1 to points (x1 , x2 ) that are one side of the line and y = -1 to points
on the other side. Why would linear discriminant analysis not be a useful method for
constructing a classifier for X in this example? (1 mark)
(iii) Suppose it is decided to use a classification tree to classify circle data observations in a
supervised learning experiment. Explain what is meant by training and test sets and the
reason for their use in evaluating classifiers. (3 marks)
(iv) A regression tree is defined by partitioning the space (two-dimensional in the case of the
circle data) into M disjoint regions R1 , . . . , RM and then forming the model
M
f(x) = ↓ cmI(x e Rm ), (14)
m=1
where I(A) = 1 if A is true, or 0 otherwise.
Suppose we consider splitting on variable j at split point s (somewhere on Xj ) and define the pair of half-planes
R1 (j, s) = {xIXj < s} and R2 (j, s) = {xIXj > s}. Then, we seek the splitting variable j and split point s that solves:
isn c(m)ì(i)n Xiej,s)(Yi - c1 )2 + c(m)in冬 Xiej,s)(Yi - c2 )2 .
(15)
(16)
For a given j, s derive the inner minimisers in (16) over c1 and c2 and label them 1 , 2(2 marks)
(v) For a given variable j explain how to rapidly compute the best split point s for the following objective function:
↓ (Yi - 1 )2 + ↓ (Yi - 2 )2 . (17)
XieR ì(j,s) XieR冬(j,s)
(2 marks) (vi) Suppose that we now choose to split variables into three regions with two split points
s1 , s2 . Define a suitable new objective function and how to minimize it given a variable j. Give a reason why the option of splitting into more than two regions for each variable
is not usually done.
(c) (i) Briefly explain the rationale behind the AdaBoost.M1 classifier.
(ii) The AdaBoost.M1 binary classifier can be written as
M
f (x) = ↓ βm Gm (x; γm ),
m=1
(3 marks) (2 marks)
(18)
where Gm (x, γm ) is one of the classifiers and G1 (x; γm ) is the initial ‘weak’ classifier, where Gi (x) e {-1, 1} and γm are parameters of the model. The fitted model can be
obtained by
(βm}(n)m(M)辛ì L ,yi , βm G(xi , γm ) 、,
where L{y, f (x)} is the loss function between observation y and model f (x). We assume that the costs of misclassification are equal.
AdaBoost.M1 can be shown to use the loss function L{y, f (x)} = exp {-yf (x)}. Show that the population minimiser is given by
f* (x) = arg mf(ixn) EY |x{e-Yf (x)} = log , 、, (20)
one half of the log-odd ratio P(Y = 1Ix). Explain why this means that the sign operator applied to {f (x)} from (18) is the appropriate function to use to build the operational classifier. (4 marks)
(Total: 20 marks)
5. (a) This part considers the case of multiple linear regression model
Y = Xβ + ∈, (21)
where β is a p-vector of unknown parameters, X is an nxp matrix of explanatory variables, Y is an n-vector dependent variable and ∈ is an n-vector of errors where E(∈i ) = 0, var(∈i ) = σ 2 and the errors are all assumed independent and identically distributed.
Suppose we decide to estimate the parameters β using ridge regression with ridge parameter λ. In this context, explain the concept of cross-validation as a technique for choosing a good value of λ and briefly discuss its pros and cons. Explain what k-fold cross-validation is. What effect might correlation between errors in the model have on the outcome of cross-validation?
What effect might outliers have on the outcome?
(10 marks)
Now suppose we have the model
yi = f (ti ) + ∈i , (22)
for i = 1, . . . , n, where f is a smooth function f : R - R, {yi } are observations taken at the points {ti } and {∈i } is a set of independent and identically distributed random variables with mean zero and variance of σ2 . Define the full spline smoothing estimator of f (t) for the whole dataset to be fˆλ (t).
In spline smoothing the ordinary cross-validated estimate of error is defined by
OCV(λ) = n-1 ↓{yi - fˆλ-i (ti )}2 .
i=1
(23)
(i) Explain what fˆλ-i is. (2 marks) (ii) It can be shown that the full spline smoothing estimator can be written as
fˆλ = H (λ)y, (24)
where fˆλ = {fˆλ (t1 ), . . . , fˆλ (tn )}T and y = (y1 , . . . , yn )T are both n-vectors and H (λ) is the n x n (hat) matrix depending on λ .
Now define the augmented data set y˜i = (y˜i,1 , y˜i,2 , . . . , y˜i,n )T for i = 1, . . . , n where
y˜i,j = yj
j i,
for j = i,
(25)
for j = 1, . . . , n. Let f˜λ-i be the spline smooth estimator based on the y˜i augmented data set (of n points).
f˜λ-i (ti ) = fˆλ-i (ti ),
for all i = 1, . . . , n.
Show that OCV(λ) can be written alternatively as
OCV(λ) = n-1 2 ,
where hi,j (λ) = {H(λ)}i,j .
(iii) What is the advantage of using formula (27) instead of (23)?
(26)
(27)
(6 marks) (2 marks)
(Total: 20 marks)
2022-05-25