闪电代写 -代写CS作业_CS代写_Finance代写_Economic代写_Statistics代写_代码代做_IT代写_加急帮助

Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

MATH97287

BSc, MSci and MSc EXAMINATIONS (MATHEMATICS)

2021

Introduction to Statistical Learning

1. Suppose in a regression problem we assume that we have a response vector Y = (Y1 , . . . , Yn ), a n x p full-rank data matrix X , an arbitrary (but ﬁxed) p-dimensional vector of parameters β linked by the following model

Y = Xβ + ∈, (1)

where ∈ = (∈1 , . . . , ∈n ) is a vector of independent and identically distributed errors with mean zero and variance of σ 2 . Let βˆ be the least squares estimator of β .

(a) Deﬁne what is meant by an unbiased linear estimator, β˜, of β. (1 mark)

(b) Suppose βˇ = CY is an unbiased linear estimator of β with C = (XT X)-1XT + D , where

D is a non-zero p x n matrix. You are given that E(βˇ) = (Ip + DX)β and so βˇ is unbiased if and only if DX = 0.

State and prove the Gauss-Markov theorem for unbiased linear estimators. (4 marks)

(d) This part of the question is about the ridge regression estimator:

βˆridge = (XT X + λIp )-1XT Y, (2)

which can be obtained as the solution of a penalized least squares problem.

(i) Give an example of a situation where you might use ridge regression. (1 mark) (ii) Brieﬂy explain (about ﬁve sentences) the Bayesian formulation of ridge regression. State

the relationship between the ridge parameter, λ , the prior variance, τ 2 , of β and the error variance σ 2 (do not work out the estimators explicitly).

Explain in at most a few sentences the main diﬀerence between the Bayesian formulation

of ridge regression and that obtained by solving the ridge penalised least squares problem. (5 marks)

(iii) By looking at second derivatives or otherwise, show that the ridge regression estimator

in (2) is at a minimum of the ridge regression objective function. (3 marks)

(e) The employment agency HardWorkerz decides to see whether the results of its applicant

testing programme can predict future salaries. HardWorkerz gave each of 67 applicants four tests Test1, Test2, Test3 and Test4 and collected information on their salaries three years later. The correlation matrix (to two decimal places) between the four Test variables

was

Test1 Test2 Test3 Test4

0.87

1.00

The salary information was put into the variable Salary and, along with the variables Test1, Test2, Test3 and Test4, were put into a data frame called the.df. They ﬁtted the

following linear model in R using the command

lm(Salary ~ Test1, data=the.df)

and obtained the following table of results

Coefficients:

Estimate Std. Error t value Pr(>|t|)

They then ﬁtted the model

Salary ~ Test1 + Test2 + Test3 + Test4

and obtained the following results on the next page

0 2 3 3 4

0.0 0.2 0.4 0.6 0.8

L1 Norm

Figure 1: Lasso plot from Hardworkerz data set.

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 3.59968 11.34036 0.317 0.75199

Test1 0.34228 0.12610 2.714 0.00859 **

Test2 0.20430 0.09458 2.160 0.03464 *

Test3 0.21527 0.22359 0.963 0.33940

The graphical results from using the glmnet function with the lasso option, and all other arguments set to defaults, based on the possibility of including any of the four explanatory variables, are shown in Figure 1. Interpret the results of these analyses. (5 marks)

(Total: 20 marks)

Sorted Eigenvalue Number

Figure 2: Eigenvalues from cmdscale() function on eurodist.

2. (a) (i) Explain how the Miles algorithm for least squares monotone linear regression works.

(6 marks)

(ii) Apply the Miles algorithm to the following sequence: 3, 9, 17, 56, 8, 17, 7, 22, 10 and show your working. (2 marks)

(b) The eurodist dataset in R contains the distances between 21 European cities in km. The

ﬁrst four cities and their distances are shown:

Athens Barcelona Brussels Calais

Barcelona 3313

Brussels 2963 1318

Calais 3175 1326 204

...

Classical multidimensional scaling was carried out using the cmscale() function in R and the

corresponding eigenvalue plot was produced as in Figure 2.

(i) For the general case, show how the eigenvalues, such as those in Figure 2, arise from E, a matrix of Euclidean distances [Hint: you are given that the Euclidean distances arise from some (unknown) n x p centred conﬁguration X , with corresponding inner product matrix B = XXT , the recovered conﬁguration will be centred, and the formula em,e = bm,m + be,e - 2bm,e .] (6 marks)

(ii) The red dotted line in Figure 2 indicates where one of the eigenvalues is precisely zero. In classical scaling, why is (at least) one of the eigenvalues always zero? (1 mark)

(iii) Using the eigenvalue plot in Figure 2 what dimensionality would you recommend that the scaling solution should be presented in? (1 mark)

(C) Jenks’ natural breaks optimization is the one-dimensional version of k-means clustering. Suppose you have four one-dimensional datapoints x1 = 1, x2 = 2 + ∈, x3 = 3 and x4 = 4, where ∈ > 0 and ∈ is considerably smaller than 1. You are told that the true number of clusters k = 2. Assume that one cluster centre is m1 = x1 = 1 and demonstrate that, depending on what points are initially used as the second cluster centre, Jenks’ algorithm can only produce two possible diﬀerent allocations of points to clusters. List the possible clusters and the associated cluster means for each of these two diﬀerent allocations. (4 marks)

(Total: 20 marks)

3. (a) Suppose we have a sample of n identically and independently-distributed observations, X1 , . . . , Xn from probability density function f : R - [0, o). Suppose we estimate f(x) using the kernel density estimator

fˆ(x) = K / 、 ,

where K (y) is the triangular kernel function deﬁned by

K (y) = 1 - IyI, for y e (-1, 1),

(3)

(4)

Suppose that f(x) is twice-diﬀerentiable with continuous second derivative and that all kernel functions below are symmetric about zero. You are given that the expectation of fˆ(x) is

E{fˆ(x)} = f(x) + C3 h2 foo (x) + o(h3 ), (5)

where C3 = ↓-o(o) v2 K (v) dv .

This question is about deriving an expression for the variance of fˆ(x).

Show that

E ↓fˆ(x) ↓ = E ,K / 、、 ,

and

var ↓fˆ(x) ↓ = var ,K / 、、 .

(ii) Compute the value of C3 for the triangular kernel function.

(iii) Show that, for the triangular kernel function,

E ,K2 / 、、 = f(x) + foo (x) + o(h4 ),

and thence

var ↓fˆ(x) ↓ = f(x) - f2 (x) + o(h/n).

(6)

(7)

(1 mark) (2 marks)

(8) (8 marks)

(9) (3 marks)

By properly balancing the squared bias and variance of the kernel density estimator above the

optimal bandwidth can be shown to be

hopt = ┌ ┐ 1/5 . (10)

[You do not have to show this].

Explain why the optimal bandwidth given in (10) is not of direct practical use. (1 mark)

(c) Let f : R - [0, o) be a probability density function. Suppose that {ψj,k (x)}j,keZ is an orthonormal wavelet basis of L2 (R), with associated father wavelets, {φj,k (x)}j,keZ, and that we can write, for some integer L > 0,

f (x) =↓ cL,k φL,k (x) + ↓ ↓ dj,k ψj,k (x), (11)

keZ jeJL keZ

where JL = {m e Z : m 2 L}, cj,k = ↓R f (x)φj,k (x) dx and dj,k = ↓R f (x)ψj,k (x) dx, where

φj,k (x) = 2j/2φ(2j x - k) and ψj,k (x) = 2j/2ψ(2j x - k), (12)

for j e JL , k e Z.

(i) Suppose that we obtain an independent and identically-distributed sample X1 , . . . , Xn from f (x). Explain why a reasonable estimator of dj,k is dˆj,k = n-1 L ψj,k (Xi ). Let j,k be the equivalent estimator of cj,k using φj,k rather than ψj,k . (3 marks)

(ii) Suppose we want to compute dˆj,k and j,k up to some high-resolution (integer) level J > L.

Explain why it is more computationally eﬃcient to use the discrete wavelet transform than

to calculate all of the dˆj,k and j,k directly using the formulae in part (i)? (2 marks) (Total: 20 marks)

4. (a) State three examples of additive basis expansions that are used in statistical learning.

(1 mark) (b) As part of a statistical learning project a researcher decides to use the circle_data()

function to generate synthetic data. The circle data function depends on two arguments: an

inner limit ir and an outer limit or . A sample from the circle_data() function involves drawing two random variables, X1 , X2 , each from a uniform distribution on [-or , or ]. Then, a Bernoulli response Y is drawn according to a conditional probability mass function based on the realization X1 = x1 , X2 = x2 as follows:

P(Y = 1Ix) =↓1

(or - IIxII)/(or - ir )

where IIxII = (x1(2) + x2(2))1/2 is the usual two-norm.

(i) Draw a sketch diagram of the conditional probability

if IIxII < ir ,

if IIxII 2 or ,

otherwise,

function (13). (2 marks)

(ii) Linear discriminant analysis is a technique that draws a line in the (x1 , x2 ) space and

allocates y = 1 to points (x1 , x2 ) that are one side of the line and y = -1 to points

on the other side. Why would linear discriminant analysis not be a useful method for

constructing a classiﬁer for X in this example? (1 mark)

(iii) Suppose it is decided to use a classiﬁcation tree to classify circle data observations in a

supervised learning experiment. Explain what is meant by training and test sets and the

reason for their use in evaluating classiﬁers. (3 marks)

(iv) A regression tree is deﬁned by partitioning the space (two-dimensional in the case of the

circle data) into M disjoint regions R1 , . . . , RM and then forming the model

f(x) = ↓ cmI(x e Rm ), (14)

m=1

where I(A) = 1 if A is true, or 0 otherwise.

Suppose we consider splitting on variable j at split point s (somewhere on Xj ) and deﬁne the pair of half-planes

R1 (j, s) = {xIXj < s} and R2 (j, s) = {xIXj > s}. Then, we seek the splitting variable j and split point s that solves:

isn c(m)ì(i)n Xiej,s)(Yi - c1 )2 + c(m)in冬 Xiej,s)(Yi - c2 )2 .

(15)

(16)

For a given j, s derive the inner minimisers in (16) over c1 and c2 and label them 1 , 2(2 marks)

(v) For a given variable j explain how to rapidly compute the best split point s for the following objective function:

↓ (Yi - 1 )2 + ↓ (Yi - 2 )2 . (17)

XieR ì(j,s) XieR冬(j,s)

(2 marks) (vi) Suppose that we now choose to split variables into three regions with two split points

s1 , s2 . Deﬁne a suitable new objective function and how to minimize it given a variable j. Give a reason why the option of splitting into more than two regions for each variable

is not usually done.

(ii) The AdaBoost.M1 binary classiﬁer can be written as

f (x) = ↓ βm Gm (x; γm ),

m=1

(3 marks) (2 marks)

(18)

where Gm (x, γm ) is one of the classiﬁers and G1 (x; γm ) is the initial ‘weak’ classiﬁer, where Gi (x) e {-1, 1} and γm are parameters of the model. The ﬁtted model can be

obtained by

(βm}(n)m(M)辛ì L ,yi , βm G(xi , γm ) 、,

where L{y, f (x)} is the loss function between observation y and model f (x). We assume that the costs of misclassiﬁcation are equal.

AdaBoost.M1 can be shown to use the loss function L{y, f (x)} = exp {-yf (x)}. Show that the population minimiser is given by

f* (x) = arg mf(ixn) EY |x{e-Yf (x)} = log , 、, (20)

one half of the log-odd ratio P(Y = 1Ix). Explain why this means that the sign operator applied to {f (x)} from (18) is the appropriate function to use to build the operational classiﬁer. (4 marks)

(Total: 20 marks)

5. (a) This part considers the case of multiple linear regression model

Y = Xβ + ∈, (21)

where β is a p-vector of unknown parameters, X is an nxp matrix of explanatory variables, Y is an n-vector dependent variable and ∈ is an n-vector of errors where E(∈i ) = 0, var(∈i ) = σ 2 and the errors are all assumed independent and identically distributed.

Suppose we decide to estimate the parameters β using ridge regression with ridge parameter λ. In this context, explain the concept of cross-validation as a technique for choosing a good value of λ and brieﬂy discuss its pros and cons. Explain what k-fold cross-validation is. What eﬀect might correlation between errors in the model have on the outcome of cross-validation?

What eﬀect might outliers have on the outcome?

(10 marks)

Now suppose we have the model

yi = f (ti ) + ∈i , (22)

for i = 1, . . . , n, where f is a smooth function f : R - R, {yi } are observations taken at the points {ti } and {∈i } is a set of independent and identically distributed random variables with mean zero and variance of σ2 . Deﬁne the full spline smoothing estimator of f (t) for the whole dataset to be fˆλ (t).

In spline smoothing the ordinary cross-validated estimate of error is deﬁned by

OCV(λ) = n-1 ↓{yi - fˆλ-i (ti )}2 .