Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

Homework 2

Advanced Topics in Statistical Learning, Spring 2024

Due Friday March 1

1 Properties of RKHS regression [32 points]

In this exercise, we will work out some facts about RKHS regression.

(a) First, let k : X × X → R be an arbitrary kernel. Recall that this means there exists a feature map φ : X → H, and a Hilbert space H with inner product h · , · iH, such that for any x, y ∈ X ,

k(x, y) = h φ(x), φ(y)i H.

Prove that k is positive semidefinite, meaning, it is symmetric, and for any n ≥ 1 and x1, . . . , xn ∈ X , if we define a matrix K ∈ R n×n to have entries Kij = k(xi , xj ), then                       [6 pts]

aTKa ≥ 0,   for all a ∈ Rn.

Hint: express aTKa in terms of the feature map φ.

(b) Henceforth, suppose that k is the reproducing kernel for H (and hence H is an RKHS). Recall that this means the following two properties are satisfied:

(a) for any x ∈ X , the function k(·, x) is an element of H;

(b) for any function f ∈ H and x ∈ X , it holds that h f, k(·, x)i H = f(x).

Let f be a function of the form

for coefficients β1, . . . , βn ∈ R. Show that                     [4 pts]

(c) Let h be any function (in H) that is orthogonal (with respect to h · , · iH) to the linear space of func-tions of the form in part (b). Prove that                   [6 pts]

h(xi) = 0, i = 1, . . . , n,

and

||f + h||H ≥ ||f||H,   with equality iff h = 0.

(d) Argue that for any λ > 0, the infinite-dimensional RKHS optimization problem

(where the minimization is implicitly over f ∈ H) has a unique solution of the form in part (b), and we can rewrite it as     [4 pts]

for the matrix K ∈ Rn×n with entries Kij = k(xi , xj).

For the uniqueness part, you may assume may assume that k is a positive definite kernel (strictly), so that K is positive definite matrix (strictly).

Hint: let g = f + h and use the results in part (c) to argue that g has a larger criterion value, unless h = 0. Use part (b) to complete the reduction to finite-dimensional form.

(e) Finally, we establish a cool fact about leave-one-out cross-validation (LOOCV) in RKHS regression problems. Recall that in general, the LOOCV error of an estimator f ˆ is defined as

where f ˆ−i is the estimator trained on all but the i th pair (xi , yi). Prove the following shortcut for-mula for LOOCV with an RKHS regression estimator f ˆ:                         [12 pts]

where S = K(K + λI) −1 is the smoother matrix for the RKHS estimator (so that the vector of fitted values is given by Yˆ = SY ).

Hint: prove that for each i,

The desired result will follow by rearranging, squaring both sides, and summing over i = 1, . . . , n. There are different ways to establish the above display; one nice way is as follows. Consider solving the RKHS regression problem on

(x1, y1), . . . ,(xi−1, yi−1),(xi+1, yi+1), . . . ,(xn, yn),

and consider solving it on

(x1, y1), . . . ,(xi−1, yi−1),(xi , y˜i),(xi+1, yi+1), . . . ,(xn, yn),

where y˜i = f ˆ−i (xi). Argue that these should produce the same solutions. Derive the zero gradient condition for optimality (differentiate the criterion and set it equal to zero) for each problem, and use these to solve for y˜i = f ˆ−i (xi).

2 Minimax lower bounds for Hölder classes [30 points]

Consider the d-dimensional Hölder ball

Cs (L; [0, 1]d ) = n f : [0, 1]d → R : |Dα f(x) − Dα f(z)| ≤ Lk x − zk 2 for all |α| = s − 1, and x, z ∈ [0, 1]d}, where s ≥ 1 is an integer and L > 0 is a constant (not growing with n). Consider i.i.d. samples (xi , yi), i = 1, . . . , n where each

Generalize the calculations given in the minimax theory lecture for the minimax lower bound over the Lipschitz ball to show that      [30 pts]

Make sure to write the full calculation from start to finish (even detailing the steps that may be analogous to those in lecture).

Hint: consider defining (adhering to the notation used in the lecture)

where K is any smooth function supported on the unit ` 2 ball {x : ||x||2 ≤ 1}, with the following proper-ties: K(0) = 1, 0 < R K(x)2 dx < ∞, and DβK is 1-Lipschitz for all |β| < s (we can equivalently assume bounded derivatives: ||Dβf||2 ≤ 1 for all |β| ≤ s). Then follow the calculations in the lecture, where we used Fano’s method and the Varshamov-Gilbert lemma.

3 Sub-Gaussian maximal inequalities [24 points]

In this exercise, we will derive tail bounds on maxima of sub-Gaussian random variables Xi , i = 1, . . . , n. Suppose each Xi has mean zero and variance proxy σ 2 . (We assume nothing about their dependence struc-ture.)

(a) Prove that for any λ ∈ R,         [6 pts]

(b) Rearrange the result in part (a), then choose a suitable value of λ to show that            [4 pts]

(c) Prove that for any λ ≥ 0,          [4 pts]

Hint: use the fact that P(Xi ≥ λ) ≤ e −λ 2/(2σ2) , for any λ ≥ 0, which you can view as a consequence of the tail bound for sub-Gaussian averages when n = 1.

(d) Reparametrize the result in part (c) to show that for any t > 0,           [2 pt]

(e) Now, we turn the question of the role of dependence: do correlations between Xi , i = 1, . . . , n make their maximum stochastically larger or smaller? Conduct (and include the results of) a small simula-tion in order to inform your answer.     [8 pts]

4 Risk analysis for the constrained lasso [24 points]

This exercise explores simple in-sample and out-of-sample risk bounds for the lasso. Assume that we ob serve i.i.d. (xi , yi) ∈ [−M, M] d × R, i = 1, . . . , n, where each

and each  i is sub-Gaussian with mean zero and variance proxy σ 2 . Consider the constrained-form lasso estimator βˆ, which solves

where Y = (y1, . . . , yn) ∈ R n is the response vector, X ∈ R n×d is the predictor matrix (whose i th row is xi ∈ Rd), and t ≥ 0 is a tuning parameter.

(a) Prove that the lasso estimator, with t = k β0k 1, has in-sample risk satisfying   [6 pts]

where the expectation is taken over the training data (xi , yi), i = 1, . . . , n.

Hint: follow the strategy that we used in lecture to derive the “slow” rate for the constrained lasso. Note that this was for fixed X, so you will need to condition on X here. Then apply the result from Q2 part (b). You may use the fact that for  = ( 1, . . . , n) ∈ R n, a vector of i.i.d. sub-Gaussians with mean zero and variance proxy σ 2 , and an arbitrary fixed vector a ∈ R n, the random variable a T is mean zero sub-Gaussian with variance proxy σ 2k ak2 2.

(b) For i.i.d. mean zero random variables Zi , i = 1, . . . , n that lie almost surely in [a, b], prove that for any t ∈ R,     [4 pts]

Hint: you may use the fact that each Zi is sub-Gaussian with variance proxy (b − a)/2, and the hint about linear combinations of sub-Gaussians from part (a).

(c) Let Σ denote the predictor covariance matrix, that is, Σ = E[x0x T0 ] for a draw x0 from the predictor distribution. Let ˆΣ = XTX/n be the empirical covariance matrix, and let V = ˆΣ − Σ. Prove that      [4 pts]

Hint: apply part (b) to the entries of V .

(d) Prove that the lasso estimator, with t = k β0k 1, has out-of-sample risk satisfying        [10 pts]

where the expectation is taken over the training data (xi , yi), i = 1, . . . , n and an independent draw x0 from the predictor distribution.

Hint: first, argue that the in-sample risk and out-of-sample risk can be written as

respectively. (Note that the expectations above are each taken with respect to the training samples (xi , yi), i = 1, . . . , n only—there is nothing else that is random.) Next, argue that

where recall Vjk = (ˆΣ − Σ)jk. Then do a little bit more algebra to bound the right-hand side above and apply the previous parts of this question to conclude the result.

(e) The bound derived in this question for the out-of-sample risk is always larger than that for the in-sample risk (by nature of its construction). As a bonus, investigate: can the out-of-sample risk of the lasso be lower than the in-sample risk? Use a simulation, a pointer to an experiment or result in the literature, or any means of answering that you believe provides a convincing argument.