ARIN7101 Statistics in Artificial Intelligence (2022 Fall) Assignment 2
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit
ARIN7101 Statistics in Artificial Intelligence
(2022 Fall)
Assignment 2, due on October 28
All numerical computation MUST be conducted in Python, and attach the Python code.
1. Question 1 (Regularization)
Consider the simple linear regression yi = 8T mi + ∈i , ∈i i. . N (0, σe(2)), i = 1, . . . , n, where n is the number of samples, and the residual sum of squares
loss,
n
RSS(8) = (yi _ 8T mi )2 = (y _ X8)T (y _ X8).
i=1
(a) Under the assumption that X T X = diag(σ1(2) , . . . , σp(2)), where p is the number of covariates in X, derive the closed-form formula for the LASSO
regression,
jASSó n RSS(8) + λ|8|1 ,
as the function of X , y, λ and (σ1(2) , . . . , σp(2)) (do not include ójS in your final results).
(b) The dataset g≥ trá之n|asv and g≥ test|asv store age, weight, height, and several body circumference measurements for 252 men. Use the ‘brozek’ as the response variable (y) and the other variables as predictors (m) in the linear regression model.
Normalize the training and test datasets by estimating sample mean and variance from the training dataset. Set γ = 1e _ 4 for the learning rate of the proximal gradient method with convergence criteria ∈ = 1e _ 7. Plot the estimated coefficients for the ridge regression and LASSO re- gression, respectively, against λ ∈np|l之nspáae℃|) 子ξ|) ≥||≥/.
(c) Given the LASSO regression results in (b), what’s the range of λ if you want to include 4 predictors in the linear regression model? Which four predictors would you choose?
(d) Find the optimal λ ∈np|l之nspáae℃|) 子ξ|) ≥||≥/ which can yield the lowest loss on the test dataset for the LASSO regression. Which predictors are included in the model for the optimal λ?
2. Question 2 (Gaussian Process)
Let f ~ GP (m(.), k(., .)). In real applications, usually the observed data y are noisy, i.e.,
yi = f (mi ) + ∈i .
We assume a white noise, i.e., ∈i i. . N (0, σ2 ) and σ 2 is known.
(a) Given the observed y = (y1 , . . . , yn )T and corresponding X = (m1 , . . . , mn )T as well as new observations X * = (m1(*) , . . . , mn(*)* )T , derive the joint distri- bution
┌ f (X(y)* )┐
and the conditional distribution
f (X )Iy*
.
(b) The entire Gaussian process can be integrated out to obtain the marginal likelihood y Iy* , X , i* .e.,
p(y* Iy , X * ) = p(y* If (X* ))p(f (X* )Iy)df (X* ),
where yj(*) = f (mj(*)) + ∈j , j = 1, . . . , n* .
Derive the distribution of y* Iy , X * .
Hints:
The density function of a multivariate normal distribution (p-dimension) with mean vector u and covariance function Σ is
f (mIu, Σ) = (2π)- IΣI- exp ╱ _ (m _ u)T Σ-1 (m _ u)、
And for : ~ Np (u, Σ), u ~ Np (uo , Σo ), where Σ , Σo and uo are known, it holds that
p(:) = p(:Iu)p(u)du ~ Np (uo , Σ + Σo ).
(c) If we consider an Ornstein–Uhlenbeck kernel k(m, m\ ) = exp ╱ _ 、 (set metr之a=Xm之nkowsk之夕) p=≥ in sa之py|spát之ál|a之stánae|aa之st) and a mean- zero GP, i.e., m(.) = 0. We set σ 2 = 0.25 and l = 1. The observed data points (yi , xi ) are stored in g扌 trá之n|asv and new observations exj(*)} are stored in g扌 test|asv. By assigning a DP prior on f , draw the following line plots in Python (as shown in Tutorial 5):
(i) the mean and point-wise Bayesian 95% credible interval (shaded area) of f (X )Iy;*
(ii) five realizations from f (X )Iy;*
(iii) the mean and point-wise Bayesian 95% credible interval (shaded area) of y* Iy;
(iv) five realizations from y* Iy .
3. Question 3 (Dirichlet Process)
Let G ~ DP (α, Go ), where Go ~ N (µo , σo(2)) and we set µo = 1, σo(2) = 4.
(a) For each α in (0.01, 0.1, 1, 10, 100, 1000), using the stick-breaking ap- proach to draw 10 realizations from DP (α, Go ) given ∈ = 10-8 as the stopping criteria. For each α, plot the empirical cumulative density func- tion in [_5, 5] of 10 realizations on one plot. Comment on your results.
(b) For each α in (0.01, 0.1, 1, 10, 100, 1000), generate a sequence of (Y1 , . . . , Y1oo ) following the Chinese Restaurant Process (CRP) for 5,000 times. For each sequence, count the number of clusters (unique values of Yi ) (you can use the √ounter function from the aolleat之ons package). For each α, plot the barplot of the number of clusters. Comment on your results.
(c) For the CRP, given n as the number of samples and α, prove that the probability mass function of the number of clusters k has the form,
P (kIα, n) = c(n, k)n!αk , k = 1, . . . , n,
where c(n, k) does not involve α . (Hints: Γ(a+1)/Γ(a) = a; exchangeable property)
(d) In fact, we can assume a Gamma prior on α, i.e., α ~ Gamma(ao , bo ). Given the observations X1 , . . . , Xn , the posterior distribution of α has the form,
π(αIX1 , . . . , Xn ) = π(αIk, n) x π(α)P (kIα, n)
x π(α)αk
1
x π(α)αk -1 (α + n) uα (1 _ u)n-1 du,
o
where k is the number of unique values of (X1 , . . . , Xn ). Describe how can you draw posterior samples of α via a Gibbs sampler?
(Hints: π(αIk, n) can be viewed as the marginal distribution of p(α, ηIk, n) x π(α)αk -1 (α + n)ηα (1 _ η)n-1 by integrating out η; η α = eα ln(η); sampling from a mixture distribution)
(e) Use the dataset g子e|asv which stores values of (X1 , . . . , Xn ) i. . G. For each α in (0.01, 0.1, 1, 10, 100, 1000), draw the cumulative density function of the mean of G, data (empirical step-wise function) and the mean of GIX1 , . . . , Xn . Comment on your results.
(f) Consider a Dirichlet process mixture model where
Xi ~ N (µi , σ 2 ), i = 1, . . . , n
µi ~ G, i = 1, . . . , n
G ~ DP (α, Go )
Here σ 2 is known. Derive the full conditional distribution of µi Iu-i , Xi . (See Page 17 in Tutorial 6 notes) Describe how can you obtained the posterior samples of (µ1 , . . . , µn ).
2022-10-29