COMP3670/6670 Introduction to Machine Learning Semester 2, 2021 Final Exam
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit
COMP3670/6670 Introduction to Machine Learning
Semester 2, 2021
Final Exam
![]()
• Section 1. Linear Algebra and Matrix Decomposition (13 points)
1. (6 points) Show that it is impossible to have a set of 3 orthogonal non-zero vectors {v1 , v2 , v3 } in R2 .
(Hint: Write v1 = [x1 ,y1]T and first assume that x1 = 0. Then, prove it for the case where x1
0, using the first case to help you.)
Solution. We write
v1 = [y(x)1(1)] v2 = [y(x)2(2)] v3 = [y(x)3(3)]
As the hint suggests, we first assume that x1 = 0. Then since v1 · v2 = 0, we have y1y2 = 0. Since y1 cannot be zero (otherwise v1 = 0) we must have y2 = 0. Since v1 · v3 = 0, we have x1x3 + y1y3 = y1y3 = 0, and y1
0, so y3 = 0. Since v2 · v3 = 0, we have x2x3 + y2y3 = x2x3 = 0. Now, y2 = 0, so x2
0 (else v2 = 0), so x3 = 0. We have that x3 = 0 and y3 = 0, so v3 = 0, a contradiction.
Now, if any of the x1 ,x2 ,x3 ,y1 ,y2 ,y3 were zero, this would also result in a contradiction in the same way, as by symmetry we can reorder the vectors so the zero was somewhere in the first vector, and since x1y1 + x2y2 = y1x1 + y2x2 , we could swap the x and y values in all vectors so that x1 = 0, from which the contradiction follows by above.
Now, assume that x1
0. v1 · v2 gives
v1 · v2 = 0
x1x2 + y1y2 = 0
![]()
2 =
and similarly
v1 · v3 = 0
x1x3 + y1y3 = 0
![]()
3 =
Then v2 · v3 = 0, so
x2x3 + y2y3 = 0
(
)(
) + y2y3 = 0
y 1(2)y2y3
y2y3 (
+ 1) = 0
Then either y2 = 0 (contradiction), or y3 = 0 (contradiction) or y北
+1 = 0, giving y 1(2) = −x1(2) .
Since the left hand side is non-negative, and the right hand side is non-positive, the only way to satisfy this equation is y1 = x1 = 0, a contradiction.
2. (7 points) Consider an arbitrary matrix A ∈ R2×2 A = [c(a) d(b)]
We would like A to satisfy the following properties:
(a) A is non-invertible.
(b) A is symmetric.
(c) All the entries of A are positive.
(d) A has a positive eigenvalue λ = p > 0.
Find the set of all matrices A (as a function of p) that satisfy all the above constraints.
Solution. A is symmetric, so b = c. We can eliminate c and consider only matricies of the
form
A = [b(a) d(b)] .
A is non-invertible, so ad − b2 = 0. Also, a,b,d > 0. We have that p > 0 is an eigenvalue, so we find all eigenvalues using the characteristic polynomial.
(a − λ)(d − λ) − b2 = 0
ad − aλ − dλ + λ2 − b2 = 0
−aλ − dλ + λ2 = 0
λ(λ − a − d) = 0
So λ = 0 or λ = a+d. Since 0 cannot be positive, we have that p = a+d, or d = p − a. We
can eliminate d.
A =
]
Finally, since ad − b2 = 0, we have a(p − a) = b2 , and hence b = ±“a(p − a). Since all entries need to be positive, we take the positive root, b = “a(p − a). For the square root to be defined (and not zero), we need a(p − a) > 0, which implies that 0 < a < p. Hence, the set of all possible A is given by
{[“a(
] : 0 < a < p}
as required.
• Section 2. Analytic Geometry and Vector Calculus (12 points)
1. (6 points) Find all matrices T ∈ R2×2 such that for any v ∈ R2 , vT v = (Tv)T v = (Tv)T Tv
Solution. Consider v · v = T(v) · v . Let v = [x,y]T . Then
v · v = T(v) · v
x2 + y2 = x(ax + by) + y(cx + dy)
x2 + y2 = ax2 + (b + c)xy + dy2
By choosing x = 0,y = 1, we obtain d = 1, and by choosing x = 1,y = 0 we obtain a = 1. Then, we choose x = 1 and leave y arbitrary for the moment. Substituting, we obtain
1 + y2 = 1 + (b + c)y + y2
from which we obtain (b+c)y = 0 for all y, so c = −b. We now consider the other equation
v · v = T(v) · T(v)
x2 + y2 = (ax + by)2 + (cx + dy)2
x2 + y2 = (a2 + c )x22 + (b2 + d )y22 + 2(ab + cd)xy
x2 + y2 = (1 + b )x22 + (1 + b )y22 + 2(b − b)xy
x2 + y2 = (1 + b2 )(x2 + y2 )
1 = 1 + b2
b = 0
So a = 1,b = 0,c = −b = 0,d = 1. So the only matrix that satisfies this property is the identity.
T = [0(1) 1(0)]
2. (6 points) Let 北, a ∈ Rn×1, and define f : Rn×1 → R as f(北) = 3exp(2北T A北)
where exp(x) := e北 , the exponential function. Compute ∇xf(北).
Solution. We apply the chain rule. Note that f(北) = g(h(北)), where g : R → R, g(h) = 3exp(2h) and h : Rn → R, h(北) = 北T A北. We have by lectures/assignment that ∇xh(x) = xT (A + AT), and by calculus we have that ∇hg(h) = 6exp(2h). Therefore
df dg dh
dx dh dx
• Section 3. Probability (15 points)
Consider the following scenario. I have two boxes, box A and box B. In their initial configuration, Box A contains 3 red balls and 3 white balls; Box B contains 6 red balls.
We swap the balls in boxes via the following procedure.
Procedure 1: We draw a ball xA uniformly at random from box A, and draw a ball xB uniformly at random from box B . We place ball xA into box B and place ball xB into box A .
Let RA and RB be random variables representing the number of red balls in box A and box B respectively.
1. (3 points) Describe the probability mass functions p(Ra = n) and p(Rb = n) for all integers n ≥ 0 for the initial configuration of the boxes.
Solution. Clearly, box A contains 3 red balls, so
p(RA = n) = {0(1)
and box B contains 6 red balls, so
![]()
![]()
{
n = 3
n
3
n = 6
n
6
2. (3 points) After performing Procedure 1, what is the new joint probability mass functions p(RA = na,RB = bn)?
Solution. We are guaranteed to move a red ball from box B into box A, it is equally likely that the ball xA chosen from A is red. So, we have a 0.5 probability of nothing changing. and a probability of 0.5 of adding a red ball to box A and loosing a red ball from box B .
Hence,
0.5
p(RA = na,RB = nb) =〈 0.5
(0
(na,nb) = (3, 6)
(na,nb) = (4, 5)
else
We now describe a new method of swapping balls between boxes.
Procedure 2: We draw a ball xA uniformly at random from box A, and place it in box B . Then, we draw a ball xB uniformly at random from box B, and place it in box A .
3. (2 points) Is Procedure 2 equivalent to Prodecure 1? Why or why not? Use no more than
3 sentences to explain your answer.
Solution. No, as we place xA into Box B before selecting a ball from box B . This means it is possible (though unlikely) that the ball drawn from B was the same ball moved there from A, so it is possible that xB is black (which was impossible in the previous case).
4. (3 points) We reset the experiment to the initial configuration, and then perform Procedure 2.
(a) Compute the probability xA is white.
(b) Compute the probability that xB is white.
Solution. Box A has the same number of red and white balls, so p(xA = white) = 1/2. Now, the outcome of xB depends on what xA was, so we use the sum and product rules.
p(xB = white) = p(xB = white|xA = white)p(xA = white)+p(xB = white|xA = red)p(xA = red)
If xA = white, then box B now has 6 red balls and 1 white ball, so p(xB = white|xA = white) = 1/7. If xA = red, then box B still contains only red balls, so p(xB = white|xA = red) = 0. Hence,
p(xB = white) = 1/7 × 1/2 + 0 × 1/2 = 1/14
5. (4 points) We reset the experiment to the initial configuration, and then perform Proce- dure 1. After that, we select either box A or box B by some uniformly random procedure (like flipping a fair coin), and draw a ball from the selected box. The ball drawn was white. What was the probability that box B was selected?
Solution.
p(Box = B|ball = w)
p(ball = w|Box = B)p(Box = B)
=
p(ball = w|Box = A)p(Box = A) + p(ball = w|Box = B)p(Box = B)
First, consider p(ball = w|Box = B). If box B were selected, there is a 1/2 chance of it containing 5 red balls out of 6, or 1/2 chance of 6 red balls out of 6. So,
p(ball = w|Box = B) = 1/2 × 5/6 + 1/2 × 1 = 11/12
If box A were selected, there is a 1/2 chance of it containing 3 red balls out of 6, or 1/2 chance of 4 red balls out of 6.
p(ball = w|Box = A) = 1/2 × 3/6 + 1/2 × 4/6 = 7/12
Hence.
p(Box = B|ball = w)
![]()
= = 11/18
• Section 4. Clustering and Gaussian Mixture Model (GMM) (16 points)
Suppose you have N data points in R: x1 ,x2 , ...,xn,xn+1 , ...,xN . Now you want to partition them into two clusters C1 and C2 . First, you assign x1 ,x2 , ...,xn into C1 and assign xn+1 , ...,xN into C2 . After that, you want to calculate the new cluster centers µ 1 and µ2 of C1 and C2 . For each cluster, your objective is to minimise the averaged squared distances between each data point and its assigned center.
1. (3 points) Calculate µ 1 and µ2 that achieve your objective (M-step). (Only showing result without the optimisation process will lead to 0 mark.) Hint: after the assignment, you can only use the first n points to calculate µ 1 , and the rest points to calculate µ2 .
Solution. We denote the two centers as µ 1 and µ2 , respectively. The loss function can be
written as:
n N
i=1 i=n+1
We then calculate the partial derivatives of L w.r.t µ 1 and µ2 . We have
![]()
=
工(n)(µ1 − xi),
i=n+1
We set both partial derivatives to 0. We can easily get
n N
i=1 i=n+1
2. (2 points) Are µ 1 and µ2 a global minimum solution to this M-step (calculating the means of clusters)? Use no more than 3 sentences to explain your answer.
Solution. Yes. Reason 1: the assignment strategy (first n points to C1 and the rest to C2) is given. Reason 2: the optimization of µ 1 does not depend on µ2 (does not assume given values of µ2 ), and the optimization of µ2 does not depend on µ 1 (does not assume given values of µ1 ).
Given µ 1 and µ2 calculated in Question 1 above, you now want to update the assignment (E- step). This time again you aim to minimise the sum of squared distances between each data point and its center.
3. (3 points) What is your optimal assignment strategy (you can only assign a data point to one cluster, i. e ., hard assignment)? Why is it optimal for your aim? (You can use whatever is helpful to illustrate, such as figures and maths. Only showing the result without a clear demonstration/proof process with lead to 0 mark)
Solution. We should assign each point to its closest center.
To prove, we only have to optimize the assignment of each individual data point, because the data points are independently and identically distributed (i.i.d). For data point xi , because of hard assignment, the loss can be written as,
Lh =
if xi is assigned to center 1
if xi is assigned to center 2.
Apparently, Lh is minimised when our assignment strategy allows L to always take the lesser between (xi − µ1 )2 and (xi − µ2 )2 . In other words, our assignment strategy should always satisfy the lesser between (xi− µ1 )2 and (xi− µ2 )2 is chosen. That is, xi should be assigned to the center it has a closer distance with.
4. (2 points) Is this assignment a global minimum solution to the E-step under hard assign- ment? Use no more than 3 sentences to explain your answer.
Solution. Yes. This is because µ 1 and µ2 are considered as given. Moreover, hard assign- ment is used. Under these conditions, the optimization of the assignment strategy gives a global minimum to the E-step.
Following the above questions, we do further explorations. The GMM performs soft assignment, i. e ., assigning a data point into multiple clusters, and this assignment is accompanied by respon- sibilities. Now, we want to explore a similar scheme in k-means. Specifically, we define
(xm− µ1 )2 (xm− µ2 )2
where rm2 denotes how likely xm belongs to C2, and rm1 denotes how likely xm belongs to C1 .
1. (2 point) Suppose xm is assigned to C1 under hard assignment, then under soft assignment, which cluster will bear a higher responsibility for xm? Use two sentences to explain.
Solution. If xm is assigned to C1 under hard assignment, it means (xm−µ1 )2 < (xm−µ2 )2 . Therefore, rm2 < rm1. So xm should be assigned to C1 with a higher responsibility.
2. (4 points) Given µ 1 and µ2 , when looking at a single data point xm, does soft assignment have a lower loss value than the hard assignment? Prove it. Hint: your loss function will be the sum of the squared distance from xm to each center multiplied by the “responsibility” rmk,k = 1, 2.
Solution. No.
Ls = rm1(xm− µ1 )2 + rm2(xm− µ2 )2 .
When (xm− µ1 )2 < (xm− µ2 )2 ,
Lh = (xm− µ1 )2 .
Lh− Ls = (xm− µ1 )2 − rm1(xm− µ1 )2 − rm2(xm− µ2 )2
(xm− µ1 )4 + (xm− µ1 )2 (xm− µ2 )2 − (xm− µ1 )4 − (xm− µ2 )4
=
(xm− µ1 )2 + (xm− µ2 )2
= ((xm− µ1 )2 − (xm− µ2 )2 )(xm− µ2 )2 < 0
(1)
(2)
(3)
Therefore, the soft assignment actually gives a higher loss value than hard assignment.
• Section 5. Linear Regression (13 points)
1. (4 points) In least squares linear regression, our empirical risk is
N
R =
工(yn− θT 北n)2 ,
n=1
where 北n ∈ Rd is a training sample, yn ∈ R is its label. θ contains the model parameters. Now we use a sigmoid function in this empirical risk, i. e .,
N
R =
工(yn− sigmoid(θT 北n))2 .
n=1
In no more than 4 sentences, state the reason why using sigmoid function in this way is not desirable.
Solution. The sigmoid function can only output values between 0 and 1, while the ground truth y could be other values. Moreover, the sigmoid function gives very similar values when θT 北 is very large (or very small) and thus will give similar loss values. This property will compromise model prediction performance under very large or very small θT 北.
2. Now we have four data points shown in the figure below. We use the least square error in regression.
(a) (b)
1) (3 points) In (a), draw the linear regression output and the first principal component. Are they of the same direction/orientation? Use no more than 3 sentences to explain why.
Solution. As shown in figure (a) below, the linear regression line should be y = 1. The first principal component is vertical. Students get full marks if the PCA result is vertical.
2) (3 points) In (b), draw the regression output, using language to describe when necessary.
Note: you can use any drawing software (PowerPoint, pdf editor, etc) to di- rectly do it on figure; otherwise, approximately draw the four points on your paper and then draw the regression/PCA lines.
Solution. As shown in figure (b) below, the linear regression output should be any straight line passing the point (2, 1) except the vertical one.
3. (3 points) In the lecture slides, to derive compact formula, we include a dummy feature x0 = 1 into the sample vector 北, and correspondingly θ includes the coefficient θ0 of this dummy feature. Now I want to remove this dummy feature and its coefficient from the
|
5 4 3 2 1 0 -1 -2 -3 -4 -5 |
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2022-11-06