MA6529 STATISTICAL LEARNING 2020
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit
MA6529/20
STATISTICAL LEARNING
SECTION A
The questions in this section will each be marked out of 10. Candidates may attempt all SIX questions but are advised that they cannot obtain a total of more than FIFTY MARKS on this section.
1. Two measurements were collected on each of 36 flea-beetles; 18 of the beetles were from a species called Chactomcnema concina and the other 18 were from another species called Chactomcnema heikertingeri. The first variable, x1 , consisted of the sum of widths (in micrometers) of the first joints of the first two tarsi (”feet”); and the second variable, x2 , consisted of the corresponding sum for the second joints. The sample means of the two species are x1 = \ and x2 = \, respectively; and the pooled sample covariance matrix and its inverse are given by
S = ╱ 64(165)835(.260)
\ ,
S − 1 = \ .
(a) It is of interest to know whether or not the population means of the two species are
different. Use Hotelling’s T2 to test the null hypothesis of no difference.
(b) What are the assumptions needed to use the test in part (a)?
[ 7 marks ]
[ 3 marks ]
2. (a) A sample of customers were asked to score movies and each customer’s average scores for three different genres of movies (Action, Comedy and Romance) were calculated. The sample correlation matrix was
Action Comedy Romance
Action 1 0.63 -0.58
Comedy 0.63 1 -0.34
Romance -0.58 -0.34 1
Calculate the partial correlation coefficients between Comedy and Romance given Action. What does this indicate about the relationship between the average scores for these genres?
[ 5 marks ]
(b) Suppose that we have three random variables X1 , X2 and X3. Explain how the partial correlation coefficient between X1 and X2 given X3 can be calculated using linear regression. [ 5 marks ]
3. Data were collected on 406 cars. We will consider five variables: Engine displacement, horsepower, weight, acceleration, miles per gallon (MPG). The variables were divided into two groups: physical characteristics,
X 1 = (Displacement, Horsepower, Weight),
and performance characteristics,
X 2 = (Acceleration, MPG).
The data were analysed using canonical correlation analysis. The first two canonical correla- tion vectors were
a1 = ( −0.262, 0.777, −0.021), and b1 = ( −0.460, 0.715), |
a2 = (0.500, 1.575, −2.274)
b2 = ( − 1.004, 0.841) |
The canonical correlations were 0.88 and 0.63. Interpret these results.
4. Consider the graph:
Answer the following questions:
(a) Is the graph complete? Justify your answer.
(b) Are (X1, X2 , X3 ) and (X3, X5 , X6 ) paths? Justify your answer.
[ 10 marks ]
[ 2 marks ]
[ 2 marks ]
(c) List the set of maximal cliques in the graph and use them to factorize the joint probability distribution of (X1, X2 , X3 , X4 , X5 , X6 ). [ 4 marks ]
(d) Provide the definition of decomposable graph. [ 2 marks ]
5. Answer the following questions about mixture models:
(a) Provide the definition of mixture models. [ 2 marks ] (b) Motivate the use of mixture models by using some examples. [ 2 marks ]
(c) Explain how to sample an observation from a mixture model. Write the R code for an example. [ 3 marks ]
(d) Explain how to complete the data in order to perform the EM algorithm. [ 3 marks ]
6. (a) Explain what is a distance matrix and a similarity matrix. (b) Explain how to transform a similarity matrix into a distance matrix.
(c) Explain in detail the model-based clustering approach.
SECTION B
These questions will each be marked out of 25. Candidates may not attempt more than TWO of the THREE questions.
7. (a) Suppose we have a dataset containing n observations and each of the observations is a p-dimensional vector xi = [xi1, . . . , xip]T (i = 1, . . . , n). Describe how to obtain loadings and scores in principal component analysis for this dataset. Explain the geometric meaning of the loadings and the scores. [ 10 marks ]
(b) Suppose we denote the covariance matrix of the original data as SX. The eigenvectors of SX are columns of the matrix A and the eigenvalues of SX are the diagonal elements of the diagonal matrix Λ. Show that the principal component scores are uncorrelated. [ 5 marks ]
(c) For each of the 50 states in the United States, the dataset contains the number of arrests per 100,000 residents for each of three crimes: Assault, Murder, and Rape. We also record UrbanPop (the percent of the population in each state living in urban areas). The loadings of the first two principal components are
|
PC1 |
PC2 |
Murder Assault UrbanPop Rape |
0.54 0.58 0.28 0.54 |
-0.42 -0.19 0.87 0.17 |
The eigenvalues of the correlation matrix are 2.48, 0.99, 0.36 and 0.17.
(i) Provide an interpretation of the first two principal components. [ 5 marks ]
(ii) Draw a scree plot and discuss the number of principal components that you would
use. [ 5 marks ]
8. In a study on diabetes, we aim to identify people with high risk of diabetes. The patient records containing eight variables were obtained for two classes of people: 45 records for healthy individuals (class 0) and 25 records for individuals with a high risk of diabetes (class 1).
(a) Assume that the distributions of the two classes are two multivariate normal distributions with the same covariance matrix, MN(µ0 , Σ) and MN(µ1 , Σ), a new sample Xnew can be
classified using a linear discriminant function in the following form
aT (Xnew − b).
Derive aT and b using the maximum likelihood discriminant rule. [ 15 marks ]
(b) To simplify the study, the original data were transformed using principal component analysis. The first two principal components are used in the analysis, instead of the original
data. The sample mean vectors of the two classes are
0 = ( −0.40, −0.20)T and 1 = (0.75, 0.36)T .
The sample covariance matrices of the two classes are
0 = ┐ and 1 = ┐ .
Using the above information, show that the estimate of Σ in (a) is = ┐ .
[ 5 marks ]
(c) Given the information in (b) and that − 1 = ┐ , calculate the linear dis- criminant function for the diabetes data. Classify the following person with feature vector ( −0.45, 0.15)T to one of the two classes. [ 5 marks ]
9. Eight objects, labeled A, B, C, D, E, F, G and H, have measures of dissimilarity between them assessed as shown below.
|
A |
B |
C |
D |
E |
F |
G |
H |
A |
0 |
57 |
105 |
95 |
100 |
93 |
89 |
51 |
B |
57 |
0 |
104 |
76 |
92 |
83 |
78 |
37 |
C |
105 |
104 |
0 |
73 |
99 |
102 |
129 |
121 |
D |
95 |
76 |
73 |
0 |
40 |
49 |
49 |
57 |
E |
100 |
92 |
99 |
40 |
0 |
72 |
52 |
74 |
F |
93 |
83 |
102 |
49 |
72 |
0 |
34 |
60 |
G |
89 |
78 |
129 |
49 |
52 |
34 |
0 |
56 |
H |
51 |
37 |
121 |
57 |
74 |
60 |
56 |
0 |
(a) Demonstrate the complete-link cluster analysis procedure by calculating the matrix showing dissimilarities between clusters in a solution with five clusters; that is, perform three iterations of the procedure of aggregating clusters. [ 13 marks ]
(b) State the main difference between the single-link cluster analysis and the complete-link cluster analysis. Illustrate this through the formation of six clusters in a single-link cluster analysis. [ 12 marks ]
2022-05-18