STAT3064/STAT5061 Semester 2, 2022 Assignment 3
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit
STAT3064/STAT5061 Semester 2, 2022
Assignment 3
Assignment Questions
1. (a) Consider the four linkages given in the lecture slides for agglomerative hierarchical clustering . Explain the main differences between them, including how they affect the clustering . When would you prefer to use the complete linkage and when the single linkage . Give reasons for your answers .
(b) Explain why k-means clustering in practive often results in different cluster ar-
rangements when carried out multiple times, and suggest a way to ameriorate such differences while at the same time obtaining a good solutions .
2. Consider the Dow Jones returns which you will find in the Data Sets folder . Read the data into R and select all rows of the data but only the columns for day 1201 to day
2400 inclusive . Refer to these data as DJ1201 . Your first entry should be for date 2 October 1995 and the last should be for 30 June 2000 .
(Hint. You may find the code chunk below useful .)
‘‘‘–r˝
DJ1 = read .csv(”Dow˙Jones˙returns/DJ30returns .csv”)
DJuse = DJ1[- (1:5),]
head(DJuse)
library( lubridate )
DJuse = mutate( DJuse, Date = as˙date( X, format = ”%d/%Om/%y” ) ) %¿% dplyr::select( -X ) %¿%
mutate˙if( is .character, as .numeric )
summary( DJuse )
‘‘‘
Use the raw data in this question, do not scale . Remember days or dates do not make any sense as part of a PCA or clustering analysis of returns .
(a) Use the 30 stocks of DJ1201 as observations and answer the following – do not
show your R code, simply list the information required .
(Hint. You need to calculate the quantitiesin iii . to vi . below directly from the covariance matrix of the data with stocks as observations . Use the command rankMatrix in the library Matrix . )
i . What is the size of the data matrix?
ii . What is the size of the covariance matrix of the observations?
iii . What is its rank of the covariance matrix of these observations?
iv . What is its largest eigenvalue of th covariance matrix of part ii?
v . What do you notice about the 30th eigenvalue of this matrix?
vi . Show a plot of the eigenvalues of the covariance matrix and comment on its shape .
(b) Use the stocks as observations . Cluster the stocks using k-means with the Eu-
clidean distance, k = 2 and nstart = 25 . What is the size of each of the two clusters? What stocks belong to the smaller of the two clusters?
(c) Use the 1200 daily returns as observations . Cluster the daily returns using k- means with the Euclidean distance, nstart = 25 . If you have problems with convergence use iter .max = 50 . Carry out the following tasks:
i . For each k = 2, . . . , 12 and for each of the e = 2, . . . k clusters, calculate the number of observations and display the results in a cluster table similar to that shown in Lecture 9 .
ii . For k = 2, . . . , 12 on the x-axis show separate plots of the within-cluster variability W , the between-cluster variability B and the total sum of squares against the index k . Comment on the behaviour of the within-cluster vari- ability W , and the between-cluster variability B as k increases .
iii . Based on the calculations and graphs in parts i . and ii ., state what you think is the right number of clusters for these data and give a reason for your choice . Comment on your results .
3. Use the 30 stocks of DJ1201 as observations and work with the raw data as in Q2 .
(a) Calculate the first two principal components and show a PC1 /PC2 score plot
similar to those shown in Figures 10 .11 and 10 .12 . Compare your score plot with those of these figures. What do you notice about the pattern or distribution of the 6 Tech stocks in your score plot? How does this pattern differ from those in pattern in these figures with respect to the Tech stocks?
(b) Apply agglomerative hierarchical clustering to the stocks based on the complete
linkage and the Euclidean distance . Show the dendrogram of this cluster analysis .
(c) Show a cluster table by levels up to 12 levels similar to Table 10 .4 . Comment on your dendrogram and your cluster/level table . Compare your figure and table with those of Figure 10 .13 and Table 10 .4 .
(d) Now use the daily returns of the DJ1201 as observations . Apply agglomerative hierarchical clustering to the daily returns based on the complete linkage and the Euclidean distance . Show the dendrogram of the cluster analysis .
(e) Show a cluster table by levels up to level 12 similar to Table 10 .5 . Comment on
your dendrogram and your cluster/level table . Compare your figure and table with those of Figure 10 .14 and Table 10 .5 .
(f) Why do you think your results in parts (e) and (f) differ considerably from those
of Figure 10 .14 and Table 10 .5?
(Hint. Only a short answer is required here, but you need more than stating that the data are different .)
4. Consider the 13-dimensional wine recognition of Example 4 .6 . The data are available in the Data Sets folder . Ignore the first column and use the raw data for the calculations of the parts below . (The first column contains the class labels .) Note that the file is tsv (not csv) . Read the data into R .
(a) For k = 2, . . ., 10 calculate the cluster arrangement using nstart=50. List the
within-cluster variabilities, and the between cluster variabilities .
(b) Show the results of clustering up to k = 10 in a table similar to that of Table
10.4.
(c) Compare the cluster table of part (b) with the analogous table calculated as part of Q3 in Lab 8 (which refers to the k-means clustering of the scaled wine data .) Comment .
(d) For the raw (so not scaled) data, calculate the cluster statistics WV, CH, the between-cluster variability B and the within cluster variability – as in Chapter 10
– for k s 10, and plot the results of these statistics against the index k, shown on the x-axis .
(e) Based on the analyses in parts (b) to (d) select the number of clusters for these
data . Give a reason for your choice .
(f) The wine data come from three different cultivars which may be regarded as the
classes for these data . Use your results from parts (a) and (b) for k = 3 clusters and compare these results with the membership of the data to the cultivars . Show your results in an appropriate table and comment .
2022-10-06