闪电代写 -代写CS作业_CS代写_Finance代写_Economic代写_Statistics代写_代码代做_IT代写_加急帮助

Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

STAT3064/STAT5061 Semester 2, 2022

Assignment 3

Assignment Questions

1. (a) Consider the four linkages given in the lecture slides for agglomerative hierarchical clustering . Explain the main diﬀerences between them, including how they aﬀect the clustering . When would you prefer to use the complete linkage and when the single linkage . Give reasons for your answers .

(b) Explain why k-means clustering in practive often results in diﬀerent cluster ar-

rangements when carried out multiple times, and suggest a way to ameriorate such diﬀerences while at the same time obtaining a good solutions .

2. Consider the Dow Jones returns which you will ﬁnd in the Data Sets folder . Read the data into R and select all rows of the data but only the columns for day 1201 to day

2400 inclusive . Refer to these data as DJ1201 . Your ﬁrst entry should be for date 2 October 1995 and the last should be for 30 June 2000 .

(Hint. You may ﬁnd the code chunk below useful .)

‘‘‘–r˝

DJ1 = read .csv(”Dow˙Jones˙returns/DJ30returns .csv”)

DJuse = DJ1[- (1:5),]

head(DJuse)

library( lubridate )

DJuse = mutate( DJuse, Date = as˙date( X, format = ”%d/%Om/%y” ) ) %¿% dplyr::select( -X ) %¿%

mutate˙if( is .character, as .numeric )

summary( DJuse )

‘‘‘

Use the raw data in this question, do not scale . Remember days or dates do not make any sense as part of a PCA or clustering analysis of returns .

(a) Use the 30 stocks of DJ1201 as observations and answer the following – do not

show your R code, simply list the information required .

(Hint. You need to calculate the quantitiesin iii . to vi . below directly from the covariance matrix of the data with stocks as observations . Use the command rankMatrix in the library Matrix . )

i . What is the size of the data matrix?

ii . What is the size of the covariance matrix of the observations?

iii . What is its rank of the covariance matrix of these observations?

iv . What is its largest eigenvalue of th covariance matrix of part ii?

v . What do you notice about the 30th eigenvalue of this matrix?

vi . Show a plot of the eigenvalues of the covariance matrix and comment on its shape .

(b) Use the stocks as observations . Cluster the stocks using k-means with the Eu-

clidean distance, k = 2 and nstart = 25 . What is the size of each of the two clusters? What stocks belong to the smaller of the two clusters?

(c) Use the 1200 daily returns as observations . Cluster the daily returns using k- means with the Euclidean distance, nstart = 25 . If you have problems with convergence use iter .max = 50 . Carry out the following tasks:

i . For each k = 2, . . . , 12 and for each of the e = 2, . . . k clusters, calculate the number of observations and display the results in a cluster table similar to that shown in Lecture 9 .

ii . For k = 2, . . . , 12 on the x-axis show separate plots of the within-cluster variability W , the between-cluster variability B and the total sum of squares against the index k . Comment on the behaviour of the within-cluster vari- ability W , and the between-cluster variability B as k increases .

iii . Based on the calculations and graphs in parts i . and ii ., state what you think is the right number of clusters for these data and give a reason for your choice . Comment on your results .

3. Use the 30 stocks of DJ1201 as observations and work with the raw data as in Q2 .

(a) Calculate the ﬁrst two principal components and show a PC1 /PC2 score plot

similar to those shown in Figures 10 .11 and 10 .12 . Compare your score plot with those of these ﬁgures. What do you notice about the pattern or distribution of the 6 Tech stocks in your score plot? How does this pattern diﬀer from those in pattern in these ﬁgures with respect to the Tech stocks?

(b) Apply agglomerative hierarchical clustering to the stocks based on the complete

linkage and the Euclidean distance . Show the dendrogram of this cluster analysis .

(c) Show a cluster table by levels up to 12 levels similar to Table 10 .4 . Comment on your dendrogram and your cluster/level table . Compare your ﬁgure and table with those of Figure 10 .13 and Table 10 .4 .

(d) Now use the daily returns of the DJ1201 as observations . Apply agglomerative hierarchical clustering to the daily returns based on the complete linkage and the Euclidean distance . Show the dendrogram of the cluster analysis .

(e) Show a cluster table by levels up to level 12 similar to Table 10 .5 . Comment on

your dendrogram and your cluster/level table . Compare your ﬁgure and table with those of Figure 10 .14 and Table 10 .5 .

(f) Why do you think your results in parts (e) and (f) diﬀer considerably from those

of Figure 10 .14 and Table 10 .5?

(Hint. Only a short answer is required here, but you need more than stating that the data are diﬀerent .)

4. Consider the 13-dimensional wine recognition of Example 4 .6 . The data are available in the Data Sets folder . Ignore the ﬁrst column and use the raw data for the calculations of the parts below . (The ﬁrst column contains the class labels .) Note that the ﬁle is tsv (not csv) . Read the data into R .

(a) For k = 2, . . ., 10 calculate the cluster arrangement using nstart=50. List the

within-cluster variabilities, and the between cluster variabilities .

(b) Show the results of clustering up to k = 10 in a table similar to that of Table

10.4.

(c) Compare the cluster table of part (b) with the analogous table calculated as part of Q3 in Lab 8 (which refers to the k-means clustering of the scaled wine data .) Comment .

(d) For the raw (so not scaled) data, calculate the cluster statistics WV, CH, the between-cluster variability B and the within cluster variability – as in Chapter 10

– for k s 10, and plot the results of these statistics against the index k, shown on the x-axis .

(e) Based on the analyses in parts (b) to (d) select the number of clusters for these

data . Give a reason for your choice .

(f) The wine data come from three diﬀerent cultivars which may be regarded as the

classes for these data . Use your results from parts (a) and (b) for k = 3 clusters and compare these results with the membership of the data to the cultivars . Show your results in an appropriate table and comment .