This assignment contains 12 questions, totalling to 100 marks. Please write the answers to each question in the boxes following directly underneath each question (some of these are R Code boxes, and some of these R Markdown boxes, depending on the character of the questions). If you need to add text to a Code box, please use the commenting symbol (#) at the beginning of the line.

A few cells contain the information "DO NOT FILL IN THIS CELL". It is very important that you indeed do not write anything into those cells, as this could mess up the validation and marking of your notebook.

You can make use of any existing R functions (including those in packages, as far as available), or R functions developed in the lectures and labs, unless stated otherwise.

We consider a data set known as the "Hidalgo issue of Mexico". This dataset contains the thickness of 485 stamps that were printed in a mixture of paper types in Mexico between 1872 and 1874. Please use the following code to read the data in and display a histogram:

In [  ]:

require(multimode)

data(stamps)

hist(stamps)

Question 1 (4 marks)

Produce, side-by-side, two additional histograms of the same data, but with about twice and four times as many, respectively, bins as the original histogram.

In [  ]:

# YOUR CODE HERE

Question 2 (4 marks)

Produce and display a kernel density estimate of the stamps data set, using the R function density (with its default options).

In [  ]:

# YOUR CODE HERE

Question 3 (3 marks)

By considering the component $bw of the fitted density object, identify the bandwidth used for the estimation of the density above. Assign this bandwidth value to an object h1.

In [  ]:

# YOUR CODE HERE

In [  ]:

# Do not fill in this cell

Question 4 (12 marks)

We are now interested in estimating the density parametrically, through a Gaussian mixture. We allow the K mixture components to have different standard deviations. Using R function normalmixEM in R package mixtools, fit Gaussian mixture models to the stamps data with K=2,3, and 4 components, respectively. Report all values of mixture weights, mean parameters, and the estimated component standard deviations, as well as the resulting log-likelihoods.

It is recommended to set a seed to ensure reproducibility. Avoid reporting poor or non-convergent solutions.

In [  ]:

# YOUR CODE HERE

Question 5 (12 marks)

Visualize the three fitted mixture distributions (side-by-side in a 1 x 3 split window)

In [  ]:

# YOUR CODE HERE

Question 6 (15 marks)

For the stamps data, carry out a bootstrap likelihood ratio test of H_0: K=2 versus H_1: K=3.

You will need for this, initially, a function which simulates data from a given mixture model. You can use for this purpose any functions that were developed in lectures or labs, or that you find within R. The bootstrap routine as such MUST be manually implemented.

Provide a short conclusion summarizing your findings.

In [  ]:

# YOUR CODE HERE

Question 7 (15 marks)

Now carry out k-means clustering with 2, 3 and 4 components, respectively. Find the average silhouette width for each of the three clusterings, and based on these, give your judgement on the adequate number of clusters for this data set. Give a statement which compares this result with that from the fitted mixture model.

In [  ]:

# YOUR CODE HERE

Question 8 (6 marks)

We return now to the problem of kernel density estimation.

A popular rule for automatic bandwidth selection is Silverman's rule of thumb, which is

given by

h=0.9×An(−1/5)h=0.9×An(−1/5)

where A=min(s,IQR/1.34)A=min(s,IQR/1.34), with ss being the sample standard deviation of the data, IQRIQR the interquartile range, and nn the sample size.

Produce a function with name hsil which implements this rule. Then apply it to the stamps data set, and save the resulting bandwidth to an object h2.

In [  ]:

# YOUR CODE HERE

Question 9 (8 marks)

The bandwidth selection tool implemented in the cell above is based on the concept of "normal reference". Explain what this means, giving as much methodological insight as you are able to. You do not need to carry out actual derivations or computations.

YOUR ANSWER HERE

In [  ]:

# Do not fill in this cell

Question 10 (5 marks)

Another concept for bandwidth selection is that of a "critical bandwidth". The critical bandwidth h(k) is defined as the smallest bandwidth so that the estimated density has at most k modes. Find this bandwidth for k=2, and save the outcome into an object h3.

Hint: Use function locmodes. Ignore any warning messages referring to unbounded support of the density.

In [  ]:

# YOUR CODE HERE

In [  ]:

# Do not fill in this cell

Question 11 (6 marks)

Draw, side by-side, two density plots using the bandwidths h2 and h3, respectively.

In [  ]:

# YOUR CODE HERE

Question 12 (10 marks)

Produce an essay which explains how density modes relate to clustering. Your essay should explain, conceptually, how cluster centres and clusters can be identified through density modes, and also give some insight into the computations required in order toachieve this. Do you deem the densities plotted in the previous cell useful for modal clustering?

YOUR ANSWER HERE