MATH4068: Coursework 2021


MATH4068 students

• This coursework is ASSESSED and is worth 20% of the total module mark for MATH4068.

• Deadline: Coursework should be submitted via the coursework submission area on the Moodle page by Tuesday 18 May, 3pm. I strongly recommend you submit it before then. Do not spend more time on this project than it merits (it is only worth 20% of the module mark).

• Format: Please submit a single pdf or html file that has been produced by R Markdown in R Studio.

• Report length: Your report should not be too long. You should aim to convey the important details in a way which is easy to follow, but not excessively long. Think about your reader, and try to help them quickly understand the key points. Avoid repetition and long print-outs of uninteresting numerical output.

• Please post any questions about the coursework on the Moodle discussion boards. This will ensure that all students receive the same level of support. I will not be meeting students 1-1 to discuss the coursework, or providing help by email or Teams. Please be careful not to ask anything on the discussion boards that reveals any part of your solution to other students.

You have been given an additional week to do the coursework this year (this ‘grace period’ is to compensate for the difficulties of remote learning). But note that work handed in after the deadline will now receive a mark of 0. Please note that this deadline is strict, so please make sure you submit on time. If it is one second late according to Moodle then it will be counted as late.


Plagiarism and Academic Misconduct

For all assessed coursework it is important that you submit your own work. Some information about plagiarism is given on the Moodle webpage.


Grading

The coursework will be marked out of 10:

• 5 marks for technical content, use of R, and appropriate methods

• 5 marks for presentation and interpretation of results.


Coursework

The file gap.csv is available on Moodle, and contains the GDP per capita, and the life expectancy for 142 different countries from 1952 to 2007. This data is from gapminder.org.

Load the data into R using the commands


gap.raw <- read.csv('gap.csv')

gap <- gap.raw

gap[,3:14]<- log(gap.raw[,3:14])


Note that for GDP per capita, it is best to work with log(GDP) when doing statistical analysis, as the values vary over several orders of magnitude between countries. For ease of plotting, it may be useful to split the data into two data frames, one containing GDP per capita, and the other life expectancy data.


gdp <- exp(gap[,3:14])

years <- seq(1952, 2007,5)

colnames(gdp) <- years

rownames(gdp) <- gap[,2]


lifeExp <- gap[,15:26]

colnames(lifeExp) <- years

rownames(lifeExp) <- gap[,2]


In this project, you will analyse this data using the methods we have looked at during the module.

• Begin by creating some basic exploratory data analysis plots, showing how GDP and life expectancy have changed over the past 70 years.


Principal component analysis

• Carry out principal component analysis on the log(GDP) data and on the life-expectancy data using your preferred choice of S or R.

• Calculate the proportion of variation explained by each of the principal components, and provide a scree plot. Discuss how many principal components you would choose to retain in each case.

• Look at the leading principal components for the log(GDP) and the life expectancy data, and provide an interpretation for each component you have chosen to retain.

• Provide scatter plots of combinations of the first three principal component scores, indicating on the plot the names of the countries. Colour the data points by the continent they belong to. Identify and discuss any countries that have interesting characteristics based on your analysis. Can you explain what happened in any of these countries?


Multidimensional scaling

• Perform multidimensional scaling using the combined dataset of log(GDP) and life expectancy, i.e., using


gap[,3:26]


Find and plot a 2-dimensional representation of the data. As before, colour each data point by the continent it is on. Discuss the similarity of this plot with your previous plots.


Hypothesis testing

• Consider the log(GDP) and life expectancy of each country in the year 2007. Conduct a multivariate hypothesis test to test whether there was a statistically significant difference between the mean log(GDP) and life expectancy of Asian and European countries in the year 2007. Were the continents more similar in the year 1952?


Linear discriminant analysis

We will now look at whether linear discriminant analysis can be used to successfully separate the continents.

• Use linear discriminant analysis to train a classifier to predict the continent of each country using the log(GDP) and life expectancy from 1952-2007. Test the accuracy of your model by randomly splitting the data into test and training sets, and calculating the predictive accuracy on the test set.

• Give a plot of the 2d projection of the data onto the first two eigenvectors found by Fisher’s discriminant analysis approach. Discuss the difference between this plot and the plot you found using PCA.


Clustering

• Apply k-means clustering to the data. Give a plot of the final clusters you find, and discuss how you chose the number of clusters.

• Apply agglomerative hierarchical clustering. Try a variety of methods, and give one or two carefully selected plots that you feel represent the most successful clustering. Note that as well as changing the measure of the distance between clusters (i.e., complete/single/average linkage etc) you may also want to consider scaling the data before computing the distance matrix, i.e., using


gap.scaled <- gap

gap.scaled[,3:26] <- scale(gap[,3:26])


• Discuss the similarity of the clusters you find using hierarchical clustering with the clusters you found using k-mean clustering, and the whether the countries naturally cluster by continent or not.


Linear regression

Finally, we will look at whether the 2007 life expectancy in each country can be predicted by a country’s GDP over the previous 55 years.

• Use a linear regression approach to predict the 2007 life expectancy from the GDP values. Explain your choice of regression method, and assess its accuracy.