STAT0023 Workshop 6: Advanced Regression Live workshop materials
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit
STAT0023 Workshop 6: Advanced Regression
Modelling in R
Live workshop materials
In this week’s live workshop we will see one more example of a Generalized Linear Model and we will also explore Generalized Additive Models.
1 Setting up
For this live workshop, in addition to these instructions you will need the following:
• The lecture slides.
• The R scripts Beetle_GLMs.r and Terrorism_GAMs.r. No more reminders as to what you’re supposed to do with these!
• The data files terrorism.dat and countries.dat.
Having downloaded the files, start up RStudio and use the Session menu to set the working directory appropriately.
2 Generalized linear models
2.1 Another example: beetles and binomials
We’ll now look at another example. The table below shows numbers of insects dead after five hours’ exposure to gaseous carbon disulphide (CS2) at various concentrations: the data are from the book An Introductionto Generalized Linear Models by Annette Dobson. It is of interest to understand the relationship between the CS2 dose and the death rate, for example to determine what is the lowest dose that could reasonably be expected to kill a specified proportion of the insects (this would be relevant for a farmer who wanted to control pests in a crop) — or alternatively to determine the maximumdose that could be applied safely.
Dose (log10 CS2 mg l− 1) Number of insects Number of deaths
1.6907
1.7242
1.7552
1.7842
1.8113
1.8369
1.8610
1.8839
59
60
62
56
63
59
62
60
6
13
18
28
52
53
61
60
Let denote the number of insects killed at the th dose (the reason for the tilde ̃ will become clear). In the first instance, it seems reasonable to assume that is distributed as binomial with parameters and say, where the { } are given by the ‘Number of insects’ column in the table above and is the probability of death which depends on the dose. In such settings, the relationship between the dose and the probability is sometimes called a dose-response curve.
Under the binomial model, the expected value of is and its variance is (1 − ). In principle we could apply the GLM framework here, and find a link function (⋅) so that ( ) = 0 + 1 where is the th dose level. However, this would not address the realquestion of interest: we want to know how the dose affects , not . The solution is easy: work with proportionsrather than with the original counts. So: define = / so that () = and Var() = (1 − )/ (check that you can relate this to the variance function given in Table 1 for the binomial distribution). Now we can apply the GLM framework with the {} as responses. The only remaining task is to choose a link function.
The most common link function for binomial response data is the logit1 link, leading to a logistic regression model for our beetle deaths:
log ( ) = 0 + 1 = , say. (1)
The left-hand side is the log of the odds of dying. You should satisfy yourself of the following:
• As tends to zero, so log( /(1 − )) tends to −∞ .
• Conversely, as tends to 1, log( /(1 − )) tends to +∞ .
• For a given value of , the probability can be calculated as = [1 + exp(− )] −1 .
The logistic link function therefore transforms the interval (0,1) to the entire real line and, given a value of the logit , you can find the corresponding probability. The logistic regression model therefore guarantees that the modelled probabilities lie within the range (0,1). Let’s see how to deal with this in R. You can use the script Beetle_GLMs.r to get started. This script contains some gaps for you to fill in at the end (so don’t try to run it all at once or you’ll get an error!). The first half of the script is complete, though. Proceed as follows:
1. Together with the students sitting close to you, work through the first half of the script, step by step. It starts by defining the data, making a plot (notice the axis labels) and then fitting a binomial GLM (notice the comment about the weights argument to the glm()
command: this argument mustbe included when fitting binomial GLMs in R). Then you get a summary and some diagnostic plots: you should know how to interpret these. Notice that the ‘residuals versus fitted’ plot shows some apparent structure: there are positive values at both ends, and negative values in the middle. One possible way to remedy this is to add a quadratic term to the model — but this makes absolutely no sensein the present context (question for discussion: why?), so don’t even think about it. An alternative is to try different link functions.
2. For the binomial distribution, there are two commonly-used alternatives to the logit link function, as follows:
– The probit link: here, instead of setting = log( /(1 − )) as in equation (1), we set = −1 ( ) where −1 (⋅) is the inverse cumulative distribution function of a standard normal distribution. To convert back from to , we have = ( ).
– The complementary log-log link: here, we set = log(−log(1 − )). To convert back, we have = 1 − exp[−exp( )]. As with the logit, both of these link functions transform the interval (0,1) to the entire real line.
Split your group into two: one half should fit a model using the probit link in place of the logit, and the other half should fit a model using the complementary log-log link. To use a probit link, just change link="logit" to link="probit" in the family argument to your glm() command. For the complementary log-log, use link="cloglog". When you have fitted both models, get back together as a group and compare the residuals for each of them, as well as anything else that helps you to decide which model is preferred.
3. Once you have decided which link function you prefer, use it to produce a plot of the data along with the fitted dose-response curve and confidence intervals. The ‘skeleton’ of the code is provided to you: you just have to replace the ##### parts with the correct expressions for your model. The basic procedure is identical to what you did for the initial Galapagos model: use the code in Galapagos_GLMs.r to help you, therefore. You may find it helpful if one of your group members has the script Galapagos_GLMs.r open, for reference in case you get stuck.
The purpose of this example is to show you how general the GLM framework really is. You’ve fitted a few different models to response data that have binomial distributions, using almost exactly the same commands that you used previously for data that have Poisson distributions. The output is also interpreted in almost exactly the same way: the only tricky part is coding the inverse link functions, to get the fitted curves onto the plots. It is fair to say that GLMs are at the core of most modern applied statistical work, because they bring together so many different techniques into a common framework.
3 Generalized additive models
For the final part of this worksheet, we will cover Generalised additive models (GAMs). GAMs
have only really become a routine part of the statistician’s toolkit in the last 10 years or so, but their use is increasing widely — largely due to the existence of the mgcv library in R, which is almost entirely the work of Professor Simon Wood at the University of Edinburgh (with a bit of help from UCL’s own Giampiero Marra).
To illustrate the use of GAMs, we’ll look at how the frequency of terrorist attacks has changed through time in different parts of the world. The data are from theRAND Database of Worldwide Terrorism Incidents(if you’re reading this on your computer as a PDF document, click on the blue text to go to the link). This database gives a comprehensive listing of global terrorist attacks from 1968 to 2009. For the present exercise, the data have been aggregated to give annual numbers of terrorist attacks for each of 195 countries, over this time period. There are two data files for this exercise. The first, terrorism.dat, contains data on three variables: Country (the country name), Year (self-explanatory) and Freq (the number of attacks in that
country and year). The second, countries.dat, contains variables Country (the country name), Continent (which continent the country is in) and Region (the name of the World Bank development region that the country is in). The information in this second file was obtained from the countrycode R package, version 0.19 (there really doesseem to be an R package for almost everything!).
In both files, the variables are separated by semicolons ‘;’, rather than the usual spaces or commas. The reason is that several of the country names themselves contain commas and spaces: China, for example, appears as “China, People’s Republic Of”. If we were to use something like read.table() to read the information from countries.dat in the normal way, R would think that the country name is ‘China’, and the continent is People's! On the other hand, if we were to pretend it’s a csv file with commas separating the variables, R would think that the country is ‘China’ and the continent is People's Republic Of. Fortunately, the people who wrote R have anticipated this kind of difficulty: the read.csv() command allows us to read a data file that uses something other than a comma to separate the columns. We have chosen the semicolon here because there are no semicolons in any of the country, continent or region names.
This example illustrates another problem that is often encountered in applications: mergingdata
from different sources. In this particular instance, this is useful because it might be of interest to compare terrorism trends between continents, or between World Bank development regions. The two data files are linked by the Country variable that is common to both of them: this is the key to merging them.
The script for this analysis is called Terrorism_GAMs.r. Working together with the people sitting next to you, open it and work through the first part, which reads the data and plots the time series of terror attacks in the United States. It should be clear from this plot that you couldn’t possibly find a linear or generalised linear model to represent the overall trend in terror attacks: if you want to study the underlying trend therefore, a GAM seems like a useful option. In the script, the next 60 lines or so show how to use a simple GAM to represent this structure. Note the following:
• The data are counts, so once again it is natural to start with a Poisson model — and, again, with a log link. The model is thus
∼ ( ) with log = 0 + 1 () , (2)
where 0 is a constant and 1 (⋅) is a smooth function of the ‘time’ (here denoting the year).
• The basic gam() command to fit this Poisson model is very similar to the glm() command: the main difference is the use of s(Year) to represent a “smooth function of Year”.
• The summary() of the fitted model looks a bit different from the GLM and linear model summaries that you’ve seen before: it gives the estimated intercept and then some information about the ‘smooth terms’. This information includes some columns entitled edf and Ref.df. The one to look at is edf, which stands for ‘effective degrees of freedom’. This is a measure of the complexity of the fitted curve: for the initial US model,
the value is 7.946, which is ‘almost 8’.2You can interpret this as saying that the curve is in some sense almost as complex as an eighth-degree polynomial (the precise definition of ‘effective degrees of freedom’ is a bit complicated for nonparametric regression models, and you don’t need to know it — you only need to know this very informal interpretation).
• When you plot() a GAM object, you don’t get the usual diagnostic plots: you get a plot of the estimated smooth function. In fact, this is a plot of the estimated function 1 () from equation (2). Notice from the equation that 1 () can’t be defined completely arbitrarily. For example, for any function 1 () you could obtain an identical model fit by adding 1 to the function and then subtracting 1 from the intercept 0 to compensate. Therefore some constraint is needed in order to ensure that the model is identifiable. The gam() command ensures that all smooth functions are centred on zero: you can see this in the plot.
• Much of the rest of the analysis should be familiar to you from the Galapagos endemics example: the use of Pearson residuals to check for overdispersion, the quasiPoisson modelling, the construction of a plot showing the fitted trend and so forth.
By the time you have finished with the US terrorism trends, you should be able to see how GAMs work in practice and how — conceptually at least — they are very similar to GLMs, which themselves are very similar to linear regression models.
GAMs are incredibly powerful in principle, although it takes a bit of practice to become an expert user. To illustrate this for anybody who is interested, the Terrorism_GAMs.r script contains an optional section at the end. This shows how (after some helpful data visualisation)
GAMs can be used to compare trends in terrorist attacks in different regions of the world. There are plenty of other interesting questions that you could ask about these data, too! This section is optional, however: you don’t have to look at it if you don’t want to.
4 Moodle quiz
We’re back to quizzes this week! There is one for you to try.
2022-03-25