STAT0023 Workshop 6: Advanced Regression Self-study materials
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit
STAT0023 Workshop 6: Advanced Regression
Modelling in R
Self-study materials
We’ll finish the R part of the course by exploring some of its advanced regression capabilities using generalized linear and generalized additive models. We’ll also learn how to combine data from different files — something that is often needed in real-world situations.
1 Setting up
For this self-study worksheet, in addition to these instructions you will need the following:
• The lecture slides.
• The R script Galapagos_GLMs.r. No more reminders as to what you’re supposed to do with these!
• The data file galapagos.dat. Notice that galapagos.dat is the same file that you used
in Workshop 1: depending on how you’re organising your STAT0023 materials therefore, you may already have this in the folder that you’re using for today’s workshop. If so,
obviously you don’t need to download it again.
Having downloaded the files, start up RStudio and use the Session menu to set the working directory appropriately.
2 Generalized linear models
2.1 Summary of theory
In the lecture we saw that generalized linear models (GLMs) extend the classical linear regression model. They are used to study the relationship between a response variable and some collection of covariates 1, … , . The data consist of pairs (1, 1), … , ( , ) say, where is the
response for the th case in the dataset and = ( , 1 … ,) is the corresponding vector of
covariates. The important things to know are:
• A GLM says that given , has some probability distribution with mean , such that ( ) = = 0 + ∑=1 , (= , say) for some coefficient vector = (0 1 … ) and monotonic link function (⋅). is a linear function of the covariates: it is called the linear predictor for .
• The distributions of the s are in the exponential family, which contains several standard distributions. A feature of these distributions is that the variance is a function of the mean: Var() = ( ) say, where (⋅) is the variance function for the distribution and is a dispersion parameter. The dispersion parameter is assumed constant for all
observations. Table 1 gives the means, variance functions and dispersion parameters for some common distributions in the exponential family.
• If (⋅) is the identity, and we specify a normal distribution for , we get back to a standard linear regression model. Table 1 shows that the dispersion parameter for a normal distribution is the variance, so here the assumption of a constant dispersion parameter is just the usual ‘constant variance’ assumption from linear regression modelling.
Distribution () Comment
Normal (, 2)
Binomial (, )
Poisson ()
Gamma (, ) /
1
(1 − )
2
2
−1
1
1/
Constant variance
—
Variance equal to mean
Constant coefficient of variation
Table 1: Means (), variancefunctions (())and dispersion parameters ()for some common distributions inthe exponentialfamily. The binomialdistribution here is considered asthe distribution ofthe proportion of successes in a series oftrials, ratherthanthe number of successes— seethe beetle example inthe liveworkshopnotes.
In a GLM, the coefficient vector is estimated by maximum likelihood. The MLE is found using a slightly modified Newton-Raphson algorithm, called iterative (re)weighted least squares or Fisher scoring. You might ask ‘if we use maximum likelihood to estimate the coefficients in a GLM, why don’t we also use maximum likelihood instead of least-squares to estimate the coefficients in a linear regression model?’ The answer is: least-squares is maximum likelihood when the assumptions of a linear regression model are satisfied (if you want to know more, see the Appendix to these notes). GLMs have analogues of all the quantities that you’re already familiar with from linear regression. Some of the important ones are indicated here. For this course, you do not need to know how they are defined: you just need to be able to interpret them by analogy with the corresponding quantities for linear regression models (you can find more details in the book that is recommended at the end of the slides for this week’s lecture).
Quantity Role
Deviance:
Proportion of deviance explained:
Residuals:
this measures the lack of fit of a GLM, and generalises the residual sum of squares (RSS) from a linear regression model. Indeed, for a linear regression model the deviance isthe residual sum of squares.
this plays the same role as the ‘proportion of variance explained’ or 2 , that you’re familiar with from linear regression. It’s defined in exactly the same way as 2 , just replacing ‘RSS’ with ‘deviance’ throughout.
the raw residuals from a GLM are just the differences between the observations and the fitted values: { − }, in an obvious (hopefully!) notation. However, residuals are often used for model checking: if a model
fits the data well then the residuals should show no obvious structure. In a GLM, the variance can depend on the mean: this can lead to apparent structure in plots of the raw residuals even when the model is correct. For this reason, other types of residuals are often defined for GLMs. The most common are: deviance residuals, which are defined so that when you square and sum them you get back to the deviance (by analogy with the residuals from a linear model, because when you square and sum these you get the RSS); and Pearson residuals which are defined so that each one has zero mean and constant variance if the model is correct. For its residual plots, R uses Pearson residuals (unless you have a slightly older R installation, in which case R would show deviance residuals). The residual plots for a GLM can usually be interpreted in exactly the same way as those for an ordinary linear regression model — including things like plots of Cook’s distance which are designed to identify influential observations.1
In R, GLMs can be fitted using the glm() command, which works exactly likely the lm() command — the only difference is that you need to add a ‘family’ argument to specify which distribution you want to use. We’ll see some examples below. Commands like summary(), plot(), anova() and predict() work in almostexactly the same way for GLMs as for linear models. There are a couple of minor differences however, as follows:
• In linear regression, to compare two nested models using an test you can use a command such as anova(Model1, Model2, test="F") (see the notes for Workshop 2). The test is designed to account for the fact that you need to estimate the residual variance. As explained above, from a GLM perspective the residual variance is the
dispersion parameter: whenever you want to compare two nested GLMs with unknown dispersion parameters, the same procedure can be used. However, sometimes the value of the dispersion parameter is known. For example, for the Poisson distribution the dispersion parameter is known to be 1 (see Table 1): we don’t need to estimate it therefore, so an test is inappropriate. In such situations, nested models can be compared using a different test statistic which has a chi-squared distribution under the null hypothesis that the data were generated from the simpler of the two models. To carry out such a test, you need to use anova(Model1, Model2, test="Chi"). In practice, the only two distributions you’ll encounter for which the dispersion parameter is known are the binomial and Poisson distributions. Use test="Chi" for these therefore, and test="F" for anything else.
• If you use summary() on a GLM, you will see an entry called AIC. This is the Akaike
Information Criterion (AIC) for the fitted model. AIC is defined as
= −2log + 2 ,
where log is the maximised log-likelihood for the model and is the number of coefficients estimated. A really good model will have a high log-likelihood with few parameters. If the AICs for models 1 and 2 are AIC1 and AIC2 therefore, you might prefer model 1 if AIC1 < AIC2 and vice versa. In principle, AIC can be used to compare non- nested models — it is most reliable, however, when the models are nested (i.e. when one of the models is a simplified version of the other). It is tempting to use AIC as your only criterion for choosing between models, but don’t! There are many other things to consider — look at the residual plots, ask yourself whether the results make sense in the context of the problem, and so forth.
• In linear regression, you can use the predict() command to obtain confidence intervals for the true regression function, as well as prediction intervals for future values. With GLMs, predict() works slightly differently: you can’t compute confidence intervals
directly and you can’t compute prediction intervals at all (there’s no easy way to do this for GLMs, except using simulation techniques). You can, however, compute predictions on the scale of the linear predictor, together with the corresponding standard errors: confidence intervals can then be calculated via the usual ‘estimate ±1.96 standard errors’ rule. If you want confidence intervals for the regression function itself (rather than the linear predictor), you must then convert back to the scale of the response variable via the inverse of the link function. There are examples below.
2.2 Example: endemic species in the Galapagos
Time for an example. We’ll revisit the Galapagos islands biodiversity data from Workshop 1 (also
from this week’s lecture). To get you started on GLMs, the script Galapagos_GLMs.r provides a step-by step analysis showing how to fit a Poisson GLM to predict the number of endemic species on an island. As usual with these introductory scripts, it is well commented. Work through it line-by-line, at each stage making sure that you understand what is being done and why. If you have any questions, ask. There is quite a lot of material in this script: if you work
steadily, it should take you up to 1 hours. When you have finished, you should understand:
• How GLMs can be used as an alternative to data transformation in regression-type problems, and why the GLM approach may be preferable.
• How to use the glm() command in R, how to obtain summaries and diagnostic plots for fitted models, and how to interpret the outputs.
• How to plot the fitted relationship in a simple GLM, with confidence intervals.
• How to compare nested GLMs.
• How to recognise overdispersion, in models for which the dispersion parameter is known, and what are the options available for dealing with it.
2.3 Appendix: maximum likelihood for the linear regression model
In Section 2.1 it was claimed that maximum likelihood is the same as least squares when the
assumptions of a linear regression model are satisfied. To demonstrate this, consider the simple linear regression model
= 0 + 1 + ( = 1, … , )
with ∼ (0, 2) independently for each . The th observation thus has a normal distribution:
∼ (0 + 1 , 2) ,
with probability density function
( ) = exp [− ] ( ∈ ℝ).
Furthermore, since the observations are all considered to be independent under the linear regression model, their joint density is the product of their individual marginal densities:
() = ∏ ( ) = (22)−/2exp [− 1 ∑( − (0 + 1 ))2] ,
(3)
where = (1 2 … ) is a vector containing all of the observed responses. The likelihood function for the parameters 0, 1 and 2 is just the joint density (3), considered as a function of the parameters. Taking logs then, we obtain the log-likelihood:
ℓ(0, 1, 2; ) = − 2 log22 − 22 ∑[ − (0 + 1 )]2 .
The maximum likelihood estimates of 0 and 1 are the values that maximise this log-likelihood. The first term doesn’t depend on 0 and 1, and the second term has a negative sign in front of it. Maximising with respect to 0 and 1 is therefore equivalent to minimising ∑1[ −
(0 + 1 )]2, which is exactly the quantity that we seek to minimise in least-squares estimation.
2022-03-25
Self-study materials