STAT0023 Workshop 9: Linear regression, ANOVA and Generalized Linear Models in SAS Self-study materials
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit
STAT0023 Workshop 9: Linear regression, ANOVA and Generalized Linear Models in SAS
Self-study materials
In Week 2, we explored linear regression and ANOVA-like models in R; and in Week 6 we covered generalized linear models. You may be tempted to ask whether you can do these things
in SAS. The answer is ‘Yes’. In Week 9 we will show you how. In the Week 9 self-study materials we will show you an ANOVA model and in the live workshop you will work through fitting a logistic regression. You will also learn a bit more about the DATA statement, as well as some of the nuances of SAS output.
1 Setting up
In addition to these instructions you will need the following, all of which can be downloaded from the ‘Week 9’ tab of the Moodle page:
• The lecture materials.
• The SAS programs sheepenergy.sas, batterylife.sas.
• The data file energy.dat.
Start up SAS and change the current folder to the location where you saved your downloaded files. Now you’re ready to start. But not without a brief methodological recap …
2 Brief methodological recap
Summaries of linear models were provided in Week 2. Summaries of generalized linear models were provided in Week 6. Make sure you have those materials to hand, therefore. You’re expected to know about least-squares and maximum-likelihood fitting, residual plots and how to interpret them, how to identify influential observations e.g. using Cook’s distance, how to interpret the estimated coefficients in fitted models, and how to compare models e.g. using - tests or criteria such as AIC.
I said it was brief :-)
3 This week’s procedures: PROC REG, PROC GLM and PROC GENMOD
This workshop covers the three main procedures for fitting linear and generalized linear models in SAS, which are:
• PROC REG: this fits regression models with continuous covariates.
• PROC GLM: this fits ‘general linear models’ (NB not ‘generalized linear models’! The two things are different, and it’s easy to get confused). The term ‘general linear model’ is a bit old-fashioned: it dates back to the time when people started to realise that classical
analysis of variance techniques could be tackled in the framework of linear models. The output of PROC GLM focuses quite heavily on ANOVA tables and -tests, therefore — for example, by default it doesn’t output the estimated regression coefficients! But at some level, PROC GLM does essentially the same thing as PROC REG, and the standard regression modelling assumptions (normally distributed residuals with constant variance, etc.) are the same for both procedures.
• PROC GENMOD: this fits generalized linear models i.e. models with a variety of different response distributions, link functions and so forth, such as logistic regression models.
4 ‘Simple’ linear regression: energy requirements of sheep
The SAS program sheepenergy.sas provides some SAS code to analyse data on daily energy requirements (in Mcal/day) and body weight (in kg), for a sample of 64 grazing sheep in Australia. The data are in file energy.dat.
Open the program sheepenergy.sas via the File menu in SAS. If you scroll down, you will notice that there are some missing parts, denoted by “???”: you must fill these in yourselves. Also, some parts of the program contain comments like “!!!! SEE WORKSHOP NOTES !!!!”. You should work through the script step-by-step, and refer to these notes to provide a commentary where necessary.
The steps in the analysis are as follows:
1. Read the data file into SAS using a DATA step. This creates a data set with five variables called Weight, Energy, WeightSq, WtLinear and WtQuad. The last three are respectively Weight2 along with carefully chosen ‘linear’ and ‘quadratic’ transformations of Weight:
The three additional variables are defined in the DATA step because we’ll need them later on: in SAS, it’s a bit messy to add new variables to an existing data set (you have to create a new data set and then use UPDATE or MERGE statements to join them together), so it’s a good idea to assemble everything we need at the outset.
2. Produce a scatterplot to show the relationship between body weight and energy requirements. Note the fairly linear relationship (do you notice anything else?).
3. Use PROC REG to fit a linear model regressing energy requirement upon body weight. The model specification is determined by the statement MODEL Energy=Weight; — the
‘formula’ is fairly intuitive, with the name of the response variable on the left-hand side, and the name of the covariate on the right-hand side. Locate the table of coefficient estimates and standard errors. What does the -test for the coefficient of body weight tell you?
Notice the QUIT; statement at the end of PROC REG step. This is needed because PROC REG can be used ‘interactively’ — in the sense that after the first RUN; statement, you can add additional MODEL statements to fit and compare different models, without having to run the entire procedure again. This is potentially useful if you want to fit many models
to large data sets, for which the initialisation of a PROC REG step might be slow.
4. Fit the same model again, but this time asking for some diagnostic plots. Here there are some new statements:
These control the behaviour of the Output Delivery System(ODS) in SAS. You have encountered this already in fact: in Week 7, you saw how to choose between HTML and ‘listing’ output via the drop-down menus. The code above does the same thing as part of your program, which is much easier to control: you can switch between the two types of output as you wish, and it’s much quicker than doing it manually via the menu system.
The statement ODS HTML; ensures that the output of the REG step will appear in HTML format, and then ODS GRAPHICS ON; turns on ‘ODS graphics’. The reason for doing this is that the diagnostic plots from PROC REG — requested via the PLOTS=DIAGNOSTICS option in the procedure statement — will only be produced if ODS graphics are turned on. You will notice that the procedure is quite slow, so it might be a good idea to use ODS graphics only when necessary.
When you run PROC REG this time, you’ll see several plots in the output: some standard diagnostic plots that are very similar to the ones you saw from R in Week 2, as well as a plot of the fitted regression line with a confidence interval around it. Make sure you understand what all the plots are telling you, and ask during Monday’s Q&A or the live workshop if anything is unclear.
While producing those plots, SAS automatically created graphics files in PNG format that can be imported into Powerpoint presentations, Word documents and so forth. If you click the Results tab in the bottom-left of your SAS window, you will see a list of the procedures you’ve run so far: if you expand an item in this list (by clicking the ‘+’ symbol next to it) then you can ‘drill down’ to the detailed results. If you do this for the PROC REG results that you’ve just produced, you will eventually get down to the plots: double- clicking an ‘image file’ icon will then open the corresponding PNG file in an external viewer. The file is stored on your computer in the current SAS folder (which should be the same folder that you stored all of your other files for this workshop, if you followed the previous instructions correctly).
Warning: because SAS automatically creates a PNG file every time you create a plot when ODS graphics is turned on, you can rapidly generate a very large number of unhelpfully- named files in your current folder if you’re not careful. It’s probably best to use it sparingly therefore: this is why ODS graphics are turned off after this PROC REG step.
This particular PROC REG step also illustrates how you can store some of the output (e.g. residuals, predicted values etc.) in a new data set if required. The subsequent PROC GPLOT step then generates a plot of residuals against fitted values — this isn’t really
necessary here because this plot was already produced in the PROC REG step itself, but it shows how you could do it if you didn’t want to use ODS graphics with the PROC REG.
5. The next model to be fitted includes WeightSq as well as Weight. Note the syntax of the model formula: it’s Energy = Weight WeightSq. If you’re used to R then it’s easy to make mistakes here: the corresponding R formula in an lm() command would be Energy ~ Weight + WeightSq. SAS uses “=” instead of “~”, and the covariate terms are notseparated by a “+” symbol. Be careful!
After fitting this model, look at the -statistics in the output. Does Weight appear significant now?
6. The next few steps in the program explore different ways of parameterising the model,
and show how to do nested model comparisons using -tests with PROC REG. Read the comments in the program carefully, and check that you understand what is happening and that you can reconcile the output from the different models. You should aim to get a very clear idea of exactly what’s going on, and what is the relationship between energy requirement and body weight.
7. The final step in this exercise shows how to fit the same model using PROC GLM instead
of PROC REG. We’ll explore PROC GLM in more detail in the next example. For the moment, just note that the estimated coefficients are the same as when you used PROC REG, but that the output is presented in a slightly different way — in particular, the - tests for nested models are produced automatically by PROC GLM (look at the ‘Type 1 SS’ table — and see this week’s lecture materials for more discussion of these tables).
This exercise should take you about 45 minutes. When you’ve finished, you should understand the following:
• How to use PROC REG to fit simple regression models, and how to interpret the output. PROC REG can also be used to fit more complicated models: there are lots of examples in the help system.
• How to switch between ODS and ordinary graphics, and how to use ODS graphics effectively when you want to generate output graphics files automatically.
You have perhaps also reminded yourself of some of the issues that can arise due to collinearity in regression models.
5 PROC GLM for ANOVA models and regression with factor
covariates: analysis of battery life data
An engineer investigates the effect of temperature and material type on the life of a battery. Of particular interest is the question ‘is there a material that gives a long battery life, regardless of the temperature?’ In other words, is there a material that is robustto temperature? The experiment has three types of material, and three different temperatures (15, 70 and 125 ∘ F). The experiment is done for each of the nine combinations of material type and temperature and replicated four times, and yields the data shown in Table 5.1.
Table 5.1: Lifetimes of 36 batterieswith different materialtypes, at differenttemperatures.
The SAS program batterylife.sas shows how to define these data to SAS, and how to analyse them using PROC GLM. Open it using the File menu. This is quite a challenging example, so there are no blanks to fill in. However, there areplaces where you should refer to these notes. The first of these places relates to the DATA statement, which deserves its own section.
5.1 Defining the data to SAS
Since the data set is small, it doesn’t seem worth storing it in a separate file, so the DATA step in this program uses a DATALINES statement to read the data. One way to do this would be to write out three columns of numbers representing the material type, temperature and battery life respectively. If we did this however, we might make a mistake. It’s far easier to type out the lifetime data as they appear in Table 5.1, and to use the structure of the table to define the material and temperature for each observation. This requires some DATA step programming.
The complete DATA step is
This requires some translation! As follows:
• Whenever SAS reads data using a DATA step, it processes one row of data at a time. As it works, it creates an automaticvariablecalled _N_ (the underscores are part of the name) which isn’t stored in the final data set: it simply denotes the number of the row that SAS is currently working on. The first IF ... ELSE IF ... ELSE part of the DATA step above therefore says “for any observation in the first two rows of input, Material takes the value 1; for any observation in the third and fourth rows, Material takes the value 2; and for the remaining rows, Material takes the value 3”. If you look at the DATALINES section, you’ll see that the input is provided in rows exactly following the layout of Table 5.1: so successive pairs of rows do indeed correspond to the three material types.
• The next part of the step is a DO loop, which works along the input rows and is essentially the same as a for loop in R. The statements within the loop are executed first with i taking the value 0, then with i taking the value 1 and finally with i taking the value 2. Each iteration of the loop reads two values of the life variable from the current row (these are the two INPUT life @@; statements — recall from the Week 8 materials that the @@ symbol tells SAS to stay on the current line of input after reading the values that it needs). Therefore, the loop will read six values of life in total — which is the number of columns of data in Table 5.1. When i is 0, the variable Temp is calculated as 15 +
(55 × 0) = 15; when i is 1, Temp is 15 + (55 × 1) = 70; and when i is 2, Temp is 15 + (55 × 2) = 125: these are the three temperatures in Table 5.1. Finally, the OUTPUT; statements cause the current values of all variables (except _N_ and i) to be written to the data set.
• Notice also that a variable TempSq is defined, holding the squared values of Temp. We did something similar for the sheep energy example.
To fully understand this, you may like to imagine that you are the SAS punched-card machine. When you get the first line of data from the DATALINES statement, you set _N_ to 1. According to the IF ... ELSE IF ... ELSE sequence, you must therefore set Material to 1. Then you go to the DO loop. Working through the first iteration of this loop with i set to zero, you set Temp to 15 and TempSq to 152 = 225. Then you read a value of life from the data line that you’ve just read: this value is 130. Then you encounter an OUTPUT statement, so you write the values of Material, Temp, TempSq and life (1, 15, 225, 130) to the output data set. One observation completed! You’ve got another INPUT statement next, so you read another value of life from your data line: this is 155. None of the other variables has changed its value, so for the next OUTPUT statement you write (1, 15, 225, 155). Two observations done! Now you go back to the start of the DO loop , set i to 1 and do the same thing again. When you’ve reached the end of the first line of data, you go back to the start of the DATA step and read another line.
5.2 Doing the analysis
The DATA step for this example is not entirely trivial. Unfortunately, the analysis isn’t either! It starts innocently enough though. The remaining steps in the program are as follows:
1. Plot the data. There are a couple of new things in this command, such as controlling the size of the points on the plot, and providing a nice format for the legend. Otherwise it should be reasonably familiar to you — if it isn’t, look back at the material for Week 8. From your plot, what effect does temperature appear to have on battery life?
2. The next step fits a model in which both Material and Temp are treated as factors
(i.e. ignoring the fact that Temp is actually a numeric variable, and just treating the three levels of temperature as three distinct groups). This is done by using the CLASS statement within PROC GLM: note that this must precedethe MODEL statement. The model is in fact a two-way analysis of variance model, and the output of PROC GLM gives the standard ANOVA table (under the heading Type I SS) — sums of squares, degrees of freedom, mean squares, statistics and -values for each source of variation so that you can see whether either the material or the temperature has a significant effect on mean battery life. You may need to revise your STAT0006 notes (or equivalent) if you’ve forgotten about ANOVA models.
SAS has two commands to fit ANOVA models, PROC ANOVA and PROC GLM. PROC ANOVA requires a balanced design i.e. the same number of observations in each group. This is the case here (there are 4 observations for every material:temperature combination), so we could use it. However, PROC GLM is much more flexible (e.g. we’ll go on to see if temperature can be treated as a continuous variable rather than a factor — PROC GLM can do this, PROC ANOVA can’t) and has a greater range of output. The main advantage of PROC ANOVA is that it is faster and uses less memory than PROC GLM. However, with modern computing power the data sets have to get verybig before this starts to become a major issue. We’ll use PROC GLM therefore.
3. If you haveforgotten about ANOVA models, you might prefer the output from the next PROC GLM step in the program: by including the SOLUTION option in the MODEL
statement, you can see the coefficient estimates in the underlying linear model (you also saw this in the earlier sheep energy example). In order to understand these however, you need to remember how factor covariates are handled in regression models. The standard trick is to define dummy binary variables to represent each level of each factor. Here, both temperature and material have three levels, so we need to define six dummy variables. The linear regression formulation of the model is then
= 0 + 11 + 22 + 33 + 44 + 55 + 66 + (1)
where
– is the battery life for the th observation.
– 1 takes the value 1 if the th observation is for material type 1, and 0 otherwise;
– 2 takes the value 1 if the th observation is for material type 2, and 0 otherwise;
– 3 takes the value 1 if the th observation is for material type 3, and 0 otherwise;
– 4 takes the value 1 if the th observation had a temperature of 15∘ F, and 0 otherwise;
– 5 takes the value 1 if the th observation had a temperature of 70∘ F, and 0 otherwise;
– 6 takes the value 1 if the th observation had a temperature of 125∘ F, and 0 otherwise;
– The { } are independent (0, 2) random variables.
If you look at the output from this second PROC GLM step, you’ll see that it includes estimates of the regression coefficients. Notice, however, that the estimates of 3 and 6 are both zero (and that SAS issues a warning message about a singular matrix): this is because, as always with factor covariates, it’s necessary to constrain the coefficients in order that the design matrix for model (1) is nonsingular — see the material from Week 2 to refresh your memory on this. If this kind of model is fitted in R, by default the first level of each variable is treated as a ‘reference’ level with a zero coefficient: by contrast, in SAS it’s the last level of each variable. This means that the estimated intercept in model (1) corresponds to the expected life for a battery with material type 3 and a temperature of 125∘ F. To see this, note that for such a battery we have 3 = 6 = 1 and
all the other dummy covariates are zero: plugging these values into (1) with 3 = 6 = 0, the expected battery life is
0 + (1 × 0) + (2 × 0) + (0 × 1) + (4 × 0) + (5 × 0) + (0 × 1) = 0 .
The other non-zero coefficients represent the estimated differences in mean battery life for the other material types and temperatures, relative to this reference level.
In this application, there is no good reason for considering material 3 and a temperature of 125∘F as natural ‘reference’ levels, so the other coefficient estimates aren’t very interpretable. We might instead want to ask whether the battery life for a particular material or temperature is longer or shorter than average. In R, this can be done very easily by specifying ‘sum-to-zero’ constraints on the coefficients (see the analysis of the iris data in Week 2). In SAS however, we have to get out a pencil and paper, and do some maths. You can find it in Box 1 below. The upshot is that if we were to impose ‘sum-to- zero’ constraints, the coefficient estimates would be (21 − 2 − 3)/3, (−1 + 22 −
3)/3 and (−1 − 2 + 23)/3 respectively.1
The ESTIMATE statements in the PROC GLM produce estimates of these quantities, together with standard errors and -statistics for testing whether the true values of these quantities are zero (i.e. whether any one material leads to significantly longer or shorter lifetimes than average).
This second PROC GLM step also shows how to use ODS graphics to obtain diagnostic plots, as well as an interaction plotshowing the model fit. The process for obtaining these plots is very similar to that for PROC REG. Make sure you understand the plots — in
particular, that you understand why the lines on the interaction plot are parallel given the model formulation above.
4. The rest of the program considers adding an interaction between Material and Temperature to account for the possibility that the effect of temperature may be
material-dependent (or vice versa); and also treating temperature as a continuous covariate rather than a factor, because this would potentially allow the model to be simplified. There are some tricks here, to try and get SAS to do what we want: they are explained in the program comments, but do ask in the Q&A or live workshop if you don’t understand.
2022-03-25