STAT0023 Workshop 7: Introduction to SAS Live workshop materials
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit
STAT0023 Workshop 7: Introduction to SAS
Live workshop materials
In the self-study session you will hopefully have got SAS up and running and will have explored how to carry out some simple data manipulations and basic plots. In the live workshop we will guide you through creating your own SAS script, using the beetle dataset we saw in Week 6.
1 Setting up
For this workshop, in addition to these instructions you will need the following, all of which can be downloaded from the “Week 7” tab of the Moodle page:
• The week 7 lecture materials.
• The week 7 self-study notes.
Your first step for the workshop is to open up SAS following the instructions from the self-study workshop.
2 More on the DATA step: the beetle mortality data
In the self-study materials you learnt how to read data from a file using a SAS DATA step. In SAS,
DATA steps are also used to carry out data manipulations such as variable transformation, subsetting and so forth; and the results can be used to produce reports. It is fair to say that much of this kind of thing would often be done using a spreadsheet nowadays however (spreadsheets weren’t widely available when SAS was first developed), so we will just focus on the data manipulation capabilities that are of most use in a statistical analysis.
Table 2.1: Beetle mortality data (from A. Dobson: ‘An Introductionto Generalized Linear Models’)
Dose (log10 CS2 mg l− 1) Number of insects Number of deaths
1.6907
1.7242
1.7552
1.7842
1.8113
1.8369
1.8610
1.8839
59
60
62
56
63
59
62
60
6
13
18
28
52
53
61
60
As an example in this section, we’ll use the beetle mortality data that you studied in Workshop 6. For your convenience, the data are given again in Table 2.1. Here are your instructions:
1. On the File menu, click New Program. A new editor pane will open. Type in some comments in the first few lines, to identify the program — for example “Workshop 7: reading the beetle mortality data using the DATA step”. Don’t forget to use /* and */ to end your comments! It’s always worth saving your program as you go, so save it in your current folder when you’ve written the comments — call it ReadBeetleData.sas or something.
2. Define a title for any output that your program will produce: something like
and run this line using the ‘Run’ button.
3. Now enter the data, by typing the following DATA step into your program and clicking the ‘Run’ button:
Be very careful to get this right! Notice in particular which lines have semicolons at the end and which ones don’t. After clicking the Run button, check the Log pane to make sure there were no errors. If there were, fix them and try again.
4. To see what this DATA step has done, use the PRINT procedure:
This exercise illustrates a few slightly more advanced features of a DATA step. Notice the following:
• The first line specifies that the DATA step will define a new SAS dataset called beetledata, in the STAT0023 library.
• There is no INFILE or SET statement in this DATA step: SAS therefore expects the data to be defined as part of the DATA step itself. This is done via the DATALINES statement. Notice
that there are no semicolons after the individual data lines, just a single semicolon indicating the end of the data.
• The INPUT statement now defines the variables that will be entered in the DATALINES part of the DATA step.
• In addition, there are some statements that compute new variables from the original variables: the variable proportion is defined as the proportion of insects that are killed at each dose level, and a variable dosegroup is defined which takes the value High if the log dose is greater than 1.8 and Low otherwise. More on these calculations below.
• Finally, notice the FORMAT statement, which controls how the proportion values are displayed. The 6.3 here translates as ‘output the values in fields of width 6, each with 3 decimal places’. You can see this in the PROC PRINT results. Be aware that the proportions are stored to full accuracy; the FORMAT statement is used only to control how they are displayed in the output.
2.1 SAS data calculations
In the beetle mortality example, we used simple arithmetic to define a new variable called proportion, with a value for every observation in the data. The arithmetic operators in SAS are + (addition), - (subtraction), * (multiplication), / (division) and ** (raising to a power). All of these are exactly as you might expect, except for the last one which is different from the ^ symbol used by R. Don’t get the two languages mixed up!
In addition to these simple arithmetic operators, there are many inbuilt functions that you can use in a SAS DATA step. Some important ones to know are:
function meaning function meaning
ABS(x)
SQRT(x)
EXP(x)
LOG(x)
LOG10(x)
absolute value
square root exponential log of log base 10 of
SUM(x,y)
MEAN(x,y)
MIN(x,y)
MAX(x,y)
VAR(x,y)
sum of variables
mean of variables
minimum of variables
maximum of variables
variance of variables
All functions listed with two arguments (x,y) can take two or more variables as arguments. If
you are familiar with R (and hopefully you are!), SAS functions such as SUM, MEAN and VAR may appear a bit confusing: in R, such statements are typically applied to calculate the sum, mean or variance of values in a single vector, whereas in a SAS data step, they are used to calculate the sums or means of different variables.1 For example, suppose you have a dataset containing income from sales, rentals and other sources for different parts of a business, and you want to calculate the total income for each part. This can be done using something like
Of course, you could also define income as sales + rentals + other: the purpose of the example is merely to show you how the SUM function works.
Question: if you run the example above in your SAS session, will the incomedata data set still be available to you in future SAS sessions? Why? Discuss within your groups to make sure you understand how data sets are stored in SAS.
Another feature of the beetle mortality example was the use of an IF ... THEN ... ELSE construction to define the dosegroup variable. This is an example of a logical statement, and it works in a similar way to the use of the if() ... else ... construction in R. An important difference is that this kind of construction is efficient in SAS, but not in R. In the same way as for R, the IF statement is followed by a logical condition which evaluates to either TRUE or FALSE. There is an important difference between IF in a SAS DATA step and if() in an R script, however: in an R script, the logical condition must be a single value that is either TRUE or FALSE. In a SAS DATA step, the condition is evaluated separately for each observation and the corresponding action is taken. So, for the beetle mortality example above, the value of dosegroup was set to "High" for every observation in which the condition logdose 1.8 is true; and to "Low" for every other observation. Look at the results from your PROC PRINT statement to verify this.
The ">" symbol in "logdose 1.8" is a logical operator meaning ‘greater than’. In SAS, logical operators can be denoted either using symbols such as >, or using abbreviations: for example, the condition logdose > 1.8 could equivalently have been written as logdose GT 1.8 (try it). The GT abbreviation stands for ‘greater than’. Other logical operators and the corresponding abbreviations are given in the following table:
Symbol Abbreviation Operation
= EQ Equal to
^= > < >= <= & |
NE
GT
LT
GE
LE
AND
OR
Not equal to
Greater than
Less than
Greater than or equal to
Less than or equal to
And
Or
Notice that there are some differences between SAS and R — notably the use of ^= instead of != for ‘not equal to’, and the use of = instead of == for ‘equal to’.
You might wonder what happens if you omit the ELSE part of an IF ... THEN ... ELSE statement. Try it: delete the line
from the beetle example above, and rerun the program — including the final PROC PRINT step so that you can see how the dataset has changed.2You might want to refer back to Section 2.1 to understand what has happened.
Exercise: after rereading Section 2.1 of the self-study notes, discuss within your groups what you think would happen if you were to code the dosegroup variable as 1 for a high dose. Try it: change the line
to
and then rerun the program. Do you see what you expected? Once you’ve understood what’s going on, finish this exercise by adding an ELSE statement that enables you to define dosegroup to be zero for log doses less than 1.8.
2.2 Plotting the beetle mortality data
Your final exercise in today’s workshop is to plot the beetle mortality data so as to show the relationship between the log dose and the proportion of insects killed. In Workshop 6, you did this in R to produce Figure 2.1. How close can you get to this using PROC GPLOT in SAS? Try and match the colour, plotting symbols, title and — as far as possible3 — axis labels. You may want to look at the earlier GalapagosProg1.sas for some ideas on how to do this, as well as the help system (see below).
Proportion of insects killed
1.70 1.75 1.80 1.85
log10 CS2 concentration (mg l- 1)
Figure 2.1: Beetle mortality data: relationship between log CS2 dose and proportion of insects
killed. Plot produced in Rduring Workshop 6.
3 Summary
This week, you have learned:
• How to navigate around the SAS interface, including use of the Editor and Log panes and the Results Viewer.
• How to set your current folder in SAS so that it can find your files.
• How to change the output preferences so that you can save SAS graphics and analysis results to other file formats.
• How to organise datasets in a SAS library, either temporary or permanent.
• The basic structure of a SAS program as a collection of DATA and PROC steps.
• How to create a SAS dataset in a DATA step, either by reading from an ASCII file with an INFILE statement, by entering the data directly with a DATALINES statement, or by extracting data from an existing SAS dataset with a SET statement.
• How to perform simple data manipulations in a DATA step.
• How to produce simple data summaries and scatterplots, using PROC MEANS and PROC GPLOT respectively.
4 Moodle quiz
There’s one Moodle quiz this week.
2022-03-25