STAT3064/STAT5061 Statistical Data Science Computer Lab 1
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit
STAT3064/STAT5061 Statistical Data Science
Computer Lab 1, Semester 2 2022
Things you may need to know/do.
● Relates to lecture 1.
● Libraries ggplot2, tidyverse, MASS, GGally may be useful. Others may be mentioned below in the hints.
● You might like to set up a project for each lab if you are using RStudio. Then you can copy a .Rmd file into that directory and write your answers in that file
Q1
Consider the aircraft data (aircraft.csv) with variables Year, Period, Power, Span, Length, Weight, Speed and Range.
Data description (from the R help for package sm)
These data record six characteristics of aircraft designs which appeared during the twentieth century. The variables are: - Year: year of first manufacture - Period: a code to indicate one of three broad time periods - Power: total engine power (kW) - Span: wing span (m) - Length: length (m) - Weight: maximum take-off weight (kg) - Speed: maximum speed (km/h) - Range: range (km)
Source: The data were collected by P. Saviotti and are described in detail in Saviotti (1996), “Technological Evolution, Variety and Economy”, Edward Elgar: Cheltenham.
a) Read the data into R, and summarise it to understand the inputs. Are there any missing values? (Hint: read.csv, head and summary.)
b) Show separate histogram of Power and log10( Power ). Describe the shape of the histograms. (Hint: truehist (from MASS) or ggplot with geom_histogram.)
c) Construct smoothed histograms of Power and log10( Power ). (Hint: ggplot + geom_density.)
d) Create new variables logPower equal to log10( Power ) and similarly for the remaining variables, apart from Period and Year. Do this by creating a new dataframe with these variables plus Period and Year. Make Period a factor on this new dataframe. Why might the log transformation be a good idea to help in understanding these data? (Note that we have specified log10 rather than the natural log, because this is easier to interpret in the circumstances we have.) (Hint: Can be done directly but could use mutate from dplyr, which is part of tidyverse.)
e) Show pairwise plots of the logged variables. Which pair has the largest absolute correlation? (Hint: pairs or ggpairs.)
f) The code chunk below generates a 3D scatterplot of the variables logPower, logSpan and logLength. Comment on the scatterplots in relation to what it says about the variables, and discuss the shortcomings of this sort of display. Experiment with changing the viewing angles phi and theta.
g) The code chunk below makes a parallel coordinate plot of the 6 logged variables, coloured by the value of Period, and with boxplots and suitable transparency. It is arranged so that the values are not scaled. Comment on what this figure says about the data, and discuss any shortcomings of the plotting method.
Code chunk for 3d scatterplots.
Remove the “, eval = FALSE” before running knitting in RStudio.
library( plot3D )
scatter3D (aircraft$logWeight, aircraft$logSpan, aircraft$logLength,
phi = 20 , theta = 80 ,
col = NULL , NAcol = "white" , breaks = NULL ,
colkey = NULL , panel .first = NULL ,
clim = NULL , clab = NULL ,
bty = "b" , CI = NULL , surf = NULL ,
add = FALSE , plot = TRUE)
Code chunk for parallel co-ordinate plot.
Remove the “, eval = FALSE” before running knitting in RStudio.
library(GGally)
ggparcoord( aircraft,
groupColumn = "Period" ,
columns = 3:8 ,
alpha = 0.1 ,
scale = "globalminmax" ,
boxplot = TRUE ) +
theme( legend .position = "bottom")
Q2
Continue with the aircraft data of Q1 (using the logged variables).
The data are from three different periods.
(Hints for these: summary, xtabs, sapply, filter.)
a) How many observations belong to each period?
b) Which years belong to each period?
c) Which variable has the largest interquartile range? Which has the largest median? For which variables are the means larger than 1.04 times the median? (Leave out Period for this question, as it is a factor, not a numeric variable. Don’t do the second part of this question by just looking at the output from say the summary command. Do it by writing a small amount of code that will tell you the answer directly.)
d) Divide the dataframe into the three period groups. For all the data and separately for period 1, give the histograms and density plots of logPower. Compare the results for the complete data with those obtained from period 1. Comment. (Hint, for interest and because it will be useful in the first assignment: It is fairly easy to get 3 separate density plots. The following on the other hand will generate a plot more useful for comparisons.)
# density plots
ggplot( aircraft, aes ( logPower, group = Period, colour = Period ) ) +
geom_density( aes ( fill = Period ), alpha = 0.4 ) +
ggtitle("All data"))
e) Compare the pairs or ggpairs results for the logged variables for the whole data and for period 1. Comment.
Q3
Continue with the aircraft data of Q2.
a) Construct contour plots of the 2D smoothed histograms for the pair of variables logLength and logWeight,
and also for logSpeed and logSpan for all records.
(Hint: This code chunk does the second one.)
ggplot( aircraft, aes ( logSpeed, logSpan ) ) +
geom_density_2d ( ) +
geom_point( aes ( colour = Period ), alpha = 0.6 ) +
theme( legend .position = "bottom")
b) Describe the shape of these density plots (eg bimodal, skewed, etc).
c) What do the shapes of the contours in these figures say about the variable pairs? Hint: look at the correlations you calculated earlier.
Answers
Start R
library(tidyverse)
## -- Attaching packages
## v ggplot2 3 .3 .6 v
## v tibble 3 . 1 .7 v
## v tidyr 1 .2 .0 v
## v readr 2 . 1 .2 v
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(ggplot2)
library (MASS)
##
## Attaching package: 'MASS '
##
## The following object is masked from 'package:dplyr ' :
##
## select
library(GGally)
## Registered S3 method overwritten by 'GGally' :
## method from
## + .gg ggplot2
library( plot3D )
## Warning in system2("/usr/bin/otool", c("-L", shQuote(DSO)), stdout = TRUE): ## running command ''/usr/bin/otool ' -L '/Library/Frameworks/R .framework/Resources/ ## library/tcltk/libs//tcltk.so '' had status 1
Q1
a) Read the data into R, and summarise to understand the inputs. Are there any missing values?
library( ggplot2 )
aircraft0 = read .csv ( "aircraft .csv" )
head ( aircraft0 )
## Year Period Power Span Length Weight Speed Range
## 1 14 1 82 .0 12 .8 7 .60 1070 105 400
## 2 14 1 82 .0 11 .0 9 .00 830 145 402
## 3 14 1 223 .6 17 .9 10 .35 2200 135 500
## 4 15 1 164 .0 14 .5 9 .80 1946 138 500
## 5 15 1 119 .0 12 .9 7 .90 1190 140 400
## 6 15 1 74 .5 7 .5 6 .30 653 177 350
summary( aircraft0 )
## Year Period Power Span Length
![]()
![]()
![]()
## Min . :14 .00 ## 1st Qu . :34 .00 ## Median :42 .00 ## Mean :46 .27 ## 3rd Qu . :61 .00
## Max . :84 .00
## Weight
## Min . : 481
## 1st Qu . : 2386
## Median : 5500 ## Mean : 17236
## 3rd Qu . : 14000 ## Max . :365000
Min . :1 .00 Min . : 30 Min . : 6 .68 Min . : 5 .69 1st Qu . :1 .00 1st Qu . : 421 1st Qu . :11 .00 1st Qu . : 9 .02 Median :2 .00 Median : 1014 Median :14 .30 Median :11 .96 Mean :2 .02 Mean : 4062 Mean :18 .13 Mean :15 .51 3rd Qu . :3 .00 3rd Qu . : 3080 3rd Qu . :22 .10 3rd Qu . :17 .84 Max . :3 .00 Max . :80000 Max . :70 .10 Max . :75 .54
Speed Range
Min . : 105 .0 Min . : 80
1st Qu . : 259 .0 1st Qu . : 850
Median : 408 .0 Median : 1400
Mean : 561 .3 Mean : 2039
3rd Qu . : 655 .0 3rd Qu . : 2400
Max . :3219 .0 Max . :20120
b) Show separate histogram of Power and log10( Power ). Describe the shape of the histograms.
library ( MASS )
truehist ( aircraft0$Power ) # uses MASS::truehist
|
|
0 20000 40000 60000 80000
aircraft0$Power
library( ggplot2 )
ggplot ( aircraft0, aes ( Power ) ) + geom_histogram()
## !stat_bin() ! using !bins = 30 ! . Pick better value
with !binwidth! .
![]()
|
400
300 200
100 0 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
0 20000 40000 60000 80000
Power
ggplot ( aircraft0, aes ( log10( Power) ) ) + geom_histogram() # example, from ggplot2 ## !stat_bin() ! using !bins = 30 ! . Pick better value with !binwidth! .
2022-08-22