Understanding Data and Statistical Design (60117) Spring 2022
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit
Understanding Data and Statistical Design (60117)
Assessment Task 2: Data Analysis Assignment
Spring 2022
Q1 & Q2 DATA
The data for Q1 and Q2 is contained in the file “q1q2data.csv”. The variables in this file are summarised in the table below.
Name |
Type |
Description |
poison |
experimental factor |
type of poison (1-3) |
tℎerapy |
experimental factor |
therapy administered to treat poison (1-4) |
time |
response |
survival time of animal (10sof hours) |
The data records the survival time (variable time) of animals randomly allocated a type of poison (variable poison) and randomly allocated a medical therapy to treat the poison (variable tℎerapy).
To read the data into R, run the getwd() function and save the CSV file in the location returned. Alternatively, use the setwd function to point R to the location where the CSV file is saved. Then run the line of code below.
q1q2.data <- read.csv("q1q2data.csv", header=TRUE,
colClasses=c("factor","factor","numeric"))
QUESTION 1. Observational experiment [14 marks]
In this question we assess the survival time (variable time) of animals administered a variety of poisons. The statistical model for the analysis is
timen = μ + En, n ∈ {1,2, … ,48},
where
• timen is the survival time of then-th animal
• μ is population meantime
• En is the random effect on time of then-th animal.
(a) Construct a histogram of time and superimpose over this a normal density curve fitted to the sample [2 marks]. Citing evidence from the plot, determine if the sample looks to be approximately normally distributed [2 marks].
From the histogram and density map, the data is not normally distributed because the right tail of the density map is very long, and the data is skewed to the right.
(b) Using significance level a = 0.05, perform a test to determine if population meantime of survival is greater than 4.2 hours. Write down the null and alternative hypotheses [1 mark], the test statistic and associated p-value [1 mark], the test decision (providing a reason for this) [1 mark] and a conclusion using a minimum of mathematical language [1 mark].
Hypotheses
Test Statistics
The test statistics t=1.7593
H0 : μ2 = 0.42
HA : μ2 > 0.42
with p-value=0.04252
Test Decision
Reject null hypothesis asp<0.05
Conclusion
There is strong evidence that population meantime of survival is greater than 4.2 hours.
(c) From the R output for part (b) you will have noticed the 95% confidence interval 0.42297 ≤ μ < ∞ .
Verify this is correct by performing your own calculation [2 marks].
One-sided confidence interval for the mean
X > ta (n − 1)
Thus, the 95% confidence interval is from 0.42297 to ∞ .
(d)Using significance level a = 0.05, perform a test to determine if population median time of survival is different to 5.3 hours. Write down the null and alternative hypotheses [1 mark], the test statistic and associated p-value [1 mark], the test decision (providing a reason for this) [1 mark] and a conclusion using a minimum of mathematical language [1 mark].
Hypotheses
Test Statistics
The test statistics t=416
H0 : μ2 = 0.53
HA : μ2 ≠ 0.53
with p-value=0.07849
Test Decision
Retain null hypothesis asp>0.05
Conclusion
There is not enough evidence that population median time of survival is different to 5.3 hours.
QUESTION 2. Two-factor experiment [16 marks]
In this question we continue the analysis from Q1, but this time also considering the factors poison and tℎeTapy.
(a) Write down the statistical model for a 3 × 4 factorial experiment that could give rise to the sample data we are considering, excluding interaction between the factors [2 marks]. Identify the experimental units [2 marks].
In this study, a 3 × 4 factorial completely randomized design (CRD) experiment was used, with 12 treatments repeated four times each, and a total of 48 observations.
The statistical model is described as
timei,j,n = μ + ai + βj + Ei,j,n
where
• i ∈ {1,2,3},j ∈ {1,2,3,4}, n ∈ {1,2,3,4}
• timei,j,n is the survival time of the animal at the n-th experiment with poison i and tℎeTapy j
• μ is the global mean time
• ai is the treatment effect on time of poison i
• βj is the treatment effect on time of tℎeTapy j
• Ei,j,n is the random effect on time at then-th experiment with poison i and tℎeTapy j.
The components of the experiment design:
• experimental factor A – type of poison (variable poison) which has 3 levels
• experimental factor B – therapy administered to treat poison (variable tℎeTpy) which has 4 levels
• treatments – the 12 combinations of levels of each factor
• experimental units – each of the 12 groups of 4 animal samples to which the 12 treatments are randomly allocated
• measurement units – the 48 animal samples
• response variable – survival time of the animal (variable time)
(b) Using significance level a = 0.05, perform two-way ANOVA (without interaction) and document the F -test for the factor poison . Write down the null and alternative hypotheses [1 mark], the test statistic and associated p-value [1 mark], the test decision (providing a reason for this) [1 mark] and a conclusion using a minimum of mathematical language [1 mark].
Hypotheses
Test Statistics
The test statistics f=20.86
H0 : β1 = β2 = β3 = 0
HA : at least one βj ≠ 0
with p-value=5.11*10-7
Test Decision
Reject null hypothesis asp<0.05
Conclusion
There is strong evidence that at least one poison affects the survival time of animals differently than others.
(c) Using significance level a = 0.05, documenta normality test on the residuals for the analysis in part (b). Write down the null and alternative hypotheses [1 mark], the test statistic and associated p-value [1 mark], the test decision (providing a reason for this) [1 mark] and a conclusion using a minimum of mathematical language [1 mark].
Hypotheses
H0 : Tℎe residuals ∈i,n are normally distributed
HA : Tℎe residuals ∈i,n are not normally distributed
Test Statistics
The test statistics w=0.92202 with p-value=0.003506
Test Decision
Reject null hypothesis asp<0.05
Conclusion
There is strong evidence that the residuals are not from a normal distribution.
(d)Using significance level a = 0.05 , perform Tukey post-hoc analysis on the factor
tℎeTapy and determine which levels have statistically different means [2 marks].
We see that therapy3 and therapy4 duration influences mean survival time of the animal that is statistically different from therapy1.
The t values of therapy1-therapy3 and therapy1-therapy4 were all less than 0.05, which were statistically significant.
(e) Using diagnostic plots of the residuals, assess whether the assumptions of independence and constant variance have been met [2 marks].
Independence.
There are no obvious patterns in the Residuals vs Fitted plot, so no problem with this assumption.
Constant variance.
The range of the residuals in the Residuals vs Fitted plot appears to increase, indicating a potential problem with this assumption.
Q3 & Q4 DATA
The data for Q3 and Q4 is contained in the file “q3q4data.csv”. The variables in this file are summarised in the table below.
Name |
Type |
Description |
TiveT length |
categorical predictor continuous predictor |
0 (Lumber), 1 (Waccamaw) length of fish (cm) |
weigℎt
meTcUTY
continuous predictor continuous response
weight of fish (g)
mercury concentration (ppm)
The data records mercury concentration and attributes of fish caught in two rivers in North Carolina.
To read the data into R, run the getwd() function and save the CSV file in the location returned. Alternatively, use the setwd function to point R to the location where the CSV file is saved. Then run the line of code below.
q3q4.data <- read.csv("q3q4data.csv", header=TRUE,
colClasses=c("factor",rep("numeric",times=3)))
QUESTION 3. Simple linear regression [14 marks]
In this question we build a simple linear regression to model the relationship between
meTcUTY and lengtℎ . We consider the population model
meTcUTY = β0 + βl ∗ lengtℎ + E
where var(E) = σ 2 .
(a) Fit the model described above, write down the regression equation [1 mark] and calculate the predicted average mercury level of a fish with length equal to the 0.75 quantile of the sample of lengtℎ [2 marks].
Regression equation
mY(lengtℎ) = −1. 1316 + 0.0581 ∗ lengtℎ
mY(lengtℎ = 0) = −1. 1316 + 0.0581 ∗ 0 = −1.1316
mY(lengtℎ + 1) = −1. 1316 + 0.0581 ∗ (lengtℎ + 1)
= −1. 1316 + 0.0581 ∗ lengtℎ + 0.0581
= mecUTY(̂)(lengtℎ) + 0.0581
0.0581 ∗ 46.2 − 1.1316 = 25.7106
Therefore, it can be seen from the regression equation that the average mercury level of fish at 75% quantile length is 25.7106.
(b) Write down the model’sestimate of σ2 [2 marks].
According to the result of regression output, the standard error of the residual is 0.5805, and the degree of freedom is 169, so the variance of the residual is:
σ 2 = 0.5805 ∗ 0.5805 = 0.33698
(c) Using 0.05 significance level, test whether average mercury level increases by less than 0.065ppm for each additional centimetre of fish length. Write down the null and alternative hypotheses [1 mark], the test statistic [1 mark], the test decision with reason [1 mark] and a conclusion using a minimum of mathematical language [1 mark].
Hypotheses
H0 : β1 ≤ 0.065
HA : β1 > 0.065
Test Statistics
t = 0.050(8)2(1)1(2)3(7)615(− 0).065 = −0.0321747
Test Decision
Retain H0 ast < t0.95 = 1.96
Conclusion
There is not strong evidence that average mercury level increases by less than 0.065ppm for each additional centimetre of fish length.
(d)Using appropriate diagnostic plots, determine if the modelling assumptions appear to have been satisfied [3 marks].
Normality
The Normal Q-Q plot shows the residuals tracking the line representing normality, indicating compliance with this assumption.
Constant variance
The Residuals vs Fitted plot shows the range of the residuals to be fairly consistent, indicating a compliance with this assumption.
Independence
The Residuals vs Fitted plot shows no obvious patterns in the residuals indicating compliance with this assumption.
(e) Is there any statistical evidence of autocorrelation in the residuals [2 marks]?
There is no strong evidence of autocorrelation in the residuals as the DW statistic is between 1 and 3.
QUESTION 4. Multiple linear regression [16 marks]
In this question we extend the model from Q3 into a multiple linear regression.
(a) Create a scatterplot of the variables meTCUTY and weigℎt and colour code the plot according to levels of TiveT [2 marks]. Discuss the need for an interaction term between the predictors TiveT and weigℎt [2 marks].
There is some evidence of different slopes according to river, suggesting need for interaction term.
We now consider the population model
meTCUTY = β0 + Y ∗ TiveT1 + βl ∗ lengtℎ + βw ∗ weigℎt + δ ∗ TiveT1 ∗ weigℎt + E where
TiveT1 = { 1(0) TiveT(Tive) w(R)T).
Note that R will create the dummy variable TiveT1 automatically.
(b) Fit the model described above, write down the regression that applies for the Lumber
River [1 mark] and provide interpretations of the estimated coefficientsβ(̂)0 andδ(̂) [2
marks].
Regression
T = −0.96452 + 0.05724 ∗ lengtℎ − 0.00018 ∗ weigℎt
The coefficient β0(̂) = −0.96452 is predicted mercury for Lumber River when
length=weight=0.
The coefficient δ = 0.00032 is predicted mercury for Waccamaw River when weight stays the same, length stays the same, but it’s not 0.
(c) Using 0.05 significance level, determine if the interaction term is significant. Write down the null and alternative hypotheses [1 mark], the test statistic [1 mark], the test decision with reason [1 mark] and a conclusion using a minimum of mathematical language [1 mark].
Hypotheses
H0 : β1 = 0
HA : β1 ≠ 0
Test state
t = 0(0).(.)0001026(0003207) = 3.12573
Test decision
Reject H0 ast > t0.95 = 1.96
Conclusion
When the significance level is 0.05, the interaction terms are very significant.
(d)Calculate the predicted average mercury level for a fish of length 37.9cm and weight 607g caught in the Waccamaw River and the associated 95% two-sided mean confidence interval [2 marks]. You will need to construct a data frame containing this new data point.
Length 37.9cm and weight 607g caught in the Waccamaw River
the fitted value=1.4378453
95% mean prediction interval: [1.25944 ,1.6162484]
Below are diagnostic plots of the residuals for the model fitted above.
We see that the modelling assumptions have not been satisfied.
Sometimes transforming the response variable and fitting a model with the transformed response can result in a model that does satisfy the assumptions.
Here we take the response variable meTCUTY to the power of 1/5 and consider the population model
meTCUTY 1/5 = β0 + Y ∗ TiveT1 + βl ∗ lengtℎ + βw ∗ weigℎt + δ ∗ TiveT1 ∗ weigℎt + E where
TiveT1 = { 1(0) TiveT(Tive) w(R)T).
(e) Fit the model described just above, write down the fitted regression equation for the Lumber River [1 mark] and produce diagnostic plots of the residuals [1 mark]. Have the modelling assumptions been satisfied for this model [1 mark]?
Regression equation
= −0.5359 + 0.01341 ∗ lengtℎ − 0.00006249 ∗ weigℎt
Normality
The Normal Q-Q plot shows the residuals tracking the line representing normality, indicating compliance with this assumption.
Constant variance
The Residuals vs Fitted plot shows the range of the residuals to be fairly consistent, indicating a compliance with this assumption.
Independence
The Residuals vs Fitted plot shows no obvious patterns in the residuals indicating compliance with this assumption.
2023-08-23
Data Analysis Assignment