Statistics 2: Computer Practical 3
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit
Statistics 2: Computer Practical 3
1 Data description
For this computer practical you will need to load the following data:
diabetes<-read .csv ("diabetes_data .csv" ,header=T)
This dataset was originally created by the National Institute of Diabetes and Digestive and Kidney Diseases
and was downloaded from https://www.kaggle.com/datasets. It includes medical records for 768 female
patients of Pima Indian heritage, aged at least 21 years old who were tested for diabetes. Analytically, the dataset consists of the following measurements:
● Pregnancies : Number of pregnancies
● Glucose : Plasma glucose concentration a 2 hours in an oral glucose tolerance test (mg/dL)
● BloodPressure: Diastolic blood pressure (mm Hg)
● SkinThickness: Triceps skin fold thickness (mm)
● Insulin : 2-Hour serum insulin (mu U/ml)
● BMI : Body mass index (kg/m2 )
● DiabetesPedigreeFunction: a function indicating the likelihood of diabetes based on family history
● Age : Age of patient (>21)
● Outcome : Outcome of diabetes test (1: Test positive, 0:Test negative)
Observing the dataset carefully, it becomes obvious that for some of the variables (Glucose, BloodPressure, SkinThickness, Insulin and BMI) the missing values were recorded as 0’s. This could lead to misleading results in the analysis, therefore we need to adjust these entries before we proceed. In general, we need to be very careful with missing data. Deleting observations is not usually a good practice. By doing so, we may loose a lot of information and in addition we may introduce bias in the data, e.g. if the missingness is related to the outcome of interest (diabetes in this case).
For these data, Pregnancies, DiabetesPedigreeFunction, Age and of course Outcome have no missing values and hence we prefer not to delete any of the patients. Instead, we replace all missing values with the median observation of the corresponding variable (we could have used the mean or even more complicated methods for imputation). To do this we run the following code:
missing<-function(var){
med<-median(var[var>0])
var[var==0]<-med
return(var)
}
diabetes$Glucose<-missing(diabetes$Glucose)
diabetes$BloodPressure<-missing(diabetes$BloodPressure)
diabetes$Insulin<-missing(diabetes$Insulin)
diabetes$SkinThickness<-missing(diabetes$SkinThickness)
diabetes$BMI<-missing(diabetes$BMI)
2 Testing the mean of normal data
We now focus on the BloodPressure measurements. According to the American Heart Association (AHA) a normal diastolic pressure should be in the range 60 _ 80mm Hg (millimeters of mercury), so one could say that the blood pressure for a ‘healthy’ person should be 70mm Hg. We want to test whether the diastolic pressure for the patients in the diabetes dataset is higher than 70mm Hg.
If the distribution is known, the test could be simplified significantly.
Question 1. [2 marks] Use the intervals
(_o, 45], (45, 55], (55, 65], (65, 75], (75, 85], (85, 95], (95, o)
to quantize the data. You can do this using the following code:
BP<-diabetes$BloodPressure
breaks<-c (-Inf ,seq (45 ,95 ,by=10),Inf)
Obs<-table(cut(BP,breaks))
Hence, using Pearson’s goodness of fit test, confirm that a Normal distribution is a valid assumption for the BloodPressure data, i.e. they are derived from N(µ, σ2 ) for an appropriate choice of the parameters µ and σ 2 .
You may find the handout on Goodness- of-fit for continuous distributions and the relevant case study helpful. Question 2 [2 marks] Assuming that the BloodPessure measurements are realisations of i.i.d. random variables from N(µ, 12), perform the following test for an appropriate (UMP) test statistic
H0 : µ = 70 vs H1 : µ > 70
reporting the p-value of the test. State clearly the test statistic used and justify your choice appropriately.
3 Logistic regression (again!)
Recall the logistic model from the previous computer practicals,
Yi i Bernoulli(σ(θT xi )), i e {1, . . . , n},
where x1 , . . . , xn are d-dimensional real vectors of explanatory variables, and σ is the standard logistic function
σ(z) = 1
Question 3 [1 mark] Show that for πi = P(Yi = 1)1,
log πi
d
= θj xij .
j=1
4 Hypothesis testing in logistic regression
In the second computer practical it was mentioned that if an explanatory variable has no effect on the probability of the response variable then we expect the corresponding coefficient to be equal to 0. We will examine this in a more formal way through hypothesis testing. Consider the logistic model described above
H0 : θi〇 = θi2 = = θir = 0
H1 : at least one of θi〇 , θi2 , ..., θir not equal to 0,
Consider the generalised likelihood ratio statistic for this nested test,
supH0 L (9; y) L (0 ; y)
Λn = sup9 L(9; y) = L(M LE ; y)
where 0 is the maximum likelihood estimator under the null hypothesis, and M LE is the maximum likelihood estimator for the full model (i.e. all θi 0).
Then,
has a xr(2) distribution under the null hypothesis (notice that r is the number of restrictions under the null hypothesis).
Returning to the diabetes data, we plot all explanatory variables in the data against the Outcome (patient diabetic or not) to get a first impression of possible effects.
Xnames <- colnames(diabetes[, -9]) #get names of explanatory variables
par (mfrow=c (1 ,3))
for (i in 1:8) {
boxplot(diabetes[,i]~diabetes$Outcome,main=paste(Xnames[i]),
names=c ("Negative" , "Positive"),xlab="Diabetes test")
}
Pregnancies
Negative Positive
Diabetes test
Glucose
|
|
|
|
|
||
|
|
|||||
|
||||||
|
|
|
||||
|
|
|
|
|
Negative Positive
Diabetes test
BloodPressure
Negative Positive
Diabetes test
SkinThickness
Negative Positive
Diabetes test
Insulin
Negative Positive
Diabetes test
BMI
DiabetesPedigreeFunction Age
Negative Positive
Diabetes test
Negative Positive
Diabetes test
Question 4 [2 marks] Use the generalised likelihood ratio test to decide whether the variables BloodPressure, SkinThickness, Insulin and Age are statistically significant for the development of diabetes. To do this start by forming the matrices that correspond to the restricted model under the null hypothesis (X_rest) and to the full model (X_full).
X_full<-cbind(1 ,as .matrix(diabetes[,1:8]))
X_rest<-cbind(1 ,as .matrix(diabetes[,c (1 ,2 ,6 ,7)]))
Y<-diabetes[,9]
Notice that, as in computer practical 2, we add a constant term in the model, and hence the vector of
parameters becomes 9 = (θ0 , θ 1 , ..., θ8 ). You will need the following functions from computer practical 2 (notice the modification in the return vector of ell.maximize):
sigma <- function(v) {
1/(1+exp (-v))
}
ell <- function(theta, X, y) {
p <- as .vector(sigma(X%*%theta))
sum (y*log(p) + ( 1-y)*log(1-p))
}
score <- function(theta, X, y) {
p <- as .vector(sigma(X%*%theta))
as .vector(t(X)%*%(y-p))
}
maximize .ell <- function(ell, score, X, y, theta0) {
optim .out <- optim (theta0, fn=ell, gr=score, X=X, y=y, method= "BFGS" ,
control=list(fnscale=- 1 , maxit= 1000 , reltol= 1e- 16)) return(list(theta=optim .out$par, value=optim .out$value))
}
5 Verifying the distribution of _2 log Λn .
Recall the following function from computer practical 2, which simulates values of the response variable for a given matrix of explanatory variables, X , and a given vector of parameters, theta.
generate .ys <- function (X, theta) {
n <- dim (X)[1]
rbinom(n, size = 1 , prob=sigma(X%*%theta))
}
Question 5 [3 marks] Consider the test in Question 4. By repeated experiments, simulate the test statistic _2 log Λn under the null hypothesis using the appropriate maximum likelihood estimate for 9 . Hence, verify that it has a x2 distribution with the underlying degrees of freedom. To do this,
(a) Plot the density of the simulated _2 log Λn values against the density of the corresponding chi-squared distribution
(b) Perform the Pearson’s goodness-of-fit test. [Note that the intervals chosen to quantize the observed data should meet the criterion of each interval having expected counts ≥ 5.]
6 Epilogue
Throughout this computer practical, we applied various forms of hypothesis testing to make decisions about the parameters of a distribution (using generalised likelihood ratio test) or even for the distribution of some observed data (goodness-of-fit test).
We also saw how one can decide between two nested models in logistic regression analysis using an appropriate
form of the generalised likelihood ratio. As mentioned in the epilogue of Computer Practical 2, we can easily obtain the estimated values for the parameters in logistic regression using the glm function. One can then use the lrtest (need to load lmtest package) to perform hypothesis testing for nested models. For example, if we have two explanatory variables x1 , x2 and we want to test if x1 is significant for the response y, we can
test this in R as follows:
model1<-glm(y~x1+x2,family=binomial) #full model
model2<-glm(y~x2,family=binomial) #restricted model
library(lmtest)
lrtest (model1, model2)
[You may use this (only!) to confirm your result in question 4.]
2022-12-08