闪电代写 -代写CS作业_CS代写_Finance代写_Economic代写_Statistics代写_代码代做_IT代写_加急帮助

Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

Statistics 2: Computer Practical 3

1 Data description

For this computer practical you will need to load the following data:

diabetes<-read .csv ("diabetes_data .csv" ,header=T)

This dataset was originally created by the National Institute of Diabetes and Digestive and Kidney Diseases

and was downloaded from https://www.kaggle.com/datasets. It includes medical records for 768 female

patients of Pima Indian heritage, aged at least 21 years old who were tested for diabetes. Analytically, the dataset consists of the following measurements:

● Pregnancies : Number of pregnancies

● Glucose : Plasma glucose concentration a 2 hours in an oral glucose tolerance test (mg/dL)

● BloodPressure: Diastolic blood pressure (mm Hg)

● SkinThickness: Triceps skin fold thickness (mm)

● Insulin : 2-Hour serum insulin (mu U/ml)

● BMI : Body mass index (kg/m2 )

● DiabetesPedigreeFunction: a function indicating the likelihood of diabetes based on family history

● Age : Age of patient (>21)

● Outcome : Outcome of diabetes test (1: Test positive, 0:Test negative)

Observing the dataset carefully, it becomes obvious that for some of the variables (Glucose, BloodPressure, SkinThickness, Insulin and BMI) the missing values were recorded as 0’s. This could lead to misleading results in the analysis, therefore we need to adjust these entries before we proceed. In general, we need to be very careful with missing data. Deleting observations is not usually a good practice. By doing so, we may loose a lot of information and in addition we may introduce bias in the data, e.g. if the missingness is related to the outcome of interest (diabetes in this case).

For these data, Pregnancies, DiabetesPedigreeFunction, Age and of course Outcome have no missing values and hence we prefer not to delete any of the patients. Instead, we replace all missing values with the median observation of the corresponding variable (we could have used the mean or even more complicated methods for imputation). To do this we run the following code:

missing<-function(var){

med<-median(var[var>0])

var[var==0]<-med

return(var)

}

diabetes$Glucose<-missing(diabetes$Glucose)

diabetes$BloodPressure<-missing(diabetes$BloodPressure)

diabetes$Insulin<-missing(diabetes$Insulin)

diabetes$SkinThickness<-missing(diabetes$SkinThickness)

diabetes$BMI<-missing(diabetes$BMI)

2 Testing the mean of normal data

We now focus on the BloodPressure measurements. According to the American Heart Association (AHA) a normal diastolic pressure should be in the range 60 _ 80mm Hg (millimeters of mercury), so one could say that the blood pressure for a ‘healthy’ person should be 70mm Hg. We want to test whether the diastolic pressure for the patients in the diabetes dataset is higher than 70mm Hg.

If the distribution is known, the test could be simpliﬁed signiﬁcantly.

Question 1. [2 marks] Use the intervals

(_o, 45], (45, 55], (55, 65], (65, 75], (75, 85], (85, 95], (95, o)

to quantize the data. You can do this using the following code:

BP<-diabetes$BloodPressure

breaks<-c (-Inf ,seq (45 ,95 ,by=10),Inf)

Obs<-table(cut(BP,breaks))

Hence, using Pearson’s goodness of ﬁt test, conﬁrm that a Normal distribution is a valid assumption for the BloodPressure data, i.e. they are derived from N(µ, σ2 ) for an appropriate choice of the parameters µ and σ 2 .

You may ﬁnd the handout on Goodness- of-ﬁt for continuous distributions and the relevant case study helpful. Question 2 [2 marks] Assuming that the BloodPessure measurements are realisations of i.i.d. random variables from N(µ, 12), perform the following test for an appropriate (UMP) test statistic

H0 : µ = 70 vs H1 : µ > 70

reporting the p-value of the test. State clearly the test statistic used and justify your choice appropriately.

3 Logistic regression (again!)

Recall the logistic model from the previous computer practicals,

Yi i Bernoulli(σ(θT xi )), i e {1, . . . , n},

where x1 , . . . , xn are d-dimensional real vectors of explanatory variables, and σ is the standard logistic function

σ(z) = 1

Question 3 [1 mark] Show that for πi = P(Yi = 1)1,

log πi

= θj xij .

j=1

4 Hypothesis testing in logistic regression

In the second computer practical it was mentioned that if an explanatory variable has no eﬀect on the probability of the response variable then we expect the corresponding coeﬃcient to be equal to 0. We will examine this in a more formal way through hypothesis testing. Consider the logistic model described above

H0 : θi〇 = θi2 = = θir = 0

H1 : at least one of θi〇 , θi2 , ..., θir not equal to 0,

where r s d and i1 , i2 , ..., ir e {1, 2, ..., d}.

Consider the generalised likelihood ratio statistic for this nested test,

supH0 L (9; y) L (0 ; y)

Λn = sup9 L(9; y) = L(M LE ; y)

where 0 is the maximum likelihood estimator under the null hypothesis, and M LE is the maximum likelihood estimator for the full model (i.e. all θi 0).

Then,

has a xr(2) distribution under the null hypothesis (notice that r is the number of restrictions under the null hypothesis).

Returning to the diabetes data, we plot all explanatory variables in the data against the Outcome (patient diabetic or not) to get a ﬁrst impression of possible eﬀects.

Xnames <- colnames(diabetes[, -9]) #get names of explanatory variables

par (mfrow=c (1 ,3))

for (i in 1:8) {

boxplot(diabetes[,i]~diabetes$Outcome,main=paste(Xnames[i]),

names=c ("Negative" , "Positive"),xlab="Diabetes test")

}

Pregnancies

Negative Positive

Diabetes test

Glucose

Negative Positive

Diabetes test

BloodPressure

Negative Positive

Diabetes test

SkinThickness

Negative Positive

Diabetes test

Insulin

Negative Positive

Diabetes test

BMI

DiabetesPedigreeFunction Age

Negative Positive

Diabetes test

Negative Positive

Diabetes test

Question 4 [2 marks] Use the generalised likelihood ratio test to decide whether the variables BloodPressure, SkinThickness, Insulin and Age are statistically signiﬁcant for the development of diabetes. To do this start by forming the matrices that correspond to the restricted model under the null hypothesis (X_rest) and to the full model (X_full).

X_full<-cbind(1 ,as .matrix(diabetes[,1:8]))

X_rest<-cbind(1 ,as .matrix(diabetes[,c (1 ,2 ,6 ,7)]))

Y<-diabetes[,9]

Notice that, as in computer practical 2, we add a constant term in the model, and hence the vector of

parameters becomes 9 = (θ0 , θ 1 , ..., θ8 ). You will need the following functions from computer practical 2 (notice the modiﬁcation in the return vector of ell.maximize):

sigma <- function(v) {

1/(1+exp (-v))

}

ell <- function(theta, X, y) {

p <- as .vector(sigma(X%*%theta))

sum (y*log(p) + ( 1-y)*log(1-p))

}

score <- function(theta, X, y) {

p <- as .vector(sigma(X%*%theta))

as .vector(t(X)%*%(y-p))

}

maximize .ell <- function(ell, score, X, y, theta0) {

optim .out <- optim (theta0, fn=ell, gr=score, X=X, y=y, method= "BFGS" ,

control=list(fnscale=- 1 , maxit= 1000 , reltol= 1e- 16)) return(list(theta=optim .out$par, value=optim .out$value))

}

5 Verifying the distribution of _2 log Λn .

Recall the following function from computer practical 2, which simulates values of the response variable for a given matrix of explanatory variables, X , and a given vector of parameters, theta.

generate .ys <- function (X, theta) {

n <- dim (X)[1]

rbinom(n, size = 1 , prob=sigma(X%*%theta))

}

Question 5 [3 marks] Consider the test in Question 4. By repeated experiments, simulate the test statistic _2 log Λn under the null hypothesis using the appropriate maximum likelihood estimate for 9 . Hence, verify that it has a x2 distribution with the underlying degrees of freedom. To do this,

(a) Plot the density of the simulated _2 log Λn values against the density of the corresponding chi-squared distribution

(b) Perform the Pearson’s goodness-of-ﬁt test. [Note that the intervals chosen to quantize the observed data should meet the criterion of each interval having expected counts ≥ 5.]

6 Epilogue

Throughout this computer practical, we applied various forms of hypothesis testing to make decisions about the parameters of a distribution (using generalised likelihood ratio test) or even for the distribution of some observed data (goodness-of-ﬁt test).

We also saw how one can decide between two nested models in logistic regression analysis using an appropriate

form of the generalised likelihood ratio. As mentioned in the epilogue of Computer Practical 2, we can easily obtain the estimated values for the parameters in logistic regression using the glm function. One can then use the lrtest (need to load lmtest package) to perform hypothesis testing for nested models. For example, if we have two explanatory variables x1 , x2 and we want to test if x1 is signiﬁcant for the response y, we can

test this in R as follows:

model1<-glm(y~x1+x2,family=binomial) #full model

model2<-glm(y~x2,family=binomial) #restricted model

library(lmtest)

lrtest (model1, model2)

[You may use this (only!) to conﬁrm your result in question 4.]

2022-12-08

Java

物理(Physical)

LINUX

C++

Python