闪电代写 -代写CS作业_CS代写_Finance代写_Economic代写_Statistics代写_代码代做_IT代写_加急帮助

Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

Homework 5

Statistics 151A (Linear Models)

Instructions: Please submit with a cover sheet that has your name and student ID. For question on data analysis, please format your report as a document with a brief intro- duction, instead of a list of numbered answers. Include R code and output only in small

portions that directly illustrate points you make in your writing. Please include your full code in the appendix. Make sure to comment your code and label visuals appropriately.

1. Show that the ridge coeﬀicent has two equivalent forms:

(XT X + λIP)− 1XT Y = XT(XXT + λIn)− 1Y.

On the computational side, explain when will the left hand side be useful and when will the right hand side be useful. (5 points)

2. (Please answer this question without using R) Consider the frogs dataset that we used in lab. To describe the data briefly, 200 sites of the Snowy Mountain area of

New South Wales, Australia were surveyed for the species of the Southern Corro- boree frog. The response variable, named pres.abs, takes the value 1 if frogs of this species were found at the site and 0 otherwise. The explanatory variables include al- titude, distance, NoOfPools, NoOfSites, avrain, meanmin and meanmax. The dataset

contains 200 observations and the response variable equals one for 75 observations and equals 0 for the rest. Suppose we fit a logistic regression model to the data via

frogs.glm <- glm(formula = pres.abs ~ log(distance) +

log(NoOfPools) + meanmin,

family = binomial, data = frogs)

summary(frogs.glm)

This gave us the following output:

Call:

glm(formula = pres.abs ~ log(distance) + log(NoOfPools) + meanmin,

family = binomial(link = "logit"), data = frogs)

Deviance Residuals:

Min 1Q Median

-1.9379 -0.7512 -0.4699

0.8643

Max

2.3081

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) 0.7936 XXXXX 0.352 0.724577

log(distance) log(NoOfPools)

meanmin

---

Signif. codes:

XXXXX

0.4961

1.0717

0.2116

0.2067

0.3187

-4.247 2.17e-05 ***

2.400 0.016381 *

3.362 0.000773 ***

0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 264.63 on XXX degrees of freedom

Residual deviance: XXXXX on XXX degrees of freedom

AIC: 210.9

Number of Fisher Scoring iterations: 5

a) Fill the five missing values in the above output giving appropriate reasons and calculations. (3 points)

b) Suppose a new site is found where the values of the explanatory variables are

distance = 265 NoOfPools = 26 meanmin = 3.5

According to the logistic regression model, what is the predicted probability that Southern Corroboree frogs will be found at this site? (3 points)

c) Suppose we add the variable altitude to the model. Would the residual deviance increase or decrease? Would the null deviance increase or decrease? Explain with reason. (2 points)

3. Selection of baseline category in multinomial logistic regression: Suppose that the response variable Y takes any of m categories. Let πij denote the probability that the

ith observation falls in the jth category of the response variable, i.e. πij ≡ P(Yi = j) for j = 1, ..., m and X1, ..., Xk denote k regressors on which the πij depend. We have learned in class that the multinomial logistic regression can be written as:

πij ln

= γ0j + γ1jXi1 + · · · + γkjXik for j = 1, ..., m − 1

with resulting probabilities:

πij =

πim =

exp (XTγj)

1 + ∑l<mexp (XTγl) ,

1 + ∑l<mexp (XTγl) .

j < m

Show that if we choose a different baseline category j′ instead of m, we obtain the same set of probabilities. (9 points)

4. Data Analysis: Download the train.csv from https://www.kaggle.com/c/ titanic/data (this is the competition Titanic: Machine Learning from Disaster from Kaggle). Randomly make 2/3 of train.csv into a training dataset. The other third will be your test data.

a) Using the training data, build a reasonable model based on logistic regression for the survival status based on the explanatory variables (you can start with a basic model and subsequently either expand it using interactions etc. and/or perform model selection to remove some variables). Describe your model. (4 points)

b) Use your model to predict the survival status yˆi for nt subjects present in the test data. Here, yˆi are estimated probabilities rounded to 0 or 1. Report the accuracy of your predictions in terms of misclassification rate, which is defined

as:

∑ I(yi yˆi).