闪电代写 -代写CS作业_CS代写_Finance代写_Economic代写_Statistics代写_代码代做_IT代写_加急帮助

Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

FIT2086 Assignment 3

2022

Introduction

There are total of three questions worth 8 + 18 + 14 = 40 marks in this assignment.

This assignment is worth a total of 20% of your ﬁnal mark, subject to hurdles and any other matters (e.g., late penalties, special consideration, etc.) as speciﬁed in the FIT2086 Unit Guide or elsewhere in the FIT2086 Moodle site (including Faculty of I.T. and Monash University policies).

Students are reminded of the Academic Integrity Awareness Training Tutorial Activity and, in par- ticular, of Monash University’s policies on academic integrity. In submitting this assignment, you acknowledge your awareness of Monash University’s policies on academic integrity and that work is done and submitted in accordance with these policies.

Submission: No ﬁles are to be submitted via e-mail. Correct ﬁles are to be submitted to Moodle, as given above. You must submit the following three ﬁles:

1. One PDF ﬁle containing non-code answers to all the questions that require written answers. This ﬁle should also include all your plots.

2. An R script ﬁle containing R code answers. Please make sure this is clearly commented so it is obvious which R statements are answering which questions, and the questions are answered in the order they appear in the assignment.

Please read these submission instructions carefully and take care to submit the correct ﬁles in the correct places.

Question 1 (8 marks)

This question will require you to analyse a regression dataset. In particular, you will be looking at

predicting the fuel eﬃciency of a car (in kilometers per litre) based on characteristics of the car and its engine. This is clearly an important and useful problem. The dataset fuel .ass3 .2022 .csv contains

n = 500 observations on p = 9 predictors obtained from actual fuel eﬃciency tables for car models available for sale during the years 2017 through to 2020. The target is the fuel eﬃciency of the car measured in kilometers per litre. The higher this score, the better the fuel eﬃciency of the car. The data dictionary for this dataset is given in Table 1. Provide working/R code/justiﬁcations for each of these questions as required.

1. Fit a multiple linear model to the fuel eﬃciency data using R. Using the results of ﬁtting the linear model, which predictors do you think are possibly associated with fuel eﬃciency, and why? Which three variables appear to be the strongest predictors of fuel eﬃciency, and why?

[2 marks]

2. How would your assessment of which predictors are associated change if you used the Bonferroni procedure with α = 0.05? [1 marks]

3. Describe what eﬀect engine displacement (Eng .Displacement) appears to have on the mean fuel eﬃciency of a car. Describe the eﬀect that the Drive .SysF variable has on the mean fuel eﬃciency of a car. [2 marks]

4. Use the stepwise selection procedure with the BIC penalty (using direction="both") to prune out potentially unimportant variables. Write down the ﬁnal regression equation obtained after pruning. [1 mark]

5. Imagine that you are looking for a new car to buy to replace your existing car. The characteristics of the new car that you are looking at are given by the thirty-third row of the dataset.

(a) Use your BIC model to predict the mean fuel eﬃciency for this new car. Provide a 95% conﬁdence interval for this prediction. [1 mark]

(b) The current car that you own has a mean fuel eﬃciency of 11km/l (measured over the life time of your ownership). Does your model suggest that the new car will have better fuel eﬃciency than your current car? [1 mark]

Variable name	Description	Values
Model .Year	Year of sale	2017 - 2020
Eng .Displacement	Engine Displacement (litres, l)	0.9 - 8.4
No .Cylinders Aspiration	Number of Cylinders Engine Aspiration (Oxygen intake)	3 - 16 N: Naturally* OT: Other
		SC: Supercharged TC: Turbocharged
		TS: Turbo+supercharged
No Gears	Number of Gears	1 - 10
Lockup .Torque .Converter	Lockup torque converter present?	N* and Y
Drive .Sys	Drive System	4* : 4-wheel drive A:All-wheel
		F:Front-wheel P:Part-time 4-wheel
		R:Rear-wheel
Max .Ethanol Fuel .Type	Maximum % of Ethanol allowed Type of Fuel	10 - 85 G* : Regular Unleaded GM: Mid-grade Unleaded Recommended GP: Premium Unleaded Recommended
		GPR: Premium Unleaded Required
Comb .FE	Fuel Eﬃciency (km/l)	4.974 - 26.224

Table 1: Fuel eﬃciency data dictionary. The * denotes the reference category for each categorical variable.

some text

Question 2 (18 marks)

In this question we will analyse the data in heart .train .ass3 .2022 .csv. In this dataset, each observation represents a patient at a hospital that reported showing signs of possible heart disease. The outcome is presence of heart disease (HD), or not, so this is a classiﬁcation problem. The predictors are summarised in Table 2. We are interested in learning a model that can predict heart disease from these measurements. To answer this question you must:

When answering this question, you must use the rpart package that we used in Studio 9. The wrapper function for learning a tree using cross-validation that we used in Studio 9 is contained in the ﬁle wrappers .R. Don’t forget to source this ﬁle to get access to the function.

1. Using the techniques you learned in Studio 9, ﬁt a decision tree to the data using the tree package. Use cross-validation with 10 folds and 5, 000 repetitions to select an appropriate size tree. What variables have been used in the best tree? How many leaves (terminal nodes) does the best tree have? [2 marks]

2. Plot the tree found by CV. Clearly describe in plain English what conditions are required for the tree to predict that someone has heart disease. (hint: use the text(cv$best .tree,pretty=12) function to add appropriate labels to the tree). [3 marks]

3. For classiﬁcation problems, the rpart package only labels the leaves with the most likely class. However, if you examine the tree structure in its textural representation on the console, you can determine the probabilities of having heart disease (see Question 2.3 from Studio 9 as a guide) in each leaf (terminal node). Take a screen-capture of the plot of the tree (don’t forget to use the “zoom” button to get a larger image) or save it as an image using the “Export” button in R Studio.

Then, use the information from the textual representation of the tree available at the console

and annotate the tree in your favourite image editing software; next to all the leaves in the tree, add text giving the probability of contracting heart disease. Include this annotated image in your report ﬁle. [1 mark]

4. According to your tree, which predictor combination results in the lowest probability of having heart-disease? [1 mark]

5. We will also ﬁt a logistic regression model to the data. Use the glm() function to ﬁt a logis- tic regression model to the heart data, and use stepwise selection with the KIC score (using direction="both") to prune the model. What variables does the ﬁnal model include, and how do they compare with the variables used by the tree estimated by CV? Which predictor is the most important in the logistic regression? [3 marks]

6. Write down the regression equation for the logistic regression model you found using step-wise selection. [1 mark]

7. Please describe the eﬀect the variable CA has on heart-disease according to this logistic regression model? [1 mark]

8. The ﬁle heart .test .ass3 .2022 .csv contains the data on a further n\ = 92 individuals. Using the my .pred .stats() function contained in the ﬁle my .prediction .stats .R, compute the pre- diction statistics for both the tree and the step-wise logistic regression model on this test data. Contrast and compare the two models in terms of the various prediction statistics? Does one seem better than the other? Justify your answer. [2 marks]

9. Calculate the odds of having heart disease for the 10th patient in the test dataset. The odds should be calculated for both:

(a) the tree model found using cross-validation; and

(b) the step-wise logistic regression model.

How do the predicted odds for the two models compare? [2 marks]

10. For the logistic regression model using only those predictors selected by KIC in Question 2.5, use the bootstrap procedure (use at least 5, 000 bootstrap replications) to ﬁnd a conﬁdence interval for the odds of having heart disease for the 65th and 66th patients in the test data. Use the bca option when computing this conﬁdence interval.

Using these intervals, do you think there is any evidence to suggest that there is a real diﬀerence in the population odds of having heart disease between these two individuals? [2 marks]

2022-10-13

Java

物理(Physical)

LINUX

C++

Python

Processing

sas

ios

maths

maple

C语言