Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

BU510.650  Data Analytics

Assignment # 4

Please submit two documents: Your answers to each part of every question in .pdf or .doc format, and   your R script, in .R format. In your document with answers, please do *not* respond with R output only. While it is okay to include R output in that document, please make sure you spell out the response to     the question asked. Please submit your assignment through Blackboard and name your files using the     convention LastName_FirstName_AssignmentNumber. For example, Yazdi_Mohammad_4.pdf and          Yazdi_Mohammad_4.R.

For answering questions 1: Please watchDecision Tree in Rrecording of class.

For answering questions 2: Please watchKNN in Rrecording of class.

1.    In this question, you will estimate a decision tree for the AutoLoss data. The data file for this            question, AutoLoss-DT.csv, is slightly different from the data file in Assignment 2. In particular,     instead of the actual loss amount for each vehicle, it has a column called HighLoss, which indicates whether the loss is high (“Yes”) or low (“No”) for each vehicle. Our goal is to create a decision tree  that predicts whether the loss for a vehicle will be high or low.

To begin your work on this question, run the following two lines of code: The first one replaces ?s with NA while reading the data from the .csv file, and the second one removes all the observations with any NA.

AutoLoss <- read.csv("AutoLoss-DT.csv", na.strings = "?",stringsAsFactors = TRUE) AutoLoss <- na.omit(AutoLoss)

**Please include set.seed(5) once at the beginning of your code, so we all get the same results.**

a)    Fit a decision tree to the entire data, with HighLoss as the response and all other variables as         predictors. Plot the tree (including the names of predictors in the plot) and answer the following   questions: Which predictors are used at the nodes of the tree? How many terminal nodes (leaves) does the tree have?

b)   Determine the best tree size, using cross-validation and pruning. (See how we accomplished this in TASK 7 of Carseats example.)  Plot the tree you obtained (including the names of predictors in the  plot).

c)    Use the best tree to answer the following question (you do not need to use R for this): Suppose my car fits the description shown below. Will this car incur a high loss or not?

FuelType

Aspiration

NumDoors

BodyStyle

DriveWheels

Length

Width

Height

gas

std

two

wagon

4wd

160

70

60

Weight

EngineSize

Horsepower

PeakRPM

Citympg

Price

3423

122

241

5000

26

23000

2. In this question, you will use the K-Nearest Neighbors (KNN) algorithm to predict whether a passenger will survive or not.

To begin your work on this question, first read the data from the file "TitanicforKNN.csv" to a data frame named Titanic.

**Note: Please review the data before proceeding. You will notice that I already converted all the   categorical variables (Gender, Fare, Class) into 0-1 columns. I did so, because KNN does not work well with non-numeric variables.**

Next, split the data into training data and test data, using random selection. Include half of the    records in the training data and the rest in the test data. You learned how to do this using      sample function in Task 3 in Carseats-DecisionTree.R for a related example. (**Remember to include set.seed(1) before the random selection in your code, so we all end up making the same split.**)

(a)  Run the KNN algorithm to predict the response variable Survived for each passenger in the test

data. Do this for K = 2, 4, and 6. According to these predictions for K = 2, 4, and 6, what is the proportion of passengers in the test data that will survive?

R Hints: To run the function knn(), recall that you need four inputs:

(i)          a matrix that contains the values of predictors in the training data,

(ii)         a matrix that contains the values of predictors in the test data,

(iii)        a vector containing the values of the response (Survived) in the training data,

(iv)        a value for K.

To obtain (i), remove the Survived column from the training data. To obtain (ii), remove the Survived column from the test data. To obtain (iii), create a vector that stores the values of Survived column in the training data. See the Smarket-KNN.R for a related example.

(b)  For each K, compute the accuracy of predictions for the test data. Which K works best in this case?