BU510.650 – Data Analytics Assignment # 4
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit
BU510.650 – Data Analytics
Assignment # 4
Please submit two documents: Your answers to each part of every question in .pdf or .doc format, and your R script, in .R format. In your document with answers, please do *not* respond with R output only. While it is okay to include R output in that document, please make sure you spell out the response to the question asked. Please submit your assignment through Blackboard and name your files using the convention LastName_FirstName_AssignmentNumber. For example, Yazdi_Mohammad_4.pdf and Yazdi_Mohammad_4.R.
For answering questions 1: Please watchDecision Tree in Rrecording of class.
For answering questions 2: Please watchKNN in Rrecording of class.
1. In this question, you will estimate a decision tree for the AutoLoss data. The data file for this question, AutoLoss-DT.csv, is slightly different from the data file in Assignment 2. In particular, instead of the actual loss amount for each vehicle, it has a column called HighLoss, which indicates whether the loss is high (“Yes”) or low (“No”) for each vehicle. Our goal is to create a decision tree that predicts whether the loss for a vehicle will be high or low.
To begin your work on this question, run the following two lines of code: The first one replaces ?s with NA while reading the data from the .csv file, and the second one removes all the observations with any NA.
AutoLoss <- read.csv("AutoLoss-DT.csv", na.strings = "?",stringsAsFactors = TRUE) AutoLoss <- na.omit(AutoLoss)
**Please include set.seed(5) once at the beginning of your code, so we all get the same results.**
a) Fit a decision tree to the entire data, with HighLoss as the response and all other variables as predictors. Plot the tree (including the names of predictors in the plot) and answer the following questions: Which predictors are used at the nodes of the tree? How many terminal nodes (leaves) does the tree have?
b) Determine the best tree size, using cross-validation and pruning. (See how we accomplished this in TASK 7 of Carseats example.) Plot the tree you obtained (including the names of predictors in the plot).
c) Use the best tree to answer the following question (you do not need to use R for this): Suppose my car fits the description shown below. Will this car incur a high loss or not?
|
FuelType |
Aspiration |
NumDoors |
BodyStyle |
DriveWheels |
Length |
Width |
Height |
|
gas |
std |
two |
wagon |
4wd |
160 |
70 |
60 |
|
Weight |
EngineSize |
Horsepower |
PeakRPM |
Citympg |
Price |
|
3423 |
122 |
241 |
5000 |
26 |
23000 |
2. In this question, you will use the K-Nearest Neighbors (KNN) algorithm to predict whether a passenger will survive or not.
To begin your work on this question, first read the data from the file "TitanicforKNN.csv" to a data frame named Titanic.
**Note: Please review the data before proceeding. You will notice that I already converted all the categorical variables (Gender, Fare, Class) into 0-1 columns. I did so, because KNN does not work well with non-numeric variables.**
Next, split the data into training data and test data, using random selection. Include half of the records in the training data and the rest in the test data. You learned how to do this using sample function in Task 3 in Carseats-DecisionTree.R for a related example. (**Remember to include set.seed(1) before the random selection in your code, so we all end up making the same split.**)
(a) Run the KNN algorithm to predict the response variable Survived for each passenger in the test
data. Do this for K = 2, 4, and 6. According to these predictions for K = 2, 4, and 6, what is the proportion of passengers in the test data that will survive?
R Hints: To run the function knn(), recall that you need four inputs:
(i) a matrix that contains the values of predictors in the training data,
(ii) a matrix that contains the values of predictors in the test data,
(iii) a vector containing the values of the response (Survived) in the training data,
(iv) a value for K.
To obtain (i), remove the Survived column from the training data. To obtain (ii), remove the Survived column from the test data. To obtain (iii), create a vector that stores the values of Survived column in the training data. See the Smarket-KNN.R for a related example.
(b) For each K, compute the accuracy of predictions for the test data. Which K works best in this case?
2022-12-05