Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

Introduction to Data Science

HOMEWORK 3

Question 1

Using the Auto data set found in the ISLR package, perform the tasks below using the supervised machine learning algorithm lm() for simple and multiple linear regression. You do not need to     split the data set into training and test for this exercise.

A.  Perform a correlation analysis on the Auto data frame using the pairs() and cor() functions (be sure not to use the name variable in this analysis). Review the results and provide a           commentary of your findings.

B.  Use the lm() function to perform a simple linear regression with mpg as the response variable and horsepower as the predictor. Store the results in a linear model object named lm1.

1.  Use the summary() function on the lm1 object to print the results.

2.  Comment on the output of summary(), for example: is there a relationship between the  predictor and the response variable? If so, how strong is the relationship? Is the relationship

positive or negative?

3.  Create a scatterplot using the response variable and predictor. In addition, use the abline() function to display the ordinary least squares (OLS) regression line.

C.  Use the lm() function to perform a second simple linear regression with mpg as the response variable and weight as the predictor. Store the results in a linear model object named lm2.

1.  Use the summary() function on the lm2 object to print the results.

2.  Comment on the output of summary(), for example: is there a relationship between the predictor and the response variable? If so,

how strong is the relationship? Is the relationship positive or

negative?

3.  Create a scatterplot using the response variable and predictor. In addition, use the abline() function to display the ordinary least squares (OLS) regression line.

D.  Use the lm() function to perform a multiple linear regression with mpg as the response          variable and horsepower and weight as the predictors. Store the results in a linear model object named lm3.

1.  Use the summary() function on the lm3 object to print the results.

2.  Comment on the output of summary(), for example: are there  relationships between the predictors and the response variable? If so, how strong are the relationship? Are the              relationships positive or negative?

3.  Use the plot() function on the linear model object lm3 to produce four diagnostic plots          describing the regression fit. Comment on each of the plots and any problems you see with the fit.

4.  Using the computed coefficients of the lm3 linear model object, what is the predicted mpg value associated with a horsepower value of 98, and a weight value of 2500?


Question 2

Using the Auto data set, perform the tasks below using the supervised machine learning              algorithm glm() for logistic regression. Develop a model to predict whether a given car gets high or low gas mileage:

A.  Create a binary categorical variable mpg01 that contains a 1 if mpg contains a value > its         median, and a 0 if mpg contains a value <= its median. You can use the median() function in base R for this purpose. Create a new data frame containing all the variables from Auto plus the new  mpg01 variable.

B.  Perform feature engineering” to determine which of the predictors seem most likely to be useful in predicting mpg01. The cor() statistical function may be useful here to compute a         correlation matrix of the predictors.

C.  Split the data into a training set and test set. You can choose the split percentage (you might experiment with several percentages in order to minimize the test error metric).

D.  Train the glm() algorithm using the training set with mpg01 as the response variable along with the predictors you chose above.

E.  Use the predict.glm() function on the test set in order to get predicted probabilities of class membership.

F.  Based on the predicted probabilities, create a vector of 0s and 1s where the 1s indicate the  predicted probability is > 0.5. The 1s indicate where the predicted probability is successful (you might experiment with several threshold values in order to minimize the test error metric).

G.  Compare the above vector (predicted response variable values) with the mpg01 variable values in the test set (actual response variable values) and create a vector index of 0s and 1s indicating whether the two values are not equal.

H.  Calculate the mean() of the vector in the above step. This is your test error metric. Your goal is to minimize this metric.

 

Question 3

Use the K-means clustering algorithm kmeans() on the iris data set for the Sepal. Length and Sepal. Width variables. Perform the following steps:

A.  Set the number of centroids to 3

B.  Call the kmeans() algorithm and store the resulting kmeans class object to a variable named kc. You need to set seed to get reproducible results because kmeans() uses a random number  generator to come up with the centers if you use the centers argument.

C.  Review and print the cluster component of the kmeans object.

D.  Review and print the centers component of the kmeans object.

E.  Produce a scatterplot data visualization to plot each of the resulting clusters of data points and their centers. Use different colors for the data points residing in each cluster. Also, plot a special character (e.g. “+”) showing the centroid of each cluster.


Question 4

1. Access the Data Set

a) Read the data set into R using a data frame named housing. Please do not use RStudio’s data

import feature, but rather write R code for accessing the data.

b)  Cast the ocean_proximity character variable to the factor class, and

display the resulting levels.

2. EDA and Data Visualization

a)  Run the head() and tail() functions on the data frame to get a feel for the actual data values.

b)  Run the summary() function on the data frame to get a sense for the  data classes, range of values for numeric variables, and any NAs found.

c)  Perform a correlation analysis on numeric variables in the data frame.

d)  Create histograms for each numeric variable.

e)  Produce a bloxplot for the numeric variables in the data frame.

f) Produce boxplots for the variables: housing_median_age, median_income, and median_house_value “with respect” to the factor variable ocean_proximity.

3. Data Transformation

The next step is to transform the raw data into a more refined form as indicated in the steps below (that will constitute your data pipeline):

a)  We see from the summary() results above that there are many NA values in the                        total_bedrooms variable (the only variable with missing values). This needs to be addressed by  filling in missing values using imputation. You use the “statistical median” for missing                    total_bedrooms values. The median is used instead of mean because it is less influenced by        extreme outliers. This may not be the best method, as these missing values could represent        actual buildings (e.g. a warehouse) with no bedrooms, but imputation often makes the best of a bad situation. You can use the impute() function covered in class, or write code to accomplish    the requirement.

b)  Split the ocean_proximity variable into a number of binary categorical variables consisting of 1s and 0s. Although many machine learning algorithms in R can handle categorical data stored in a factor variable, but we will cater to the lowest common denominator and do the splitting          ourselves. Once you’re done with the splitting, you can remove the ocean_proximity variable      from the data frame.

c)  Use the total_bedrooms and total_rooms variables along with households to create two new variables: mean_number_bedrooms and mean_number_rooms as these are likely to be more    accurate depictions of the houses in a given group. You can then remove the total_bedrooms     and total_rooms variables once you’ve accomplished this requirement.

d)  Perform feature scaling. Scale each numerical variable except for median_house_value (as this is our response variable), and the binary categorical variables.

e)  The result of your data transformation efforts should yield a new data frame named cleaned_housing with the following variables:

"NEAR BAY"           "<1H OCEAN"          "INLAND"

"NEAR OCEAN"         "ISLAND"             "longitude"

"latitude"           "housing_median_age" "population"

"households"         "median_income"      "mean_bedrooms"

"mean_rooms"         "median_house_value"

4. Create Training and Test Sets

a)  Create a random sample index for the cleaned_housing data frame.

b)  Create a training set named train consisting of 70% of the rows of the cleaned_housing data frame.

c)  Create a test set named test consisting of 30% of the rows of the cleaned_housing data frame.