Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

ACCT3011

Big Data Analytics

Undergraduate Programmes 2022/23

SUMMATIVE ASSIGNMENT

Programming Assignment 1: Part 1 - Linear Regression

Introduction

In this part of the assignment, you will implement linear regression and get to see it work on data.  To get started, you will need to download the starter code (from Learn Ultra) and unzip its contents to the directory where you wish to complete the assignment.  If needed, use the cd command in Matlab/Octave or Python to change to this directory before starting this assignment.

Files included in this assignment

· ex1.m – Matlab/Octave script that will help step you through the assignment

· ex1.py – Python script that will help step you through the assignment

· ex1data1.txt - Dataset for linear regression with one variable

· [⋆] plotData.m - Function to display the dataset

· [⋆] computeCost.m - Function to compute the cost of linear regression

· [⋆] gradientDescent.m - Function to run gradient descent

⋆ indicates files you will need to complete.

Throughout the assignment, you will be using the scripts ex1.m or ex1.py.  These scripts set up the dataset for the problems and make calls to functions that you will write.  You do not need to modify either of them.  You are only required to modify functions in other files, by following the instructions in this assignment.

For this programming assignment, you are required to complete the first part of the assignment to implement linear regression with one variable.  The second part of the assignment, which you must also complete, covers logistic regression.

Linear regression with one variable

In this part of this assignment, you will implement linear regression with one variable to predict profits for a food truck.  Suppose you are the CEO of a restaurant franchise and are considering different cities for opening a new outlet.  The chain already has trucks in various cities and you have data for profits and populations from the cities.  You would like to use this data to help you select which city to expand to next.

The file ex1data1.txt contains the dataset for our linear regression problem.  The first column is the population of a city and the second column is the profit of a food truck in that city.  A negative value for profit indicates a loss.  The ex1.m and ex1.py script have already been set up to load this data for you.

Plotting the Data

Before starting on any task, it is often useful to understand the data by visualizing it.  For this dataset, you can use a scatter plot to visualize the data, since it has only two properties to plot (profit and population).  (Many other problems that you will encounter in real life are multi-dimensional and can’t be plotted on a 2-d plot.)  In ex1.m, the dataset is loaded from the data file into the variables X and y:

data = load('ex1data1.txt');

X = data(:, 1); y = data(:, 2);

m = length(y);

In ex1.py, the dataset is loaded into the variables X and y as follows:

data = pd.read_csv("ex1data1.txt",names=["X","y"])

x = np.array(data.X)[:,None]

y = np.array(data.y)

m = len(y)

Next, the scripts call the plotData function to create a scatter plot of the data.  If you are using matlab/octave, your job is to complete plotData.m to draw the plot; modify the file and fill in the following code:

plot(x, y, 'rx', 'MarkerSize', 10);

ylabel('Profit in $10,000s');

xlabel('Population of City in 10,000s');

To do this in python modify the plotData function using the following code:

ax.plot(x,y,'rx',markersize=10)

ax.set_xlabel("Population of City in 10,000s")

ax.set_ylabel("Profit in $10,000s")

Now, when you continue to run ex1.m or ex1.py, with the same red “x” markers and axis labels.

Gradient Descent

In this part, you will fit the linear regression parameters θ to our dataset using gradient descent.

Update Equations

The objective of linear regression is to minimize the cost function

 

 

 

(1)

where the hypothesis  is given by the linear model;

 

 

(2)

Here, the’s are the parameters, which map the x’s to the y’s.  Consequently, this can be written more concisely as follows,

 

 

 

(3)

The value assigned to (the left side the equation) can take any value from - to +. This is a linear regression model.

Recall that the parameters of your model are the θi values. These are the values you will adjust to minimize cost J(θ). One way to do this is to use the batch gradient descent algorithm. With each step of gradient descent, your parameters θi come closer to the optimal values that will achieve the lowest cost J(θ).

Implementation

In ex1.m and ex1.py, we have already set up the data for linear regression. In the following lines, we add another dimension to our data to accommodate the θ0 intercept term.  We also initialize the initial parameters to 0 and the learning rate alpha to 0.01.  In ex1.m this is;

X = [ones(m, 1), data(:,1)]; % Add a column of ones to x

theta = zeros(2, 1);

iterations = 1500;

alpha = 0.01;

While in ex1.py the code is as follows;

ones = np.ones_like(x)

X = np.hstack((ones,x)) # Add a column of ones to x

theta = np.zeros(2)

iterations = 1500

alpha = 0.01

Computing the cost J(θ)

As you perform gradient descent to learn minimize the cost function J(θ), it is helpful to monitor the convergence by computing the cost. In this section, you will implement a function to calculate J(θ) so you can check the convergence of your gradient descent implementation.

Your next task is to complete the code in the file computeCost.m or the function computeCost in ex1.py. These functions computes J(θ).  As you are doing this, remember that the variables X and y are not scalar values, but matrices whose rows represent the examples from the training set.  Once you have completed the function, the next step in ex1.m or ex1.py will run computeCost once using θ initialized to zeros, and you will see the cost printed to the screen. You should expect to see a cost of 32.07.

Gradient descent

Next, you will implement gradient descent in the file gradientDescent.m if you are using matlab/octave or the gradientDescent function in ex1.py if python.  The loop structure has been written for you, and you only need to supply the updates to θ within each iteration.

As you program, make sure you understand what you are trying to optimize and what is being updated.  Keep in mind that the cost J(θ) is parameterized by the vector θ, not X and y.  That is, we minimize the value of J(θ) by changing the values of the vector θ, not by changing X or y.  Refer to the equations in this handout and to the lecture notes if you are uncertain.

A good way to verify that gradient descent is working correctly is to look at the value of J(θ) and check that it is decreasing with each step.  Assuming you have implemented gradient descent and computeCost correctly, your value of J(θ) should never increase, and should converge to a steady value by the end of the algorithm.  After you are finished, the code will use your final parameters to plot the linear fit.

Your final values for θ will also be used to make predictions on profits in areas of 35,000 and 70,000 people.  Note the way that the following lines in ex1.m uses matrix multiplication, rather than explicit summation or looping, to calculate the predictions. This is an example of code vectorization.

predict1 = [1, 3.5] * theta;

predict2 = [1, 7] * theta;

These lines in ex1.py are;

predict1 = np.dot([1, 3.5],theta)

predict2 = np.dot([1, 7],theta)

Visualizing J(θ)

To understand the cost function J(θ) better, you will now plot the cost over a 2-dimensional grid of θ0 and θ1 values. You will not need to code anything new for this part, but you should understand how the code you have written already is creating these images.

In the next step of ex1.m, there is code set up to calculate J(θ) over a grid of values using the computeCost function that you wrote.

J_vals = zeros(length(theta0_vals), length(theta1_vals));

for i = 1:length(theta0_vals)

for j = 1:length(theta1_vals)

t = [theta0_vals(i); theta1_vals(j)];

J_vals(i,j) = computeCost(x, y, t);

end

end

In the python script the code set up to calculate J(θ) over a grid of values using the computeCost function that you wrote is;

J_vals = np.zeros((len(theta0_vals),len(theta1_vals)))

for i in range(len(theta0_vals)):

    for j in range(len(theta1_vals)):

        t = np.array([theta0_vals[i],theta1_vals[j]])

        J_vals[i][j] = computeCost(X,y,t)

After these lines are executed, you will have a 2-D array of J(θ) values. The code will then use these values to produce surface and contour plots of J(θ) using the surf and contour commands.

The purpose of these graphs is to show you that how J(θ) varies with changes in θ0 and θ1. The cost function J(θ) is bowl-shaped and has a global minimum. (This is easier to see in the contour plot than in the 3D surface plot). This minimum is the optimal point for θ0 and θ1, and each step of gradient descent.

Required:

a) Using your own words, briefly define the functions computeCost and gradientDescent.  

b) Create flowchart solutions for computeCost and gradientDescent.

c) As CEO, your goal is to gain strategic advantage over its rival firms.  How can the food truck business use data analytics analytics and exploit social media to accomplish this goal?

d) 

Programming Assignment 1: Part 2 - Logistic Regression

Introduction

In this exercise, you will implement logistic regression and apply it to two different datasets.

To get started with this part of the exercise, you will need to download the starter code (from Learn Ultra) and unzip its contents to the directory where you wish to complete the exercise.  If needed, use the cd command in Octave/Matlab or Python to change to this directory before starting this exercise.

Files included in this programming assignment

· ex2.m – Octave/Matlab script that will help step you through the exercise

· ex2.py - Python script that will help step you through the exercise

· ex2data1.txt - Training set for the first half of the exercise

· plotDecisionBounday.m - Function to plot classifier’s decision boundary

· [⋆] plotData.m - Function to plot 2D classification data

· [⋆] sigmoid.m - Sigmoid Function

· [⋆] costFunction.m - Logistic Regression Cost Function

· [⋆] predict.m - Logistic Regression Prediction Function

⋆ indicates files you will need to complete.

Throughout the exercise, you will be using the scripts ex2.m and/ or ex2.py.  These scripts set up the dataset for the problems and make calls to functions that you will write.  You are only required to modify functions by following the instructions in this assignment.

Logistic Regression

In this part of the exercise, you will build a logistic regression model to predict whether a student gets admitted into a university.

Suppose that you are the administrator of a university department and you want to determine each applicant’s chance of admission based on their results on two exams. You have historical data from previous applicants that you can use as a training set for logistic regression.  For each training example, you have the applicant’s scores on two exams and the admissions decision.

Your task is to build a classification model that estimates an applicant’s probability of admission based the scores from those two exams.  This outline and the framework code in ex2.m and ex2.py will guide you through the exercise.

Visualizing the data

Before starting to implement any learning algorithm, it is always good to visualize the data if possible.  In the first part of ex2.m, the code will load the data and display it on a 2-dimensional plot by calling the function plotData.

You will now complete the code in plotData so that it displays a figure like Figure 1, where the axes are the two exam scores, and the positive and negative examples are shown with different markers.

 

% Find Indices of Positive and Negative Examples

pos = find(y==1); neg = find(y == 0);

% Plot Examples

plot(X(pos, 1), X(pos, 2), 'k+','LineWidth', 2, 'MarkerSize', 7);

plot(X(neg, 1), X(neg, 2), 'ko', 'MarkerFaceColor', 'y', 'MarkerSize', 7);

In the ex2.py script the following code should be added to the plotData function to display the data on a 2-dimensional plot:

pos = X[np.where(y==1)]

    neg = X[np.where(y==0)]

    fig, ax = plt.subplots()

    ax.plot(pos[:,0],pos[:,1],"k+",neg[:,0],neg[:,1],"yo")

Sigmoid function

Before you start with the actual cost function, recall that the logistic regression hypothesis is defined as:

 

 

(3)

where function g is the sigmoid function. The sigmoid function is defined as:

 

 

(4)

Your first step is to implement this function in the sigmoid.m script if you are using octave/ matlab or the sigmoid function in the ex2.py script if you are using python. When you are finished, try testing a few values by calling sigmoid(x) at the command line.  For large positive values of x, the sigmoid should be close to 1, while for large negative values, the sigmoid should be close to 0. Evaluating sigmoid(0) should give you exactly 0.5.  Your code should also work with vectors and matrices.  For a matrix, your function should perform the sigmoid function on every element.

Cost function and gradient

Now you will implement the cost function and gradient for logistic regression.  In octave/matlab, complete the code in costFunction.m to return the cost and gradient.  In python this is the costFunction function in the ex2.py script.  Once you are done, ex2.m and ex2.py will call your costFunction using the initial parameters of θ.  You should see that the cost is about 0.693.

Learning parameters

In the previous assignment, you found the optimal parameters of a linear regression model by implementing gradient descent.  You wrote a cost function and calculated its gradient, then took a gradient descent step accordingly.  This time, instead of taking gradient descent steps will use the built in fminunc and minimize functions in octave/matlab and python respectively.  These two functions are optimization solver that find the minimum of an unconstrained function.  For logistic regression, you want to optimize the cost function J(θ) with parameters θ.

Concretely, you are going to use these functions to find the best parameters θ for the logistic regression cost function, given a fixed dataset (of X and y values).  You will pass following inputs:

· The initial values of the parameters we are trying to optimize.

· A function that, when given the training set and a particular θ, computes the logistic regression cost and gradient with respect to θ for the dataset (X, y)

In ex2.m and ex2.py, we already have code written to call the optimization functions with the correct arguments.  If you have completed the costFunction correctly, optimization functions will converge on the right optimization parameters and return the final values of the cost and θ.  Notice that by using the built in optimization functions, you did not have to write any loops yourself, or set a learning rate like you did for gradient descent. This is all done by built in optimization functions you only needed to provide a function calculating the cost and the gradient.

Once the optimization functions have complete, ex2.m or ex2.py will call your costFunction function using the optimal parameters of θ.  You should see that the cost is about 0.203.

Evaluating logistic regression

After learning the parameters, you can use the model to predict whether a particular student will be admitted.  For a student with an Exam 1 score of 45 and an Exam 2 score of 85, you should expect to see an admission probability of 0.774.

Another way to evaluate the quality of the parameters we have found is to see how well the learned model predicts on our training set.  In this part, your task is to complete the code in predict.m in octave/matlab or the predict function in ex1.py.  The predict function will produce “1” or “0” predictions given a dataset and a learned parameter vector θ.

After you have completed the code the ex2.m and ex2.py scripts will proceed to report the training accuracy of your classifier by computing the percentage of examples it got correct.

Required:

a) Using your own words, briefly define the functions sigmoid, costFunction and predict.

b) Create flowchart solutions for sigmoid, costFunction and predict.

c) Suppose after training the logistic regression classifier with gradient descent, you find that it has overfit the training set and does not achieve the desired performance on the training or cross validation sets. Briefly discuss the problem of overfitting and identify and discuss two alternative approaches that could be taken to improve the performance of the logistic regression classifier.

Overall word limit: 500 words

SUBMISSION INSTRUCTIONS

Assignments should be typed, using 1.5 spacing and an easy-to-read 12-point font. Assignments and dissertations/business projects must not exceed the word count indicated in the module handbook/assessment brief.

 The word count should:

§ Include all the text, including title, preface, introduction, in-text citations, quotations, footnotes and any other items not specifically excluded below.

§ Exclude diagrams, tables (including tables/lists of contents and figures), equations, executive summary/abstract, acknowledgements, declaration, bibliography/list of references and appendices. However, it is not appropriate to use diagrams or tables merely as a way of circumventing the word limit. If a student uses a table or figure as a means of presenting his/her own words, then this is included in the word count.

Examiners will stop reading once the word limit has been reached, and work beyond this point will not be assessed. Checks of word counts will be carried out on submitted work, including any assignments or dissertations/business projects that appear to be clearly over-length. Checks may take place manually and/or with the aid of the word count provided via an electronic submission. Where a student has intentionally misrepresented their word count, the School may treat this as an offence under Section IV of the General Regulations of the University. Extreme cases may be viewed as dishonest practice under Section IV, 5 (a) (x) of the General Regulations.

Very occasionally it may be appropriate to present, in an appendix, material which does not properly belong in the main body of the assessment but which some students wish to provide for the sake of completeness. Any appendices will not have a role in the assessment - examiners are under no obligation to read appendices and they do not form part of the word count. Material that students wish to be assessed should always be included in the main body of the text.

Guidance on referencing can be found on Durham University website and in the Student Information Hub.

MARKING GUIDELINES

Performance in the summative assessment for this module is judged against the following criteria:

· Relevance to question(s)

· Organisation, structure and presentation

· Depth of understanding

· Analysis and discussion

· Use of sources and referencing

· Overall conclusions