Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

Homework #7

ECE 461/661: Introduction to Machine Learning

2022

Please remember to show your work for all problems and to write down the names of any students that you collaborate with.  The full collaboration and grading policies are available on the course website: https://18661.github.io/. As a reminder, any code that you write must be fully your own.

Your solutions should be uploaded to Gradescope (https://www.gradescope.com/) in PDF format by the deadline. We will not accept hardcopies.  If you choose to hand-write your solutions, please make sure the uploaded copies are legible. Gradescope will ask you to identify which page(s) contain your solutions to which problems, so make sure you leave enough time to finish this before the deadline. We will give you a 30-minute grace period to upload your solutions in case of technical problems.

You will be graded based on the correctness of your code, as well as your explanations for the sub-questions, but we are not imposing a minimum accuracy threshold for any of the models.  As part of the write-up, please submit code for the functions you are requested to complete. If you need to modify any methods for a sub-question, please paste all modified methods as answers to the question on Gradescope

Unlike other homework assignments, in this homework you are encouraged and expected to use existing PyTorch, Python, etc. functions in your code.

All students must complete Q1 on Fashion-MNIST. 18-661 students must also complete either Q2 or Q3 or Q4. If you complete more than one of these three, we will count the highest of these grades. 18-461 students only need to complete Q1.

1    Learning to classify the “classy” digits [60 points]

Fashion-MNIST is a dataset of Zalandos article images consisting of a training set of 60,000 examples and a test set of 10,000 examples.  The goal of this question is to leverage an important modern machine learning tool, PyTorch, to implement an end-to-end Multi-Layered Perceptron based fashion classifier on the Fashion-MNIST dataset. We will walk through the general process of building deep learning systems for deployment.

Input  and  Outputs:  Before we begin building models, it is important to understand the available inputs and expected outputs for the system.  In this problem, our inputs are 28×28 grayscale images, and we will build models to predict one of 10 fashion classes.

The code template has the following directory structure:

HW7

|--  README .md

|--  config .yaml

|--  data .py

|-- main .py

|--  network .py

|--  requirements .txt

|--  train .py

 

Figure 1: Fashion-MNIST

In this assignment, you will modify code in train .py, main .py,  and  network .py.  Provided code in data .py creates dataset objects and dataloader objects that can loop through the dataset to return batches of examples for training,validation and inference. README .md gives instructions on how to install/setup your python environment and run the provided code. config .yaml is the model configuration file provided as de- fault. By modifying the configuration file, you should be able to run experiments with different configurations and compare the runs.

1.1    Modifying network.py- Designing the architecture[10 points]

A multi-layer perceptron is a series of linear projections followed by non-linear activations. To build a custom MLP with a dynamic number of layers and layer sizes, we can append the required layers which are of type torch .nn .Module into a list and then call torch .nn .Sequential. This creates an object that links the input of the first module in the list to the input of the second and so on. We can use a loop over the configuration variable of type list hidden sizes pre-pended with the input size, and appended with the output size to create the required list of modules.

For a sequential object we can perform forward propogation by simply passing the input of the first layer to the sequential object to obtain the output of the final layer in the Sequential block as output.

The provided baseline architecture is as follows:

• Flatten layer to convert input to shape [batch size,784]

• Linear layer to map to 256

• Sigmoid Activation

• Linear layer to map to 128

• Sigmoid Activation

• Linear layer to map to 64

• Sigmoid Activation

• Linear layer to map to number of classes

(a) Complete the    init   () function for the class Network of type nn .Module as directed. Make sure to

instantiate the MLP sequential module. [4 points]

(b) Initialize the weights in the neural network to be drawn from a Xavier  uniform   distribution, and

the biases to be drawn from zeros in the method init weights() - you should find torch .nn .init helpful. [2 points]

(c) Complete the forward method of the class Network to take in the input image and return the final classification logits. [2 points]

1.2    Modifying train.py - The training process [20 points]

Neural networks can be trained by using the entire training dataset to update the model parameters multiple times ( each run over the dataset is called an epoch) . In each epoch, we perform training, and then validation, and elect to save our models if the current validation accuracy is better than the best validation accuracy until this point.  This enables inference on the model with the best validation accuracy as opposed to the model from the last epoch.

Your task is to complete multiple functions within train .py as directed:

(a) In the method train validate epoch(), complete the function as directed in the handout. [8 points]

(b) Complete the inference method inference() to retrieve the best saved model and compute test accu-

racy. [4 points]

(c) Complete the utility compute accuracy() to obtain the accuracy from the logits and target labels. [2 points]

(d) Complete the utility save model() to save the best model if the current validation accuracy is better than the best recorded validation accuracy. [3 points]

(e) Complete the utility log epoch() to record learning rate, training loss, training accuracy, validation

loss, validation accuracy and best recorded validation accuracy on Tensorboard using the provided SummaryWriter object. TensorBoardprovides the visualization and tooling needed for machine learning experimentation, by exposing an extremely easy to use interface for plotting scalars, images, histograms while training. [3 points]

1.3    Modifying main.py - Instantiating and running Training[8 points]

It’s finally time to initialize your model and trainer on the Fashion-MNIST dataset. Using the file main .py, complete the following functionalities in method main() in main .py:

(a) Instantiate the variable criterion as an instance of torch .nn .CrossEntropyLoss, which is the objective

function of our model. [2 points]

(b) Create an optimizer of type torch .optim .SGD (Stochastic Gradient Descent), using the hyperpa-

rameters provided via the config object. [2 points]

(c) Create a learning rate scheduler object that can modify the learning rate of your SGD optimizer based on the validation accuracy. Use a scheduler of type torch .optim .lr scheduler .ReduceLROnPlateau. Read about the patience and factor parameters of this scheduler, and explain their role.  Use the hyper-parameters from the config file [4 points]

1.4    Running Training, Validation and Inference[22 points]

Now run training, validation and inference your model by running main .py as directed in README .md for 30 epochs. You should be able to obtain test accuracy of over 83% using the provided default configuration.

Report the following for all experiments  (i,e,  subparts a-d):   (a) Tensorboard plots - training  (loss, accuracy and learning rate), validation (loss,accuracy,best acc), (b) Test accuracy, (c) whether the training curves show (1) Optimal behavior, (2) Overfitting, or (3) Underfitting.

(a) Run the baseline without the learning rate scheduler using config .yaml and use the tag base noscheduler

[5 points]

(b) Run the baseline with the learning rate scheduler using config .yaml and the ReduceLRonplateau

scheduler base scheduler [5 points]

(c) Modify the hidden layer sizes in config.yaml to be ”10 10 10” and rerun the experiment with tag smaller scheduler. Compare and contrast the logs for these two experiments. Why do you think the logs are similar or different when we switch to these smaller layer sizes? [5 points]

(d) Modify the hidden layer sizes in config.yaml to be 1024 1024 1024” and rerun the experiment with tag larger scheduler. Compare and contrast the logs for these three hidden layer size experiments. Explain the difference and why one might see a difference. [5 points]

(e) In machine learning, we generally use validation sets as non-overlapping subsets of the training data.

Why do we need validation sets and why should the validation and train sets not share any data examples ? [2 points]

1.5    BONUS: Convolutional Neural Networks [10 points]

Convolutional neural networks are based on the shared-weight architecture of the convolution kernels or filters that slide along input features and provide translation-equivariant responses known as feature maps. In this question, you will implement a Convolutional Neural Network architecture for the Fashion MNIST task in network .py.  You will create a copy of class Network as CNNNetwork and modify the    init   () method to define a series of convolution and pooling layers for the task. Recommended model architecture is as follows:

• 2D Convolution layer with kernel size = 3 * 3, padding = 1, in channels =1, out channels=32, stride = 1

• Max Pooling layer with kernel size of 2 * 2 and stride 2

 ReLU Activation

• 2D Convolution layer with kernel size = 3 * 3, padding = 0, stride = 1,in channels =1, out channels=32

• Max Pooling layer with kernel size of 2 * 2

 ReLU Activation

• Flatten layer that returns shape [batch size,dim]

• Linear layer that maps to number of classes

You can also use Dropout, or any other techniques to boost your performance.  Make sure to explain the modifications you make, and how you obtained the stated performance. If correctly implemented, your test accuracy will be greater than 90 % and you will earn the full 10 bonus points.

2    Improving the MLP Classifier[40 points]

In this problem, you will dive deeper into the configuration parameters and evaluation for the Fashion MNIST MLP. Generally, while experimenting with ML model hyper-parameters, we fix all hyper-parameters except one and rerun the experiment to evaluate the impact of changing the one parameter. In this question, use your completed code from the previous question to continue experiments.

Report the following for all experiments  (i,e,  subparts a-d):   (a) Tensorboard plots - training  (loss, accuracy and learning rate), validation (loss,accuracy,best acc), (b) Test accuracy, (c) whether the training curves show (1) Optimal behavior, (2) Overfitting, or (3) Underfitting, and (d) any functions with changed code

2.1    Changing Optimizer [4 points]

We have used the SGD optimizer for this task.  In this question, you will replace the SGD optimizer with an Adam optimizer, and rerun the experiment. Compare test accuracy and the logs from Tensorboard. Are they different? Why or why not?

2.2    Modifying the Neural Network [14 points]

(a) Neural networks can use many regularization strategies such as weight decay or dropout. Read about

Dropout and how it affects training of deep networks. Now, add dropout to your network and include in the write-up the effect of dropout in the performance(test accuracy, final train loss) of your model. Explain the effect of dropout on performance if it causes noticeable changes, and explain why dropout doesn’t help, if it doesn’t. [7 points]

(b) Non-linear activations are the reason behind the expressivity and generalization ability of neural net-

works.  What happens to test accuracy when the sigmoid activations in the model configuration are changed to (a) ReLU activations, (b) Hard Tanh , or (c) ELU activations and why ? [7 points]

2.3    Data Split and Augmentation[14 points]

(a) Data Augmentation refers to creating multiple copies of the training data with different transformations

to build robust machine learning models.  Transformations are chosen so that the model learns to classify potential test data with these transformations.  Give two examples of transformations that can be used to augment the Fashion MNIST training set. Rerun the experiment with this additional transformation, and report the test accuracy and validation accuracy curve from Tensorboard. Explain the reason for the change in performance, if any. [8 points]

(b) In our Fashion MNIST task, we have chosen a split of 85-15 for training and validation data.  What

would happen if we had chosen 50-50 instead ? Rerun the experiment with this new split, and report the test accuracy and validation accuracy curve from Tensorboard. [6 points]

2.4    Evaluating Classification with more metrics [8 points]

Classification is evaluated using precision, recall, F-1 score, and confusion matrix. Use sklearn .metrics to evaluate your test predictions for two models :  (a) the model with lowest test accuracy (among all models you have trained in Q1 and Q2), and (b) the model with highest test accuracy among all of these previous models.

Report the precision, recall, F-1 on the test set for these two models- what do higher/lower values mean? Also plot confusion matrices for these models, and comment on any differences and what they reflect.

3    Decision Tree for Spotify Data [40 points]

In this problem, you will implement a Decision Tree classifier using scikit-learn and use it to classify the Spotify data set.  In this data set, an individual has generated a list of songs, each with a set of features, and whether the individual liked or disliked the song. The goal of this problem is to create a decision tree classifier to predict whether this individual would like or dislike a song based on a list of features.  We encourage you to use sklearn’s DecisionTreeClassifier class for this problem.

3.1    Import Data[8 points]

(a) Import the data into a Pandas dataframe.  Pandas is a data analysis library that is very useful for

machine learning projects.  Examine the data.  Which features, if any, appear to not be useful for classification and should be removed? Print the final list of the feature names that you believe to be useful.

(b) Of the remaining features which you believe may be useful for classification, which feature(s) do you

estimate will be the most important?  Which feature(s) will be the least important?  Briefly explain your answers.

(c) Create a Pandas dataframe with just the useful features you have selected, and a separate data series for the targets (labels) of each sample.

(d) Divide the full dataset into a training set and testing set, with 80% of the data used for training. Consider using the train test split function for this step.

3.2    Training the Model[12 points]

(a) Determine the best hyper-parameters for your decision tree using cross-validation with at least 5

folds.  Search across at least 3 hyper-parameters for Decision Trees.  It is recommended to look at ‘criterion’, ‘max depth’, and ‘class weight’, but you are welcome to explore additional or alternative hyper-parameters.  The GridSearchCV module may be helpful here.  Report which hyper-parameters you searched over and the best hyper-parameter values.

(b) Train your model with the best hyper-parameters found in Q2.2a. Run it on the test data to generate

predictions for the test data. Your final accuracy may vary, but expect it to be around 70%.

3.3    Evaluating the Model[20 points]

(a) Generate the precision, recall, accuracy, and F1-score for your predictions from Q2.2b. These metrics

are all refinements of the classification accuracy. The sklearn .metrics modules may help with this. What are your results?

(b) Generate a confusion matrix to visualize your predictions (Figure 2 has an example; note that your

matrix may have very different values). The sklearn .metrics module may also be useful here.

 

 

Figure 2: Confusion Matrix Example

(c) Generate a representation of your decision tree from Q2.2b using the graphviz and export graphviz functions.   Figure 3 shows an example decision tree output (note that your results may look very different from this example).

instrumentalness <= 0.0

True 

  

False

Figure 3: Decision Tree Visualization Example

(d) Determine the relative importance of each feature for the tree you trained in Q2.2b. The importance of a feature is computed as the (normalized) total reduction of the criterion brought by that feature. How well do these results match your initial (qualitative) estimates of each feature’s importance in Q2.1a? Print the importance of each feature.

Hint:   This quantity is computed as you train your model, and thus should not require additional computations on your part.

4    Clustering for COVID-19 Data [40 points]

In this problem you will analyse COVID data and perform clustering using libraries like scikit-learn and Pandas. Two datasets will be used for this problem:

(a) us-states.csv: This dataset contains information about COVID cases and death in each City per date.

The relevant columns here are: [’date’, ’City’, ’cases’and ’deaths’].

(b) statelatlong.csv:  This dataset contains information about each City and its corresponding Latitude

and Longitude coordinates. The relevant columns here are: [’Latitude’, ’Longitude’, ’City’]

 

Figure 4: COVID data scatter plot

4.1    Import Data and Plotting [5 points]

(a) Import the us-states.csv and statelatlong.csv files into a Pandas dataframe. Merge the two dataframes

based on City’on the us-states.csv so that the merged data contains latitude and longitude columns for each state, in addition to the COVID case information. Print the first 5 rows of the merged dataframe.

(b) Find all the data rows in your merged dataframe for the date 25th March 2020 and print the first

5 rows.  For the subsequent questions below, this subset of data will be used and referred to as the sampled data.

(c) Make a weighted scatter plot of the sampled data (from part 3.1(b)) where x-axis is Latitude, y-axis is Longitude and the size of each point is scaled according to the number of cases. Attach the scatter plot to your solution PDF. See Figure 5 for an example.  (Note:  check scatter plots from matplotlib for plotting)

4.2    Geographical Distribution of Cases [20 points]

Now we will use K-means clustering (as taught in class) and its variant called weighted K-means to cluster the data. The K-means algorithm divides a set of N samples xn  for n = 1, . . . ,N into K disjoint clusters, each described by the mean µk   of the samples in the cluster.   The K-means algorithm aims to choose means (centroids) to minimise the inertia, also called the “within cluster sum-of-squares criteria,” which is mathematically defined as:

N     K

J = X X rnk ||xµk ||2

n=1 k=1

(1)

In weighted K-means, each data point has an associated sample weight. This allows us to assign more weight to samples when computing cluster centers and values of inertia. For example, assigning a weight of 2 to a sample is equivalent to adding a duplicate of that sample to the dataset X. Our goal is then to choose the cluster means to minimize a modified version of the inertia (1):

N     K

J = X X rnk wn ||xµk ||2                                                                                       (2)

n=1 k=1

where wn  is the weight of the nth  data point.

Consider three columns of the sampled data  (from part 3.1(b)):   [’Longitude’,’Latitude’,  ’Cases’] for the subsequent questions (You can use the K-means clustering function from sklearn to implement both unweighted and weighted K-means clustering). Use kmeans++ initialisation for all the below sub-questions.

(a) Unweighted K-means clustering by location. We will first use K-means with two features [’Lon-

gitude’,’Latitude’]. An important hyper-parameter in K-means clustering is the number of clusters K . Use the Elbow Method to determine this optimal value of K . Increment the value of K from 1 to 50 with a step size of 1 and plot the K-means loss function versus K where the loss function is given in (1). Include the plot in your answer.

(b) Unweighted K-means clustering by location.  Take the value of K = 15 and perform K-means

clustering. In a single plot, make a scatter plot of the data points (’Longitude’ (x-axis) vs ’Latitude’ (y- axis)) and a scatter plot of the cluster centers. Make sure the data points and centroids have different colors and attach the plot below. Print the names of the states which are clustered together and attach it in the solution PDF.

(c) Weighted K-means by location.  We will use weighted K-means here, where the two features are [’Longitude’,’Latitude’] and the weight of each sample corresponds to the [’cases’] feature.  Use the Elbow Method to determine this optimal value of K by incrementing the value of K from 1 to 50 with a step size of 1 and plot the weighted K-means loss vs. the value of K, where the loss function is given in (2). Include the resulting plot in your answer.

(d) Weighted K-means by location. Take the value of K = 15 and perform weighted K-means cluster- ing. In a single plot, make a scatter plot of the data (‘Longitude’ (x-axis) vs ‘Latitude’ (y-axis)) and a scatter plot of the cluster centers. Make sure the data points and centroids have different colors and attach the plot below. Print the name of the states which are clustered together and attach it in the solution PDF.

(e) Compare the state clustering obtained from unweighted K-means (b) and weighted K-means (d). Which

method do you think provides better-defined clusters and why?

4.3    Growth Rate Modelling and Distribution Analysis[15 points]

In this section we will analyze the growth rate of COVID cases in different states using us-states.csv.  For ease of numeric analysis and uniformity, replace the [’date’] column values with that of numbers ranging from 0 to the number of rows in the DataFrame.

 

Figure 5: Growth of cases over time

(a) Plot the [‘cases’] vs [‘date’] for California and New York in a single plot, as shown in Figure 5. Observe

that the number of cases grows exponentially with time in both locations.

(b) Exponential modeling.  We will now fit an exponential model y = AeBx  to model the exponential

trend of [’cases’] with respect to time.  In other words, we want to find the values of A and B that best fit the data.  Realize this model can also be written as log y = log A + Bx, which reduces the problem to a linear regression task (you can use np .polyfit for this task). Find the values of A and B separately for each state and report them in your answer.

(c) Exponential modeling. Use the above model y = AeBx  learnt for the states of California and New York to plot the predicted number of cases versus time. Superimpose your plots of this model on the plot of the actual cases versus time obtained in 3.3(a). Attach the combined plot to your answer.

(d) K-means  clustering  by  growth  rate.   We will use un-weighted K-means here, where the two features are the parameter values [A, B] and each data point corresponds to a location. The clusters will be based here on features [A, B]. We will use the Elbow Method to determine the optimal value of K . Increment the value of K from 1 to 50 with a step size of 1 and plot the un-weighted K-means loss vs the value of K . Attach the obtained plot here.

(e) K-means clustering by growth rate. Take the value of K = 8 and perform un-weighted K-means

clustering. In a single plot, make a scatter plot of the data points (A vs B) and a scatter plot of the cluster centers.  Make sure the data points and centroids have different colors and attach the plot to your solution PDF. Print the names of the states which are clustered together and attach them in the solution PDF.

(f) Examine the clusters that you obtained in part (e). Can you explain why particular states are clustered

together?