Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

COMP5318 – Machine Learning and Data Mining

Semester 2, 2022

Assignment 1: Classification

Key information

Deadlines

Submission: 11:59pm, 16 September, 2022 (Friday week 7, Sydney time)

Late submissions policy

Late submissions are allowed for up to 3 days late. A penalty of 5% per day late will apply. Assignments more than 3 days late will not be accepted (i.e. will get 0 marks). The day cut-off time is 11:59pm.

Marking

This assignment is worth 15 marks = 15% of your final mark.

Your code will be marked for correctness. A few marks will be allocated for style meaningful variable names and comments.

The assignment can be completed individually or in groups of 2 students. No more than 2 students are allowed. See the submission details section for more information about how to submit.

Submission

This assignment must be written in Python in the Jupyter Notebook environment. A Jupyter Notebook template is provided. Your implementation should use the same suite of libraries that we have used during the tutorials, such as scikit-learn, numpy and pandas.

The assignment will be submitted in Canvas. You need to submit two versions of your code: .ipynb and .pdf. There are two submission boxes – “Assignment 1 ipynb” for the .ipynb file and Assignment 1 pdf” for the .pdf file.

Before you submit, if you work in a group, you will need to create a group in Canvas. Under the People” page on Canvas, select the A1 Group” tab. You and your group partner should choose one of the empty groups listed under this tab, and both join it. Groups have a maximum of 2 members. If you are completing the assignment individually, you dont need to create a group.

The submission file should contain the SID number(s) and should be named like this:

•    a1-SID.ipynb (.pdf) for a student working individually, where SID is the student’s SID number

•    a1-SID1-SID2.ipynb (.pdf) for a group of 2 students, where SID1 and SID2 are the SIDs of the two students

Task

In this assignment you will investigate a real dataset by implementing multiple classification algorithms. You will first pre-process the dataset by replacing missing values and normalising the dataset with a min- max  scaler.  You  will  then  evaluate  the  performance  of  multiple  classification  algorithms:  K-Nearest Neighbour, Logistic Regression, Naïve Bayes, Decision Tree, Support Vector Machine, Bagging, AdaBoost,

Gradient Boosting and Random Forest, using the stratified 10-fold cross validation method. You will also apply a grid search to find the best parameters for some of these classifiers.

1. Data loading, pre-processing and printing

The dataset for this assignment is the Breast Cancer Wisconsin. It contains 699 examples described by 9 numeric attributes. There are two classes class1, corresponding to benign breast cancer tumours, and class2, corresponding to malignant breast cancer tumours. The features are computed from a digitized image of a biopsy sample of breast tissue for a subject.

The  dataset  should  be  downloaded  from  Canvas:  breast-cancer-wisconsin.csv.  This  file  includes  the attribute (feature) headings and each row corresponds to one individual. Missing attributes in the dataset are recorded with a ‘?’ .

You will need to pre-process the dataset, before you can apply the classification algorithms. Three types of pre-processing are required: filling in the missing values, normalisation and changing the class values . After this is done, you need to print the first 10 rows of the pre-processed dataset.

1.    Filling in the missing attribute values - The missing attribute values should be replaces with the mean value of the column using sklearn.impute.SimpleImputer.

2.    Normalising the data - Normalisation of each attribute should be performed using a min-max scaler to normalise the values between [0,1] with sklearn.preprocessing.MinMaxScaler.

3.    Changing the  class values - The  classes class1  and class2 should  be changed to 0  and  1 respectively.

4.    Print the first 10 rows of the pre-processed dataset. The feature values should be formatted to 4 decimal places using .4f, the class value is an integer.

For example, if your normalised data looks like this:

Clump

Thickness

Uniformi ty  of Cell Size

Uniformi ty of Cell Shape

Marginal Adhesion

Single      Epithelial Cell Size

Bare

Nuclei

Bland       Chromati n

Normal

Nucleoli

Mitose

s

Class

0.1343

0.4333

0.5432

0.8589

0.3737

0.9485

0.4834

0.9456

0.4329

0

0.1345

0.4432

0.4567

0.4323

0.1111

0.3456

0.3213

0.8985

0.3456

1

0.4948

0.4798

0.2543

0.1876

0.9846

0.3345

0.4567

0.4983

0.2845

0

Then your program should print:

0.1343,0.4333,0.5432,0.8589,0.3737,0.9485,0.4834,0.9456,0.4329,0

0.1345,0.4432,0.4567,0.4323,0.1111,0.3456,0.3213,0.8985,0.3456,1

0.4948,0.4798,0.2543,0.1876,0.9846,0.3345,0.4567,0.4983,0.2845,0

(You need to print the first 10 rows not the first 3.)

Your program must be able to correctly infer X and y from the file. Do not hard-code the number of features and examples - do not set then to 699 and 9 as in the breast cancer dataset . We will test your code on a dataset with different number of features and examples than the given breast cancer dataset.

2. Defining functions for the classification algorithms

Part 1: Cross validation without parameter tuning

You will  now apply  multiple classifiers to the  pre-processed dataset,  in  particular:  Nearest  Neighbor, Logistic Regression, Naïve Bayes, Decision Tree, Bagging, Ada Boost and Gradient Boosting. All classifiers should use the sklearn modules from the tutorials. All random states in the classifiers should be set to random_state=0.

You need to evaluate the performance of these classifiers using 10-fold stratified cross validation from

sklearn.model_selection.StratifiedKFold with these options:

cvKFold=StratifiedKFold(n_splits=10, shuffle=True, random_state=0)

You will need to pass cvKFold (the stratified folds) as an argument when calculating the cross-validation accuracy, not cv=10 as in the tutorials. This ensures that random_state=0.

For  each  classifier, write  a  function that  accepts  the  required  input  and  returns  the  average  cross- validation score:

def exampleClassifier(X, y, [options]):

return scores.mean()

where X contains the attribute values and y contains the class (as in the tutorial exercises). More specifically, the headers of the functions for the classifiers are given below:

K-Nearest Neighbour

def kNNClassifier(X, y, k)

return scores.mean()

It should use the KNeighborsClassifier from sklearn.neighbors.

Logistic Regression

def logregClassifier(X, y)

return scores.mean()

It should use LogisticRegression from sklearn.linear_model.

Naïve Bayes

def nbClassifier(X, y)

return scores.mean()

It should use GaussianNB from sklearn.naive_bayes

Decision Tree

def dtClassifier(X, y)

return scores.mean()

It should use DecisionTreeClassifier from sklearn.tree, with information gain (the entropy criterion)

Ensembles: Bagging, Ada Boost and Gradient Boosting

def bagDTClassifier(X, y, n_estimators, max_samples, max_depth)

return scores.mean()

def adaDTClassifier(X, y, n_estimators, learning_rate, max_depth)

return scores.mean()

def gbClassifier(X, y, n_estimators, learning_rate)

return scores.mean()

These     functions     should     implement     Bagging,     Ada     Boost     and     Gradient     Boosting     using BaggingClassifier,      AdaBoostClassifier  and      GradientBoostingClassifier  from sklearn.ensemble. Bagging and Ada Boost should combine decision trees and use information gain.

Part 2: Cross validation with parameter tuning

For two other classifiers, Linear SVM and Random Forest, we would like to find the best parameters using grid search with 10-fold stratified cross validation (GridSearchCV in sklearn).

The   data   should    be   split    into   training   and   test   subsets    using    train_test_split from sklearn.model_selection with  stratification  and  random_state=0  (as  in  the  tutorials  but  with random_state=0).

You will need to pass cvKFold (the stratified folds) as an argument to GridSearchCV, not cv=10 as in the tutorials. This ensures that random_state=0.

Write the following functions:

Linear SVM

def bestLinClassifier(X,y)

return (appropriate values so that the required printing can be done)

It should use SVC from sklearn.svm.

The grid search should consider the following values for the parameters C and gamma:

C = {0.001, 0.01, 0. 1, 1, 10, 100}

gamma = {0.001, 0.01, 0. 1, 1, 10, 100}

The function should return appropriate values, so that best parameters found, the best cross-validation accuracy and the test set accuracy can be printed when calling this function, see the next section.

Random Forest

def bestRFClassifier(X,y)

It should use RandomForestClassifier from sklearn.ensemble with information gain and max_features set to sqrt’ .

The grid search should consider the following values for the parameters n_estimators and max_leaf_nodes: n_estimators = {10, 20, 30, 50, 100}

max_leaf_nodes = {4, 10, 16, 20, 30}

The function should return appropriate values, so that best parameters found, the best cross-validation accuracy and the test set accuracy can be printed when calling this function, see the next section.

3. Running the classifiers and printing the results

Run the classifiers from the previous section on the pre-processed dataset and print the results.

For Part1, set the parameters as follows (this is already done for you in the template):

#KNN

k=3

#Bagging

bag_n_estimators = 50

bag_max_samples = 100

bag_max_depth = 5

#AdaBoost

ada_n_estimators = 50

ada_learning_rate = 0.5

ada_bag_max_depth = 5

#GB

gb_n_estimators = 50

gb_learning_rate = 0.5

The printing should look like this but with the correct numbers (these are random numbers):

kNN average cross-validation accuracy: 0.8234

LR average cross-validation accuracy: 0.8123

NB average cross-validation accuracy: 0.7543

DT average cross-validation accuracy: 0.6345

Bagging average cross-validation accuracy: 0.8765

AdaBoost average cross-validation accuracy: 0.7165

GB average cross-validation accuracy: 0.9054

SVM best C: 0.0100

SVM best gamma: 10.0000

SVM cross-validation accuracy: 0.8676

SVM test set accuracy: 0.8098

RF best n_estimators: 10

RF best max_leaf_nodes: 16

RF cross-validation accuracy: 0.8600

RF test set accuracy: 0.8321

Format all numbers to 4 decimal places using .4f, except n_estimators and max_leaf_nodes which should be formatted as integers.

Academic honesty  very important

Please read the University policy on Academic Honesty very carefully:

https://sydney.edu.au/students/academic-integrity.html

Plagiarism  (copying from another student, website or other sources),  making your work available to another student to  copy, engaging  another  person to  complete the  assignments  instead of you  (for payment or not) are all examples of academic dishonesty.  Note that when there is copying between students, both students are penalised – the student who copies and the student who makes his/her work available for copying

The University penalties are severe and include: 1) a permanent record of academic dishonesty on your student file, 2) mark deduction, ranging from 0 for the assignment to Fail for the course and 3) expulsion from the University and cancelling of your student visa.

If there is a suspected case, the investigation takes several months . Your mark will not be finalised until the  investigation  is  completed. This  may  create  problems  enrolling  in  other  courses  next  semester (COMP5318  is  a  pre-requisite  for  many  courses)  or  delaying  your  graduation.  Going  through  the investigation is also very stressful.

In addition, the Australian Government passed a new legislation last year (Prohibiting Academic Cheating ServicesBill) that makes it a criminal offence to provide or advertise academic cheating services - the provision or undertaking of work for students which forms a substantial part of a student’s assessment task.

Do not confuse legitimate co-operation and cheating! You can discuss the assignment with other students, but you (if you work individually) or your group (if you work in pairs) must write your own code.

To detect code similarity in this assignment, we will use TurnItIn and MOSS which are extremely good. If you cheat, the chances that you will be caught are very high.

Do not even think about engaging in plagiarism or academic dishonesty, it is not worth it. Be smart and dont risk your future or break the law by engaging in plagiarism and academic dishonesty!