A1
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit
A1
Objective
This assignment is to help you establish the coding capability of implementing your first ML model and the mindset of designing a proper evaluation procedure for your ML model.
Datasets
You are given three datasets:
Website Phishing (https://canvas.auckland.ac.nz/courses/103626/files/12407848?wrap=1) (https://canvas.auckland.ac.nz/courses/103626/files/12407848/download?download_frd=1)
Breast Cancer Prediction (BCP) (https://canvas.auckland.ac.nz/courses/103626/files/12407849? wrap=1) (https://canvas.auckland.ac.nz/courses/103626/files/12407849/download? download_frd=1)
Arrhythmia (https://canvas.auckland.ac.nz/courses/103626/files/12407892?wrap=1) (https://canvas.auckland.ac.nz/courses/103626/files/12407892/download?download_frd=1)
For each dataset, you can find the feature names in the header. The last column "Class" stores the labels of the examples of the dataset.
Tasks
You will need to use Python to perform the following tasks:
1. Load the datasets and deal with missing values if applicable in a proper way and describe how you did it. One way you can do it is to replace the value with the mean value of the feature in the training set.
2. Implement (1) a decision stump, (2) an unpruned decision tree, (3) a pruned decision tree. Apply (1)-(3) on each dataset. Implement these methods from scratch, do not use ready implementations such as available in scikit-learn for this task. You can use pre-pruning and / or post-pruning techniques as your pruning strategy to obtain the pruned decision tree. Explain the pruning techniques you used.
3. Use a proper way to select your hyperparameters such as depth. Explain how you did it. Explain the observation you got from different datasets, and discuss the possible reason.
4. Compare the three methods used in task 2 and determine if any are performing significantly worse on each dataset. Report the p-value for the significance tests. Explain why the worst method performs worse than others.
Submission
You need to submit:
1. The raw jupyter notebook .ipynb AND
2. An HTML generated from the notebook.
The notebook needs to be clearly structured according to the assignment tasks listed above. Each part should contain a header pointing out which task it contains, your code with results, and one paragraph containing your answers to the questions. You won't get any marks for your code and results alone; the discussion is the most important part!
The assignment must be submitted to Canvas. It will be run through Turnitin, so make sure that everything you submit has been done by you.
Note that we will deduct marks if the solution is not submitted in the correct format. You can only submit html and ipynb files.
2025-06-27
This assignment is to help you establish the coding capability of implementing your first ML model and the mindset of designing a proper evaluation procedure for your ML model.