COMP60711 - Part 2 Coursework 2
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit
COMP60711 - Part 2 Coursework 2 (COURSEWORK 4)
Course Unit: COMP60711: Data Engineering
Reminders
Please make clear any assumptions and provide evidence to justify your answers
Jupyter notebooks use markdown. A brief summary of how to use markdown can be seen here. Otherwise, please refer to the brief guide on Blackboard.
You must cite any sources used, from web pages to academic papers and textbooks.
Please ensure your code has no errors, and that the output is shown in your submitted version.
We have added some general notebooks on Blackboard to cover the basics of plotting in Python, Jupyter notebooks, and anaconda.
Some questions require a mixture of code and text to answer the question. Marks are awarded based on the output of your code (i.e. graphs) and the explanation provided, not on the code itself.
Q1: Pre-processing & Feature Importance (9 marks)
This question will use the "genes-leukemia.csv" dataset available on Blackboard. For
some background information about this dataset, see
https://www.kdnuggets.com/data_mining_course/data/genes-leukemia-description.txt. The sub-questions will involve inspecting and pre-processing the data in order to use a decision tree. We will then look at which features are deemed important for prediction, and how removing important features affects tree structure.
It is expected that you will use pandas for this question, though this is not a requirement (but it may be more difficult if you do not).
Q1.1 (1 mark)
Count the number of records/examples where the "Treatment_Response" feature is non- missing. Describe these examples in terms of the other features (Year from XXXX to YYYY, Gender = X etc.)
Hint:
You need to ensure that you are looking at all of the data. By default, some of the columns may be truncated, in which case you should adjust this (through e.g.
pd.set_option("display.max_columns", 100) )
Q1.1 Answer
In [ ]:
Q1.2 (1 mark)
Explain why it is not correct to build predictive models for "Treatment_Response" using records where it is missing?
Q1.2 Answer
From Q1.3-Q1.6 (inclusive), use only the subset of data where "Treatment_Response" is non-missing.
Q1.3 (1 mark)
Remove the features that are either all the same or have all missing values. Which sample fields should you keep?
Hints:
For simplicity in the following questions, also remove "FAB if AML".
"SNUM" should be the index.
Q1.3 Answer
In [ ]:
Q1.4 (1 mark)
Fit a decision tree ( DecisionTreeClassifier ) using default settings to the data, now that it has been pre-processed.
As we have a small amount of data, if we want to more meaningfully assess the performance, we should use leave-one-out cross-validation. Report the accuracy across each fold, and
the overall mean accuracy obtained.
Important: Please use random_state=42 where necessary to ensure reproducible results.
Q1.4 Answer
In [ ]:
Q1.5 (3 marks)
Split the data into a training and test set (using a 75¸25 ratio). Once again, fit a decision tree to this data, and report the accuracy. Visualize the tree (using tree.plot_tree ), and
state which feature/predictor is the most important. Then, removing this top predictor, fit the tree again with this feature removed. Again, report the accuracy and visualize the tree.
Compare the accuracy between the two trees. Explain why the tree is different with this feature removed.
Important: Please use random_state=3 where necessary to ensure reproducible results.
Hint:
You need to ensure that the original feature names are visible in the tree.
Q1.5 Answer
In [ ]:
Q1.6 (2 marks)
Which tree do you think is more generalizable? You may want to more thoroughly compare the trees (readability, sensitivity/specificity, structure simplicity, etc.).
Q1.6 Answer
Q2: Decision Boundaries (4 marks)
In this question, we will visualize the decision boundaries formed by three simple classifiers on an example dataset.
Q2.1 (4 marks)
We have provided code below to produce the data and to create the decision boundary. You will need to run this code using the following models:
1. "ZeroR" classifier - sklearn.dummy.DummyClassifier using the
"most_frequent" strategy.
2. KNN classifier - sklearn.neighbors.KNeighborsClassifier
3. Decision tree classifier - sklearn.tree.DecisionTreeClassifier
You will need to modify the code to output the accuracy for each of the models. Using both this information and the visualized decision boundaries, explain the performance of these algorithms. A brief explanation of the classifiers will be required for this.
Hints:
Although not necessary, the use of further visualizations, performance measures, or even datasets may help to support your discussion
Use the decision boundaries as a reference point to explain the differences between the classifiers.
Q2.1 Answer
In [ ]:
Question 3: Training Time Comparison (4 marks)
Q3.1 (2 marks)
Plot the training time for both DecisionTreeClassifier and GaussianNB against the data size. A function to generate the data is provided to you, which takes the size as its only argument.
Explain what you observe and your understanding in terms of training time and data size
(include a graph). Consider algorithm implementation and potential stochasticity in running times.
Q3.1 Answer
In [ ]:
Q3.2 (2 marks)
What do you think would happen if we continue increasing the number of instances? Which of the algorithms would be more suitable for a very large number of instances and why?
Consider the algorithmsʼ complexity and how they scale.
Q3.2 Answer
Question 4: Memory Usage Comparison (3 marks)
Q4.1 (3 marks)
Plot the memory usage of the DecisionTree model against the data size. Explain the memory usage of the model (including a graph in your answer).
You should use the same create_data() function provided for Q3, and ensure that you have downloaded memory.py from Blackboard in order to load the measure_memory() function.
from memory import measure_memory |
Q4.1 Answer
In [ ]:
2023-10-30
Data Engineering