闪电代写 -代写CS作业_CS代写_Finance代写_Economic代写_Statistics代写_代码代做_IT代写_加急帮助

Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

COMP60711 - Part 2 Coursework 2 (COURSEWORK 4)

Course Unit: COMP60711: Data Engineering

Reminders

Please make clear any assumptions and provide evidence to justify your answers

Jupyter notebooks use markdown. A brief summary of how to use markdown can be seen here. Otherwise, please refer to the brief guide on Blackboard.

You must cite any sources used, from web pages to academic papers and textbooks.

Please ensure your code has no errors, and that the output is shown in your submitted version.

We have added some general notebooks on Blackboard to cover the basics of plotting in Python, Jupyter notebooks, and anaconda.

Some questions require a mixture of code and text to answer the question. Marks are awarded based on the output of your code (i.e. graphs) and the explanation provided, not on the code itself.

Q1: Pre-processing & Feature Importance (9 marks)

This question will use the "genes-leukemia.csv" dataset available on Blackboard. For

some background information about this dataset, see

https://www.kdnuggets.com/data_mining_course/data/genes-leukemia-description.txt. The sub-questions will involve inspecting and pre-processing the data in order to use a decision tree. We will then look at which features are deemed important for prediction, and how removing important features affects tree structure.

It is expected that you will use pandas for this question, though this is not a requirement (but it may be more difficult if you do not).

Q1.1 (1 mark)

Count the number of records/examples where the "Treatment_Response" feature is non- missing. Describe these examples in terms of the other features (Year from XXXX to YYYY, Gender = X etc.)

Hint:

You need to ensure that you are looking at all of the data. By default, some of the columns may be truncated, in which case you should adjust this (through e.g.

pd.set_option("display.max_columns", 100) )

Q1.1 Answer

In [ ]:

Q1.2 (1 mark)

Explain why it is not correct to build predictive models for "Treatment_Response" using records where it is missing?

Q1.2 Answer

From Q1.3-Q1.6 (inclusive), use only the subset of data where "Treatment_Response" is non-missing.

Q1.3 (1 mark)

Remove the features that are either all the same or have all missing values. Which sample fields should you keep?

Hints:

For simplicity in the following questions, also remove "FAB if AML".

"SNUM" should be the index.

Q1.3 Answer

In [ ]:

Q1.4 (1 mark)

Fit a decision tree ( DecisionTreeClassifier ) using default settings to the data, now that it has been pre-processed.

As we have a small amount of data, if we want to more meaningfully assess the performance, we should use leave-one-out cross-validation. Report the accuracy across each fold, and

the overall mean accuracy obtained.

Important: Please use random_state=42 where necessary to ensure reproducible results.

Q1.4 Answer

In [ ]:

Q1.5 (3 marks)

Split the data into a training and test set (using a 75¸25 ratio). Once again, fit a decision tree to this data, and report the accuracy. Visualize the tree (using tree.plot_tree ), and

state which feature/predictor is the most important. Then, removing this top predictor, fit the tree again with this feature removed. Again, report the accuracy and visualize the tree.

Compare the accuracy between the two trees. Explain why the tree is different with this feature removed.

Important: Please use random_state=3 where necessary to ensure reproducible results.

Hint:

You need to ensure that the original feature names are visible in the tree.

Q1.5 Answer

In [ ]:

Q1.6 (2 marks)

Which tree do you think is more generalizable? You may want to more thoroughly compare the trees (readability, sensitivity/specificity, structure simplicity, etc.).

Q1.6 Answer

Q2: Decision Boundaries (4 marks)

In this question, we will visualize the decision boundaries formed by three simple classifiers on an example dataset.

Q2.1 (4 marks)

We have provided code below to produce the data and to create the decision boundary. You will need to run this code using the following models:

1. "ZeroR" classifier - sklearn.dummy.DummyClassifier using the

"most_frequent" strategy.

2. KNN classifier - sklearn.neighbors.KNeighborsClassifier

3. Decision tree classifier - sklearn.tree.DecisionTreeClassifier

You will need to modify the code to output the accuracy for each of the models. Using both this information and the visualized decision boundaries, explain the performance of these algorithms. A brief explanation of the classifiers will be required for this.

Hints:

Although not necessary, the use of further visualizations, performance measures, or even datasets may help to support your discussion

Use the decision boundaries as a reference point to explain the differences between the classifiers.

In [ ]:

'''

The code below provides you with the functions to get the data,

and plot the decision boundary.

The resulting graphs have not been properly formatted, however,

so you will need to add that. You will also need to modify the

code to output the accuracy.

'''

import numpy as np

from sklearn.datasets import make_classification

def get_data():

# Create data

data, labels = make_classification(

n_features=2, n_redundant=0, n_informative=2,

random_state=1, n_clusters_per_class=1

)

# Set the RNG

rng = np.random.RandomState(42)

# Add some noise

data += 2 * rng.uniform(size=data.shape)

return data, labels

def plot_boundary(X, ax, clf):

# Plotting decision regions

x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1

y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1

xx, yy = np.meshgrid(

np.arange(x_min, x_max, 0.1),

np.arange(y_min, y_max, 0.1)

)

Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])

Z = Z.reshape(xx.shape)

ax.contourf(xx, yy, Z, alpha=0.4)

return ax

def boundary_full(data, labels, model, name, **kwargs):

# Create estimator/model/classifier

clf = model(**kwargs)

# Fit the classifier

clf.fit(data, labels)

# Create axis

fig, ax = plt.subplots()

# Call the provided function

ax = plot_boundary(data, ax, clf)

# Now add the data (using scatter)

# Ensure to colour the points according to the prediction

ax.scatter(data[:,0], data[:,1], c=labels, s=20, edgecolor="k")

# Format the graph...

Q2.1 Answer

In [ ]:

Question 3: Training Time Comparison (4 marks)

Q3.1 (2 marks)

Plot the training time for both DecisionTreeClassifier and GaussianNB against the data size. A function to generate the data is provided to you, which takes the size as its only argument.

Explain what you observe and your understanding in terms of training time and data size

(include a graph). Consider algorithm implementation and potential stochasticity in running times.

In [ ]:

# Use this function to measure the time

from time import time

# Use this function to generate the data

def create_data(size):

# Create data

data, labels = make_classification(

n_samples=size,

n_features=2, n_redundant=0, n_informative=2,

random_state=4, n_clusters_per_class=1

)

return data, labels

Q3.1 Answer

In [ ]:

Q3.2 (2 marks)

What do you think would happen if we continue increasing the number of instances? Which of the algorithms would be more suitable for a very large number of instances and why?

Consider the algorithmsʼ complexity and how they scale.

Q3.2 Answer

Question 4: Memory Usage Comparison (3 marks)

Q4.1 (3 marks)

Plot the memory usage of the DecisionTree model against the data size. Explain the memory usage of the model (including a graph in your answer).

You should use the same create_data() function provided for Q3, and ensure that you have downloaded memory.py from Blackboard in order to load the measure_memory() function.

from memory import measure_memory

Q4.1 Answer

In [ ]:

2023-10-30

Data Engineering

Java

物理(Physical)

LINUX

C++

Python

Processing

sas