Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

CS235 Default Project Option

1    INTRODUCTION

In this default project for CS235 we are going to work with the Breast Cancer Wisconsin Diagnostic Dataset and implement a variety of supervised and unsupervised methods, as we have seen them in the class. This document roughly mimics the structure of a proposal that one would put forward for a class project in the proposed track.

1.1    Project Type

This is the default project type.

2    PROBLEM DEFINITION

We will be working on the Breast Cancer Wisconsin (Diagnostic) Kaggle competition (https://www.kaggle.com/datasets/ uciml/breast-cancer-wisconsin-data).

We are going to tackle two different problem definitions on this dataset.

Problem Definition 1: Given a data point expressed by the numerical features of our dataset (which describe a given breast mass), classify that data point into one of two classes (malignant or benign)

Problem Definition 2: Given N data points expressed by the numerical features of our dataset (which describe a given breast mass), cluster those data points into a set of coherent clusters, where points within each cluster have high similarity, and points across clusters have low similarity.

3    DATASET DESCRIPTION

We are going to use the dataset found in the Breast Cancer Wisconsin (Diagnostic) Kaggle competition (https://wwwkaggle.com/datasets/uciml/breast-cancer-wisconsin-data).

4    PROPOSED APPROACH

Each team member should be working on one of the following methods. Methods 1-3 are tackling Problem Definition 1, and methods 4-6 are tackling Problem Definition 2.

Author’s address: Vagelis Papalexakis, UC Riverside.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components

of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

(1) Random Forest classifier: The base classifier within the Random Forest will be a Decision Tree using the Information Gain Criterion, as we saw in class.

(2) Multi-Layer Perceptron (MLP): This is considered a“deep learning”implementation, which means that you should systematically and thoroughly experiment with different architectural/hyperparameter choices (number of layers, width per layer, activations per layer).

(3) K-nearest neighbors: This is considered a method that lends itself to near-trivial implementation, so in addition to the base functionality, you should also include the following (as also instructed in the project description document):

(a) Two different distance functions (e.g., Euclidean and Manhattan)

(b) Two different classes of feature representations, obtained as follows:

• Low rank approximation of the data using the Singular Value Decomposition (SVD) for 2 different selections of the approximation rank: (i) a“low”value (i.e., before the singular values drop dramatically) and (ii) a “high”value (i.e., right after the singular values drop dramatically) → In total, we have 2 different SVD-based feature reps.

• Repeat the same idea as above, but now, instead of a SVD we use an off-the-shelf MLP-based AutoEncoder where the encoder and the decoder have 2 layers with ReLu activations. Here you should choose two

different sizes of the bottleneck layer: (i) 5% of the original #features, (ii) 20% of the original #features → In total we have 2 AutoEncoder-based feature reps.

(4) DBSCAN clustering

(5) Spectral clustering, including the base implementation of K-means necessary for this method.

(6) Agglomerative Clustering with Single Linkage

For teams with fewer than 6 team members, please select the subset of methods you prefer, however, please make sure that each team member is implementing a different method.

5    EVALUATION PLAN

We will use different evaluation plans depending on the problem definition:

• For Problem Definition 1 (and methods 1-3): We will measure F1 score, Precision, and Recall. The reason we use those measures is because there is a relative imbalance in the class distributions and we would like to make sure this does not introduce any unintended artifacts in the evaluation of our methods.

• For Problem Definition 2 (and methods 4-6): As we saw in class, for unsupervised methods, we can use intrinsic and extrinsic methods of evaluation. Here we will use both kinds, since our dataset has labels, which we can use for extrinsic evaluation:

 Intrinsic: We will measure the Silhouette coefficient averaged over all data points.

 Extrinsic: We will measure the Normalized Mutual Information (NMI) given the class labels of the dataset.

For each method, we are going to provide results of the average performance metric (obtained via cross-validation) as a function of any number of hyperparameters that the method includes. For each such hyperparameter we will produce a metric vs. hyperparameter figure which includes error-bars (obtained via cross-validation).

In addition to the above figures, we will produce a comparison figure of all the methods, where the best hyperparameter setting for each method is chosen via cross-validation, and each method’s best performance is compared via a bar-chart with error-bars.

6    PROJECT TEAM & PROJECTED LABOR DIVISION

Please submit your team and labor division as part of the Project Team Declaration.