CS5487 Programming Assignment 2 Clustering
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit
CS5487 Programming Assignment 2
Clustering
In this programming assignment you will implement and test several clustering algorithms on both synthetic and real data. Given a set of points X = zy1 , · · · , yn}, where yi e Rd, the goal is to assign a cluster label to each point, zi e z1, · · · , K}, where K is the number of clusters. In this programming assignment, you will consider three clustering algorithms:
1. K-means algorithm – The K-means algorithm calculates the cluster centers uj using the current assignment to each cluster (K is assumed to be known). In each iteration, K-means performs the following two steps:
Cluster Assignment : _ij = │││ ,K扌|yi . uk|2 (1)
Estimate Center : uj = z _ijyi (2)
The cluster label for point yi is the label of the closest cluster center, zi = argmaxj _ij .
2. EM algorithm for Gaussian mixture models (EM-GMM) – The EM algorithm cal- culates a maximum likelihood estimate of a GMM with K components, with parameters zmj, uj , Σj} . K is also assumed to be known. The E- and M-steps are:
mjN(yiIuj , Σj) E . step : _ˆij = 夕(_i = jIyi) =
M . step : Nˆj = _ˆij , j = , j = _ˆijyi ,
n
Nˆj i=1
(3)
(4)
(5)
After convergence, the cluster label for yi is the component with the largest posterior proba- bility, zi = argmaxj _ˆij . The document “gmm-tips”, available on the course website, contains some useful tips on implementing EM for GMMs.
3. Mean-shift algorithm – The mean-shift algorithm uses gradient ascent with an adaptive step-size to find local peaks in a kernel density estimate of X . We will use a Gaussian kernel with bandwidth h. Given an initial point (0) (where the superscript denotes the iteration number), the mean-shift algorithm update rule is:
(t+1) = . (6)
To obtain the clustering, the mean-shift algorithm is run using each data point yi as an initial point. After convergence, the points that converged to the same local peak are given the same cluster label. In this case, the number of clusters K is not fixed (and depends on the selected bandwidth h).
Problem 1 Clustering synthetic data
In the first problem, you will test these clustering methods on synthetic data, and examine how each method performs on different configurations of data. The zip file PA2-cluster-data .zip contains the synthetic data files. Inside the MATLAB data file cluster data .mat (or cluster data * .txt if you are not using MATLAB) are three data sets (points and ground-truth labels). The file contains the following variables:
● dataA X - dataset A points. Each column is a data point, yi e R2 .
● dataA Y - dataset A true labels, zzi}.
● dataB X - dataset B points.
● dataB Y - dataset B true labels.
● dataC X - dataset C points.
● dataC Y - dataset C true labels.
The three datasets are shown in the following figure, where the color represents the ground-truth label.
dataA
15
10
5
0
−5
−10
−15
−15 −10 −5
dataB
15
10
5
0
−5
−10
−15
−15 −10 −5
dataC
15
10
5
0
−5
−10
−15
−15 −10 −5
The goal is to discover the clusters in each dataset using the data points X .
(a) Implement the three clustering algorithms. You will reuse these algorithms in the next problem, so try to make them as general as possible.
(b) Run the algorithms on the three synthetic datasets. Qualitatively, how does each clustering algorithm perform on each dataset? Comment on the advantages and limitations of each algorithm, in terms of the configuration of the data.
(c) How sensitive is mean-shift to the bandwidth parameter h?
. . . . . . . . .
Problem 2 A real world clustering problem – image segmentation
In this problem we will consider a real world clustering problem, image segmentation, where the goal is to find homogeneous regions in the image, which hopefully belong to objects or object parts. Below is one of the test images and an example segmentation using K-means (In the right image, each segment is colored based on the average color within the segment).
50
100
150
200
250
300
original
|
100 200 300 400
50
100
150
200
250
300
segments
|
100 200 300 400
50
100
150
200
250
300
segments (colored)
|
100 200 300 400
To segment the image, we first extract feature vectors from each pixel in the image (or a subset of pixels along a regular grid). In particular, we form a window centered around each pixel and extract a four-dimensional feature vector, y = [u, u, cy, cx]T , where (u, u) are the average chrominance values (color without brightness) in the window, and (cx, cy) is the pixel location (center of the window in x-y-coordinates). The feature values for the example image are shown below (each subplot represents one feature dimension over the entire image).
50 100 150 200 250 300
50 100 150 200 250 300 |
x 1
100 200 300 400
x 3
100 200 300 400 |
|
120 110 100 90 80 70
300 250 200 150 100 50 |
50 100 150 200 250 300
50 100 150 200 250 300 |
x 2
100 200 300 400
x 4
100 200 300 400 |
|
180
160
140
120
400 300 200 100 |
Next, a clustering algorithm is used to group the feature vectors, and a cluster label is assigned to each corresponding pixel forming the segmentation.
The zip file PA2-cluster-images .zip contains the images and MATLAB code for ex- tracting the features, and creating the segmentation image from the cluster labels. The following MATLAB functions are provided:
● getfeatures .m – MATLAB function for extracting features from an image.
● labels2segm .m – MATLAB function for generating a segmentation image from the cluster labels.
● colorsegm .m – MATLAB function to create a nicely colored segmentation image. As an example, the following code was used to generate the above segmentation images:
% read the image and view it
img = imread(’images/12003 .jpg’);
subplot(1,3,1); imagesc(img); axis image;
% extract features (stepsize = 7)
[X, L] = getfeatures(img, 7);
XX = [X(1:2,:) ; X(3:4,:)/10]; % downscale the coordinate features (see part (b))
% run kmeans -- this is the MATLAB version . You have to write your own!
[Y, C] = kmeans(XX’, 2);
% make a segmentation image from the labels
segm = labels2segm(Y, L);
subplot(1,3,2); imagesc(segm); axis image;
% color the segmentation image
csegm = colorsegm(segm, img);
subplot(1,3,3); imagesc(csegm); axis image
Documentation for each function can be viewed in MATLAB using “help getfeatures”, etc.
For Python users, the file PA2-cluster-python .zip contains Python versions of the above demo code and helper functions for extracting features. This Python code requires the numpy, scipy, matplotlib, and Image modules. Alternatively, the file PA2-cluster-imfeatures-txt .zip contains a text files of the extracted image features for a step size of 7.
(a) Use the three clustering algorithms to segment a few of the provided images. Qualitatively, which algorithm gives better results? How do the results change with different K and h? Which is less sensitive to changes in the parameters? Comment on any interesting properties or limitations observed about the clustering algorithms.
The feature vector y contains two types of features that are on different scales; the chrominance values (s, u) range from 0-255, while the pixel locations (c北 , cy) range from 0-512. The EM-GMM clustering algorithm can adapt to the different scales of each feature by changing the covariance matrix. On the other hand, K-means and mean-shift assume a fixed isotropic covariance matrix. We can modify these two algorithms to allow different scaling of the features:
● For K-means, the distance to a cluster center y\ is modified to apply a weighting between the
feature types,
d(y, y\) = ┌ ┐u(s) . ┌ ┐u(s) 2 + A ┌c(c)y(北)┐ . ┌c(c)\北\y┐ 2 ,
where A > 0 controls the weighting.
● For mean-shift, the kernel is modified to use separate bandwidths on each feature type,
k(y, y\) = exp ,. ┌ ┐u(s) . ┌ ┐u(s) 2 . ┌ ┐c(c)y(北) . ┌ ┐c(c)\北\y 2 } ,
(8)
where hc is the bandwidth for the color features, and hp is the bandwidth for the pixel locations.
(b) Modify your K-means and mean-shift implementations to allow different feature scaling. Hint: changing the distance in (7) or kernel in (8) is equivalent to scaling each dimension in the feature vector y by an appropriate amount. Rerun the segmentation experiment. Do the segmentation results improve?
. . . . . . . . .
Problem 3 Quantitative Evaluation of Clustering (Optional)
In this optional problem, you will quantitatively evaluate your results from Problems 1 and 2. Consider a set s = zr1 , · · · , rn} with n elements, and two partitions of s into subsets, 3 = zY1 , · · · , YR } and y = zZ1 , · · · , ZC }, where Yi are the subsets of 3, and similarly for Zi and y .
The Rand index can be used to quantitatively measure the amount of agreement between the two clusterings of the set s. Intuitively, the Rand index corresponds to the probability of pair-wise agreement between the 3 and y, i.e. the probability that the assignment of any two items will be correct with respect to each other (in the same cluster, or in different clusters).
Formally, the Rand index is calculated as follows. Define the following counts:
● a – the number of pairs in s that are in the same set in 3 and in the same set in y ,
● b – the number of pairs in s that are in different sets in 3 and in different sets in y ,
● c – the number of pairs in s that are in the same set in 3 and in different sets in y ,
● d – the number of pairs in s that are in different sets in 3 and in the same set in y ,
then the Rand index is the fraction of pairwise agreements
Rand = a + b + c + d = ╱2(n)、.
The Rand index can be efficiently calculated from a contingency table:
Partition y
Class |
Z Z Z |
Sums |
Y1 Y2
YR |
n11 n12 n1C n21 n22 n2C
|
2022-11-25