Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

Statistical Machine Learning GR5241

Spring 2023

Homework 5

Due: May 1st, 2023

Homework submission:  Please submit your homework electronically through Canvas by 11:59pm on the due date. You need to submit both the pdf file and your code (either in R or Python).

Problem 1 (Boosting, 30 points)

The objective of this problem is to implement the AdaBoost algorithm. We will test the algorithm on handwritten digits from the USPS data set.

AdaBoost:  Assume we are given a training sample (x(i),yi ),i = 1, ...,n, where x(i)  are data values in Rd  and yi  ∈ {−1, +1} are class labels. Along with the training data, we provide the algorithm with a training routine for some classifier c (the weak learner”). Here is the AdaBoost algorithm for the two-class problem:

1.  Initialize weights: wi  =

2. for b = 1, ...,B

(a) Train a weak learner cb  on the weighted training data.

(b)  Compute error: ϵb  := i(n)=1 w对(iI) 

(c)  Compute voting weights: αb  = log b )

(d)  Recompute weights: wi  = wi exp (αbI{yi   cb (x(i))})

3.  Return classifier cˆB (x(i)) = sgn (z αb cb (x(i)))

Decision stumps:  Recall that a stump classifier c is defined by

c(x|j,θ,m) := { m(m)

xj  > θ

otherwise.          (1)

Since the stump ignores all entries of x except xj ,  it is equivalent to a linear classifier defined by an affine hyperplane.  The plane is orthogonal to the jth axis, with which it intersects at xj  = θ .  The orientation of the hyperplane is determined by m ∈ {−1, +1}. We will employ stumps as weak learners in our boosting algorithm. To train stumps on weighted data, use the learning rule

(j* ,θ * ) := arg θ(in)   .                                          (2)

In the implementation of your training routine, first determine an optimal parameter θj(*)   for each dimension j = 1, ...,d, and then select the j *  for which the cost term in (2) is minimal.

Homework problems:

1.i  (8) Implement the AdaBoost algorithm in R. The algorithm requires two auxiliary functions, to train and

evaluate the weak learner. We also need a third function which implements the resulting boosting classifier. We will use decision stumps as weak learners, but a good implementation of the boosting algorithm should permit you to easily plug in arbitrary weak learners. To make sure that is possible, please use function calls of the following form:

• pars  <-  train(X,  w,  y) for the weak learner training routine, where X is a matrix the columns of which are the training vectors x(1) , . . . , x(n), and w and y are vectors containing the weights and class labels. The output pars is a list which contains the parameters specifying the resulting classifier. (In our case, pars will be the triplet (j,θ,m) which specifies the decision stump).

• label  <-  classify(X,  pars) for the classification routine, which evaluates the weak learner on X using the parametrization pars.

• A function c hat  <-  agg class(X,  alpha,  allPars) which evaluates the boosting classifier (“ag- gregated classifier”) on X. The argument alpha denotes the vector of voting weights and allPars

contains the parameters of all weak learners.

1.ii  (6) Implement the functions train and classify for decision stumps.

1.iii  (12) Run your algorithm on the USPS data (the digit data we used in Homework 4, use the training and

test data for the 5s and 6s) and evaluate your results using cross validation.

Your AdaBoost algorithm returns a classifier that is a combination of B weak learners.   Since it is an incremental algorithm, we can evaluate the AdaBoost at every iteration b by considering the sum up to the b-th weak learner.  At each iteration, perform 5-fold cross validation to estimate the training and test error of the current classifier (that is, the errors measured on the cross validation training and test sets, respectively).

1.iv  (4) Plot the training error and the test error as a function of b.

Submission.  Please make sure your solution contains the following:

• Your implementation for train, classify and agg class.

• Your implementation of AdaBoost.

•  Plots of your results (training error and cross-validated test error).

Problem 2 (Hierachical clustering, 5 points)

Perform single-linkage hierarchical clustering on the following 2-dimensional observations.  Use the Manhattan distance between a pair of points: d(A,B) = |X1A − X1B | + |X2A − X2B |. Show your results after each step.

Ob

 

X1

X2

A

1

4

B

1

3

C

3

4

D

5

2

E

3

2

F

3

0

Problem 3 (Back Propagation, 15 points)

The dataset "BackPropData .csv" contains two features x1 ,x2  and dichotomous response y (coded y = 0 or y = 1). Consider applying the vanilla one hidden layer neural network to classify dichotomous response y .

• One hidden layer with 3 nodes

•  Bias for both the hidden layer and output layer

• Sigmoid activation function at the hidden layer

• Softmax output function

•  Cross-entropy cost (or objective) Q

•  17 total parameters defined by θ

θ is the collection of weights: W [1] , b[1] , W [2] , b[2] .

Required task: Compute ∇Q(θ* ) using backpropagation. The symbol ∇Q(θ* ) is the gradient of Q(θ) evaluated at θ * . Note that you are not solving the full gradient descent optimization problem - this problem is simplified as a single required step in the gradient descent algorithm.

(W[1] )*  =     ,      (b[1] )*  =   ,

(W[2] )*  = ) ,      (b[2] )*  = )

Please solve this problem manually.  A loop is not required because the algorithm only iterates over k = 2, 1. Display the full gradient ∇Q(θ* ) on your solution.

Problem 4 (Multinomial Clustering, 30 points)

In this exercise, we will apply the EM algorithm and a finite mixture of multinomial distributions to the image segmentation problem. Image segmentation refers to the task of dividing an input image into regions of pixels that “belong together” (which is about as formal and precise a definition as you will be able to find in the literature.) One way to approach image segmentation is to extract a suitable set of features from the image and apply a clustering algorithm to them. The clusters output by the algorithm can then be regarded as the image segments. The features we will use are histograms drawn from small regions in the image.

Histogram  clustering:  The term histogram  clustering refers to grouping methods whose input features are histograms. A histogram can be regarded as a vector; if all input histograms have the same number of bins d, a set of input histograms can be regarded as a set of vectors in Rd . We can therefore perform a simple histogram clustering by running a k-means algorithm on this set of vectors. The method we will implement in this problem

is slightly more sophisticated:  Since histograms are represented by multinomial distributions, we describe the clustering problem by a finite mixture of multinomial distributions, which we estimate with the EM algorithm.   The EM algorithm for multinomial mixtures. The input of the algorithm is:

• A matrix of histograms, denoted H, which contains one histogram vector in each row.  Each column corre- sponds to a histogram bin.

• An integer K which specifies the number of clusters.

• A threshold parameter τ .

The variables we need are:

• The input histograms. We denote the histogram vectors by H1 , ..., Hn .

• The centroids t1 , ..., tK . These are Rd vectors (just like the input features). Each centroid is the parameter vector of a multinomial distribution and can be regarded as the center of a cluster; they are computed as the weighted averages of all features assigned to the cluster.

• The assignment probabilities a1 , ..., an .  Each of the vectors ai  is of length K, i. e. contains one entry for each cluster k . The entry aik  specifies the probability of feature i to be assigned to cluster k . The matrix which contains the vectors ai  in its rows will be denoted P .

The algorithm iterates between the computation of the assignment probabilities ai   and adjusting the cluster centroids tk .

The algorithm:

1.  Choose K of the histograms at random and normalize each. These are our initial centroids t1 , ..., tK .

2.  Iterate:

(a)  E-step: Compute the components of each ai , the assignment probabilities aik , as follows:

ϕik      :=

aik      :=

exp j(之) Hij log(tkj )

ck ϕik

 cl ϕil

Note that ϕik  is the multinomial probability of Hi with parameters tk , up to the multinomial coefficient  , which cancels out in the computations of aik .

(b)  M-step: Compute new mixture weights ck  and class centroids tk  as

ck      =      aik

bk      =    aik Hi

i=1

bk        

tk      =

(c)  Compute a measure of the change of assignments during the current iteration:

δ := ∥A − Aold ∥ ,

where A is the assignment matrix (with entries aik ), Aold   denotes assignment matrix in the previous step, and ∥ . ∥ is the matrix 1-norm (the largest sum of column absolute values). This is implemented in R as norm( . ,"O").

3. Terminate the iteration when δ < τ .

4. Turn the soft assignments into a vector m of hard assignments by computing, for i = 1, ...,n,

mi  := arg   max   aik  ,

k=1 , . . . ,K

i. e. the index k of the cluster for which the assignment probability aik  is maximal.

The histogram data: The histograms we use have been extracted from an image by the following procedure:

1. Select a subset of pixels, called the sites, at which we will draw a histogram.   Usually, this is done by selecting the pixels at the nodes of an equidistant grid.  (For example, a 2-by-2 grid means that we select every second pixel in every second row as a site.)

2.  Place a rectangle of fixed radius around the site pixel.

3. Select all pixels within the rectangle and sort their intensity values into a histogram.

The data we provide you with was drawn from the 800x800 grayscale image shown below (left):

 

The image is a radar image; more specifically, it was taken using a technique called synthetic aperture radar (or SAR). This type of image is well-suited for segmentation by histogram clustering, because the local intensity dis- tributions provide distinctive information about the segments. The image on the right is an example segmentation using K = 3 clusters, computed with the algorithm described above.

The histograms were drawn at the nodes of a 4-by-4 pixel grid. Since the image has 800 × 800 pixels, there are 200 × 200 = 40000 histograms.  Each was drawn within a rectangle of edge length 11 pixels, so each histogram contains 11 × 11 = 121 values.

Homework problems.  Download the data file histograms .bin from the course homepage. You can load it in R with the command

H<-matrix(readBin("histograms .bin",  "double",  640000),  40000,  16)

which results in a 40000 × 16 matrix. Each row is a histogram with 16 bins. You can check that the matrix has been loaded correctly by saying dim(H) (which should be 40000 rows by 16 columns) and rowSums(H) (the sum of each row should be 121).

4.i  (10) Implement the EM algorithm in R. The function call should be of the form m<-MultinomialEM(H,K,tau),

with H the matrix of input histograms, K the number of clusters, and tau the threshold parameter τ .

4.ii  (10) Run the algorithm on the input data for K=3, K=4 and K=5. You may have to try different values of τ

to obtain a reasonable result.

4.iii  (10) Visualize the results as an image.  You can turn a hard assignment vector m returned by the EM

algorithm into an image using the image function in R.

Hints:

• You may encounter numerical problems due to empty histogram bins. If so, add a small constant (such as 0.01) to the input histograms.

•  Both the E-step and the M-step involve sums, which are straightforward to implement as loops, but loops are slow in interpreted programming languages (such as R or Matlab). If you want to obtain a fast algorithm, try to avoid loops by implementing the sums by matrix/vector operations.

Problem 5 (Implementing PageRank, 20 points)

In this problem, you will implement the PageRank algorithm. You will be experimenting with a small randomly generated graph (assume graph has no dead ends) which can be downloaded on Canvas.

It has n = 100 nodes (numbered 1, 2, . . . , 100), and m = 1024 edges, 100 of which form a directed cycle (through all the nodes) which ensures that the graph is connected. There may be multiple edges between a pair of nodes, your code should handle these instead of ignoring them.  The first column in graph .txt refers to the source node, and the second column refers to the destination node.

Assume the directed graph G = (V,E) has n nodes (numbered 1, 2, . . . ,n) and m edges, all nodes have positive out-degree, and A = (Aji )n ×n  is an n × n matrix as defined in class such that for any i,j ∈ {1, 2, . . . ,n}:

Aji  = {0(de)g(1)(i)

if (i j) E,

otherwise.

Here, deg(i) is the number of outgoing edges of node i in G. Denote the PageRank vector by the column vector r . By the definition of PageRank, we have the following equation:

α

n

where 1 is the vector of length n with all entries equal to 1.

Based on this equation, the iterative procedure to compute PageRank works as follows:

•  Initialize: r (0)  =  1.

  For i from 1 to k, iterate: r (i)  =  1 + (1 α)Ar(i−1) .

You need to do the following tasks:

1.  (5) Implement the iterative procedure mentioned above to compute PageRank.

2.  (5) Run the aforementioned iterative process for 40 iterations, assume α = 0.2 and obtain the PageRank vector r .

3.  (5) List the top 5 nodes ids with the highest PageRank scores (submit both ids and scores).

4.  (5) List the bottom 5 node ids with the lowest PageRank scores (submit both ids and scores).