闪电代写 -代写CS作业_CS代写_Finance代写_Economic代写_Statistics代写_代码代做_IT代写_加急帮助

Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

PACSLZ-501001: Statistical Inference for Big Data

Exercise 5

This is the ﬁnal project of this course. Our task is to write our own learning and classiﬁcation program on the synthesized data set we used in Homework 4. As a reminder, these data samples x e R2 are generated according to a mixed-Gaussian model

nClusters

pX|Y(x Ii) = cj．N (x; µi,j , I2 ), i = 0, 1

j=1

where µi,j ’s are some randomly chosen cluster centers, I2 is a 2 × 2 identity covariance matrix, and cj ’s satisfy that j cj = 1. A typical plot of the generated data, with nClusters = 10 is shown in the following ﬁgure. We have shown in Homework 4 that this dataset can be

learned and classiﬁed with a simple neural network. The goal of this project is to write our own learning algorithm, based on iterative updating of the parameters, to do the same task, so that we can develop an understanding of what are the key ingredients inside a neural network.

Model Family and Parameters

To start, we deﬁne a model family. Of course we could just deﬁne the family to be mixed Gaussians, but that would appear to be cheating. A commonly used model family is given in a very general form as follows.

eFi(北)

PY |X (iIx; θ) = i eFi(北) y = i e y (1)

for some set of functions Fi (x) for each value i that Y can take. Now a quick observation is that adding a constant to every Fi (x) does not change the ratio, so w.o.l.g. we often take a convenient constraint that

Fi (x) = 0, for all x.

In this project, we will only consider the simple case that y = (0, 1}, i.e., we only have two classes, so we have F1 (x) = 一F0 (x) = F (x). So we will work with a slightly simpler model as

PY |X(1Ix) =

eF (北)

eF (北) + e_F (北) ,

PY |X(0Ix) =

e_F (北)

eF (北) + e_F (北)

(2)

and I will drop the vector sign on X in the following whenever it does not cause confusion. We assume that

k_1

F (x) = weights[j]．b(x, center[j]) (3)

j=0

i.e., a linear combination of some base function b(x, center) that is easy to compute and easy to ﬁt, which we will give an example later. All we need to worry about is to ﬁnd these weights and the center coeﬃcients of the base functions. We allow k of them in total, which gives the complexity of the model. We write θ = (weights[j] e R, center[j] e R2 , j = 0, . . . , k 一 1) as the parameters of the distributions. θ is what we need to learn from the data. When we are given a set of training samples (x[i], y[i]), i = 0, . . . , n 一 1, we need to ﬁnd the maximum likelihood estimate of θ

θˆML = arg mθ(a)x log PY |X (y[i]Ix[i])\ (4)

where PY |X depends on the parameters θ as deﬁned in (2) and (3).

Short Discussion 1 You might feel that the form in (2) a bit strange. Why don’t we directly write the distributions as linear combinations of simple functions? It turns out that this particular form of doing linear combinations on the exponent is convenient when we have a lot of samples, and we would like to add up their log likelihood values. This form is closely related to a concept called “exponential family”, which is very important but we will not go into details in this course. You are certainly encouraged to try with other forms of distribution families.

The base function b(x, center) can be chosen as a uni-modal function that is easy to compute. We would like the function value to be larger for x close to center and smaller otherwise. For example, a simple choice can be

b(x, center) = ,0(1)

if |x 一 center|2 < r2

oherwirse

for r as global parameters. A sensible choice I take is r2 = 20, such that most samples x generated from ~ N (center, I2 ) would have b(x, center) = 1. A slightly better choice that emphasizes on samples closer to the center and gradually decrease the base function values is a truncated quadratic function

b(x, center) = ,0(r)2 一 |x 一 center|2

if |x 一 center|2 < r2

otherwise

This should be compared with neural networks, where the outputs of the last layer of nodes can be thought as such base functions, but chosen from some diﬀerent families of functions that can be generated by the previous layers of activation functions. The last layer combining corresponds to using the weights to form a new linear combination. It turns out that when we use the right loss function, the optimization is exactly the same as matching distributions by using the linear combination on the exponent as we do here.

Task 1. You need to implement an eﬃcient way to evaluate the base functions. In particular, with input x as an array of samples, [nSamples, 2], and the center as a 2-dimensional array, the function should return an array of base function values for all samples together.

Task 2. In order for us to debug later and monitor how the right set of parameter is learned, we should have a way to visualize the chosen function. This takes some plotting skills, and you should do this in a way of your own choice. An example code is given as follows.

When you design your own visualization, you need several components:

. Scatter plot are probably the right choice to see all the samples as well as the centers

of the base functions, which in the above code are stored in coef array (size k × 3), i.e., coef [:, 0] and coef [:, 1]. coef [:, 一1] is used to set the size of the scatter.

. I used colors to indicate which class the samples belong, and how the weights corre-

sponding to classes 0 and 1. A colormap ‘bwr’ is used so that more blue and red colors are used to indicate the two diﬀerent classes. In this colormap, ‘white’ means “not sure which class”, so I had to change the background to gray.

● I made the plot somewhat transparent by the alpha parameter in the plots, so that when the samples and functions are plotted together I can still see both.

● Fix the axis of the plot is helpful.

Update the Parameters

Obviously, we start by randomly choosing all the parameters. We will use an iterative method to ﬁnd the maximum likelihood estimate of these parameters. That is , in each step, we ﬁx all but one of the parameters, and update this one parameter to increase the likelihood.

Without loss of generality, let’s say that we would like to update θ0 = (center[0, :] e R2 , weights[0]). For convenience, let’s write

Fold (x) = weights[j]．b(x, center[j, :])

j 0

f (x; θ0 ) = weight[0]．b(x, center[0, :])

and our goal is to ﬁnd a θ0 such that when we form F (x) = Fold (x) + f (x; θ0 ), and after that PY |X from (2), the updated PY |X ﬁts the data better.

Let’s consider the log likelihood of the training data, where we use XY as the joint

= ExY ┌Y log + (1 一 Y) log ┐

= ExY [2YF (X) 一 log(1 + e2F (X))┐

With the deﬁnition F (x) = Fold (x)+f (x) to ﬁx all but one of the base functions in F (x), we solve in each iteration

fupdate = arg m x Ef(a) xY ← 2Y (Fold (X) + f (X)) 一 log(1 + e2(Fold (X)+f (X)))]

and then make the update Fnew ← Fold + fupdate .

The key is to solve (5). For convenience, let’s deﬁne

l(Fold +f )(x, y) 2y(Fold (x) + f (x)) 一 log(1 + e2(Fold (x)+f (x)))

so that (5) can be written in a shorter form as maximizing

n_1

When we don’t have any other constraint, and are allowed to choose any function f (．), then we can choose the value f (x) for each value of x separately, by solving

fupdate (x) = arg max

1xi =x． ← 2yi (Fold (xi ) + a) 一 log(1 + e2(Fold (xi)+a))]

This is actually simpler than it looks. Since xi ’s take real values, we can safely assume that no two samples in our data set has exactly the same xi , which means we can just do the above optimization for each sample:

fupdate (xi ) = arg m xa(a) ← 2yi (Fold (xi ) + a) 一 log(1 + e2(Fold (xi)+a))]

In reality, we have some constraints of fupdate that has to be some scaled version of a single base function from the set of base functions we have chosen. This means we cannot arbitrarily choose f (x) for every x as we wish, but instead have to compromise between the ideal f (x) values at diﬀerent x, which is really a ﬁtting problem.

Fitting a Single Base Function

We start by taking the derivative of the objective function w.r.t. a particular f (xi ), at the point that the entire function f (．) = 0, (Notice that we should take derivative w.r.t. f (x) for every x, but since the objective function is evaluated at a set of samples, so only the derivative w.r.t. f (xi ) for xi in the sample set matters.)

╱ l(Fold +f) (xi , yi )\ │f(.)=0

= ╱yi 一 \

A gradient descend algorithm (ascend in this case as we are maximizing the objective function) would now pick

fupdate (xi ) α ╱yi 一 \

and use that for the update Fnew = Fold + fupdate .

We should stop here at take a look of this gradient function we just calculated and make some sense out of it. The term eFold (xi)/(eFold (xi) + e_Fold (xi)) looks familiar! It is by deﬁnition (2) the model PY |X(1Ixi ) based on our old function Fold . The conditioning on xi means that based on Fold , we use our model and predict the probability Yi = 1 as this value. Then yi minus that is in a sense the error in our current prediction, on this pair (xi , yi ). Clearly, we would like to use fupdate to correct them. For convenience, we deﬁne a Z-value

Z(x, y) ╱y 一 \

and we hope the update function fupdate to resemble the empirical average of Z-values fupdate (xi ) α Z(xi , yi ) (6)

We need to do that with the constraint that fupdate can only be chosen as one of our base function. So we cannot make the above choice precise, but have to do it approximately. Let’s try to ﬁnd a fupdate , which is a function for all x, but within the family of functions that we can choose from, and hope that it is close to the desired values at all xi ’s, as close as possible. For that, we can change the problem into minimizing the mean square error

fupdate = arg min

[Z(xi , yi ) 一 f (xi )]2

where the optimization is a constrained optimization, where f is chosen from the allowed base functions.

This is the step that we are a little bit hand waving. Why should we use the MSE as a way to see if the chosen f is “close” to the ideal gradient in (6)? Could we use some other error measure for this ﬁtting? These are really good questions that we actually can answer, but with some more analysis. You are encouraged to explore along this line. But to make this project work, implementing this MSE ﬁtting should suﬃce.

Task 3: Design an algorithm to do this!

First, assuming we already ﬁnd the correct base function b(x, center[0, :]) for f . Then, the optimal weight for the least square problem should be (try to convince yourself)

weight[0] = ExY [Z(x, y)．b(x, center[0, :])]

ExY [b(x, center[0, :])2 ] .

Now let’s consider how to ﬁnd the new base function (ﬁnd its center). One can do a brute force search for this center. That is, one can try to put a base function centered at every possible point on the 2-D space, and see how much the square error will be if this center is chosen.

A problem of this approach is that there are too many choices of center’s. One remedy to that is to only allow new centers to be selected on existing samples. That is, to pick centers only within the same class of samples. This is meaningful if we do not really worry about those isolated samples. It is important to notice that the base function of x[n] with center being at another sample x[m] depends only on the distance between the two samples, which should be computed and stored before the iterations start, to avoid repetitive calculations.

You should use your visualization tool in Task 2 to see how this works.

Now we are already doing decently well in the overall task. We ﬁrst compare the mean square error between f (x) and Z-value with diﬀerent base functions. Then, pick the base function for our f (x) corresponding to the minimum error. Finally, assign the optimal weight to the selected base function. The resulting linear combination should work quite well.

Task 4: Design your own algorithm to choose a center and assign a weight to it.

The goal is to make sure your choice give a boost to the F values of those you want to boost, but not change too much on those in the right class. At this point we are deeply in the area of greedy and engineering choices. Clean solutions might not exist. Any creative step that works can be accepted, and this is a perfect place to practice your debug capability, to look at the real data and decide what to do.

Task 5: Think!

This is a simple project designed for you to practice using iterative algorithms to adjust one small part of your model at a time. If successful, you should observe that the neural network on the same data set can have roughly the same level of performance as your own algorithm. In a sense, neural networks, as well as any other learning algorithms that work on this data must be doing roughly the same thing. Your ﬁnal presentation of this should contain not only the performance of your code, but also some reﬂections on what you have done in this project: what creative steps you have taken? what you have been thinking and made connections in this process? how you would propose to improve if you had more time? We do not want to just read your code at the end, but would like you to start learning how to be a thoughtful and creative engineer: thoughtful means you are equipped with the right mathematical tools and good intuitions on the overall procedure, creative means you can come up with good solutions that surprises the others.

The following ﬁgure is what we should see from the learning process.