DSCI 552 Sample Final 2018
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit
DSCI 552 Sample Final
2018
1. For the following classification problem, design a single-layer perceptron that has zero error on training set.
C1 = {x1 = ┌ 1(1) ┐ , x2 = ┌ 2(1) ┐}
C2 = {x3 = ┌ 2-1 ┐ , x4 = ┌ 0(2) ┐}
C3 = {x5 = ┌ -21 ┐ , x6 = ┌ -12 ┐}
C4 = {x3 = ┌ -(-)1(1) ┐ , x4 = ┌ -(-)2(2) ┐}
solution: The patterns are shown in the following figure:
And tentative decision boundaries can be those in the following figure:
It is left to the student to verify that a single classify the vectors given that the four classes
C1 → t1 = ┌ C2 → t2 = ┌
┌
layer perceptron with two neurons can are modeled using vectors
-(-)1(1) ┐
-1 ┐
1-1 ┐
1(1) ┐
Sample Final DSCI 552, Instructor: Mohammad Reza Rajati May 4, 2018
and the weight vectors are columns of the following matrix
w = ┐
and the biases in the vector:
b = ┌ 0(1) ┐
The architecture of the Perceptron is:
WT |
1
2. Answer the following questions about the figure:
(a) What is the leave-one-out cross-validation error estimate for maximum margin
separation in the figure? (The answer is a number)
(b) True or False? We would expect the support vectors to remain the same in general
as we move from a linear kernel to higher order polynomial kernels.
(c) If each of the classes is modeled with a normal distribution and the variances are assumed equal, how many parameters are needed to describe the decision boundary for classification?
(d) Assume that all data points outside the margin are unlabeled. Mark on the figure with a - the first data point that has to be labeled, when implementing active learning. Also, mark on the figure with a + the first data point that has to be labeled, when implementing self-training.
solution:
(a) Based on the figure we can see that removing any single point would not chance the
resulting maximum margin separator. Since all the points are initially classified correctly, the leave-one-out error is zero.
(b) There are no guarantees that the support vectors remain the same. The feature
vectors corresponding to polynomial kernels are non-linear functions of the original input vectors and thus the support points for maximum margin separation in the feature space can be quite different.
(c) Modeling classes with normal densities with euqal variances creates linear decision boundaries. A linear decision boundary with two variables x1 and x2 is described by three parameters (w0 , w1 , w2 ), i.e.: g(x1 , x2 ) = w0 + w1 x1 + w2 x2 .
(d)
3. Consider the unlabeled two-dimensional data represented on the following figure.
(a) Using the two points marked as squares as initial centroids, draw (on that same
figure) the clusters obtained after one iteration of the k-means algorithm (k = 2).
(b) Does your solution change after another iteration of the k-means algorithm?
solution:
(a)
(b) No.
4. A company with headquarters in the Bay Area has two offices in Los Angeles and San Diego. An employee in San Diego office is sent to the Los Angeles office the next day with probability 0.2 and stays in San Diego office with probability 0.8. An employee in Los Angeles office is sent to the San Diego office with probability 0.1 and stays in Los Angeles office with probability 0.9. A new employee is assigned to Los Angeles office with probability 0.7 and to San Diego office with probability 0.3. An employee in San Diego office works between six and eight hours per day with probability 0 .6, works more than eight hours with probability 0.3, and works less than six hours per day with probability 0.1. An employee in Los Angeles office works between six and eight hours per day with probability 0.15, works more than eight hours with probability 0.8, and works less than six hours per day with probability 0.05. A manager in the headquarters can only observe the number of hours each employee worked each day.
(a) Construct a Hidden Markov Model that models the observations of the manager
in their headquarters. Clearly show the parameters with matrices and vectors and draw a state transition graph for the model.
(b) If the manager observes the number of hours a new employee worked in the first
three consecutive days of work to be 5, 9, 7, what is the most likely sequence of places at which the employee worked in those three days?
(c) What sequence of three places has the maximum expected number of correct places?
5olution:
(a) The figure below shows the HMM model of the problem. We used L, M, H to
show working less than six hours, between six and eight hours, and more than eight hours per day, respectively.
0.2
SD
0.1
L M
The parameters will be:
πz0 = [ 0.7 0.3 ]
A = ┐
B = ┐
where the rows show LA and SD.
(b) The sequence of observations is L, H, M . We do not need to use the Viterbi
algorithm here. An exhaustive search is easy:
State |
Probability |
LA, LA, LA LA, LA, SD LA, SD, SD LA, SD, LA 5D, 5D, 5D SD, LA, SD SD, SD, LA SD, LA, LA |
0.7 x 0.9 x 0.9 x 0.05 x .8 x 0.15 = 0.003402 0.7 x 0.9 x 0.1 x 0.05 x .8 x 0.6 = 0.001512 0.7 x 0.1 x 0.8 x 0.05 x .3 x 0.6 = 0.000504 0.7 x 0.1 x 0.2 x 0.05 x .3 x 0.15 = 0.0000315 0.3 x 0.8 x 0.8 x 0.1 x .3 x 0.6 = 的.的的4s扌6 0.3 x 0.2 x 0.1 x 0.1 x .8 x 0.6 = 0.000288 0.3 x 0.8 x 0.2 x 0.1 x .3 x 0.15 = 0.000216 0.3 x 0.2 x 0.9 x 0.1 x .8 x 0.15 = 0.000648 |
Therefore, the most likely sequence of places is SD, SD, SD.
(c) For the maximum number of correct states (Expectation Maximization) we cal- culate the probability of each of states at each step from the table above:
State |
0 |
1 |
2 |
|
P (LA P (SD |
LHM) LHM) |
0.5418345 0.458166 |
0.5816555 0.4183445 |
0.4272931 0.5727069 |
Hence, the answer using this method is LA, LA, SD.
5. Consider the two classes of patterns that are shown in the figure below. Design a multilayer network to distinguish these categories.
solution:
There are many different multilayer networks that could solve this problem. The vectors in each class can be represented as:
┌ 1(1) ┐ ┌ -(-)1(1) ┐ ┌ 1-1 ┐ ┌ -11 ┐
x1 = '(') - 1 '(') x2 = '(') 1 '(') x3 = '(') 1 '(') x4 = '(') - 1 '(')
' -1 ' ' 1 ' ' -1 ' ' 1 '
We will design a network by first noting that for the Class I vectors either the first two elements or the last two elements will be “1” . The Class II vectors have alternating “1” and “-1” patterns. This leads to the network shown in the figure below. Note that x1 , x2 , x3 , x4 denote the four elements in each of the vectors x1 , x2 , x3 , x4 .
AND Operations
x1 x2
x3 x4
-1
The first neuron in the first layer tests the first two elements of the input vector. If they are both “1” it outputs a “1”, otherwise it outputs a “-1” . The second neuron in the first layer tests the last two elements of the input vector in the same way. Both of the neurons in the first layer perform AND operations. The second layer of the network tests whether either of the outputs of the first layer are “1” . It performs an OR operation. In this way, the network will output a “1” if either the first two elements or the last two elements of the input vector are both “1” .
6. The following diagram shows a person’s state of mind and the actions they may perform due to happiness. Assume that
P (J) = 0.1
P (F) = .8
P (H|J, F) = 0.9
P (H|J, ~ F) = .8
P (H| ~ J, F) = 0.7
P (H| ~ J, ~ F) = 0.1
P (M|H) = 0.7
P (M| ~ H) = 0.2
P (D|H) = 0.3
P (D| ~ H) = 0.05
(a) Calculate the probability that the person watches a movie.
(b) Show that if the person is happy, having had food explains away finding a job.
(c) Calculate the probability that the person goes to a movie and dances given that the person is happy.
solution:
(a) By the law of total probability we have:
P (M) = P (M|H)P (H) + P (M| ~ H)P (~ H) Therefore, we need P (H):
P (H) = P (H|J, F)P (J)P (F) + P (H| ~ J, F)P (~ J)P (F) + P (H|J, ~ F)P (J)P (~ F) + P (H| ~ J, ~ F)P (~ J)P (~ F)
= 0.9(0.1)(0.8) + (0.7)(0.9)(0.8) + (0.8)(0.1)(0.2) + (0.1)(0.9)(0.2) = 0.61 Hence
P (M) = (0.7)(0.61) + (0.2)(0.39) = 0.505
(b) We need to show P (F |H) > P (F |H, J). By Bayes’ Rule:
P (H|F)P (F)
P (H)
But by the law of total probability:
P (H|F) = P (H|J, F)P (J) + P (H| ~ J, F)P (~ J) = (0.9)(.1) + (0.7)(0.9) = 0.72
Therefore:
(0.72)(0.8)
0.61
Next
P (H|J, F)P (J, F) P (H|J, F)P (F |J)P (J) P (H|J, F)P (F |J)
P (H, J) P (H|J)P (J) P (H|J)
On the other hand P (F |J) = P (F), because F and J are independent. Also, by the law of total probability, P (H|J) = P (H|J, F)P (F) + P (H|J, ~ F)P (~ F) = (0.9)(0.8) + (0.8)(0.2) = 0.88.
Therefore,
P (H|J, F)P (F |J) P (H|J, F)P (F) (0.9)(0.8)
P (H|J) P (H|J) 0.88
Obviously, P (F |H) > P (F |H, J).
(c) Given H , M and D are conditionally independent
P (M, D, H) P (H)P (M|H)P (D|H)
P (H) P (H)
= (0.3)(0.7) = 0.21
2022-07-23