闪电代写 -代写CS作业_CS代写_Finance代写_Economic代写_Statistics代写_代码代做_IT代写_加急帮助

Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

EE 660

Homework 3 (Week 6)

2022

Please note: in AML, Exercises are embedded in the text of each section; Problems are at the end of each chapter. Please be sure you work the assigned problem or exercise to get credit.

1. AML Problem 1.7a,b (p. 36), plus the following: For part (b), add: also plot for N = 60 .

Hint: in this problem, tossing one coin N times corresponds to one draw of a dataset of size N. Doing this with 1000 coins independently, corresponds to drawing 1000 different datasets, each of size N.

2. Suppose you have trained a model in a binary classification problem, and you want to estimate its error based on a test set.

You want to estimate this error empirically for the case where labeled data is expensive. So you decide to first do some simulations to see how your test-set error might vary with the “luck of the draw” of the test set.

Let the true probability of error on your model be Eout (h) = µ . Because µ is unknown to you, for purposes of simulation you will try different values of µ.

Method: Conceptually, a test set D (N) is created by drawing N data points randomly from the input space, with replacement. An expert could correctly label each data point, and then you could color each data point as “correct” or “incorrect” depending on the ML classifier’s prediction.

You decide to simulate this by drawing (colored) data points randomly, with

replacement, from a bin of “correct” and “incorrect” data points, with

P (incorrect ) = µ.

(a) Let µ = 0.30 and N = 10.

why or why not.

(ii) Calculate theoretically, using the given values of µ and N, the

probability P +ED(1#)(ℎ) = u1 for any draw of N data points, rounded to 2 decimal places (0.xx).

(iii) Give a theoretical expression for the probability P +ED(N)( ℎ) = µ1 in

terms of variables u, N . Assume that uN is an integer.

ED(10) (h) from each run (no need to turn in these 100 values). Give the

following statistics and answer these questions:

(i)

(i) µ = 0.10, N = 10 u = 0.10, N = 100

(ii) u = 0.30, N = 100

(iii) u = 0.50, N = 10

u = 0.50, N = 100

Tip: present your results in a table (including values given in (a)) for easy interpretation. A sample table is as below.

u	N	max{ED }	min{ED }	sample mean {ED }	sample std {ED }	# runs ED ≠ u	P(\|ED − u\| < 0.05)
0.1	10
0.1	100
0.3	10
0.3	100
0.5	10
0.5	100

(d) Comment on your results, in the following:

(i) How does the accuracy of your error-rate estimate from a test dataset, vary with N ? Hint: look at min, max, std, and P +5ED(N)(ℎ) − µ 5 < 0.051.

(ii) For the classifier of (c)(iii):

Þ based on the given true error rate (P(incorrect)), did the classifier learn anything?

Þ How many test datasets of your draws in (c)(iii) gave an error rate indicating that the classifier did learn something, assuming that ED(N)(ℎ) ≤ 0.45 or ED(N)(ℎ) ≥ 0.55 means it learned something?

3. Consider a hypothesis set ℋ consisting of all positive-interval hypotheses and all

negative-interval hypotheses. (See AML pp.43-44, Example 2.2, positive intervals,

for a definition of positive interval hypotheses.) Negative-interval hypotheses return

ℎ = − 1 within the interval and ℎ = +1 outside the interval. Thus, any hypothesis

h in ℋ contains one interval, and that interval can be positive or negative.

Hint: before answering the questions below, you may find it helpful to read or review AML Example 2.2, cases 1 and 2, or Discussion 5 notes.

(a) Find the growth function mH (N) as a function of N. Show your work or

reasoning.

Tip: you may use the AML book’s result for negative intervals, and build on top of that.

(b) Give the smallest break point k of ℋ . Briefly justify your answer.

4. Consider the WiFi localization dataset used in Homework 2: it had N = 2000 data points and 7 features, split into a training set of 1500 data points and a test set of 500 data points. Suppose you intend to use a linear perceptron classifier on that data instead of random forest. In the parts below, for the tolerance δ in the VC generalization bound, use 0.1 (for a certainty of 0.9). The parts below have short

answers.

Hint: You may use without proof the relation that if H is a linear perceptron

classifier in D dimensions (D features), dVC (H ) = D + 1 .

a) What is the VC dimension of the hypothesis set?

b) Expressing the upper bound on the out-of-sample error as

For Ein (hg ) measured on the training data, use dvc from part (a) to get a value

for εvc .

c) To get a lower εvc , suppose you reduce the number of features to D = 4. Now what is εvc ?

d) Suppose that you had control over the number of training samples NTr (by collecting more WiFi data). How many training samples would ensure a generalization error of εvc = 0.1 again with probability 0.9 (the same tolerance

δ = 0.1), and using the reduced feature set (4 features)?

Hint: if you’re not sure how to solve for N, see an example in AML Sec. 2.2.1

(“Sample Complexity” , pp.57).

Use the original feature set and the original test set, so that NTest = 500. Use the version of e that gives the tightest bound. Give the appropriate expression for ε and calculate it numerically.

Does the number of features have any effect on this ε ?

5. You are given a regression problem in 1D. The target function is

f(x) = σ (3x) =

and feature space extends over − 1 ≤ x ≤ +1 . p (x) is a uniform distribution over

Each draw of the dataset consists of N = 2 points:

D = {(x1, y1 ), (x2, y2 )} = {+x1, σ (3x1 )1, +x2, σ (3x2 )1} . The hypothesis set consists of all lines ℎ (x) = ax + b .

In this problem you will explore bias and var numerically.

You may use built-in python functions, including functions in NumPy to draw random numbers, such as numpy.random.uniform and numpy.random.normal.

(a) Give an algebraic expression for the best hypothesis ℎ in terms of

x1, x2, y1, y2 . This is the hypothesis that minimizes the MSE on D . Hint: no need to take derivatives and do a formal minimization.

(b) Find the mean best hypothesis numerically. Tip: average over many draws of D . Give resulting a and b. Draw a plot containing curves of f(x), , and several ℎD) over the range of − 1 ≤ x ≤ +1. The number of ℎD) curves can be determined yourself by best visibility.

(d) Now let there be sensor noise on the data, so that

y (x) = f(x) + ϵn , ϵn ∼ N ( ϵn ∣ un = 0, Gn = 0.004 )

Repeat parts (a)-(c) for this noisy data. Tip: if there is no change in (a), just state so.

Is var of the learned system very sensitive to σn ? Conjecture a reason why or why not.

(e) Now let there also be some sensor bias on the data, so that

y (x) = f(x) + en , en ∼ N ( en ∣ un = 0. 1, Gn = 0.004 )

Repeat parts (a)-(c) for this biased and noisy data. Tip: if there is no change in (a), just state so.

What is the effect of the sensor bias on the bias and var of the learned system (i.e., does it increase, decrease, or have no effect, on bias and on var)?