闪电代写 -代写CS作业_CS代写_Finance代写_Economic代写_Statistics代写_代码代做_IT代写_加急帮助

Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

Problem Set 5 - EC420

Front Matter

This assignment covers the use of instrumental variables to estimate a causal eﬀect. It will use simulated data so that we may gain insight into the nature of endogeneity, and the manner in which instrumental variable methods address endogeneity.

● Download a fresh copy of the EC420 Assignment Template from D2L. Save it in a folder you created just for EC420 Problem Set 5 and give it a real name.

● For each of the numbered main headers (Part 0, Part 1, Part 2) in this document, create a main header in your markdown document using a double ##, as shown in the template. Headers must be followed by an empty space. Please review your ﬁnal work to make sure your headers are developed properly.

● When there are tasks (which require only coding) and questions (which are to be answered) (e.g. “Task 1.A” and “Question 1.B”), then use triple ###’s, as shown in the template.

● Tasks only require coding. Make sure your code is “echo”ed into your .pdf. Each Task can be done in one code chunk. Questions should not require any additional coding (though you can add to the corresponding code chunk if you want).

Part 0

Task 0.1 (2 points)

In a code chunk, load the wooldridge, lmtest, sandwich, zoo, lubridate, and AER packages. If you have not yet installed all of them (the last three are new), then do so. Remember, you never ever use ‘install.pacakges‘ inside a code chunk. You install only once directly in the console, then use library to load in your chunk. Make sure you have put your name in the header of your work, and check now to make sure you can knit to pdf without any problems.

Question 0.1 (2 points)

● (a) (2 points) What folder have you saved your .Rmd in, and is it in a separate folder from Problem Set #4?

Part 1

For this problem, we are going to generate the data ourselves as we did before. We can set up the data to have whatever issues we want to examine, and see how a “naive’ ’ regression fares versus our preferred method.

The code to generate the data is below. We’ll use Y as the outcome, D as the variable of interest (treatment), Z as the instrument, and X1, X2 as other exogenous variables. We’ll have UO as the unobserved variable causing the problem. It will be part of our data construction, but not part of our estimation. We’ll see how UO biases a naive regression, and we’ll see how well our instrument, Z , addresses the issue.

We will set the R variables that will deﬁne our data. We need to tell R how many observations we will create using the variable NN = 1000, the true β’s that will determine the outcome Y , as well as the true δ’s that will determine the endogenous treatment D .

In order for our data to have an endogeneity problem in D , it must be the case that D is determined within the system. This occurs when D is determined in part by UO. For instance, in our KIPP example, if Y is “test score” and D is “attends KIPP”, then UO might be “parental involvement in the child’s education”; then UO would aﬀect D and it woudl aﬀect Y .

The true data generating process for our D and Y will be:

Y = β0 + βD D + β塞1 x1 + β塞2 x2 + βUo UO + u

D = δ0 + δZ Z + δUo UO + v

(1)

(2)

We will create values for δ Uo , βUo , the increase in D and Y per unit increase in UO , ceteris paribus that reﬂect this endogeneity. The code below should be copied into a code chunk in your Rmarkdown. Our goal is to recover the true βD = 1.25, which is true by construction.

Task 1.1 (5 points)

Copy the following data creating code into a code chunk:

## Copy this section--------------------------------------------------##

## Set up the data generating process

NN = 1000

## All the true betas and variances for the first equation

beta0 = 15

betaD = 1.25

betaX1 = -.25

betaX2 = .15

betaUO = .66

sigma2u = 4

## All the true deltas and variances for the second equation

delta0 = .10

deltaUO = .35

deltaZ = .20

sigma2v = .20

## And let !s set our random number generator

set.seed(4202022)

## Creating the data, P2.

### First, all of the exogenous variables:

P2 = data.frame(X1 = rnorm(n = NN, mean=2, sd=1),

X2 = rpois(n = NN, lambda=3),

UO = rnorm(n = NN, mean = 0, sd=2),

Z = rnorm(n = NN, mean = 0, sd=2),

u = rnorm(n = NN, mean = 0, sd = sqrt(sigma2u)),

v = rnorm(n = NN, mean = 0, sd = sqrt(sigma2v)))

### Add in the endogenous variables: D and Y

P2$D = delta0 + deltaZ*P2$Z + deltaUO*P2$UO + P2$v

P2$Y = beta0 + betaX1*P2$X1 + betaX2*P2$X2 + betaD*P2$D + betaUO*P2$UO + P2$u

### Note that Y includes the unobserved variable UO, and that D is a function of UO as well.

P2unobserved = P2

## End copy --------------------------------------------------------##

A couple of notes about the code before we move on:

1. We use rnorm and rpois to create our exogenous random variables X1, X2, Z, UO, u, v .

(a) rnorm draws n = NN = 1000 random variables from a normal distribution with the mean and std. dev speciﬁed in the template code.

(b) rpois does the same, but from a poisson distribution, which is a count variable. Try plot(Y X2, P2) to see what this looks like. All the X2 values are integers.

2. Once we have all of the exogenous (determined outside the system) variables, we can create the endogenous varables

(a) D is a function of UO , the unobserved.

(b) D is also a function of Z , and v , which are exogenous and random.

Once we have constructed the data, we will pretend that we do not observe UO, u, v at all and try to recover the correct value for βD . We do this to test our estimation strategy. We know that a naive estimate would be biased due to UO , so we want to see if we can recover the correct value without observing UO .

Questions 1.1 (10 points)

In answering these questions, remember that the quality of endogeneity is relative to the system at hand. A variable may be endogenous in one example, and exogenous in another. The “system” in “determined within the system” refers to all of the variables in our equation(s).

● (a) (2 points) Are u and v endogenous or exogenous in our example? Brieﬂy explain why.

● (b) (3 points) Is D endogenous or exogenous in our example? Brieﬂy explain why.

● (c) (3 points) If D were to be only a function of Z and v, would it then be endogenous or exogenous in our example? Brieﬂy explain why.

● (d) (2 points) Is UO endogenous or exogenous in our example? Brieﬂy explain why.

Task 1.2 (13 points)

● (a) (2 points) Drop the variables UO, v, u from the data entirely. The fastest way to do this is to subset P2 with the columns you do want by using a column index c("Y","D","Z","X1","X2"). Leave the row index empty so we keep all rows.

● (b) (3 points) Plot the relationship between Y and D using plot(...). Set the color to as.factor(P2$X2).

● (c) (5 points) Run a naive regression of Y on D, X1, X2. Do not include UO .

● (d) (3 points) Generate an approximate 95% Conﬁdence Interval for the coeﬃcient on D. You can do this by taking βˆD and adding/subtracting 1.96 X sˆe(βˆ). Your sˆe(βˆ) is in your results from Task 1.2.c.

Question 1.2 (16 points)

● (a) (5 points) What is the coeﬃcient on D in our naive regression and what does it mean?

● (b) (5 points) We can think of UO as being “in the error term”. Using what we learned in partialling out and the known values of δ Uo , βUo , what is the sign of the bias?

● (c) (2 points) Is the true value of βD within the 95% Conﬁdence Interval of our estimate for βD ?

● (d) (2 points) Are the true values of β塞1, β塞2 within the 95% Conﬁdence Intervals of our naive estimates?

● (e) (2 points) Why would expect (or not expect) (d) to be true?

Task 1.3 (20 points)

● (a) (4 points) Run the ﬁrst stage regression using lecture notes as your guide. Include X1, X2 in your ﬁrst stage. Call it FirstStageOLS.

● (b) (4 points) Create a predicted from this ﬁrst stage. You can use the function P2$Dhat = predict(FirstStageOLS) to create a column in P2 that contains the variable Dhat.

● (c) (2 points) Make a scatterplot with D on the x-axis and Dhat on the y-axis.

● (d) (7 points) Run the second stage regression using lecture notes as your guide. Include X1, X2 in this stage as well.

● (e) (3 points) Create an approximate 95% Conﬁdence Interval for βˆD(Iv) .

Question 1.3 (18 points)

● (a) (3 points) Is (Dhat) exogenous or endogenous? Why?

● (b) (3 points) Do you think is correlated with D given your scatterplot from Task 1.3.c 3?

● (c) (3 points) What is the coeﬃcient on (Dhat) in the second stage and what is its interpretation?

● (d) (3 points) How does it compare to the true value of βD that we created in Task 1.1? Is the true value within the 95% Conﬁdence Interval?

● (e) (3 points) Given the 2SLS (two stage least squares) method we used, why would βD be unbiased?

● (f) (3 points) Are the true values of β塞1, β塞2 within a 95% Conﬁdence Interval of our estimates?

Part 2: Instrumental Variables with R

In the previous section, we used 2SLS to estimate a model with an instrumental variable. We used constructed data where we knew all the parts, including unobserved errors and parameters, and saw that 2SLS got us an estimate that was close to the true value, while naive OLS did not.

In this question, we will brieﬂy use the AER package’s ivreg to analyze the same data.

Task 2.1 - Estimating using ivreg (15 points)

An important R skill is being able to ﬁgure out the syntax of an R function. We will use the AER package’s ivreg function to estimate the same instrumental variables model as HW3. Use ?ivreg directly in your console (not in your code chunk) to see the syntax for the function. This will tell you how to specify your formula, and what other inputs the ivreg function needs. If you did not follow Task 1.1 and did not require(...) the AER package, then you will not see anything when you type ?ivreg.

● (a) (3 points) The ﬁrst input ivreg needs is the formula. We can input our formula and save it as an R object (which we can then input to the call of ivreg). To do this, simply use as.formula(y x + ... | z + ...). You don’t need to put quotations around the formula when you code it up. It is up to you to ﬁgure out how to specify the endogenous and instrumental variables in the formula. See the arguments section of the ivreg help for instructions, and then look at your data to see what to put in which place. Use the “recommended” three-part formula format from the help. It may help to clearly write down which variables are endogeneous, which are exogenous, and which are instruments. Then follow the help’s formula.

Remember that we included our exogenous variables X1, X2 in both stages of our 2SLS in HW2. This is because our exogenous variables “instrument for themselves”. That means we specify them as instruments. Keep this in mind when writing your formula.

● (b) (12 points) Run the ivreg command using your formula and the P2 data.frame. Use the robust

standard errors by wrapping the command in coeftest as you did in prior Problem Sets.

Question 2.1 - Interpreting ivreg (15 points)

● (a) (4 points) What is your estimate for βD , the coeﬃcient of interest?

● (b) (3 points) What is the interpretation of the coeﬃcient on βD ?

● (c) (3 points) If D is our treatment variable, what type of treatment eﬀect does βD represent? Is it the average treatment eﬀect (ATE)?

● (d) (5 points) We had a way of establishing whether or not our model met the relevant ﬁrst stage requirement (see our notes on Instrumental Variables and 2SLS). What was the criteria (hint: it has to do with an F-test).

Task 2.2 - Testing (10 points)

● (a) (5 points) To test the relevant ﬁrst stage assumption, we will use lm(...) to regress D on Z , the ﬁrst stage of our 2SLS (leaving aside the exogenous variables). This will tell us whether or not Z has an eﬀect on D. Run this simple regression

● (b) (5 points) Naturally, ivreg has the ability to output some important tests, including one for the relevant ﬁrst stage. We can get this by using summary(ivreg(myFormula, data=P2)) (robust standard errors are not necessary here as we won’t be looking at the standard errors of the coeﬃcients).

Question 2.2 - Testing (10 points)

● (a) (4 points) Using the output from Task 2.2.a lm(...), the ﬁrst-stage regression of D on Z , what can we say about the relevant first stage assumption based on these results? If our instruments are not relevant to the endogenous variable D, we say we have “weak instruments”. Do we have weak instruments?

● (b) (4 points) The output from summarize(ivreg(...)) in Task 2.2.b gives us some additional statistics, including one which tells us about the answer to the previous question. What is the value here, and what does this tell us about our instruments’ relevant ﬁrst stage?

● (c) (2 points) Are the results from Task 2.2.a using lm(...) and the diagnostic result from summarize(ivreg(...)) in 2.2.b similar?

Part 3 (2 points)

How much time did you spend on this problem set

Postword

Don’t forget to render this to .pdf. Do not turn in your Rmarkdown ﬁle. I’ll be able to see your code in the chunk outputs. Check to make sure your code chunks are echoing in our ﬁnal .pdf