Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

Assignment 2: ECON-UA 266 - Intro to Econometrics Spring 2023

The second assignment is due on Friday 10th February 2023. It covers the material related to the population

regression model and the OLS estimator. For the Data questions and any other questions that relies on using R, report the output of your analysis in a “report style” pleasing to read and add the codes you used to generate your results.  Do not hand in the raw data, the raw output from R or any intermediary output unless stated otherwise. [1] You are encouraged to discuss the problems with others, but [2] you must write up your own results. Do not copy someone else’s answer.

IMPORTANT DISCLAIMER: The homework is not graded. The points are only to give you information about the weight assigned to each questions.

Question 1 [5 points]

Show that = .

Solution :

From class, we know we can write the predicted/fitted value as:

= αOLS + βOLS X

Where we understand αOLS  and βOLS  to have been obtained from the FOC we derived in class, and where

as another way to interpret an OLS regression. For each i, write:

Yi  = i + eˆi

To show that = , we sum each side of the equation and divide by N :

N                           N

Yi  = (i + ei )

i=1                     i=1

N                           N                          N

Yi  = i + ei

i=1                     i=1                    i=1

= +

where we know that = 0 from the algebraic properties of the OLS estimators. Recall one of the properties is that the sum, and therefore the sample average, of the OLS residuals is zero. In other words:

With this property in mind, the second term on the right side of the above equation becomes zero and we simplify the expression as:

=

Question 2 [15 points]

Let X and Y have joint pdf: PX,Y (x, y) = , where x = 1, 2, 3 and y = 0, 1

a.  [5 points] Find the Covariance and correlation of X and Y  (write the formula and then find the covariance and coefficient of correlation)

Solution :

Let’s first derive the joint pdf:

PX,Y (X = 1, Y = 0) = =

PX,Y (X = 2, Y = 0) = =

PX,Y (X = 3, Y = 0) = =

PX,Y (X = 1, Y = 1) = =

PX,Y (X = 2, Y = 1) = =

PX,Y (X = 3, Y = 1) = =

Additionally, let’s derive the marginal pdf:

PX (X = 1) = and PX (X = 2) = and PX (X = 3) =

PY (Y = 0) = and PY (Y = 1) =

Now solving for Cov (X, Y):

Cov (X, Y) = E [(Y E[Y])(X E[X])] = E [XY] − E[X]E [Y]

Note that the random variable XY can take the following values: xy = 0, 1, 2, 3. We get this by multiplying each possible value for X by each possible value for Y.

E [XY] = (0) + (1) + (2) + (3) = + + =

E [X] = (1) + (2) + (3) = + + =

E [Y] = (0) + (1) = 0 + =

Cov (X, Y) = E [XY] − E[X]E [Y] = ( ) − ( )( ) = = − ≈ −0.027 Now solving for Corr (X, Y):

Corr (X, Y) = ρX,Y  = =

σX  = ^E [X2] − E[X]2  and σY  = ^E [Y2] − E[Y]2

σX  = 4((12 )( ) + (22 )( ) + (32 )( )) − ( )2  = 4( + + ) − = 4 ≈ 0.772

σY  = 4((02 )( ) + (12 )( )) − ( )2  = 4(0 + ) − = 4 ≈ 0.490 Corr (X, Y) = ρX,Y  = ≈ −0.071

b.  [5 points] Find E [Y |X] (again write the formula first)

Solution :

For  discrete  random  variables,  the  conditional  expectation  is  written  generally  as:   E [Y |X  =  x]  = t∈T tP (Y = t|X = x) where T is the support of Y (i.e. all the possible values that Y can take). Additionally, remember that P (Y = t|X = x) = P (P(X)t) .

E [Y |X = 1] = (0)( ) + (1)( ) =

E [Y |X = 2] = (0)( ) + (1)( ) =

E [Y |X = 3] = (0)( ) + (1)( ) =

Note that E [Y |X] is a random variable, as X is a random variable.

( 2/3

E [Y |X] =〈 3/5 (4/7

if X = 1 (X=1 with probability 3/15)

if X = 2 (X=2 with probability 5/15)

if X = 3 (X=3 with probability 7/15)

(1)

( 2/3

E [Y |X] =    3/5 (4/7

with probability 3/15)

with probability 5/15)

with probability 7/15)

(2)

c.  [5 points] Calculate directly E [E [Y |X]] and hence show that it is equal to E [Y]. This is known as the law of iterated expectation.

Solution :

E [E [Y |X]] = ( )( ) + ( )( ) + ( )( ) = + + = 0.6 = E [Y]

Thus the law of iterated expectation holds.

Question 3 [10 points]

In class, we introduce two different concepts to study the relationship between X and Y . The first object was the Conditional Expectation Function (CEF), and the second object was the univariate linear regression model (LRM). Although the CEF is not always linear, when it is linear, then the LRM is the CEF. One special case where the CEF is linear is when X takes one of two values as follows:

Consider E [Y |X] where X is a dummy variable that equals one with probability p and is zero otherwise. Prove that the CEF and the regression of Y on X are the same in this case. Do this by showing that for

Bernoulli X:

α = E [Y] − βE[X] = E [Y |X = 0]

β = Cov (X, Y)/Var(X ) = (E [Y |X = 1] − E[Y |X = 0])

Solution :

First, consider the formula for the slope:

β = Cov (X, Y)/Var(X )

Remember that

Cov (X, Y) = E [XY] − E[X]E [Y]

where E [X]  =  Pr (X  =  1)  = p.   Applying the law of iterated expectation, we can rewrite  E [XY]  = E [E [Y |X]X] = E [Y |X = 1]Pr(X = 1) × 1 + E [Y |X = 0]Pr(X = 0) × 0 = E [Y |X = 1]Pr(X = 1) × 1 = E [Y |X = 1]p and we can rewrite E [Y] = E [E [Y |X]] = E [Y |X = 1]Pr(X = 1) + E[Y |X = 0]Pr(X = 0) = E [Y |X = 1]p + E[Y |X = 0](1 − p).

Hence, we can rewrite

Cov (X, Y) = E [Y |X = 1]p − (E [Y |X = 1]p + E[Y |X = 0](1 − p))p = (E [Y |X = 1] − E[Y |X = 0])(1 − p)p

On the other hand, the denominator is the variance of a Bernoulli given by:

(1 p)p

It follows that:

β = E [Y |X = 1] − E[Y |X = 0]

The slope is the difference in the conditional expectation Y .

For the intercept:

α = E [Y] − βE[X] = E [Y |X = 0] =

E [Y |X = 1]p + E[Y |X = 0](1 p) − (E [Y |X = 1] − E[Y |X = 0])p =

E [Y |X = 0]

where we used the fact that E [X] = p.

Question 4 (Wooldridge Chapter 2 question 5) [15 points]

In the linear consumption function

cons = + βˆinc

where the (estimated) marginal propensity to consume (MPC) out of income is simply the slope, while the

average propensity to consume (APC) is cons/inc = /inc + βˆ .

Using observations for 100 families on annual income and consumption (both measured in dollars), the following equation is obtained:

c一ons = 124.84 + 0.853inc

a.  [5 points] Interpret the intercept in this equation, and comment on its sign and magnitude.

Solution :

The positive intercept indicates that if a given family had zero annual income, its predicted consumption would be $124.84, which of course cannot literally be true. You can see that for low levels of income, this linear function would not describe the relationship between income and consumption very well, which is why we will eventually have to use other types of functions to describe such relationships.

b.  [5 points] What is the predicted consumption when family income is $30, 000?

Solution :

c一ons = 124.84 + 0.853($30, 000) = $25, 714.84

c.  [5 points] With inc on the x-axis, draw a graph of the estimated MPC and APC Solution :

MPC = βˆ = 0.853

APC = c一ons/inc = $124.84/inc + 0.853

Note that the APC is not constant, it is always larger than the MPC, and it gets closer to the MPC as income increases.

# If needed, run install .packages("ggplot2") first

library(ggplot2)

# Selecting a range of (arbitrary) income levels to plot

inc  <- 25 : 1000

# Generating and naming the APC curve

apc  <- 124.84/inc+0 .853

# Generating and naming the MPC line

mpc  <- 0.853

# Creating a data frame for the three objects of interest: inc, apc and mpc

inc_data  <- data .frame(inc,apc,mpc)

# Plotting the two series (apc and mpc with income on x - axis and consumption on y - axis)

ggplot(data  =  inc_data,  aes (x=inc))  +

geom_line(aes (y  =  apc,  colour  =  "APC"))  +  geom_line(aes (y  = mpc,  colour  =  "MPC"))  + labs(title  ="Estimating  MPC  &  APC" ,  x  ="Income" ,  y  ="Consumption")  +                              scale_colour_manual("" ,  values  =  c ( "APC"  =  "red" ,  "MPC"  =  "blue"))

Estimating MPC & APC

0                    250                  500                  750 1000

Income

Question 5 [15 points]

A college bookseller makes calls at the offices of professors and forms the impression that professors are more likely to be away from their offices on Friday than any other working day. A review of the records of calls, one-fifth of which are on Fridays, indicates that for 16% of Friday calls, the professor is away from the office, while this occurs for only 12% of calls on every other working day. Define the random variables as follows: X is equal to one if the call is made on Friday and zero if the call is made on Monday to Thursday and Y is equal to one if the professor is away from the office and zero if the professor is in the office.

a.  [5 points] Find the joint probability function for X and Y .

Solution :

Let’s first establish the marginal pdf of X , Pr (X = x). Note that the probabilities are derived simply from the fact that there are 5 working days in a week. X takes a value of 1 when the call is made on a Friday (in other words, 1 possible working day of the week). On the other hand, X takes a value of 0 when the call is made on Monday-Thursday (in other words, the other 4 possible working days of the week).

Pr (A call is made on a Friday) = Pr (X = 1) = = 0.2            Pr (A call is made on a day that isn’t a Friday) = Pr (X = 0) = = 0.8

We can use the following formula to calculate the joint probability distribution:

P (X = x, Y = y) = P (Y = y|X = x)P (X = x)

where

P (X = 1, Y = 1) = P(The professor is away from the office and Friday) = 0.16 * 0.2 = 0.032

P (X = 1, Y = 0) = P(The professor is in the office and Friday) = 0.84 ∗ 0.2 = 0.168

P (X = 0, Y = 1) = P(The professor is away from the office and Not Friday) = 0.12 ∗ 0.8 = 0.096

P (X = 0, Y = 0) = P(The professor is in the office and Not Friday) = 0.88 ∗ 0.8 = 0.704

Note that the joint probabilities sum to one, which is an easy way to check that our calculations are correct.

b.  [5 points] Find the conditional probability function for Y given X = 0.

Solution :

P (Y = 0|X = 0) = P(The professor is in the office|Not Friday) = 0.88

and

P (Y = 1|X = 0) = P(The professor is absent|Not Friday) = 0.12

c.  [5 points] Find E [Y |X]

Solution :

For  E [Y |X  =  x]  = tT tP (Y  =  t|X  =  x)  where  T  is  the  support  of Y  (i.e.  all  the  possible  val-

ues  that  Y  can  take).    Because  we  have  a  Bernoulli  distribution,  we  know  that  E [Y |X  =  Friday]  =

P (The professor is away|Friday) = 0. 16.  It takes this value with probability 1/5.  E [Y |X = Not Friday] =

P (The professor is away|Not Friday) = 0.12. It takes this value with probability 4/5.

Note that E [Y |X] is a random variable, as X is a random variable.

Data Question 1 [20 points]

Download data from the 2010 CPS at a geographic level of the state or lower [the TA will tell you how to access the dataset].  Choose data to generate two variables that will make up a SLRM but one variable

CANNOT be median household income. Your data set must have AT LEAST 30 observations.

1.  [5 points] Describe your data; include the period of analysis, the number of observations, the location, and the geographic level of the data.

Solution :

library(foreign)

library(ggplot2)

library(stargazer)

##

##  Please  cite  as:

##    Hlavac,  Marek  (2022) .  stargazer:  Well-Formatted  Regression  and  Summary  Statistics  Tables .

##    R  package  version  5 .2 .3 .  https://CRAN .R-project .org/package=stargazer

mydata  <- read .dta( 'morg10 .dta ')

newdata  =  mydata[which(mydata$stfips== 'AL '),]

cols_keep  <- c ( 'grade92 ' , 'ownchild ')

subset_data  =  newdata[cols_keep]

newvals_grade92<-c ( '31 '=0 , '32 '=3 , '33 '=6 , '34 '=8 , '35 '=9 ,

'36 '= 10 , '37 '= 11 , '38 '= 12 , '39 '= 12 , '40 '= 14 , '41 '= 14 ,

'42 '= 14 , '43 '= 16 , '44 '= 17 , '45 '=20 , '46'=22)

subset_data[ 'yrs_ed ']=newvals_grade92[as .character(subset_data$grade92)]

subset_data$grade92  <- NULL

final_data  <- na .omit(subset_data)

final_data  <- final_data[!is .infinite(rowSums (final_data)),]

My data cover monthly census interview dates (‘intmonth’) from January-December 2010. There are 3569 observations in my final data set (including observations only from Alabama, excluding NA’s). The geographic level of my final data set is the state (Alabama).

2.  [5 points] Describe your variables. This will include the definitions of these variables and the summary statistics (mean and standard deviation). DO NOT include the R output at part of your homework; rather write a sentence that indicates the value of these statistics.

Solution :

As mentioned in recitation, you can find the description and value ranges of the relevant variables in this

document: https://data.nber.org/morg/docs/cpsx.pdf:

(1)yrse d : This variable takes the value of the number of years of education of the survey respondent. It ranges from 0-22, with 0 representing the respondent completed less than 1st grade and 22 representing the respondent completed a doctorate degree. The mean of this variable is 12.9 and the standard deviation is 2.7.

(2)ownchild : This variable takes the value of the number of own children less than 18 in the respondent’s primary family. The mean of this variable is 0.428 and the standard deviation is 0.89.

3.  [5 points] Write down the population LRM that is based on these two variables.  Explain why this is an economically interesting relationship (i.e. what economic theory/reasoning indicates that the independent variable, X , causes the dependent variable, Y?). What is the predicted sign of the slope coefficient in your regression?

Solution :

The general form of a population LRM is: Yi  = α + βXi + εi

Using my choice of variables, the population LRM is: ownchild = α + β × yrse d + εi

I am choosing to write the population LRM in terms of the number of years of education and number of own children. You could have chosen any two other variables besides the log of weekly earnings and years of education. This is an economically interesting relationship, as one might imagine there is a trade off between obtaining more education and having and raising children.  I predict that there will be a negative slope coefficient in my regression, meaning that the more education a respondent has, the fewer children under 18 the respondent’s primary family has.

4.  [5 points] Run a regression using these two variables and interpret the slope parameter estimate from this regression. Include the regression table [hint: you can use stargazer or summary] as part of your homework.

Solution :

# Run after following the data import instructions from "Census2010_Commands .R" file uploaded # to NYU Classes (which we also went through in recitation) .

# We use the "lm" command in R to fit our population LRM. The dependent, or response, variable # is listed first, followed by "~" and then one or more independent variables:

model_with_intercept  <- lm(ownchild ~ yrs_ed, data=final_data)

summary (model_with_intercept)

##

##  Call:

##  lm(formula  =  ownchild  ~  yrs_ed,  data  =  final_data)

##

##  Residuals:

##           Min             1Q    Median             3Q           Max

##  -0 .7295  -0 .4640  -0 .3976  -0 . 1984    9 .6024

##

##  Coefficients:

##                               Estimate  Std .  Error  t  value  Pr(> |t |)


##  (Intercept)  -0 .0007563    0 .0717663    -0 .011        0 .992

##  yrs_ed              0 .0331946    0 .0054303      6 .113  1 .08e-09  ***

##  ---

##  Signif .  codes:    0  '*** '  0.001  '** '  0.01  '* '  0.05  ' . '  0.1  '  '  1

##

##  Residual  standard  error:  0 .8886  on  3567  degrees  of  freedom

##  Multiple  R-squared:    0 .01037,        Adjusted  R-squared:    0 .01009

##  F-statistic:  37 .37  on  1  and  3567  DF,    p-value:  1 .084e-09

In this model, I obtained a slope coefficient of ≈ 0.033. I can interpret this as follows: For every additional year of education the Alabama respondent obtains, he or she has 0.033 additional children under the age of

18. From the summary table, I can see this result is statistically significant at the 0.001 level, but the sign of the slope coefficient is opposite of what I had predicted. Of course, this is a simplistic model, and it is likely that years of education only partially explain the number of children respondents in Alabama have (i.e. income, marital status, etc.). Additionally, since“ownchild” only captures the number of children under 18, it is possible there are older children respondents had but which do not show up in our regression as a result.

Data Question 2 [15 points]

Download data from 2016 CPS (the TA will help you find the data during recitation – look at the March CPS) which contains observations on weekly earnings, sex, race, age and education for respondents aged 25-64.

a.  [5 points] Plot the weekly earnings of individuals against the number of years of education.

Solution :

library(foreign)

mydata  <- read .dta( 'morg16 .dta ')

newdata  <- mydata[which(mydata$intmonth== 'March'),]

newdata2  <- newdata[which(newdata$age>=25  &  newdata$age<=64),]

cols_keep  <- c ( 'earnwke ' , 'sex ' , 'age ' , 'race ' , 'grade92 ')

subset_data  =  newdata2[cols_keep]

newvals_grade92  <- c ( '31 '=0 , '32 '=3 , '33 '=6 , '34 '=8 , '35 '=9 ,

'36 '= 10 , '37 '= 11 , '38 '= 12 , '39 '= 12 , '40 '= 14 , '41 '= 14 ,

'42 '= 14 , '43 '= 16 , '44 '= 17 , '45 '=20 , '46'=22)

subset_data[ 'yrs_ed ']=newvals_grade92[as .character(subset_data$grade92)] subset_data$grade92  <- NULL

subset_data[ 'log_earnings ']=log(subset_data$earnwke)

final_data  <- na .omit(subset_data)

final_data  <- final_data[!is .infinite(rowSums (final_data)),]

ggplot(final_data,  aes (yrs_ed,  earnwke))  +

geom_point(na .rm  =  TRUE)  +  labs(title  =  "Years  of  Educ .  &  Weekly  Earnings" , x  =  "Years  of  Education" ,  y  =  "Weekly  Earnings")

Years of Educ. & Weekly Earnings

0                       5 10                      15                     20