Assignment 2: ECON-UA 266 - Intro to Econometrics Spring 2023
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit
Assignment 2: ECON-UA 266 - Intro to Econometrics Spring 2023
The second assignment is due on Friday 10th February 2023. It covers the material related to the population
regression model and the OLS estimator. For the Data questions and any other questions that relies on using R, report the output of your analysis in a “report style” pleasing to read and add the codes you used to generate your results. Do not hand in the raw data, the raw output from R or any intermediary output unless stated otherwise. [1] You are encouraged to discuss the problems with others, but [2] you must write up your own results. Do not copy someone else’s answer.
IMPORTANT DISCLAIMER: The homework is not graded. The points are only to give you information about the weight assigned to each questions.
Question 1 [5 points]
Show that = .
Solution :
From class, we know we can write the predicted/fitted value as:
= αOLS + βOLS X
Where we understand αOLS and βOLS to have been obtained from the FOC we derived in class, and where
as another way to interpret an OLS regression. For each i, write:
Yi = i + eˆi
To show that = , we sum each side of the equation and divide by N :
N N
工 Yi = 工(i + ei )
i=1 i=1
N N N
工 Yi = 工 i + 工 ei
i=1 i=1 i=1
= +
where we know that = 0 from the algebraic properties of the OLS estimators. Recall one of the properties is that the sum, and therefore the sample average, of the OLS residuals is zero. In other words:
With this property in mind, the second term on the right side of the above equation becomes zero and we simplify the expression as:
=
Question 2 [15 points]
Let X and Y have joint pdf: PX,Y (x, y) = , where x = 1, 2, 3 and y = 0, 1
a. [5 points] Find the Covariance and correlation of X and Y (write the formula and then find the covariance and coefficient of correlation)
Solution :
Let’s first derive the joint pdf:
PX,Y (X = 1, Y = 0) = =
PX,Y (X = 2, Y = 0) = =
PX,Y (X = 3, Y = 0) = =
PX,Y (X = 1, Y = 1) = =
PX,Y (X = 2, Y = 1) = =
PX,Y (X = 3, Y = 1) = =
Additionally, let’s derive the marginal pdf:
PX (X = 1) = and PX (X = 2) = and PX (X = 3) =
PY (Y = 0) = and PY (Y = 1) =
Now solving for Cov (X, Y):
Cov (X, Y) = E [(Y − E[Y])(X − E[X])] = E [XY] − E[X]E [Y]
Note that the random variable XY can take the following values: xy = 0, 1, 2, 3. We get this by multiplying each possible value for X by each possible value for Y.
E [XY] = (0) + (1) + (2) + (3) = + + =
E [X] = (1) + (2) + (3) = + + =
E [Y] = (0) + (1) = 0 + =
Cov (X, Y) = E [XY] − E[X]E [Y] = ( ) − ( )( ) = − = − ≈ −0.027 Now solving for Corr (X, Y):
Corr (X, Y) = ρX,Y = =
σX = ^E [X2] − E[X]2 and σY = ^E [Y2] − E[Y]2
σX = 4((12 )( ) + (22 )( ) + (32 )( )) − ( )2 = 4( + + ) − = 4 − ≈ 0.772
σY = 4((02 )( ) + (12 )( )) − ( )2 = 4(0 + ) − = 4 − ≈ 0.490 Corr (X, Y) = ρX,Y = ≈ ≈ −0.071
b. [5 points] Find E [Y |X] (again write the formula first)
Solution :
For discrete random variables, the conditional expectation is written generally as: E [Y |X = x] = 又t∈T tP (Y = t|X = x) where T is the support of Y (i.e. all the possible values that Y can take). Additionally, remember that P (Y = t|X = x) = P (P(X)t) .
E [Y |X = 1] = (0)( ) + (1)( ) =
E [Y |X = 2] = (0)( ) + (1)( ) =
E [Y |X = 3] = (0)( ) + (1)( ) =
Note that E [Y |X] is a random variable, as X is a random variable.
( 2/3
E [Y |X] =〈 3/5 (4/7
if X = 1 (X=1 with probability 3/15)
if X = 2 (X=2 with probability 5/15)
if X = 3 (X=3 with probability 7/15)
(1)
( 2/3
E [Y |X] = 3/5 (4/7
with probability 3/15)
with probability 5/15)
with probability 7/15)
(2)
c. [5 points] Calculate directly E [E [Y |X]] and hence show that it is equal to E [Y]. This is known as the law of iterated expectation.
Solution :
E [E [Y |X]] = ( )( ) + ( )( ) + ( )( ) = + + = 0.6 = E [Y]
Thus the law of iterated expectation holds.
Question 3 [10 points]
In class, we introduce two different concepts to study the relationship between X and Y . The first object was the Conditional Expectation Function (CEF), and the second object was the univariate linear regression model (LRM). Although the CEF is not always linear, when it is linear, then the LRM is the CEF. One special case where the CEF is linear is when X takes one of two values as follows:
Consider E [Y |X] where X is a dummy variable that equals one with probability p and is zero otherwise. Prove that the CEF and the regression of Y on X are the same in this case. Do this by showing that for
Bernoulli X:
α = E [Y] − βE[X] = E [Y |X = 0]
β = Cov (X, Y)/Var(X ) = (E [Y |X = 1] − E[Y |X = 0])
Solution :
First, consider the formula for the slope:
β = Cov (X, Y)/Var(X )
Remember that
Cov (X, Y) = E [XY] − E[X]E [Y]
where E [X] = Pr (X = 1) = p. Applying the law of iterated expectation, we can rewrite E [XY] = E [E [Y |X]X] = E [Y |X = 1]Pr(X = 1) × 1 + E [Y |X = 0]Pr(X = 0) × 0 = E [Y |X = 1]Pr(X = 1) × 1 = E [Y |X = 1]p and we can rewrite E [Y] = E [E [Y |X]] = E [Y |X = 1]Pr(X = 1) + E[Y |X = 0]Pr(X = 0) = E [Y |X = 1]p + E[Y |X = 0](1 − p).
Hence, we can rewrite
Cov (X, Y) = E [Y |X = 1]p − (E [Y |X = 1]p + E[Y |X = 0](1 − p))p = (E [Y |X = 1] − E[Y |X = 0])(1 − p)p
On the other hand, the denominator is the variance of a Bernoulli given by:
(1 − p)p
It follows that:
β = E [Y |X = 1] − E[Y |X = 0]
The slope is the difference in the conditional expectation Y .
For the intercept:
α = E [Y] − βE[X] = E [Y |X = 0] =
E [Y |X = 1]p + E[Y |X = 0](1 − p) − (E [Y |X = 1] − E[Y |X = 0])p =
E [Y |X = 0]
where we used the fact that E [X] = p.
Question 4 (Wooldridge Chapter 2 question 5) [15 points]
In the linear consumption function
c一ons = + βˆinc
where the (estimated) marginal propensity to consume (MPC) out of income is simply the slope, while the
average propensity to consume (APC) is c一ons/inc = /inc + βˆ .
Using observations for 100 families on annual income and consumption (both measured in dollars), the following equation is obtained:
c一ons = 124.84 + 0.853inc
a. [5 points] Interpret the intercept in this equation, and comment on its sign and magnitude.
Solution :
The positive intercept indicates that if a given family had zero annual income, its predicted consumption would be $124.84, which of course cannot literally be true. You can see that for low levels of income, this linear function would not describe the relationship between income and consumption very well, which is why we will eventually have to use other types of functions to describe such relationships.
b. [5 points] What is the predicted consumption when family income is $30, 000?
Solution :
c一ons = 124.84 + 0.853($30, 000) = $25, 714.84
c. [5 points] With inc on the x-axis, draw a graph of the estimated MPC and APC Solution :
MPC = βˆ = 0.853
APC = c一ons/inc = $124.84/inc + 0.853
Note that the APC is not constant, it is always larger than the MPC, and it gets closer to the MPC as income increases.
# If needed, run install .packages("ggplot2") first
library(ggplot2)
# Selecting a range of (arbitrary) income levels to plot
inc <- 25 : 1000
# Generating and naming the APC curve
apc <- 124.84/inc+0 .853
# Generating and naming the MPC line
mpc <- 0.853
# Creating a data frame for the three objects of interest: inc, apc and mpc
inc_data <- data .frame(inc,apc,mpc)
# Plotting the two series (apc and mpc with income on x - axis and consumption on y - axis)
ggplot(data = inc_data, aes (x=inc)) +
geom_line(aes (y = apc, colour = "APC")) + geom_line(aes (y = mpc, colour = "MPC")) + labs(title ="Estimating MPC & APC" , x ="Income" , y ="Consumption") + scale_colour_manual("" , values = c ( "APC" = "red" , "MPC" = "blue"))
Estimating MPC & APC
0 250 500 750 1000 Income |
Question 5 [15 points]
A college bookseller makes calls at the offices of professors and forms the impression that professors are more likely to be away from their offices on Friday than any other working day. A review of the records of calls, one-fifth of which are on Fridays, indicates that for 16% of Friday calls, the professor is away from the office, while this occurs for only 12% of calls on every other working day. Define the random variables as follows: X is equal to one if the call is made on Friday and zero if the call is made on Monday to Thursday and Y is equal to one if the professor is away from the office and zero if the professor is in the office.
a. [5 points] Find the joint probability function for X and Y .
Solution :
Let’s first establish the marginal pdf of X , Pr (X = x). Note that the probabilities are derived simply from the fact that there are 5 working days in a week. X takes a value of 1 when the call is made on a Friday (in other words, 1 possible working day of the week). On the other hand, X takes a value of 0 when the call is made on Monday-Thursday (in other words, the other 4 possible working days of the week).
Pr (A call is made on a Friday) = Pr (X = 1) = = 0.2 Pr (A call is made on a day that isn’t a Friday) = Pr (X = 0) = = 0.8
We can use the following formula to calculate the joint probability distribution:
P (X = x, Y = y) = P (Y = y|X = x)P (X = x)
where
P (X = 1, Y = 1) = P(The professor is away from the office and Friday) = 0.16 * 0.2 = 0.032
P (X = 1, Y = 0) = P(The professor is in the office and Friday) = 0.84 ∗ 0.2 = 0.168
P (X = 0, Y = 1) = P(The professor is away from the office and Not Friday) = 0.12 ∗ 0.8 = 0.096
P (X = 0, Y = 0) = P(The professor is in the office and Not Friday) = 0.88 ∗ 0.8 = 0.704
Note that the joint probabilities sum to one, which is an easy way to check that our calculations are correct.
b. [5 points] Find the conditional probability function for Y given X = 0.
Solution :
P (Y = 0|X = 0) = P(The professor is in the office|Not Friday) = 0.88
and
P (Y = 1|X = 0) = P(The professor is absent|Not Friday) = 0.12
c. [5 points] Find E [Y |X]
Solution :
For E [Y |X = x] = 工t∈T tP (Y = t|X = x) where T is the support of Y (i.e. all the possible val-
ues that Y can take). Because we have a Bernoulli distribution, we know that E [Y |X = Friday] =
P (The professor is away|Friday) = 0. 16. It takes this value with probability 1/5. E [Y |X = Not Friday] =
P (The professor is away|Not Friday) = 0.12. It takes this value with probability 4/5.
Note that E [Y |X] is a random variable, as X is a random variable.
Data Question 1 [20 points]
Download data from the 2010 CPS at a geographic level of the state or lower [the TA will tell you how to access the dataset]. Choose data to generate two variables that will make up a SLRM but one variable
CANNOT be median household income. Your data set must have AT LEAST 30 observations.
1. [5 points] Describe your data; include the period of analysis, the number of observations, the location, and the geographic level of the data.
Solution :
library(foreign)
library(ggplot2)
library(stargazer)
##
## Please cite as:
## Hlavac, Marek (2022) . stargazer: Well-Formatted Regression and Summary Statistics Tables .
## R package version 5 .2 .3 . https://CRAN .R-project .org/package=stargazer
mydata <- read .dta( 'morg10 .dta ')
newdata = mydata[which(mydata$stfips== 'AL '),]
cols_keep <- c ( 'grade92 ' , 'ownchild ')
subset_data = newdata[cols_keep]
newvals_grade92<-c ( '31 '=0 , '32 '=3 , '33 '=6 , '34 '=8 , '35 '=9 ,
'36 '= 10 , '37 '= 11 , '38 '= 12 , '39 '= 12 , '40 '= 14 , '41 '= 14 ,
'42 '= 14 , '43 '= 16 , '44 '= 17 , '45 '=20 , '46'=22)
subset_data[ 'yrs_ed ']=newvals_grade92[as .character(subset_data$grade92)]
subset_data$grade92 <- NULL
final_data <- na .omit(subset_data)
final_data <- final_data[!is .infinite(rowSums (final_data)),]
My data cover monthly census interview dates (‘intmonth’) from January-December 2010. There are 3569 observations in my final data set (including observations only from Alabama, excluding NA’s). The geographic level of my final data set is the state (Alabama).
2. [5 points] Describe your variables. This will include the definitions of these variables and the summary statistics (mean and standard deviation). DO NOT include the R output at part of your homework; rather write a sentence that indicates the value of these statistics.
Solution :
As mentioned in recitation, you can find the description and value ranges of the relevant variables in this
document: https://data.nber.org/morg/docs/cpsx.pdf:
(1)yrse d : This variable takes the value of the number of years of education of the survey respondent. It ranges from 0-22, with 0 representing the respondent completed less than 1st grade and 22 representing the respondent completed a doctorate degree. The mean of this variable is 12.9 and the standard deviation is 2.7.
(2)ownchild : This variable takes the value of the number of own children less than 18 in the respondent’s primary family. The mean of this variable is 0.428 and the standard deviation is 0.89.
3. [5 points] Write down the population LRM that is based on these two variables. Explain why this is an economically interesting relationship (i.e. what economic theory/reasoning indicates that the independent variable, X , causes the dependent variable, Y?). What is the predicted sign of the slope coefficient in your regression?
Solution :
The general form of a population LRM is: Yi = α + βXi + εi
Using my choice of variables, the population LRM is: ownchild = α + β × yrse d + εi
I am choosing to write the population LRM in terms of the number of years of education and number of own children. You could have chosen any two other variables besides the log of weekly earnings and years of education. This is an economically interesting relationship, as one might imagine there is a trade off between obtaining more education and having and raising children. I predict that there will be a negative slope coefficient in my regression, meaning that the more education a respondent has, the fewer children under 18 the respondent’s primary family has.
4. [5 points] Run a regression using these two variables and interpret the slope parameter estimate from this regression. Include the regression table [hint: you can use stargazer or summary] as part of your homework.
Solution :
# Run after following the data import instructions from "Census2010_Commands .R" file uploaded # to NYU Classes (which we also went through in recitation) .
# We use the "lm" command in R to fit our population LRM. The dependent, or response, variable # is listed first, followed by "~" and then one or more independent variables:
model_with_intercept <- lm(ownchild ~ yrs_ed, data=final_data)
summary (model_with_intercept)
##
## Call:
## lm(formula = ownchild ~ yrs_ed, data = final_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0 .7295 -0 .4640 -0 .3976 -0 . 1984 9 .6024
##
## Coefficients:
## Estimate Std . Error t value Pr(> |t |)
## (Intercept) -0 .0007563 0 .0717663 -0 .011 0 .992
## yrs_ed 0 .0331946 0 .0054303 6 .113 1 .08e-09 ***
## ---
## Signif . codes: 0 '*** ' 0.001 '** ' 0.01 '* ' 0.05 ' . ' 0.1 ' ' 1
##
## Residual standard error: 0 .8886 on 3567 degrees of freedom
## Multiple R-squared: 0 .01037, Adjusted R-squared: 0 .01009
## F-statistic: 37 .37 on 1 and 3567 DF, p-value: 1 .084e-09
In this model, I obtained a slope coefficient of ≈ 0.033. I can interpret this as follows: For every additional year of education the Alabama respondent obtains, he or she has 0.033 additional children under the age of
18. From the summary table, I can see this result is statistically significant at the 0.001 level, but the sign of the slope coefficient is opposite of what I had predicted. Of course, this is a simplistic model, and it is likely that years of education only partially explain the number of children respondents in Alabama have (i.e. income, marital status, etc.). Additionally, since“ownchild” only captures the number of children under 18, it is possible there are older children respondents had but which do not show up in our regression as a result.
Data Question 2 [15 points]
Download data from 2016 CPS (the TA will help you find the data during recitation – look at the March CPS) which contains observations on weekly earnings, sex, race, age and education for respondents aged 25-64.
a. [5 points] Plot the weekly earnings of individuals against the number of years of education.
Solution :
library(foreign)
mydata <- read .dta( 'morg16 .dta ')
newdata <- mydata[which(mydata$intmonth== 'March'),]
newdata2 <- newdata[which(newdata$age>=25 & newdata$age<=64),]
cols_keep <- c ( 'earnwke ' , 'sex ' , 'age ' , 'race ' , 'grade92 ')
subset_data = newdata2[cols_keep]
newvals_grade92 <- c ( '31 '=0 , '32 '=3 , '33 '=6 , '34 '=8 , '35 '=9 ,
'36 '= 10 , '37 '= 11 , '38 '= 12 , '39 '= 12 , '40 '= 14 , '41 '= 14 ,
'42 '= 14 , '43 '= 16 , '44 '= 17 , '45 '=20 , '46'=22)
subset_data[ 'yrs_ed ']=newvals_grade92[as .character(subset_data$grade92)] subset_data$grade92 <- NULL
subset_data[ 'log_earnings ']=log(subset_data$earnwke)
final_data <- na .omit(subset_data)
final_data <- final_data[!is .infinite(rowSums (final_data)),]
ggplot(final_data, aes (yrs_ed, earnwke)) +
geom_point(na .rm = TRUE) + labs(title = "Years of Educ . & Weekly Earnings" , x = "Years of Education" , y = "Weekly Earnings")
Years of Educ. & Weekly Earnings
0 5 10 15 20 |
2023-03-27