Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

Assignment 5: ECON-UA 266 - Intro to Econometrics

Spring 2023

The solution to this assignment will be released on Friday March 10th, 2023. It covers the material related to

hypothesis tests and the multivariate regression model. For the Data questions and any other questions that relies on using R, report the output of your analysis in a “report style” pleasing to read and add the codes you used to generate your results.

Solution Continuation of Assignment 4 Q1

a. If children, on average, were expected to be of the same height as their parents, then this would imply two hypotheses, one for the slope and one for the intercept.

(i) What should the null hypothesis be for the intercept? [This means: How would you restrict the intercept in order to give as a prediction that the children, on average, are expected to be the same height as their parents?]

Solution

Note that Student Height = Midpar Height implies that:

Student Height = α + βStudent Height

From the above equation this clearly implies that the intercept α = 0.

As such, we have H0  : α = 0 vs. H1  : α 0

(ii) What should the null hypothesis be for the slope?  [This means: How would you restrict the slope in order to give as a prediction that the children, on average, are expected to be the same height as their parents?]

Solution

From the equation in part (i) we see that this implies β = 1. As such, we have H0  : β = 1 vs. H1  : β 1

Question 1

Simulation and multiple hypothesis testing in R.

Suppose the true relationship between Yi  and Xi  is captured by

Yi  = 0.9 + 0.7Xi + εi

where E [εi |Xi] = 0.

where εi  is drawn from Fisher’s z-distribution with df1 = 5 and df2 = 2 and Xi  is a Chi-square with 4 degrees of freedom.

a.  Simulate 100 random samples of size N = 200 from the population.

b. Estimate the OLS estimators.

Solution:

set .seed(100323)

N=200

k=100

# coefs <- cbind("hat_ beta" = numeric(100))

coefs  <- matrix(ncol=4 , nrow=k)

for (i in 1:k)  {

X  <- rchisq (N,df=4)

u  <- 0.5*log(rf(N, df1 = 5 , df2 = 2))

Y  <- 0.9 + 0.7 * X + u

coefs[i, 1:2]  <- summary (lm(Y ~ X))$coefficients[ 1 , 1:2]

coefs[i,3:4]  <- summary (lm(Y ~ X))$coefficients[2 , 1:2]

}

c. Plot the OLS estimators distribution. What is the OLS mean? Compare it to the population mean of the OLS estimator.

The plot of the intercept estimates:

library(ggplot2)

intercepts  <- coefs[, 1]

qplot(intercepts,  binwidth=0.01)

##  Warning:  `qplot()`  was  deprecated  in  ggplot2  3 .4 .0 .

The plot of the slope estimates:

slopes<- coefs[,3]

qplot(slopes,  binwidth=0.01)

The mean of the OLS Estimates are given by:

mean_ols  <- colMeans (coefs)

mean_ols

##  [1]  1 .10133616  0 .09002781  0 .69707510  0 .01860230

Note that our average αO LS estimate is far from 0.9. This comes from the fact that Fisher’s z distribution has a mean of around 0.18. This mean will be absorbed by α in the regression model. Hence the population α is given by α = 0.9 + 0.18 = 1.08, which our average αOLS  is close to.

d. Estimate the standard deviation of the OLS estimates and compare it to the standard error formula. Note that the formula for the standard deviation of the intercept is given by:

Std(α) = σϵ ^σX(2) + µX(2)

Std(β) =

The standard deviation of the Intercept Estimate is given by:

std_intercept  <- sqrt(var (intercepts))

std_intercept

##  [1]  0 .0898923

std_slope  <- sqrt(var (slopes))

std_slope

##  [1]  0 .01801986

Note that σe(2)  ≈ 0.53, σX  = ^8 and µX  = 4.

Thus for the intercept we have that Std(α) = ^0.53^2042 = ^0.^24 = 0.09.

For the slope we have that Std(β) =^2(^)53^8 = 0.018.

e. Test whether the true population parameter is β = 0.7 against the alternative β 0.7. How often are you rejecting the null, i.e., the true? How does it relate to the type I and type II error?

Solution:

First we need to use the following formula to compute t-stat for each of the OLS regression slopes.

βOLS βH0

t .stat  <- (coefs[,3] - 0.7)/coefs[,4]

We then compute the critical value at 5% confidence level given df = 200 − 1 − 1 = 198.

# +/- 1 . 972

cv  <- qt(0.025 , df=N-2)

cv

##  [1]  -1 .972017

And we need to compare the sequence of our t-stats against this threshold.

tests  <- (abs(t .stat) >  abs(cv))

tests

##      [1]  FALSE  FALSE  FALSE  FALSE    TRUE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE ##    [13]  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE    TRUE ##    [25]  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE ##    [37]  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE ##    [49]  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE ##    [61]  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE ##    [73]  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE ##    [85]  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE    TRUE  FALSE  FALSE  FALSE  FALSE ##    [97]  FALSE  FALSE  FALSE  FALSE

sum (tests)

##  [1]  3

Here tests” is a vector of boolean values, simply because we are comparing each of the t-stat against our  critical value. If our t-stat is larger than the critical value, the result will be “TRUE”; otherwise it will return “FALSE”. In R, “TRUE” (or “T”) is regarded as 1 and “FALSE” (or “F”) is 0.

1  ==  TRUE

##  [1]  TRUE

0  ==  FALSE

##  [1]  TRUE

So, if we sum up the vector “tests”we will get exactly the number of “TRUE”s, which is 3 here.

Also, given that we are running the OLS 100 times and thus testing the null hypothesis 100 times, we observe

3 samples out of these 100 where we can reject the null hypothesis that β = 0.7. We certainly know that the null hypothesis is actually true (because that is the value we used to generate sample to begin with), however, we still reject it 3 times. These 3 occasions are the cases where we are committing Type-I error, where we falsely reject the null hypothesis when the hypothesis is actually true. The probability of committing Type-I error should be equal to our confidence level (5% in this case), and it is reasonable to observe 3 cases out of 100.

f. How does your answer change if your sample decreases to N = 20? [Repeat the entire exercise]

Solution:

set .seed(0)

N=20

k=100

# coefs <- cbind("hat_ beta" = numeric(100))

coefs  <- matrix(ncol=4 , nrow=k)

for (i in 1:k)  {

X  <- rchisq (N,df=4)

u  <- 0.5*log(rf(N, df1 = 5 , df2 = 2))

Y  <- 0.9 + 0.7 * X + u

coefs[i, 1:2]  <- summary (lm(Y ~ X))$coefficients[ 1 , 1:2]

coefs[i,3:4]  <- summary (lm(Y ~ X))$coefficients[2 , 1:2]

}

# histograms

intercepts  <- coefs[,1]

qplot(intercepts,  binwidth=0.01)

slopes  <- coefs[,3]

qplot(slopes,  binwidth=0.01)

# mean

mean_ols  <- colMeans (coefs)

mean_ols

##  [1]  1 .08361260  0 .28288773  0 .69432237  0 .06050888

# sd

sd(intercepts)

##  [1]  0 .2826424

sd(slopes)

##  [1]  0 .06190018

# hypothesis testing

t .stat  <- (coefs[,3] - 0.7)/coefs[,4]

cv  <- qt(0.025 , df=N-2)

cv

##  [1]  -2 .100922

tests  <- (abs(t .stat) >  abs(cv))

tests

##      [1]  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE ##    [13]  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE ##    [25]  FALSE  FALSE  FALSE  FALSE    TRUE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE ##    [37]  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE ##    [49]  FALSE  FALSE  FALSE  FALSE    TRUE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE


##    [61]  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE ##    [73]  FALSE  FALSE  FALSE    TRUE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE ##    [85]  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE ##    [97]  FALSE  FALSE    TRUE  FALSE

sum (tests)

##  [1]  4

Here we repeat everything but decrease sample size to 20, and we observe 4 Type-I errors. This time the

error rate is 4%, still around 5%. The chance of getting Type-I error should be equal to the significance level and will not be decreased by including more observations.

Question 2

Suppose (Yi , Xi ) is a random sample of size N = 250, where the regression estimated yield:

= 3.1 + 2.1X

6 .3        0 .6

Furthermore, the R2  = 0.53 and the SER = 6.2.

a. Test H0  : β = 0 v.s. H1  : β 0 at the 5%, and the 1%

Solution

We shall employ the t-statistic method.  Firstly, let the significance level α = 0.05.  The hypothesis is H0  : β = 0 vs. H1  : β 0 and so this case corresponds to a two-sided test. The formula for the t-statistic is given by

βOLS - β 2.1 - 0

In this case we reject the Null Hypothesis if

|t - stat| > tα/2,df

Here, with 248 degrees of freedom t0 .025 ,248  = 1.97. Since 3.5>1.97 we reject the null hypothesis that the slope is zero in favor of the alternative hypothesis that the slope is non-zero.

Proceeding similarly for the case in which the significance level α = 0.01 we have t0 .005 ,248  = 2.60.  Since 3.5>2.60 we again reject the null hypothesis that the slope is zero in favor of the alternative hypothesis that the slope is non-zero.

b. Test H0  : β = 0 v.s. H1  : β > 0 at the 5%, and the 1%.

Solution

The null hypothesis is H0  : β = 0 v.s. H1  : β > 0. Again we employ the t-statistic method noting that t-stat = 3.5. Assuming a significance level of 5% and noting that this is a one-sided test we have t0 .05  = 1.65. Since 3.5>1.65 we reject the null hypothesis at the 5% significance level.

For the 1% significance level note that t0 .01  = 2.35. Since 3.5>2.35 we reject the null hypothesis at the 1% significance level.

c. Compare the results. Explain why there is a difference between the one-sided and the two-sided test.

Solution

Both the two-sided and one-sided t-test yield a rejection of the null hypothesis. Nevertheless, there is a distinct difference in the two tests. The two-sided test tests whether the slope is statistically significantly different from 0, whereas the one-sided test we employ here tests whether the slope is statistically significantly greater

than 0. As such, at a given significance level α, in the two-sided test we will only reject if |t stat| > tα/2,df

d. Construct the 99% confidence interval for β . What do you conclude in terms of the test in a given the confidence interval? Is your answer surprising? Why?

Solution

Noting that t0 .005 ,248  = 2.6 then the 99% confidence interval for the two-sided test is given by

(βOLS  − 2.54se(βOLS ), βOLS + 2.54se(βOLS )) = (2.1 − 2.54 * 0.6, 2.1 + 2.54 * 0.6) = (0.576, 3.624)        Since β0  = 0 does not lie in the confidence interval we can reject the null hypothesis that our estimate is zero.

e. Suppose that the true Yi  and Xi  are independent.  How often do you think the coefficient will be statistically significant (you will reject the null in a) if you draw 1000 samples of size 250?

At a significance level of 5% we will, by chance, have a statistically significant slope coefficient 50 times of the 1,000 samples. Similarly, at a significance level of 1% we will have a statistically significant slope coefficient

10 times.

Question 3

Let CEOeducation denote the number of years of education of a CEO in a given firm, and let ROE denote the return on equity of the firm.  The return on equity (ROE) is calculated by dividing net income by shareholders’ equity and it captures the financial performance of a firm. A simple model relating ROE to the

education level of a CEO can be captured as follows:

ROEi  = α + βCEOeducationi + εi ,

a. What kinds of factors are contained in εi ? Are these likely to be correlated with the level of education?

Solution: εi  could capture many factors other than the education of CEO. The management skills of CEOs could have effect on firm’s performance, and that could be correlated with his/her education. Getting MBA or other degrees related to management may improve their management ability and thus lead to better performance of firm.

b. Will a simple regression analysis uncover the causal effect of the CEO education on the financial performance of the firm? Explain.

Solution: No. Simple linear regression can suggest some relationship between factors, but we could not thus infer causal relationship based on this. There could be many other factors that correlates to both education of CEO and the ROE of the firm. Without knowing the hidden factors, it is hard to infer causal relationship from this simple regression model.

Question 4

Consider the multiple regression model containing three independent variables, under Assumptions 1 through 3 as seen in class:

Yi  = β0 + β1 X1i + β2 X2i + β3 X3i + εi .

You are interested in estimating the sum of the parameters on X1  and X2 ; call this θ = β 1 + β2

a. Show that θOLS  = β 1OLS + β2OLS  is an unbiased estimator of θ .

Solution:

E [θOLS ] = E [β1,OLS + β2,OLS] = E [β1,OLS] + E [β2,OLS] = β 1 + β2  = θ

Therefore, θ is an unbiased estimator.

b. Derive the variance of θOLS .

Solution:

Var (θOLS ) = Var (β1,OLS + β2,OLS)

= Var (β1,OLS) + Var (β2,OLS) + 2Cov(β1,OLS, β2,OLS)

= Var (β1,OLS) + Var (β2,OLS)

Since we know that Cov (Y,j )

βj  =   Var (j )

βj  is not correlated with any Xi  where i j . Thus, Cov (β1,OLS, β2,OLS) = 0, and we are left with the sum of two variance separately.

Question 5

Wooldridge Chapter 4.4

Are rent rates influenced by the student population in a college town? Let rent be the average monthly rent paid on rental units in a college town in the United States. Let pop denote the total city population, avginc the average city income, and pctstu the student population as a percentage of the total population.  One model to test for a relationship is

log(rent) = β0 + β1 log(pop) + β2 log(avginc) + β3pctstu + u

(i) State the null hypothesis that size of the student body relative to the population has no ceteris paribus effect on monthly rents. State the alternative that there is an effect.

Solution: The null hypothesis and alternative are:

H0  : β3  = 0

H1  : β3 0

(ii) What signs do you expect for β 1  and β2 ?

Solution: Both positive. The higher the household income and population, the higher would be the house

rent.

Data Question

Follow up with the dataset you downloaded in the previous question, i.e.,  2016 CPS, which contains observations on weekly earnings, sex, age, race, and education for respondents aged 25-64.

a. Generate the regression table of Y be log of weekly earnings and X as education years.

Solution:

library(stargazer)

##

##  Please  cite  as:

##    Hlavac,  Marek  (2022) .  stargazer:  Well-Formatted  Regression  and  Summary  Statistics  Tables . ##    R  package  version  5 .2 .3 .  https://CRAN .R-project .org/package=stargazer


cps.data  <- foreign ::read .dta("morg16 .dta")

# subset data

d  <- cps .data[cols <- c ("earnwke" , "sex" , "age" ,

"race" , "grade92")]

d  <- na .omit(d)

# replace grade92 by edu

map  =  setNames (c (0 ,3 ,6 ,8 ,9 , 10 , 11 , 12 , 12 , 14 ,

14 , 14 , 16 , 17 ,20 ,22),  31:46)

d[ 'edu ']  <- map[unlist(as .character(d$grade92))]

d$grade92  <- NULL

# take log of earnings

d[ 'log_earning ']  <- log(d$earnwke)

d  <- na .omit(d)

d  <- d[!is .infinite(rowSums (d)),]

finaldata  <- subset(d, d$age>=25  &  d$age  <=64)

stargazer(finaldata,  header=F)

Table 1:

Statistic

N

Mean

St. Dev.

Min

Max

earnwke

135,751

996.638

673.971

0.010

2,884.610

sex

135,751

1.490

0.500

1

2

age

135,751

43.465

11.165

25

64

race

135,751

1.429

1.275

1

26

edu

135,751

14.231

2.680

0

22

log_earning

135,751

6.666

0.757

4.605

7.967

ols  <- lm(log_earning ~ edu, data=finaldata)

stargazer(ols,  header=F)

Table 2:

Dependent variable:

log_earning

edu

0.103***

(0.001)

Constant

5.201***

(0.010)

Observations

R2

Adjusted R2 Residual Std. Error F