Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

Midterm Exam

Statistics & Data Analysis

October 26, 2023 (9:00 am - 9:00 pm)

Part I

1.  [Definitions and concepts] DeÖne/discuss the followings.  Stick to the nota- tion from the lectures.  If you use mathematical expressions, add yourexplana- tion in words, but be concise and complete. Unnecessarily long answers will be penalized. You can use graphs to elaborate your answer.

(a) Unbiasedness of the OLS estimators of the regression coe¢cients. (b) E¢ciency of the OLS estimators of the regression coe¢cients.

(c) The omitted variable bias of the OLS estimator of the coe¢cient of a variable  (included variable) when  a relevant variable that is correlated with the included variable is omitted. In your answer you need to clearly deÖne what the omitted variable bias is: that is, start with the deÖnition of the omitted variable bias that is discussed in lecture.

2. [True/False] Identify whether the following statements are true or false and present reasons why you believe so.

(a) Positive regression coe¢cient of an independent variable implies that the sample covariance between the independent variable and the dependent variable is positive.

(b) Suppose the variance of an estimator is 5/n2 : Then if the estimator is unbi-ased, it is consistent. [You can use a graph to answer this].

(c) Suppose the true model includes two independent variables, but one inde-pendent variable is omitted. If the omitted variable is not correlated with the omitted variable, then there is no omitted variable bias.

Part II

[The two-step procedure of multiple regression estimation] We aim toesti- mate the causal e§ect of household income on household food expenditure. Werescale the household income and household food expenditure by taking log-transformation. To measure the causal e§ect of household income on household food expenditure con- trolling for other factors, we decide to add the number of household members as an additional independent variable since household food expenditure is likely to increase if there are more individuals within a household.  The multiple regression we need to

log(food expenditure) = β0  + β1 log(income)+ β2 number of members + U     (1)

We use the Ordinary Least Squares method to estimate, β0 ;β 1 ; and β2 : The OLS estimators are denoted by β(b)0 ; β(b)1 ; and β(b)2 ; respectively. The dataset is given by

where variable names l_income,  l f expen,  and  hhd_mbrs  are  used to  refer to log(household income), log(food expenditure) and the number of household members, respectively. The following exercises use the two-step procedure of multiple regression estimation justified by the Frisch-Waugh-Lovell Theorem.

1. What are the dependent variable and the independent variables in this multiple regression?

2. Inspection using scatterplots:

(a)  Given the scatterplot of log(food expenditure) against log(income), what is the sign of the sample covariance between the two variables?  Explain why you believe so.

l f expen

2 3 4 5 6

10 20 30 40

l_income

(b)  Given the scatterplot of log(food expenditure) against the number of house- hold members, what is the sign of the sample covariance between the two variables? Explain why you believe so.

l f expen

2                    3 4                    5 6

0 2 4 6                              8

hhd_mbrs

(c) The scatterplot of the number of household members against log(income) and that of log(income) against the number of household members are as follows.  What is the sign of the sample covariance between the two variables? Explain why you believe so.

l_income

10 20 30 40

0 2 4 6 8

hhd_mbrs


hhd_mbrs

0 2 4 6 8

10 20 30 40

l_income

(d) The exercises in Part II demonstrate what happens to the coe¢cient of log(income) if the number of household members (hhd_mbrs) is omitted. Given (a)-(c), what can you comment on the omitted variable bias of the coe¢cient of log(income) if hhd_mbrs is omitted?  In particular, do you expect the coe¢cient of log(income) to be larger in a simple regression (with hhd_mbrs omitted) than in a multiple regression (with hhd_mbrs included)? Explain why you believe so.

3. Estimation of the coe¢cientoflog(household income),β(b)1  :

(a)  [The 1st-step:  Running an auxiliary regression of l__income on hhd__mbrs] To Önd out the coe¢cientoflog(income), in the Örst step we run an auxiliary regression of log(income) (l_income) on the number of household members (hhd_mbrs).

i. What are the dependent and independent variables in this auxiliary regression?

ii. of(C)athi(lcu)s(l)aau(te)x(t)il(h)ia(e)r(r)y(es)re(id)o(f)1 };Specify(follow)iall(ng) tth(he)enn(o)ecessa(tation)ry(of)inte(the)rm(le)e(c)d(t)i(u)a(r)t(e)

calculation results in your answer.

iii.  Calculate the SST of this regression (SST1 ; following the notation of the lecture)

iv.  Calculate the R2  of this regression (R1(2) ; following the notation of the lecture)

v. Interpret R1(2):

(b)  [The 2nd step] Now run the second-stage (step) regression of l f expen

att(he)cie(by)nt of(the) robtained from(isch-Waugh-)t(L)his s(ovel)i(l)mple r(Theo)e(r)gres(em)ion(on)whic(can)h(s)c(h)an(ow)

P 1 (l f expeni )

P 1

is actually the OLS estimator of the multiple regession, (1).  That is, by the Frisch-Waugh-Lovell Theorem,

β(b)1  = P 1 (lP(_f) eni ) :

Note, however, the intercept of the 2nd-stage (step) regression would not

be the same as β(b)0 :

i. What are the dependent and independent variables in the 2nd-step regression?

ii. lo(H)g(o)co(wo)as(yo)ua regressor(interpret) th(?)n(i)sta(w)ge(e)ast(e)s(n)l;e regres(instea)si(d)on(of)

when we calculate the regression coe¢cient of educ, β1 ; in (1).  How

A(ar)e p(a)p(n)hous(come)me) and number of household mem-

bl_ein(rs)co(a)m(re)ecare rel(omplet)a(e)ted(ly)?(u)ncorrelated, that is, if R1(2)  = 0; how and

iii.  Calculate β(b)1 :

4. Estimation of the coe¢cient of hhd_mbrs;β(b)2  :

(a)  [The 1st-step:  Running an auxiliary regression of hhd__mbrs on l__income] We run an auxiliary regression of hhd_mbrs on l_income in the Örst place.

i. What are the dependent and independent variables in this auxiliary regression?

ii. of(C)athi(lcu)s(l)aau(te)x(t)il(h)ia(e)r(r)y(es)re(id)o(f)2 };Specify(follow)iall(ng) tth(he)enn(o)ecessa(tation)ry(of)inte(the)rm(le)e(c)d(t)i(u)a(r)t(e)

calculation results in your answer.


(b) e(n)¢c(th)ient(e se)cof(on)obt(stag)a(e)ined from(regressio)t(n)h(o)is(f)lsimpl(wage)e(o)r(n)ess(T)

P 2 (l_f__2expenii )

P 2

is actually the OLS estimator of the multiple regession, (1).  That is,

β(b)2  = P 2 (lP(_f) eni ) :

Note, however, the intercept of the 2nd-stage regression would not be the

same as β(b)0 :

i. What are the dependent and independent variables in the 2nd-step regression?

ii. How would you interpret ?  (That is,  we are using ; instead of

iii.  Calculate β(b)2 :

5. Based on the result in the lecture, the coe¢cient of X1 has the following relation:

β(b)1(S) = β(b)1(M) + β(b)2(M)δ(e)1

whereδ(e)1  is the coe¢cient estimator of simple regession of X2  on X1 .  Note that

whether this relationship holds in this example by plugging the numbers for

β(b)1(S) ; β(b)1(M) ; β(b)2(M) and δ(e)1 :

6. Based on the results and discussion above, would you add or omithhd_mbrs in the regression to Önd the causal e§ect of household income on household food expenditure?

7.  Consider a simple regression of log(food expenditure) (l f expen) on log(income) (l_income): We let Y = l f expeni  and X1  = l_income:

\

(a)  Given the data, Önd the sample analog of Cov(Y;X1 ); Cov(Y;X1 ):

\

(b)  Given the data, Önd the sample analog of Var(X1 ); Var(X1 )

\             \

(c)  Calculate the ratio of the sample analogs, Cov(Y;X1 ) to Var(X1 ): Is it

the same as β(b)1(S)?

(d)  Can you Önd the values of Cov(Y;X1 ) and Var(X1 )? How are Cov(Y;X1 )

\                                                                        \

and Cov(Y;X1 ) related? Likewise, how are Var(X1 ) and Var(X1 ) related? Discuss.

8. Interpret the coe¢cients,β(b)1  andβ(b)2  obtained by estimating the regression equa-

Part III

1. Data are collected from a random sample of 220 home sales from a community in 2013. Let Price denote the selling price (in $1000). BDR denote the number of bedrooms, Bath denote the number of bathrooms, Hsize denote the size of the house (in square feet), Lsize denote the lot size (in square feet), Age denote the age of the house (in years), and Poor denote a binary variable that is equal to 1 if the condition of the house is reported as "poor" and 0, otherwise.  An estimated regression yields

\

ln(Price)   =   119:2+0:485.BDR + 23:4.Bath +1:56 . ln(Hsize)

+0:002 . Lsize +0:090 . Age 48:8 . Poor

(a) If we change the measure (or unit) of Lsize into thousands of square feet, what would be the coe¢cientof Lsize?

(b) Bearing in mind that a log-transformation is used in the regression, in- terpret the coe¢cient of  ln(Hsize).   Use  a  "%  change" interpretation whenever possible, instead of "one unit change in.... "

(c) Bearing in mind that a log-transformation is used in the regression, inter- pret the coe¢cient of Age.  Use a "% change" interpretation whenever possible, instead of "one unit change in.... "