Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

ECON 121, Applied Econometrics and Data Analysis

Summer 2022

PROBLEM SET 2: SOCIOECONOMIC DETERMINANTS OF HEALTH

Instructions:

• Use the provided pset2_submission.R” template file to complete this assignment.  Do not modify the file name for your submission. The autograder requires this filename to grade your assignment.

• Use the setwd()” command to read in the datafiles locally.   Comment out the “setwd()” command before you submit to Gradescope. Do not modify the provided code in the template file that loads the data. This will cause an error with the autograder.

• Only use the packages loaded in pset2_submission.R” when executing the tasks for the problem set. The autograder is only configured to use these packages and may not work if you use others.

Problem Set:

This problem set explores the socioeconomic correlates of health status in the United States. The dataset

https://github.com/credpath/econ121/raw/main/nhis2000.rda (download this file locally) contains a sample of adults from the 2000 National Health Interview Survey, with 5-year mortality follow-up. You will analyze two outcome variables:  (1) mortality in the five years after the survey and (2) self-reported health status.

Self-reported health status is based on the question: “On a scale of 1 (excellent) to 5 (poor), how would you rate your health?”  The dataset contains many interesting covariates, including measures of socioeconomic status, race, health behaviors, and health conditions. To make the results representative of the population, we

will use the sampling weights sampweight throughout, except for the local linear regressions and the marginal effects. Similarly, to account for potential serial correlation in the error term within sampling units psus, we will cluster standard errors at the level of sampling throughout.

1. Generate a binary variable called fpoor that equals 1 if the respondent reports fair or poor health and 0 otherwise.  Summarize the data, creating weighted means and unweighted means for each vari- able, and assign the output to an object called nhis_summary.  Do this using the following code: nhis_summary  <  -

nhis  %>%

sapply(

FUN  =  function(x)  c(

Mean  =  . . .

Mean_wt  =  . . .

)

)

Instead of ... write the commands to create your weighted and unweighted means; instead of writing a variable name inside those commands, just write x. Make sure to separate each command by a comma. (Hint: This will create a final matrix with 2 rows and 29 columns.) Does weighting meaningfully change the average age, or sex or race shares? (Answer with code and words.)

2. To get a sense of how self-reported health status relates to mortality risk over the lifecycle, estimate local linear regressions of mortality on age, with a bandwidth of 1.  You do not need to use weights. Do this two separate samples:  (a) people who report being in fair-to-poor health and (b) people who

report being in good-to-excellent health. Assign a dataframe of the outputs to poor_health_reg and not_poor_health_reg, respectively.  Plot both regressions using the ggplot command and assign the plot to an object called mort_age_plot.  (Hint:  See the R example from lecture 3.)  How does the risk of death change with age? Do people with worse self-reported health status have higher risk of death? Are these plots the non-parametric versions of the linear probability model, the logit model, or the probit model? (Answer with code and words.)

3. Now use bar graphs to describe the relationship between socioeconomic variables and health. Write ggplot(

data  =  data_frame_name,

mapping  =  aes(

x  =  xvar,

y  =  yvar,

weight  =  weightvar

)

)  +

stat_summary(

fun.data  = mean_sdl ,

geom  =  “bar

)

You will need to generate two new socioeconomic variables that contain the right categories:  three

categories for income and five categories for education.  Use 1, 2, 3 or 1, 2 3, 4, 5 for the values of these categories, where 1 represents the lowest category. The the following variables to construct these

categories: faminc_gt75 represents family income above 75 thousand dollars, faminc_25t75 represents

family income between 25 and 75 thousand dollars, and edyrs represents years of education. Name the

income category faminc and the education category edlev. For each graph, describe your results and take note of any unexpected patterns. (Answer with code and words.)

(a) Graph rates of mortality and fair/poor health by the level of family income.  Assign the plots to objects called mort_graph_inc and health_graph_inc, respectively.

(b) Graph rates of mortality and fair/poor health by education level, with five categories of educational attainment: less than high school completion (<12), high school completion (12), some college (13- 15), college completion (16), and post-graduate study (>16).  Assign the plots to objects called mort_graph_educ and health_graph_educ, respectively.

4. Age, income, education, sex, and race/ethnicity are correlated, so we must use multiple regression to disentangle the relative importance of these variables in determining health.   For both outcomes 5-year mortality and fair/poor health, run linear probability models, probit models, and logit mod- els with age,  education,  family income,  sex,  and race/ethnicity as independent variables.   Explain why we might want to let age and education enter the regression model continuously (as one term each) or why we might want to let them enter as categorical/factor variables.   Let age and educa- tion enter as factor/categorical variables.  To do so, create a new variable called agecat for age in

ten year bins (25-34, ..., 65-74, 75-84, 85) with values 1, 2, 3, 4, 5, 6, 7 and use the education cat- egories (edlev) from the previous question.  Make sure these variables are factor variables.   (To do that use as .factor(), possibly inside the mutate() command.)  Use white  =  1 as the omitted base race/ethnicity category.  And use low family income as the omitted base category for income.  Assign the regression outputs to objects called reg_lpm_mort, reg_probit_mort, and reg_logit_mortreg_lpm_health, reg_probit_health, and reg_logit_health, respectively.  For the probit and logit models,  compute the marginal effects of the independent variables.   Use the probitmfx  and logitmfx commands but, for simplicity, don’t weight these regressions. Assign the marginal effects out- put to objects called reg_probit_mort_mfx, reg_logit_mort_mfx, reg_probit_health_mfx, and reg_logit_health_mfx, respectively. Describe your results and take note of any expected or un- expected patterns. Are the LP, probit, and logit results similar? (Answer with code and words.)

5. Holding all else equal, do high-income black people have higher or lower mortality risk than low-income white people?  Use your estimates from reg_lpm_mort to run this test.  Assign the t-stat from this test to an object called t_stat. Do you think this regression specification is appropriate for testing for differences between high-income black people and low-income white people? If not, how would you alter it? (You do not need to run any new altered regressions.) (Answer with code and words.)

6. Should we think of the coefficients (or marginal effects) on family income as causal? Why or why not? (Answer with words only.)

7. Many wonder how much of the relationship between socioeconomic status and health reflects differences in health insurance or differences in health behaviors (think of behavior as a choice you can to make). Use the logit model to explore the role of these mediating variables. Start by recoding the five mediator variables (described below) into dummy variables to make the results easier to interpret.  (A value for

zero and a value for not zero.)  Create a new dataframe nhis_nomiss that subsets the data so that there are no missing values for these variables (use the is .na() command; see lecture 4 R example). Do  not subset the original nhis dataframe.  Then, using the nhis_nomiss dataframe estimate the following models similar to reg_logit_mort from question 4:  i.  include one mediating variable for insurance and assign the output to reg_insurance; ii.  include four mediating variables for health behaviors and assign the output to reg_behavior; and iii.  re-estimate reg_logit_mort but call it reg_logit_mort2.  (Hint: to see the health behavior values and labels, use unique(nhis$var_name) to view the labels of var_name.  There can be more labels than values present in the data.)  Interpret the results. (Answer with code and words.)