PPHA 31002: Homework 1

Profs. Frank, Lo & Moskowitz

Harris School of Public Policy

Due by October 16, 2023 at 11:45 PM (CT)

instructions

Please upload the following by 11:45 PM (CT) on the due date:
1. Your write-up through the PDF Gradescope portal – Your write-up should be formatted in a clean and professional way; it should be organized, clearly labeled, and fully legible. Additionally, you must always accurately tag the location of each question in your PDF write-up using Gradescope. Note: the answers that you upload to Gradescope will be the answers that are graded. You are responsible for ensuring that you have uploaded the most up-to-date and most complete file.

2. Your R script through the R-Script Gradescope portal – This script should be able to reproduce all answers you included in your write-up that were generated using R, including numerical quantities, graphs/figures, etc. The script should be organized and appropriately commented so that a skilled R user can easily follow your code. Note that this script should be comprehensive (meaning that it produces everything included in your write-up that you generated in R) and executable (meaning that, after changing the working directory to the appropriate location, it should run all the way through without triggering errors).

Your PDF must contain the answers to all questions, including numerical answers, written explanations, and relevant graphs and figures; your grader will not look to your R script for your answers. In other words, you should consider the R script to be a supporting document for a technical reader, but the reader is not expected to reference your script to find an answer to a question. You must show your work.

Collaboration Policy: You are permitted to work together on this assignment in small groups of 2-4 students. All students are required submit their own write-ups to Gradescope. Note your write-up and code should only reflect your understanding of the material. As such, these should be written in your own words. If your write-up contains identical language to another person in the course (including a member of your group), it is considered a violation of the school’s academic integrity policies. Please indicate at the top of your write-up the names of students with whom you collaborated. 

statistical exercises

For questions 1-6, consider the following prompt. Suppose a bowl contains 50 objects: 20 red circles, 15 green squares, 10 blue triangles, and 5 red squares.
1. What is the probability of drawing a green object? [0.5pt]
2. What is the probability of drawing a red object? [0.5pt]
3. Suppose you are blindfolded and you draw an object from the bowl. You feel that it is square. What is the probability it is red? What is the probability it is blue? [1pt]
4. Your friend hears you say that it is a square object, but she can’t see the color of the object, which is concealed in your hand. She wants to play a betting game. Specifically, if the object in your hand is green, your friend wins and you have to pay her $20. However, if the object in your hand is red, you win and your friend has to pay you $50. What is your expected financial return from the game? What is your friend’s expected financial return from the game? [1pt]
5. What is the probability of drawing a red object conditional on having drawn a red circle with replacement? [0.5pt]
6. What is the probability of drawing a red object conditional on having drawn a red circlewithout replacement? [0.5pt]

For questions 7-11, consider the following information. Suppose a policy school has 48 faculty members: 15 study only politics, 24 study only economics, and 9 study both politics and economics. The table below indicates seniority status (untenured or tenured) among these faculty by disciplinary area:

Seniority
Politics
Economics
Both
Untenured
5
8
3
Tenured
10
16
6


Suppose this policy school requires you to sit for an oral exam at the end of the quarter and chooses a faculty member at random to examine you. Note that fortunately for you, this question is entirely hypothetical and not based in fact.

7. What is the probability your examiner studies only economics? [0.5pt]

8. What is the probability your examiner studies economics? [0.5pt]

9. What is the probability your examiner is untenured and studies only politics? [0.5pt]

10. Are untenured and tenured disjoint events? Briefly explain in one sentence. [0.5pt]

11. Suppose you learn from a reliable source that your grader is tenured. Is this informative as to their disciplinary area? Briefly explain and demonstrate why your answer is true by calculating the relevant probabilities. [1pt]
For questions 12-13, consider the following information. Suppose the size composition of households in the population of a city is as follows:
HH size
Percentage
1
50%
2
25%
3
5%
4
15%
5
5%

12. Let X indicate the size of a household. Find the expected value of X. [1pt]

13. Find the variance of X. [1pt]

For questions 14-16, consider the following CDF:

14. Plot the PMF or PDF that corresponds to the above CDF. [2pt]

15. Calculate the expected value of Y . [1pt]

16. Calculate the variance and standard deviation of Y . [1pt]

For questions 17-20, consider the function below. Hint: You do not need to use calculus to find the solutions to any of these questions; you can simply use geometry.

17. Is the above function a valid PDF? Briefly explain in 1-2 sentences how you know. [1pt]

18. Find the following: P(1 ≤ Z ≤ 3), P(Z = 4). [1pt]

19. Find the following: P(Z ≤ 0), P(Z ≤ 1), P(Z ≤ 2), P(Z ≤ 6). [2pt]

20. If the function is a valid PDF for Z, plot the CDF for Z. [1.5pt]

For questions 21-24, consider the probability density function below. Hint: You do not need to use calculus to find the solutions to any of these questions; you can simply use geometry.

21. What is the expected value of X? If it is not possible to determine that value, briefly explain why.[0.5pt]

22. Find P(1 ≤ X ≤ 3). If it is not possible to determine that value, briefly explain why. [0.5pt]

23. Find P(8 ≤ X ≤ 10). If it is not possible to determine that value, briefly explain why. [0.5pt]4

24. Let F(x) indicate the cumulative distribution function for X. Find F(5). If it is not possible to determine that value, briefly explain why. [0.5pt]

25. Suppose A and B are disjoint and the P(A) = 0.7. What is the possible range for P(B)? [0.5pt]

26. Suppose P(A) = 0.7 and P(A ∩ B) = 0.2. What is the possible range for P(B)? [0.5pt]

27. Suppose P(A) = 0.7, P(A ∩ B) = 0.2, and P(A|B) = 0.8. What is the possible range for P(B)? [1pt]

For questions 28-30, indicate whether the statement is true or false. If the statement is false briefly explain in 1-2 sentences why it is false.

28. True/false: Sampling weights are necessary for simple random samples but not other types of random samples. [0.5pt]

29. True/false: Suppose we over-sample residents from rural areas to ensure we have an adequately large number in our sample. As a result, the sampling weights for non-rural residents will be larger than the sampling weights for rural residents. [0.5pt]

30. True/false: The primary rationales for a clustered random sampling design practicality and cost reduction. [0.5pt]

data exercise

In this part of the assignment, we will examine trends over time in voter turnout in U.S. elections. Electoral participation is considered by some to be an important indicator of democratic performance (e.g., Powell 1982). As a result, declines in voter participation within a polity are often concerning to scholars and political observers. For instance, Rosenstone and Hansen (1993), commenting on a decrease in turnout, noted that the “decline of citizen involvement in government has yielded a politically engaged class that is not only growing smaller and smaller but is also less and less representative.” For this data exercise, we will examine data on turnout in U.S. presidential and midterm elections from McDonald and Popkin (2001) for the period of 1948-2020.1 We are providing you with two data sets both of which are available on the Canvas HW1 assignment page: mcdonald1.csv and mcdonald2.csv. The data sets contain the following variables: 

File
Variable
Description
mcdonald1.csv
year
Election year
mcdonald1.csv
votes_higho
Votes cast for the highest office on the ballot (in thousands)
mcdonald1.csv
vap
Voting-age population living in the U.S. (in thousands)
mcdonald2.csv
noncit_pop
Non-citizen resident population in the U.S. (in thousands)
mcdonald2.csv
felon_inel
Convicted felons ineligible to vote (in thousands)
mcdonald2.csv
overseas_el
Overseas U.S. citizens eligible to vote (in thousands)

Before proceeding, let’s read the mcdonald1.csv data set into R. We will separately examine voting patterns in presidential and midterm election years, so let’s create a variable in your dataframe called midterm, which is coded =0 for presidential election years and coded =1 for midterm election years.2 There are various ways to create such a variable, but the following code would work to do so:
# set the working directory
setwd("~/Dropbox/Stats I/Homeworks2023/hw1")
# read in the mcdonald1.csv data set
mcdon <- read.csv(file = "data/mcdonald1.csv")


1For more information on their study, see McDonald, Michael P. and Samuel L. Popkin. 2001. “The Myth of the Vanishing Voter.” American Political Science Review 95(4): 963-974.

2The data sets you are provided contain only presidential and midterm elections. Presidential elections occur every four years (e.g., 1952, . . . , 2020), and midterm elections occur every four years too (e.g., 1954, . . . , 2018).
# create a midterm variable
mcdon$midterm[mcdon$year %in% seq(1952,2020,4)] <- 0
mcdon$midterm[mcdon$year %in% seq(1954,2020,4)] <- 1


3.1. Scholars traditionally measured turnout by examining the number of votes cast in an election as a share of the voting-age resident population. Create such a variable and store in your dataframe. Based on the variable you just created, what was the average turnout rate in presidential elections for this time period? What was the average turnout rate in midterm elections? Is turnout generally higher in midterm or presidential elections? [1pt]

3.2. In what year was the turnout rate highest? Write a single line of code to determine this answer.[0.5pt]

3.3. In what year was the turnout rate was the lowest? Again, write a single line of code to find this answer. [0.5pt]

3.4. What is the average turnout rate for each of the following time periods: 1952–1970, 1972–1990, and 1992– 2020? [1pt]
3.5. Graph the turnout rate over time for presidential elections only. Make sure that the axes are labeled and that your plot has an appropriate title. [2pt]

3.6. In 1-2 sentences, briefly describe the over-time patterns in presidential turnout as a share of the voting-age resident population. [1pt]

The main insight from McDonald and Popkin (2001) is that using the voting-age population (VAP) living in the U.S. as the denominator for a turnout measure is problematic. Specifically, the VAP includes non-citizens and convicted felons who are not eligible to vote in these elections, and it excludes citizens residing overseas who are eligible to vote.3 Read the mcdonald2.csv data set into R and merge it into the data set you have been analyzing.

# read mcdonald2.csv into R
mcdon2 <- read.csv(file = "data/mcdonald2.csv")
# merge mcdon2 into our data set
mcdon.merged <- merge(mcdon, mcdon2, by = "year")

3.7. Create a new variable in your dataframe, which is turnout as a share of the voting-eligible population (VEP). The voting-eligible population accounts for ineligible non-citizens and felons as well as eligible citizens resid ing overseas. What was the average turnout rate (based on your new VEP measure) in presidential elections for 1952-2020? What was the average turnout rate in midterm elections? [1pt]

3.8. What is the average turnout rate (again, using the VEP measure) for each of the following time periods: 1952–1970, 1972–1990, and 1992–2020? [1pt]

3.9. Create a plot in which you graph two separate times series on the same plot. The first series is presidential turnout as a share of VAP, which you plotted for question 35, and the second series is presidential turnout as a share of VEP. As always, make sure that the axes are labeled and that your plot has an appropriate title.4 [2pt]

4.0. Briefly describe over-time patterns in presidential turnout as a share of the voting-eligible population. In what ways, if any, are these over-time patterns different than the over-time patterns based on the VAP measure? Are you now more or less concerned about the health of U.S. democracy compared to your assessment after questions 35-36? [1pt]
3Whether a felony conviction results in the loss of voting rights, temporarily or permanently, varies based on state law. See here for more information: https://www.ncsl.org/elections-and-campaigns/felon-voting-rights.
4 If you make this graph using base R tools, in addition to the plot() command, you should also use the points() command to graph the second series on the same plot. You can also use text() to label each series on the graph.