STATS221-22A Statistical Data Analysis Assignment One
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit
STATS221-22A Statistical Data Analysis
Assignment One
Important Notes:
This assignment is to be solved using the R Studio statistical package. Type your answers in a word document, paste the R Studio outputs/figures into this document and print this document for your submission.
Make sure that you download ‘STATS221-22A Assignment Cover Page.pdf’ from Moodle, print it, write clearly your name and student ID in the space provided and use it as a cover-page for your assignment submission.
Submit your assignment by dropping it in the STATS221 box located outside the main reception at the FG link – ground floor.
Please ensure that you clearly identify which question and task each of your answers relates to.
Your attention is drawn to the policies regarding plagiarism and late submission which are described in the course outline available on Moodle.
Maximum marks for each question are indicated in square brackets.
Question 1: [17 Marks]
It is well known that many people try to improve their chances of winning a lottery by using lucky numbers or buying their tickets from certain lucky stores. In order to estimate what proportion of population would buy from a lucky store, a question was asked in the class survey. Participants were asked to respond Yes/No to the following question: “If you learned that a certain store has sold the winning lotto ticket a few times in the past year, would you be tempted to buy a ticket from this store?”
In the survey, 12 out of 43 respondents said ‘Yes’ . These responses are also stored in an Excel file luckylotto.CSV on Moodle.
Task 1:
a. Based on this data, what will be your estimate for the proportion of people in the whole population that would consider buying their tickets this way?
[3 Marks]
b. Explain why this estimate is considered to be ‘statistically’ sound? [4 Marks]
Task 2:
a. Find the confidence intervals for your estimate at the 95% and the 99% levels and
interpret each.
b. Explain why the 99% confidence interval is wider than the 95% one?
[6 Marks] [4 Marks]
Question 2: [25 Marks]
For this question, we will use the data in the Excel worksheet Treesize.CSV. A copy of this can be found on Moodle. This data was collected in the US state of Georgia to test if the tree species growth is superior in a warmer climate compared to a cooler one . The data comes from two regions – north (n) and south (s). The northern region is elevated and hence hosts a much cooler climate compared to the southern region. 30 pine trees were randomly selected from each region. Sizes were determined by measuring the diameter at breast height (DBH) for each tree in the sample. The data contains the following variables:
Variable |
Description |
Variable Type |
ns |
Region – n or s |
Categorical |
dbh |
Diameter at breast height (DBH) measurements |
Numeric |
Task 1:
a. Produce a descriptive summary of the two groups of data . Which statistic did you include in this summary? Why? Paste your output. [2 Marks]
b. Use plot/s to examine the two groups of data graphically . Which plot/s did you choose? Why? Paste your output. [2 Marks]
c. Based on the descriptive summary and the plots that you have produced, describe the patterns in the data. [3 Marks]
Task 2:
What are the appropriate null and alternative hypotheses for comparing the two groups
of data? Justify your choice.
Task 3:
a. Explain why it is appropriate to use a two sample t-test on this data?
[4 Marks]
[2 Marks]
b. How would you decide which version of the two sample t-test is most appropriate for this data? Describe the process you would follow in making this decision, then perform the required tasks in that process and paste the outputs of those tasks in your submission. [5 Marks]
c. So based on the results of ‘b’ above, which version of the t-test is most appropriate? Why? [2 Marks]
Task 4:
a. Perform the two sample t-test (the version you have chosen in Task 3) to test the hypotheses you described in Task 2 . Paste your output. [2 Mark]
b. What do you conclude at 5% level? Report your finding in the context of the question posed. [3 Marks]
Question 3: [14 Marks]
Vehicles have built-in computers that calculate various quantities related to performance. One of this is the fuel efficiency. A car manufacturer wants to test the accuracy of these measurements. To do this, they arrange a number of test-runs where in addition to the computer calculating the miles per gallon (mpg), the driver also recorded the mpg by dividing the miles driven by the amount of gallons at fill-up. They want to determine if the two readings agree or not. For this question, we will use the data stored in a Excel
worksheet MPG_Comparison.CSV. A copy of this can be found on Moodle. The data contains the following variables:
Variable |
Description |
Variable Type |
Test-run |
Test run # |
Numeric |
Computer |
mpg reading recorded by the computer |
Numeric |
Driver |
mpg reading recorded by the driver |
Numeric |
Diff |
Difference = Computer reading – driver reading |
Numeric |
Task 1:
What are the appropriate null and alternative hypotheses for this data? Justify your
choice.
Task 2:
Which test is the most appropriate for this data? Explain why.
[4 Marks]
[5 Marks]
Task 3:
a. Perform the test you chose in Task 2 to test the hypotheses you described in Task
1. Paste your output. [2 Mark]
b. What do you conclude at 1% level? Report your finding in the context of the question posed. [3 Marks]
Question 4: [14 Marks]
For this question, we will use the data stored in the Excel worksheet SexPartners.CSV. A copy of this can be found on Moodle. The data contains information obtained from the
STAT121 students on the number of ‘sexual partners’ they have had. This data was collected many years back. The data contains:
Variable |
Description |
Variable Type |
Gender |
Gender: F or M |
Categorical |
No partner |
Count of the students with no sexual partners |
Numeric |
1 partner |
Count of the students with one sexual partner |
Numeric |
2 partners |
Count of the students with two sexual partners |
Numeric |
3 partners |
Count of the students with three sexual partners |
Numeric |
4 partners |
Count of the students with four sexual partners |
Numeric |
5 partners |
Count of the students with five sexual partners |
Numeric |
Task 1:
We want to compare if there are gender differences in terms of the number of sexual
partners using the tests of association .
What would be your null and alternative hypotheses? Why? [4 Marks]
Task 2:
Perform both the Chi-squared test as well as the Fisher’s exact test on this data to test
your hypothesis. Paste the output in your word document. [2 Mark]
Task 3:
Which one of these two tests do you think is more appropriate for this data? Explain why. [3 Mark]
Task 4:
Answer the following questions:
a) For the test that you have deemed to be the most appropriate, what is the P- value? How would you interpret it? [2 Marks]
b) In the Chi-squared test output, what is the ‘Chi-sq’ value? How is it related to the P-value reported? [3 Marks]
2022-03-17