Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

STATS 786

SEMESTER 1, 2021

STATISTICS

Time Series Forecasting for Data Science

Midterm - Test

Instructions

❼ The length of the test includes an additional 30 minutes (to allow for reading time,

the additional complexity of the online mode, and submission). You get one hour for answering the questions and an extra 30 minutes for uploading your les.

You must submit your nal answers before due time so do not leave sub-

mitting until just before the due time - make sure you allow time for submission.

❼ Test answers will not be accepted after the end of this extra 30 minute period.

❼ If you encounter computer/internet/other issues during the test that affect your

ability to work on or submit your test answers please contact the lecturer via email ([email protected]).

❼ We STRONGLY recommend you download your submitted document from Can-

vas, after submitting it, to verify you have uploaded the correct document.  It is your responsibility to check you have submitted the correct document.

❼ It is your responsibility to ensure your test is successfully submitted on time. Please

don’t leave it until the last minute to submit your test.

Academic Honesty Declaration

By completing this assessment, I agree to the following declaration: I understand the University expects all students to complete coursework with integrity and honesty. I promise to complete all online assessment with the same academic integrity standards and values. Any identied form of poor academic practice or academic misconduct will be followed up and may result in disciplinary action. As a member of the Universitys student body, I will complete this assessment in a fair, honest, responsible and trustworthy manner.

This means that:

I declare that this assessment is my own work.

I will not seek out any unauthorised help in completing this assessment.

I am aware the University of Auckland may use plagiarism detection tools to check

my content.

I will not discuss the content of the assessment with anyone else in any form,

including, Canvas, Piazza, Facebook, Twitter or any other social media or online platform within the assessment period.

I will not reproduce the content of this assessment anywhere in any form at anytime.

I declare that I generated the calculations and data in this assessment indepen-

dently, using only the tools and resources dened for use in this assessment.

I will not share or distribute any tools or resources I developed for completing this

assessment.

1 Run the following code in R.

# Use  your  student  ID  as  the  seed

set .seed(2021)

sample(letters[1:6],  3,  replace  =  FALSE)

Use the output from the above R code to select the statements that you need to answer from the list given below.  For example, suppose the output for the above code is f”, “b”, and c”, then you should select statements b”, “c”, and f” from the list below to answer this question.

Note: Please make sure to replace the seed used in the above R code by your student ID to select the statements that you need to answer in this question.

State whether the selected statements are true or false.  You MUST provide rea- soning for your answer.

a There is something wrong with my forecasts because they take the same value

for all forecast horizons.

b I should always choose the regression model with the smallest sum of squared

errors for obtaining predictions.

c Prediction intervals are not very important because most people want the point

forecasts.

d A time series cross-validation based on a rolling forecast origin is better than

a simple test set for comparing forecast methods.

e A white noise series has zero mean and constant autocovariance. f Linear regression models are simplistic because the real world is nonlinear.

[Total: 15 marks]

2 This question attempts to analyze the effect of temperature and pollution level on weekly cardiovascular mortality in one of the states in the US.

Note: Please refer to the appendix on pages 7–9 for the necessary figures.

Figure 1 shows the time plots for average weekly cardiovascular mortality, temper- ature, and particulate pollution level over ten years. Figure 2 shows a scatter plot matrix of mortality and the two predictor variables.

a Briey describe the main features that you can observe from Figures 1 and 2. [5 marks]

b Let Mt  denotes cardiovascular mortality, Tt  denotes the temperature and Pt

denotes the particulate levels at time t.   One of the students in the class suggested fitting the following four models:

Mt  = β0 + β1t + et , (M1)

Mt  = β0 + β1t + β2 (Tt − ) + et , (M2)

Mt  = β0 + β1t + β2 (Tt − ) + β3 (Tt − )2 + et , (M3)

Mt  = β0 + β1t + β2 (Tt − ) + β3 (Tt − )2 + β4 Pt + et , (M4)

where denotes the mean temperature. Explain briey why the student has suggested fitting these four models. [4 marks]

c Summary statistics for M1–M4 are given in Table 1.  Among these models, which one do you select as the best model? Briey give reasons for your selection. Interpret the value of 2 .

Table 1: Summary statistics for models M1–M4.

2 2       AIC    BIC

M1   79.1   0.209   2224   2237

M2   62.2   0.378   2103   2120

M3   55.5   0.445   2047   2068

M4   40.8   0.592    1891    1916

[5 marks]

d Figure 3shows the residual diagnostics for the best model chosen from M1–M4.

What conclusions can you draw from these plots. [3 marks] [Total: 17 marks]


3 The revenue-domestic-flights .csv le contains information about monthly rev- enue from domestic ights in US from 1979–2000.

a Read the le into R and convert it to a tsibble object. [3 marks]

b Plot the revenue series and comment briey the main features of the data. [2 marks]

c Do you think a Box-Cox transformation is useful for this time series? Briey give reasons for your answer. [3 marks]

d Mention at least four forecasting methods that are most appropriate for this series. [4 marks]

e Using last 2 years of data as the test set, fit the methods that you suggested in part d (you may transform the original series based on your answer to part c). [11 marks]

f Obtain the forecasts for 2 years. [2 marks]

g Compare the accuracy of your forecasts from different methods against the test set. [2 marks]

h Which method does best? Justify your selection. [3 marks]

i Plot the point forecasts from the best method along with the 95% prediction interval. [3 marks] [Total: 33 marks]


Appendix

Figure 1: Time plots for average weekly cardiovascular mortality, temperature, and par- ticulate pollution level.

0.04

0.03

0.02

0.01

0.00

100

90

80

70

60

50

100

80

60

40

20

Mortality

Temperature

Mortality

Corr:

-0.439***

Corr:

0.444***

Corr:

-0.017

120

50            60 70            80            90           100     20                40                60                80               100

Figure 2: A scatter plot matrix of mortality and the two predictor variables.

30