Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

QBUS2820 Assignment 2: Forecasting Google Trends keywords

Overview

The assignment consists of forecasting the number of searches for several search topics as reported by Google via its platform Google Trends.

You will create and validate a methodology that considers the forecast models seen in the lectures, compute point, probabilistic forecasts and estimate the errors of the model.

Context and Data

Each time series measures the number of searches for a specific keywork for a given time period. For example, the time series of searches in Google for the keyword “Business Analytics” in the last 5 years, measured every week (sum of searches on that week).  Google trends report the search volume in a relative scale, so each time series is normalized to a range 0 to 100, with 100 being the maximum number of searchers for a particular time point (this means total volumes cannot be compared across  series with this dataset). In this dataset, we have time series measured at different frequencies: every  hour, every day, every week.

This dataset features some general ‘topics of interest’ and some pop culture’/‘influencers’ . A potential use would be to identify if a particular keyword is ‘on the rise’ or ‘declining’, how ‘stable’ it is, etc. For   example to decide if we should invest in a long advertisement campaign with an influencer, or invest in real state in a city, district,etc. Realistically, however, google trends is not a very precise indicator of such, so this is purely for academic purposes.

The dataset comes in a csv format, with columns:

•    date: Date in datetime format.

•    hits: search volume, normalized to 0 – 100.

•    keyword: The search term that the time series is measuring.

From this data, you would create several time series, one for each keyword, and then forecast them.

Remember that there are several frequencies in this dataset, part of the exercise will be to find, for         example, the potential seasonality (if any) for the time series. For example, some time series might have period 24, others (maybe) 7, etc.

The forecast problem

The forecasts’ that have to be computed and reported:

•    Point forecasts: You will forecast the number of hits for the last 10% observations of the series (rounding up). This means that the last 10% of each series will be considered the ‘test set’ that cannot be used for modelling.


•    Probabilistic forecasts: For each time series, forecast the 75% quantile and 25% quantile across the test set for each series

•    Expected Performance: Estimate of the mean absolute error of the predictions for each time series. This is the estimation of what is going to be the prediction error on the test set, based on the training, before looking at the test set .

•    Actual performance: The prediction error calculated on the actual on the test set.

There are several time series in the dataset, and you do not need to forecast all ‘manually’ by trying several models and eyeballing the best one. You can of course try whatever exploratory process to get a grasp of the data, but the objective is to create a complete automatic methodology, that given a time series compares several models (seasonal naïve, exponential smoothing family, ARIMA family, etc), chooses the best one according to a metric and computes the forecasts. The methodology is then            applied to each time series. If it helps, imagine that then someone will apply the methodology to new     unseen time series having the same format, from a similar csv (though it only needs to work for the data in the assignment).

What you need to submit

•    A notebook (.ipynb file) that runs the methodology,  creates the forecasts and documents the decisions and results along the way.

o Divide it into sections and document clearly what you are doing using markdown cells   before the code of each section. You can use one (or more) cells for the methodological discussion, this should be separated from the cells that clarify technical (programming   parts). Failure to explain/Incomplete sections of the notebook might lead to strong penalties for those sections (this is, do not just have ‘code’).

o The filename of the notebooks should be STUDENTID_ ASG2_nbook_QBUS2820.ipynb’

o No pdf document is needed, make sure that the notebook is as clear as possible, you can find many examples online that interweave code and analysis.

•   The notebook must have a Results section at the end where you will report the forecast items    (the four points in the forecast problem section of this document), using plots for the point and probabilistic forecast and text output for the performance . You will also report on which model was used to point forecast each series, to get a rough idea of the dynamics of the series (is it     seasonal? Does it have a trend?, etc.)

Marking

Percent of the total grade that is dedicated to each part of the assignment.

•    Visual aspect of the notebook (15%). Proper sectioning, plots, text and code cells structure, etc.

•    Methodology: Point forecast (45%), Probabilistic( 25%), Estimate the error (15%)

•    Other errors might penalize the final grade, e.g. Notebook does not run, the format of submission is not correct, etc.