Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

MSIN0025 Data Analytics II

2022/23

Section B: Assessment Brief and Requirements

For this final assignment, you will need to identify an important business problem, find one or more relevant datasets, generate insightful visualisations of the data, fit a range of models to the data to produce your best predictions/forecasts, and make and justify recommendations to a decision maker related to this problem. A key goal for this final individual assignment is to demonstrate a wide range of the concepts covered in the module.

This assignment is worth 50% of the overall module assessment.

Report Structure

Section 1: The Problem (10%)

• Discuss the problem you are addressing.

• What are the questions and business/management decisions your analysis is trying to address?

• Describe your problem’s decision maker and what is important for them to know from your data analysis?

• Discuss the source of your data. Questions to consider include:

- Where did you find this data?

- How reliable or uncertain is this data?

- How old is the data?

- Is the data recorded at given dates or times?

• Discuss and justify whether your problem relates a regression analysis or a classification analysis.

• Identify and justify your choice of target attribute(s) and explain how this/these should be derived, if not already available.

Section 2: Understand the Data (30%)

• Discuss the nature and size of the dataset(s) you are using.

• Discuss the data attributes that are relevant to your problem. Exactly what does the data represent and, if relevant, how was it derived? How is it distributed? What type of data is it?

• Explore and discuss whether any of the data attributes you have focused on are closely correlated with other attributes - either positively or negatively.

• Include at least 3 Tableau-generated visualisations (e.g., map, scatter plot, bar chart, pie chart, box-and-whisker plot) that give different insights to support your discussions.

• Include at least 3 R-generated plots or aggregation tables that give different insights to support your discussions.

• Include the R-code you used in the appendix of your report.

Section 3: Prepare the Data (10%)

• If required, explain how you have derived your chosen target attribute(s) in Tableau or in R.

• Discuss and justify what other steps you may have taken to prepare your data, including, where relevant: removing attributes from consideration, adding further "derived" attributes (e.g., Dates), imputing "reasonable" values for missing data, transforming attributes, and standardising data values.

• Prepare suitable separate "Training" and "Testing" datasets.

• Include any R-code you used to prepare your data in the appendix of your report.

Section 4: Generate and Test Prediction Models (40%)

• Select and justify at least 3 different prediction models (with at least 1 ensemble model) that are likely to best help with your stated problem objectives.

• Configure your models (e.g., select attributes and/or other model tuning parameters) that you expect will best deliver relevant insights and/or provide the lowest error rates, justifying your decisions.

• Run these models, discussing the model outputs and drawing, where possible, insights related to your problem.

• Select proper evaluation metrics to measure the accuracy of your models. Determine and comment on the best model across your 3 prediction models.

• Discuss what steps you may have taken to improve your individual models.

• Include any R-code you used in the appendix of your report.

Section 5: Problem Conclusions and Recommendations (10%)

• Combining the results from your various analysis steps, draw conclusions about the particular problem and questions stated at the beginning.

• What recommendations would you now make to your problem’s decision maker and why? E.g.,

- Which are the most important variables/features for the decision maker to look at?

- What benefits that he decision maker would gain by implementing your prediction model?

Marking Criteria

Marks will be awarded for:

• Using Tableau and R in a way that is relevant and appropriately justified, and that is ideally different from that presented in the lectures and other module materials.

• Meaningful insights are discussed after each analysis task.

• Your analysis should flow, with each step building on the last.

• Structuring your report and analysis so as to follow the standard stages of a data science project.

• The correctness, reproducibility, and quality of your code, visualisations and conclusions.

• Employing a wide range of the concepts and methods covered in this module.

• Problem identification: you have found a novel and significant problem.

• Proposed a compelling solution/recommendation: you have generated important business or policy insights.

• Your report was well-written: clear and compelling.

Submission Requirement

You are required to submit 3 files for this assignment:

1. A PDF file containing your fully completed report, including an appendix containing all your R-based analysis.

2. A runnable R script file (.R file) that contains all your R-based analysis.

3. The data file, if it is not too large to upload on Moodle, that you used for your analysis. If it is too large, please include a link (either to the original dataset that are freely available online or to the online cloud, e.g., Dropbox, GitHub, where you store the dataset) in appendix of your PDF report.

Only the first PDF file will be marked. The additional code file and data file are only provided to ensure your code works as you have claimed it should.

Section C: Module Learning Outcomes covered in this Assessment

This assessment contributes towards the achievement of the following stated module Learning Outcomes as highlighted below:

This assignment contributes towards the achievement of the following module Learning Outcomes:

• During the module, students will work with example data sets to experience and understand the stages of the data science process: they will visualise data, propose models that might fit the data, choose a best-fit model, use that model to make predictions, and test those predictions against new realisations.

• The module builds on ideas and tools introduced in MSIN0010 Data Analytics I and MSIN0023 Computational Thinking, including R and Tableau, statistical software used by the world’s leading data scientists.