Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

CS989: Big Data Fundamentals

RESIT – COURSEWORK

DEADLINE

DUE:  12:00 noon, Wednesday July 12th, 2023

AIM OF THE ASSIGNMENT

To  provide  deeper  understanding  of  appropriate  methodological  approaches  to  processing  and analysing noisy data; and to encourage appreciation of the challenges involved in data analysis.

LEARNING OUTCOMES

Understanding of the fundamentals of Python to enable the use of various big data technologies; Understand how classical statistical techniques are applied in modern data analysis; Understanding of the potential application of data analysis tools for various problems and appreciate their limitations; Understanding of the challenges and complexity of data analysis.

THE BRIEF

Provide a brief report on the analysis of an open dataset. There are some restrictions on the dataset that can be selected (see below DATASET RULES”). You can focus your report on one aspect of the dataset or multiple aspects, the main objective is to find some interesting questions or problems to answer.

The following criteria will be used when marking your assignment:

•    Identification and description of key challenge(s) or problem(s) to be addressed   10%

•    Introduction to the dataset   10%

•   The challenge(s)/problem(s) is (are) to be addressed using the following   20%

o Summary statistics (including figures) for data being analysed   20%

o Description, rationale, application and findings from only one unsupervised analysis method covered in the module   20%

o Description, rationale, application and findings from only one supervised analysis method covered in the module   20%

•    Reflection on methods used for analysis   10%

•    Structure presentation, and proper citation of references   10%

SUBMISSION

The report to be submitted should be 2500 words (+/- 10%) excluding the front cover, table of content, list of figure / tables, references and appendices. The document must be in pdf format. All code used for the analysis is also to be submitted, if not submitted the submission will be considered incomplete and the resit will receive a mark of zero; More details will be available on the submission page on MyPlace. Both the code and the report should be submitted using MyPlace; no submission will be accepted in any different way. Assessments submitted after the deadline will receive a mark of zero .

DATASET RULES

Example datasets are available on:

➢ The UCI Machine Learning Repository:https://archive.ics.uci.edu/ml/datasets.php

➢ Kaggle website:https://www.kaggle.com/datasets

You can also select a dataset from other sources, but make sure that the dataset is public and that you have the right to access and analyse the dataset and to share the results.

However, you cannot select a dataset that:

A comes packaged with Scikit-Learn

❖ Boston house-prices dataset

❖ Iris dataset

❖ Diabetes dataset

❖ Digits dataset

❖ Linnerud dataset

❖ Wine dataset

❖ Breast cancer wisconsin dataset

For more information:https://scikit-learn.org/stable/datasets/index.html

A comes packaged with Seaborn

❖ anscombe.csv: Anscombe dataset

❖ attention.csv: Attention dataset

❖ brain_networks.csv: Brain networks dataset

❖ car_crashes.csv: Add 538 car crash dataset

❖ diamonds.csv: Add diamonds dataset

❖ dots.csv: Add dots dataset

❖ exercise.csv: Add exercise dataset

❖ flights.csv: Add flights dataset

❖ fmri.csv: Change sorting of events in fmri data

❖ gammas.csv: Make fake fmri data make a bit more sense

❖ iris.csv: Add iris dataset

❖ mpg.csv: Add mpg dataset

❖ planets.csv: Planets dataset

❖ tips.csv: Tips dataset

❖ titanic.csv: Titanic dataset

For more information:https://github.com/mwaskom/seaborn-data

A we have worked on during Lab sessions

Please note that submitted projects on one of the above datasets will receive a mark of zero.