CS989: Big Data Fundamentals
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit
CS989: Big Data Fundamentals
RESIT – COURSEWORK
DEADLINE
DUE: 12:00 noon, Wednesday July 12th, 2023
AIM OF THE ASSIGNMENT
To provide deeper understanding of appropriate methodological approaches to processing and analysing noisy data; and to encourage appreciation of the challenges involved in data analysis.
LEARNING OUTCOMES
Understanding of the fundamentals of Python to enable the use of various big data technologies; Understand how classical statistical techniques are applied in modern data analysis; Understanding of the potential application of data analysis tools for various problems and appreciate their limitations; Understanding of the challenges and complexity of data analysis.
THE BRIEF
Provide a brief report on the analysis of an open dataset. There are some restrictions on the dataset that can be selected (see below “DATASET RULES”). You can focus your report on one aspect of the dataset or multiple aspects, the main objective is to find some interesting questions or problems to answer.
The following criteria will be used when marking your assignment:
• Identification and description of key challenge(s) or problem(s) to be addressed 10%
• Introduction to the dataset 10%
• The challenge(s)/problem(s) is (are) to be addressed using the following 20%
o Summary statistics (including figures) for data being analysed 20%
o Description, rationale, application and findings from only one unsupervised analysis method covered in the module 20%
o Description, rationale, application and findings from only one supervised analysis method covered in the module 20%
• Reflection on methods used for analysis 10%
• Structure presentation, and proper citation of references 10%
SUBMISSION
The report to be submitted should be 2500 words (+/- 10%) excluding the front cover, table of content, list of figure / tables, references and appendices. The document must be in pdf format. All code used for the analysis is also to be submitted, if not submitted the submission will be considered incomplete and the resit will receive a mark of zero; More details will be available on the submission page on MyPlace. Both the code and the report should be submitted using MyPlace; no submission will be accepted in any different way. Assessments submitted after the deadline will receive a mark of zero .
DATASET RULES
Example datasets are available on:
➢ The UCI Machine Learning Repository:https://archive.ics.uci.edu/ml/datasets.php
➢ Kaggle website:https://www.kaggle.com/datasets
You can also select a dataset from other sources, but make sure that the dataset is public and that you have the right to access and analyse the dataset and to share the results.
However, you cannot select a dataset that:
A comes packaged with Scikit-Learn
❖ Boston house-prices dataset
❖ Iris dataset
❖ Diabetes dataset
❖ Digits dataset
❖ Linnerud dataset
❖ Wine dataset
❖ Breast cancer wisconsin dataset
For more information:https://scikit-learn.org/stable/datasets/index.html
A comes packaged with Seaborn
❖ anscombe.csv: Anscombe dataset
❖ attention.csv: Attention dataset
❖ brain_networks.csv: Brain networks dataset
❖ car_crashes.csv: Add 538 car crash dataset
❖ diamonds.csv: Add diamonds dataset
❖ dots.csv: Add dots dataset
❖ exercise.csv: Add exercise dataset
❖ flights.csv: Add flights dataset
❖ fmri.csv: Change sorting of events in fmri data
❖ gammas.csv: Make fake fmri data make a bit more sense
❖ iris.csv: Add iris dataset
❖ mpg.csv: Add mpg dataset
❖ planets.csv: Planets dataset
❖ tips.csv: Tips dataset
❖ titanic.csv: Titanic dataset
For more information:https://github.com/mwaskom/seaborn-data
A we have worked on during Lab sessions
Please note that submitted projects on one of the above datasets will receive a mark of zero.
2023-07-10