COMP4030 Data Modelling and Analysis
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit
Data Modelling and Analysis
COMP4030
Coursework 2022 CW2 Brief
Assessment Name |
Coursework 2 – Data Analysis Study |
Weight |
75% |
Description and Deliverable(s) |
This assignment requires you to work in a pair. You will need to analyse a data set using all the data science steps you have learnt to create and compare classification models. You will write your work up as a joint academic paper with a coursework partner, comparing and analysing your results at every stage of the data analysis and modelling pathway (6 to 8 pages including references and diagrams) as stated in this coursework specification. The paper should be submitted in PDF, using the IEEE template for formatting. The code should be submitted as R script. |
||
Release Date |
Tuesday 1st March 2022 |
||
Submission Date |
Monday 9th May 2022 by 3pm |
||
Late Policy (University of Nottingham default will apply, if blank) |
Work submitted after the deadline will be subject to a penalty of 5 marks (the standard 5% absolute) for each late working day out of the total 100 marks. Late submission deadline is Friday 13 May 2022. Submissions after this date will only be accepted through the extenuating circumstances process. |
||
Feedback Mechanism and Date |
Written feedback in Moodle on the 6th of June 2022 |
Instructions
For this coursework assignment you will need be required to work in pairs to analyse a data set (select one from the three provided or find one of your own choice) using all the data science steps you have learnt to create and compare classification models.
You will write your work up as a joint academic paper with your coursework partner, comparing and analysing your results at every stage of the data analysis and modelling pathway .
You will need to present your paper in an IEEE format using a template from here:
https://www.ieee.org/conferences/publishing/templates.html
Your paper should be between 6 to 8 pages (including tables, diagrams and references as appropriate) and submitted as a PDF . The diagrams table and diagrams should add value to the writing. Diagrams are preferrable to tables.
Your paper should be organised into 8 parts:
1. Title and Abstract (2.5%)
2. Introduction to the data set and research question(s) (5%)
3. Literature Review – covering a few key methods adopted by other researchers who used this or a similar dataset (5%)
4. Methodology – including a justification for your selected approaches for data analysis and pre-processing and data classification. (10%)
5. Results from each of the stages – data analysis, pre-processing and classification (20%) Please note at each partner in the pair should use a different approach for each stage.
6. Discussion - comparing your results (partners in pair) and also with other results from previous research on the dataset as noted in your literature review (25%)
7. Conclusions and recommendation for future research (10%)
8. References (2.5%)
Code Submission
Please include all your code as an R script which the be run to generate your results (20% = each person in the pair will be marked individually on this) as a separate file in additional to the paper.
The ultimate aim of this coursework is to give you first-hand experience on working with a relatively large and real data set, getting experience of the first stages of data description, exploratory data analysis to the later stages of knowledge extraction and classification/prediction.
Please note that you need to include a contributions section in the paper to clearly specify which person worked on what aspects of the paper.
Datasets
You can choose to work on one of the following datasets:
1. Wine Data Set
https://search.r-project.org/CRAN/refmans/HDclassif/html/wine.html
Data Set Information:
These data are the results of a chemical analysis of wines grown in the same region in Italy but derived from three different cultivars. The analysis determined the quantities of 13 constituents found in each of the three types of wines.
Format: A data frame with 178 observations on the following 14 variables:
Class The class vector, the three different cultivars of wine are represented by the three integers : 1 to 3. |
V1 Alcohol |
V2 Malic acid |
V3 Ash |
V4 Alkalinity of ash |
V5 Magnesium |
V6 Total phenols |
V7 Flavanoids |
V8 Nonflavanoid phenols |
V9 Proanthocyanins |
V10 Color intensity |
V11 Hue |
V12 OD280/OD315 of diluted wines |
V13 Proline |
2. Breast Cancer Wisconsin (Diagnostic) Data Set
https://search.r-project.org/CRAN/refmans/mlbench/html/BreastCancer.html
Data Set Information:
The objective is to identify each of a number of benign or malignant classes. Samples arrive periodically as Dr. Wolberg reports his clinical cases. The database therefore reflects this chronological grouping of the data. This grouping information appears immediately below, having been removed from the data itself. Each variable except for the first was converted into 11 primitive numerical attributes with values ranging from 0 through 10. There are 16 missing attribute values. See cited below for more details.
Format A data frame with 699 observations on 11 variables, one being a character variable, 9 being ordered or nominal, and 1 target class.
[,1] |
Id |
Sample code number |
[,2] |
Cl.thickness |
Clump Thickness |
[,3] |
Cell.size |
Uniformity of Cell Size |
[,4] |
Cell.shape |
Uniformity of Cell Shape |
[,5] |
Marg.adhesion |
Marginal Adhesion |
[,6] |
Epith.c.size |
Single Epithelial Cell Size |
[,7] |
Bare.nuclei |
Bare Nuclei |
[,8] |
Bl.cromatin |
Bland Chromatin |
[,9] |
Normal.nucleoli |
Normal Nucleoli |
[,10] |
Mitoses |
Mitoses |
[,11] |
Class |
Class |
3. Pima Indians Diabetes Dataset
https://search.r-project.org/CRAN/refmans/hhcartr/html/pima.html
Description
This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage.
2022-03-15