FIT5196 Assessment 3
FIT5196-S1-2021
Assessment 3
This is an individual assessment and worth 30% of your total mark for FIT5196.
Due date: Please check, Assessment 3: Data Integration and Reshaping
In this assessment, you must write Python code to integrate several datasets into one single schema and find and fix possible problems in the data. The input and output of this assessment are shown below:
Table 1. The input and output of the task Inputs Output
Inputs
|
Outputs
|
Jupyter-Notebook & pdf
|
<Student_ID>.zip,
Vic_suburb_boundary.zip,
gtfs.zip
|
<Student_ID>_A3_solution.
csv
|
<Student_ID>_ass3.ipynb,
<Student_ID>_ass3.pdf
|
Note: A single zip file with CSV and IPYNB and PDF file is to be submitted.
The pdf file should be generated from your jupyter notebook file (after clearing all the cells output), and it will be used for plagiarism checks via Turnitin.
Each of you is given seven (7) datasets in various formats, and the data is about housing information in Victoria, Australia. You can find your dataset here. In this assignment, you need to perform the following tasks.
Task 1: Data Integration (60%)
In this task, you must integrate the input datasets (i.e., seven datasets including hospitals, school, Recreational activity areas, real estate files (one XML and one CSV), Vic_suburb_boundary, and gtfs) into one dataset with the following schema.
Table 2. Description of the final schema
Column
|
Description
|
Property_id
|
A unique id for the property
|
lat
|
The property latitude
|
lng
|
The property longitude
|
addr_street
|
The property address
|
suburb (21%)
|
The property suburb.
|
price
|
The property price
|
property_type
|
The type of the property
|
year
|
Year of sold
|
bedrooms
|
Number of bedrooms
|
bathrooms
|
Number of bathrooms
|
parking_space
|
The number of parking space on the property
|
School_id (5%)
|
The closest school to the property.
|
Distance_to_school (1%)
|
The distance from the closest school to the property.
|
Train_station_id (10%)
|
The closest train station to the property.
|
Distance_to_train_station (1%)
|
The distance from the closest train station to the property.
|
travel_min_to_CBD (25%)
|
The average travel time (minutes) from the closest train
station to the “Southern Cross Station” station on weekdays
(i.e., Monday-Friday) departing between 7 to 9 am. For
example, if three (3) trips are departing from the closest train
station to the Southern Cross station on weekdays between
7-9 am, and each takes 6, 7, and 8 minutes respectively, then
the value of this column for the property should be
(6+7+8)/3.
|
Transfer_flag (25%)
|
A Boolean attribute indicates a direct trip to the Southern
Cross station from the closest station between 7-9 am on the
weekdays. This flag is 0 if there is a direct trip (i.e., no
transfer between trains is required to get from the closest
train station to the Southern Cross station) and one (1)
otherwise.
|
Hospital_id (5%)
|
The closest hospital to the property.
|
Distance_to_hospital (1%)
|
The distance from the closest hospital to the property.
|
Recreation_centre_id (5%)
|
The closest recreation activity centre to the property.
|
Distance_to_Recreation_centre (1%)
|
The distance from the closest recreation activity centre to
the property.
|
Task 2: data reshaping (20%)
In this task, you need to study the effect of different normalization/transformation methods (i.e., standardization, min-max normalization, log, power, box-cox transformation) on the “price”, “Distance_to_school”, “travel_min_to_CBD”, and “Distance_to_Recreation_centre” attributes.
Further, observe and explain their effect, assuming we want to develop a linear model to predict the “price” using “Distance_to_school”, “travel_min_to_CBD”, and “Distance_to_Recreation_centre” attributes. The linear regression assumptions that you need to study in this task are Normality and Linearity.
Task 3: Documentation (20%)
The main focus of the documentation would be on the quality of your explanation on task 2 but similar to the previous assignments. Your notebook file should be in a proper format with appropriate sections and subsections.
Notes:
1. The output CSV file must have the same columns as specified on the schema. Please note that the output files which are not in the correct format, as defined in the integrated schema, won’t be marked.
2. If you decide not to calculate any of the required columns, you must have that column in your final data frame with the ‘Null’ as the value of all the rows. Please note that the output files which are not in the correct format, as specified in the integrated schema, won’t be marked.
3. No external data is allowed to calculate the values of the integrated schema. For example, to calculate the suburb, you can only use the provided shapefiles.
4. The radius of the earth is still 6371 km!
5. In table 2, numbers in front of some of the columns in the format of (a%) are the allocated mark associated with that column. For example, column “suburb” carries 21% of the total output mark of task 1.
6. For transfer_flag column, if your answer is incorrect, negative mark will be awarded. For e.g., if a you got 50% of transfer_flag correct and the other 50% are incorrect, then scores is zero (0).
2021-05-30