闪电代写 -代写CS作业_CS代写_Finance代写_Economic代写_Statistics代写_代码代做_IT代写_加急帮助

FIT5196-S1-2021

Assessment 3

This is an individual assessment and worth 30% of your total mark for FIT5196.

Due date: Please check, Assessment 3: Data Integration and Reshaping

In this assessment, you must write Python code to integrate several datasets into one single schema and find and fix possible problems in the data. The input and output of this assessment are shown below:

Table 1. The input and output of the task Inputs Output

Inputs	Outputs	Jupyter-Notebook & pdf
<Student_ID>.zip, Vic_suburb_boundary.zip, gtfs.zip	<Student_ID>_A3_solution. csv	<Student_ID>_ass3.ipynb, <Student_ID>_ass3.pdf

Note: A single zip file with CSV and IPYNB and PDF file is to be submitted.

The pdf file should be generated from your jupyter notebook file (after clearing all the cells output), and it will be used for plagiarism checks via Turnitin.

Each of you is given seven (7) datasets in various formats, and the data is about housing information in Victoria, Australia. You can find your dataset here. In this assignment, you need to perform the following tasks.

Task 1: Data Integration (60%)

In this task, you must integrate the input datasets (i.e., seven datasets including hospitals, school, Recreational activity areas, real estate files (one XML and one CSV), Vic_suburb_boundary, and gtfs) into one dataset with the following schema.

Table 2. Description of the final schema

Column	Description
Property_id	A unique id for the property
lat	The property latitude
lng	The property longitude
addr_street	The property address
suburb (21%)	The property suburb.
price	The property price
property_type	The type of the property
year	Year of sold
bedrooms	Number of bedrooms
bathrooms	Number of bathrooms
parking_space	The number of parking space on the property
School_id (5%)	The closest school to the property.
Distance_to_school (1%)	The distance from the closest school to the property.
Train_station_id (10%)	The closest train station to the property.
Distance_to_train_station (1%)	The distance from the closest train station to the property.
travel_min_to_CBD (25%)	The average travel time (minutes) from the closest train station to the “Southern Cross Station” station on weekdays (i.e., Monday-Friday) departing between 7 to 9 am. For example, if three (3) trips are departing from the closest train station to the Southern Cross station on weekdays between 7-9 am, and each takes 6, 7, and 8 minutes respectively, then the value of this column for the property should be (6+7+8)/3.
Transfer_flag (25%)	A Boolean attribute indicates a direct trip to the Southern Cross station from the closest station between 7-9 am on the weekdays. This flag is 0 if there is a direct trip (i.e., no transfer between trains is required to get from the closest train station to the Southern Cross station) and one (1) otherwise.
Hospital_id (5%)	The closest hospital to the property.
Distance_to_hospital (1%)	The distance from the closest hospital to the property.
Recreation_centre_id (5%)	The closest recreation activity centre to the property.
Distance_to_Recreation_centre (1%)	The distance from the closest recreation activity centre to the property.

Task 2: data reshaping (20%)

In this task, you need to study the effect of different normalization/transformation methods (i.e., standardization, min-max normalization, log, power, box-cox transformation) on the “price”, “Distance_to_school”, “travel_min_to_CBD”, and “Distance_to_Recreation_centre” attributes.

Further, observe and explain their effect, assuming we want to develop a linear model to predict the “price” using “Distance_to_school”, “travel_min_to_CBD”, and “Distance_to_Recreation_centre” attributes. The linear regression assumptions that you need to study in this task are Normality and Linearity.

Task 3: Documentation (20%)

The main focus of the documentation would be on the quality of your explanation on task 2 but similar to the previous assignments. Your notebook file should be in a proper format with appropriate sections and subsections.

Notes:

1. The output CSV file must have the same columns as specified on the schema. Please note that the output files which are not in the correct format, as defined in the integrated schema, won’t be marked.

2. If you decide not to calculate any of the required columns, you must have that column in your final data frame with the ‘Null’ as the value of all the rows. Please note that the output files which are not in the correct format, as specified in the integrated schema, won’t be marked.

3. No external data is allowed to calculate the values of the integrated schema. For example, to calculate the suburb, you can only use the provided shapefiles.

4. The radius of the earth is still 6371 km!

5. In table 2, numbers in front of some of the columns in the format of (a%) are the allocated mark associated with that column. For example, column “suburb” carries 21% of the total output mark of task 1.

6. For transfer_flag column, if your answer is incorrect, negative mark will be awarded. For e.g., if a you got 50% of transfer_flag correct and the other 50% are incorrect, then scores is zero (0).