Project 2: Data Profiling and Preparation
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit
Project 2: Data Profiling and Preparation
Project Brief
Key Tasks
In this project, you will perform data profiling and cleansing for the Passengers' Satisfaction Survey dataset. You'll examine and assess the data to expose technical and data issues to plan and conduct data cleansing using Python. Here is the Passenger's Satisfaction Survey data dictionary wherel you can find a description for each column in the dataset. All your work will be performed using a Jupyter Notebook with annotations that explain and document your work. Your submission will also include a CSV file for the cleaned dataset.
For Project 2, set up one Jupyter notebook to complete the following key tasks and annotate your work in the notebook:
1. Conduct Data Access and Profiling:
To begin, read the survey dataset into a pandas DataFrame.
Then, create a subset of the DataFrame to include all passenger satisfaction data about only the Business Class. Name the created DataFrame subset "df_Business".
Collect information about the "df Business" dataset and its validity. Report the following:
o The number of observations (or rows) and the number of variables (or columns) in the dataset.
o The name and data type of each data column.
o The unique values of each column.
The number of missing values of each column.
Summary statistics for each data column.
o Frequency distribution (list the number of times each unique value appears) for each data column.
o The number of fully duplicated data rows.
Based on the information that you collected, provide a list of the issues that the data contains as an annotation. Supporting Materials
2. Conduct Data Cleansing:
Implement the following tasks:
Drop the Class column from the created Business Class passenger satisfaction survey ("df_Business").
Rename "df_Business" columns by making all the columns" names in lowercase, and if the column name has whitespace, replace it with anl underscore. (For example: "Customer Type" will be renamed to "customer_type").
Modify improperly formatted data and handle bad data.
Handle the missing values (You should have zero missing values in your cleaned dataset).
Remove the fully duplicated observations.
Fix the columns datatypes.
3. Export the cleaned dataset to CSV file.
Use the following naming convention: Project_2_Group#.csv
4. Compare the Data Characteristics Before and After Data Cleansing.
At the end of the Jupyter Notebook, provide a brief write-up annotation to compare the characteristics of each column before and after data cleansing (this includes, the column name, datatype, unique values, and the number of missing values).
What to Submit:
Jupyter notebook in IPYNB format with annotations that explain and document your work.
CSV file for the cleaned dataset.
Use the following naming convention for your Jupyter Notebook submission (upload notebook in 2 file formats: HTML format and ipynb):
Project_2_Group#. ipynb
Project_2_Group#.html
Project_2_Group#.csv
2025-05-30