Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

Assignment 1:  Data Exploration & Visualization

For this assignment, you will use the file july4_snapshot.csv, which can be found on our course Blackboard page.   This dataset includes information about the park visitors on one particular day of operations – July 4, 2021.  

Once you have completed this assignment, you will upload two files into Blackboard:  The .ipynb file that you create in Jupyter Notebook (or Colab), and an .html file that was generated from your .ipynb file.  If you run into any trouble with submitting the .html file to Blackboard, you can submit it as a PDF instead.  Please include your last name in your filenames.  The exact way you save it is up to you, but the last name makes it easier to keep track of the file (e.g. BakerAssignment1.ipynb, bakerAssgn1.ipynb, bakerAssignment1.html, etc. -- any of these would be fine).  

For any question that asks you to perform some particular task, you just need to show your input and output.  Tasks will always be written in regular, non-italicized font.  

For any question that asks you to include interpretation, write your answer in a Markdown cell in Jupyter Notebook (or a ‘Text’ cell if you used Colab).  Any homework question that needs interpretation will be written in italicized font.  Do not simply write your answer in a code cell as a comment, but use a Markdown or Text cell instead.  

Remember to be resourceful!  There are many helpful resources available to you, including the video library, the lecture notes on Blackboard, recitations, the office hours sessions, and the web.  

In the prompt, variables might not be referred to in the exact way that their names appear in the dataset.  This is okay -- that’s very realistic.  You should familiarize yourself with the dataset and its variables through the dataset description table.  

Dataset Description:

Variable

Description

visitor

This is an incremental count variable – each person who visited Lobster Land on July 4th is assigned a unique number.  Note that the actual number of visitors is larger than the number of rows here – if a person purchased tickets for a family, only the ticket buyer is included in this dataset.

day_pass

This variable indicates that the visitor either used a day pass (1) or did not use a day pass (0).  A day pass gives the buyer access to Lobster Land for one full day.  Season ticket holders do not purchase day passes.

season_ticket

A “1” in this variable means that the person used a season ticket, whereas a “0” means that the person did not.  

domestic

A “1” in this variable means that the person is a U.S. resident, whereas a “0” indicates that the person is not.  

state

The homestate of domestic visitors to Lobster Land on July 4th.

country

The visitor’s home country, either BRA (Brazil) CAN (Canada)  CHN (China) FRA (France)  GER (Germany)  IND (India)  JPN (Japan)  MEX (Mexico)  ROK (South Korea)   UK (United Kingdom  USA (United States of America

gender

This shows the gender of the visitor, with 1 representing female, and 0 representing male.  

age

This is an integer variable depicting the age of the visitor.  

maine_res

A “1” in this variable means that the person is from Maine, whereas a “0” indicates that the person is not.  

stay_four

A “1” in this variable means that the person stayed at Lobster Land for more than four hours on July 4th, whereas a “0” indicates that the person did not.

payment_method

A “1” in this variable indicates that the visitor used cash to purchase a ticket, whereas a “0” indicates the use of a credit card, debit card, or digital payment method.

ice_cream_purch

This variable indicates whether the visitor purchased ice cream during their visit (1) or did not purchase ice cream (0).  

ice_cream_flavor

This indicates the type of ice cream purchased by the visitor during their stay at Lobster Land on July 4th.

sky_chair

This variable indicates whether the visitor rode the “Sky Chair” ride during their visit (1) or did not go on this ride (0).  A picture of this ride is below.

 

ferris_wheel

This variable indicates whether the visitor rode the “Ferris Wheel” ride during their visit (1) or did not go on this ride (0).  A picture of this ride is below.

 

 

lobster_claw

This variable indicates whether the visitor rode the “Lobster Claw” roller coaster during their visit (1) or did not go on this ride (0).  A picture of this ride is below.

 

 

lobster_junior

This variable indicates whether the visitor rode the “Lobster Junior” kids’ roller coaster during their visit (1) or did not go on this ride (0).  A picture of this ride is below.

 

 

merch_spend

Total merchandise spending on July 4th by the visitor.

lobsterama_spend

Total spending at the Lobsterama (a sit-down restaurant inside of Lobster Land) by the visitor on July 4th.  

 

Your Tasks:

 

Bring this dataset into your local environment (in Jupyter Notebook, or in Colab).  

I. Exploratory Data Analysis:  Exploration & Manipulation

Call the head() function on this dataframe and look at your results.

How many rows of the dataset are visible in Jupyter now?   

Take a look at the dataset’s shape attribute.

How many rows, and how many columns, are in this entire dataframe?   

Read the dataset description, and take a look at the variables in the dataset.  

Which of your variables should be seen as categorical, and which ones should be seen as numeric?   

Lobsterland has two monetary-related variables in this dataset.  One of them has too many decimal places!  This was caused by an issue with Lobster Land’s software system.  Using Python, round that variable’s values to just two digits.  

Are there any missing values in this dataset?  If so, how many total values are missing?  Use Python code to answer this question.  

Make a separate subset of the dataframe that only includes the rows that have NaN values for ‘state’.  What important thing do they all have in common, which helps to explain why there are NaN values for these rows?

G.   Erroneous Data.

We just received an update from the Lobsterland front ticket office.  Apparently, some guests’ age were mistakenly copied down at the time that their tickets were purchased.  The youngest age of any guest who purchased a ticket on July 4th was 16.  Alter the dataframe so that any guest age currently less than 16 becomes 16.    

H.  LobsterLand wants to know more about how its international guests compare to its domestic ones.

First, find the percentage of guests from the entire dataset who stayed at Lobster Land for more than four hours on July 4th.  

Now, let’s break this down a bit more.  What percentage of domestic visitors stayed for more than 4 hours on that day?   What percentage of international visitors stayed for more than 4 hours on that day?

If the values you found in Step B were different, what do you think might explain this difference?  (No domain knowledge is required here – take a moment to think about it, and come up with a thoughtful, plausible explanation).   

I. Removing a variable

Pick any variable from the dataset that is redundant (in other words, all the information that it contains is already included in another variable).  Remove the variable that you have identified as redundant.

In a sentence or two, explain why this variable is not needed.  

J. Renaming a variable.

Pick any variable in the dataset, and rename it.  (For this step, it doesn’t matter which variable you pick -- the purpose is just to become familiar with the process for doing this -- it can sometimes be a very helpful step in data cleaning/data preparation).

II.   Data Visualization

    K. Using any plotting tool in Python, generate a boxplot that shows maine_res on
on the x-axis, and merchandise spending on the y-axis.    

What do you notice about this relationship?   In a couple of sentences, why does this fit or not fit with what you would intuitively expect? 

L.  Which rides were most popular / least popular on July 4th, 2021?  Generate one barplot that depicts the total number of people who went on the Sky Chairs, the Ferris Wheel, the Lobster Claw, and the Lobster Junior.   (Note:  there are many ways you can solve this – any approach that gets the job done is completely fine).

In a sentence or two, what does this plot show?  

M.   Build a histogram that depicts total merchandise spending per visitor.  

How can you increase the number of bins in your histogram?  

Create another Lobsterama spending histogram, but with more bins.   Be sure to include an x-axis label and a title with your histogram.

How is your second histogram different from your first one?  What is the impact of increasing the number of bins?  

Now, add a hue variable to your histogram, to indicate whether the visitor was domestic or international.  Include multiple=’stack’ inside of the function that builds the histogram.   What does this plot show?

     N.  Use the countplot() function from seaborn to show a comparison of the
homestates of local visitors to Lobster Land on July 4th.  Set up the bars so that
they are in decreasing or increasing order of size.

What does this graph show?  In a sentence or two, explain what it depicts.

O. Now, use the barplot() function from seaborn to show a comparison of Lobster Claw riders from country to country.  Construct this plot so that countries are on the x-axis, and the proportion of guests who rode the Lobster Claw is on the y-axis.  Do not include confidence intervals with the bars.  Set up the bars so that they are in decreasing or increasing order of size.

What does this graph show?  In a sentence or two, explain what it depicts.