Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

Project 2: You and Data Science - Data Science Discovery

» Data Science Guides »

Project 2: You and Data Science

Due: Last Day of Lecture (Wednesday, December 6th at 11:59pm)

Throughout this semester, you have grown into an amazing Data Scientist! You are analyzing datasets in Python, performing advanced statistical tests, and finding the answers to complex questions using data. You have seen dozens of datasets we have provided throughout the semester. For the final project, we want you to teach us something -- we want to learn about something you are passionate about!

For this final project in Data Science DISCOVERY, you will use Data Science to explore something you are passionate about or interested in learning more about. At the end, you will write a small report telling us about what you found and teaching us something! We only have a few minimal requirements:

You must use a non-trivial dataset. The dataset must have at least 200 data points (this could be 20 rows with 10 columns, 50 rows with 4 columns, etc).

The dataset you use must NOT be a dataset we used in class or lab (details on how to find datasets in the "Dataset" section below).

You must create your own Python notebook. You will turn in both code and analysis. You must do something, but it could be anything.

With students from so many different majors in Data Science DISCOVERY, we are excited for everything we are going to learn from you! :)

Setting Up Your Project Workspace

To complete this project, there is no starter code or starter files -- you are building it from scratch!

However, we do want to nerd out with your work so we need you to place it in a specific spot in your stat107 directory so you can turn it in and so that we can find it:

In your stat107/netid directory (the directory that contains all of your labs, extra credit microprojects, etc.), create a new directory called project2.

You will want to complete ALL your work related to project2 in your project2 directory.

At the end, you'll turn in your whole folder. :)

Dataset

The idea of this project is that you will use a dataset you are passionate about. It can be anything -- it can be a dataset used from another class (eg: think if you had any data you get in Excel), it can be a dataset you found online, or it can be a dataset you gather yourself. Some ideas include but are not limited to:

A dataset about a hobby you are interested in (eg: vacation destinations, best beaches, fashion trends, Instagram, music, etc.)

A dataset about something you enjoy doing or watching (eg: swimming, volleyball, Rocket League, Illini Football, etc.)

A dataset about your a topic related to your major (economics, communications, political science, etc.)

Any dataset that means something to you.

Online Data Sources

The best data is data that you personally care about. This may be data from a club you are part of or data about something you're passionate about that you already have available.

If you have no datasets at all, here are several websites that many people use as sources for datasets:

Government-Provided Datasets

UIUC Division of Management Information (DMI) Student Enrollment Data, https://www.dmi.illinois.edu/stuenr/

City of Chicago Data Portal, https://data.cityofchicago.org/

State of Illinois Data Portal, https://data.illinois.gov/

U.S. Government's Open Data, https://data.gov/

Curated Lists of Datasets by Others

BigML Blog List of Datasets, http://blog.bigml.com/list-of-public-data-sources-fit-for-machine-learning/

Quora Answer: "Where can I find large datasets open to the public?" https://www.quora.com/Where-can-I-find-large-datasets-open-to-the-public

Collections of University-Provided Datasets

UC Irvine Machine Learning Repository, https://archive.ics.uci.edu/

Stanford Large Network Dataset Collection, http://snap.stanford.edu/data/index.html

Third-Party Data Sources

(Note: These sites listed here are generally third-party data providers. This means that they do not collect the data themselves, but simply pass on data that others have provided. Some datasets provided may be high-quality and trusted, others may be completely made up data.)

Kaggle, http://www.kaggle.com/

League of Legends Match Data Downloads, https://oracleselixir.com/tools/downloads

sports-reference.com

Project Notebook

The major deliverable for this project is a notebook of your analysis and summary of your findings.

To create your notebook:

1. Open Visual Studio Code and then choose "File -> Open Folder". Go into your project2 folder you created earlier (Desktop -> stat107 -> netid -> project2).

2. Once you have the project2 space open inside of Visual Studio Code, choose "New -> New File..."

3. In the options dialog, choose Jupyter Notebook. You now have a blank notebook.

4. We recommend immediately saving it and calling it project2.ipynb.

Jupyter Notebook Format

You will need to add a combination of "Code" (Python) and "Markdown" cells to complete your notebook. You can hover your mouse below each cell to see the + Code and + Markdown options to add new blocks of certain types.

Code blocks are used for Python programming. Everything in a Code block will be read as Python.

Markdown blocks are used for writing. Everything in a Markdown block will be read as Markdown (formatted text).

You can learn about the options available for Markdown on the "Basic Syntax" guide for Markdown or any other source for Markdown documentation.

Deliverable

The Jupyter notebook is your only deliverable. The requirements are:

1. You must have five sections in your notebook. Each section MUST start with a clearly identifiable Markdown cell that contains a "Header 1" of your current section. (See the "Basic Syntax" guide for Markdown to understand what "Header 1" means in Markdown.)

2. The five sections must be:

Section 1: Dataset: In Markdown, explain what dataset you chose and why you chose it. Include why is it meaningful to you and how you went about finding it. Then, in Python, load your dataset into a DataFrame.

Section 2: Exploratory Data Analysis: In Markdown, explain what descriptive statistics can help you give a broad overview of the data (ex: size, shape, interesting descriptive statistics, etc.). In Python, do this exploratory data analysis.

Reference Labs: lab_intro, lab_pandas, lab_exp_design, lab_simpsons_paradox

Section 3: Exploratory Data Visualization: In Python, create at least one data visualization. This does not need to be complex, but should showcase something about your EDA or Data Science analysis. In Markdown, provide at least a two sentence summary of this result.

Reference Labs: lab_plots, lab_gpa

Section 4: Data Science: In Markdown, explain at least one question you have about your dataset. Clearly state the questions you have and how you plan on using Python to answer them. This may involve cleaning or selecting a subset of the data. You can use any technique you learned in DISCOVERY that is beyond simple descriptive statistics. You can use regression, hypothesis testing, correlation, simulation, or ideas from any of the labs, MicroProjects, or lecture. In Python, do the data

science! :)

Reference Labs: lab_favorites, lab_similarity, lab_justice, lab_probability, lab_random_variable, lab_hypothesis_tests, lab_regression, lab_kmeans

Section 5: Overall Summary: In Markdown, summarize your dataset, findings, and visualization. A good summary shares a complete overview of your work in only 1-2 paragraphs without going low-level into the code. This might be the summary you would share in a future interview if someone asked you about "what is a data science project you did on your own?". Make sure to include at least 1-2 paragraphs for your summary (a paragraph is at least 5 sentences).

3. Each section should generally be several sentences AND several lines of code and span at least a full screen. The only exception is Section 3, where many data visualizations can be just one line of code.

4. Your audience is going to be Prof. Wade, Prof. Karle, and/or your lab TA. You do NOT need to explain Python or Data Science to us, but you should assume we know nothing about your specific interest/passion/dataset.

5. Make sure to save your work and submit it to GitHub before 11:59pm on the last day of class (Wednesday, December 6 at 11:59pm).

Submission

Make sure you have saved your notebook. Once your notebook is saved, you will turn in your project2 folder just like you have done for all of your other assignments. This submission process is different than usual!! In your stat107/netid directory, add your project2 folder:

git add project2

Make sure you don't have any errors. This command will add your project2 folder only if you are in the stat107 directory.

Once you've added the project2 folder, turn it in with the following commands:

git commit -m "project 2 submission"

git push

Make absolutely sure your files are turned in by checking your repository on GitHub found here:

https://github.com/orgs/stat107-illinois/repositories.

On This Page

Project 2: You and Data Science

1. Setting Up Your Project Workspace

2. Dataset

i. Online Data Sources

3. Project Notebook

i. Jupyter Notebook Format

ii. Deliverable

4. Submission

Data Science Discovery is an open-source data science resource created by The University of Illinois with support from The Discovery Partners Institute, the College of Liberal Arts and Sciences, and The Grainger College of Engineering. The aim is to support basic data science literacy to all through clear, understandable lessons, real-world examples, and support.