Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

DAT 560M: Big Data and Cloud Computing

Fall 2023, Mini B

Homework #3

INSTRUCTIONS

1. This is an individual assignment. You may not discuss your approach to solving these questions with anyone, other than the instructor or TA.

2. Please include only your Student ID on the submission.

3. The only allowed material is:

a. Class notes

b. Content posted on Canvas

c. Utilize ONLY the codes we practice. Anything beyond will not get any point!

4. You are not permitted to use other online resources.

5. The physical submission is due by the next lab.

6. There will be TA office hours. See the schedule on Canvas.

ASSIGNMENT

Each question worth 10 points. You need to take screenshot of your code environment for each question. In all query questions, you are free to choose Hive or Impala.

In this assignment, we are going to work with a file named accidents.csv and the file is located in our database. This dataset has information about accidents happened in the US at certain time period. There are a few number of columns in the dataset and we would like to do an analysis on that by Linux shell and Hive/Impala. Please answer each question and provide a screenshot.

Part 1- Linux Code Practice

1- Download accidents.csv from our dataset folder into your home directory.

2- Using one line bash command print the number of columns.

3- Using the Linux commands, how many different states are represented here? Your result should be only a number.

4- Using the Linux commands, find the top 3 cities in the state of Missouri (in terms of frequency).

5- Using the Linux commands, find the severity, city, state, and the temperature of the 5 lowest temperature accidents. For this question, you need to learn how to better use sort in a csv file.

Part 2- Hive/Impala Practice

There is a table available on the server called accidents. Please use this table to answer the following questions.

Please do not change the name of this table or do not remove it.

6- Compare the average severity of accidents happened in Saint Louis and Kansas City. List the city, average severity, number of accidents, and average precipitation.

7- Find the top 5 cities in Missouri with the total number of accidents less than 25. Sort the results to find the top 5 ones. List the city, and number of accidents.

8- Find the number of accidents per severity per sunrise_sunset status. Sort the results by the sunset_sunrise status. Can you see a trend?

9- Visualize the number of accidents per state. Which state is like an outlier?

Part 3- PySpark Installation Result

10- After successful installation of PySpark on your computer, please run the following commands on a new jupyter notebook and take a screenshot of the Spark version running on Jupyter Notebook.

import pyspark

from pyspark.sql import SparkSession

spark=SparkSession.builder.appName('Practice').getOrCreate()

spark