Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

DSCI 551 – HW5

(Hadoop and Spark) (Fall 2021)

100 points, Due 11/28, Sunday

SET UP:

Download solution.json, hw5_grade.py, run.sh from blackboard and put it under your homework folder and submit them along with your homework.

Run “chmod 707 run.sh”

This give you a chance to test your answer with TA’s grading script.

1. [50 points] Write a Hadoop MapReduce SQL2MR.java to implement the following SQL query in Hadoop MapReduce on the ﬁlm table (from Sakila) stored in multiple CSV ﬁles: ﬁlm1.csv and ﬁlm2.csv. To learn how to compile and run your java ﬁle in Hadoop, please refer to the WordCount tutorial. Your program does not need to use a combiner. However, you should think about how to implement one (e.g., being able to write pseudocode describing the input, output, and logic of combiner and related reducer).

SELECT rating, avg(rental_rate)

FROM ﬁlm

where length >= 60 group by rating

having count(*) >= 160

Note that each line in the csv ﬁle has four values: ﬁlm_id, rental_rate, length, and rating. For example:

1	0.99	86	PG
2	4.99	48	G
…

Execution format:

hadoop jar hw5.jar SQL2MR input output

where input is a directory storing all the CSV ﬁles and output is the directory for storing output from your mapreduce program. Two sample CSV ﬁles are provided but note that your program may be tested using additional ﬁles with the same format.

Submission:

· SQL2MR.java

· hw5.jar

· q1_screenshot.png: A screenshot showing the output of your program on EC2. For example (from WordCount), name it “q1.png”:

For question 2 and 3 below, you will be using three csv ﬁles: ﬁlm.csv, actor.csv, and ﬁlm_actor.csv as input. Note that each of these ﬁles contains the complete information for the corresponding table (i.e. data are NOT split into multiple ﬁles like in question 1). Note also that the ﬁles have headers in the ﬁrst line.

2. [20 points, Spark DataFrame] Write a Spark script using Spark DataFrame functions for each of the following questions. 5 points each question.

Execution format: python3 q2_x.py Name all your ﬁles q2_x.py

Any typo in column name and order will result in loss of ALL POINT

a) Find out how many ﬁlms are rated as either 'PG' or ‘PG-13’, use ‘.show()’ to output your result, your columns’ name and order should look EXACTLY like:

|count|

b) Find ﬁrst and last name of actors who have played in the ﬁlm 'ANONYMOUS HUMAN’, use ‘.show()’ to output your result, your columns’ name and order should look EXACTLY like:

|ﬁrst_name|last_name|

c) Find name of actor who has played in the most number of ﬁlms, use ‘.show()’ to output your result, your columns’ name and order should look EXACTLY like:

|ﬁrst_name|last_name|

d) Find answer to the SQL query in Question 1: SELECT rating, avg(rental_rate)

FROM ﬁlm

where length >= 60 group by rating

having count(*) >= 160

use ‘.show()’ to output your result, and round ‘avg_rate’ to 2 decimal with pyspark.sql.functions.round (aka. fc.round)

your columns’ name and order should look EXACTLY like:

|rating|avg_rate|

3. [30 points, Spark RDD]

For each of the subquestion in Question 2 above, write a Spark script using Spark RDD API functions to answer the question. 7.5 points each question.

Execution format: python3 q3_x.py Name all your ﬁles q3_x.py

a) output your result with the following template: for count in res:

print(count)

b) output your result with the following template: for ﬁrst, last in res:

print(ﬁrst, last)

c) output your result with the following template: for ﬁrst, last in res:

print(ﬁrst, last)

d) output your result with the following template (round average to 2 decimal with the template):

for rating, avg in res:

print(rating, "{:.2f}".format(avg))

Submission:

1. Put all ﬁles in the same directory and compress it into a zip ﬁle. Zip ﬁle name format: LASTNAME_FIRSTNAME_HW5.zip

Make sure when the ﬁle is unzipped, the folder name is LASTNAME_FIRSTNAME_HW5

2. Your submission folder should contain 19 ﬁles and look EXACTLY like, each extra ﬁle (-2 pts)

Please understand how TA will run your python scripts for q2 and q3. The TAs will simply run:

./run.sh

Once done, your folder will look EXACTLY like:

Aser all “.res” ﬁles are generated, we will run python3 hw5_grade.py

Your folder will look EXACTLY like:

Aser running this command, a score.res ﬁle will be generated.

If your column name or order are incorrect, points will be deducted as mentioned before.

You can test your ﬁles with the given grading script before you submit them, but please make sure to delete all the .res ﬁles before compressing the folder and making your submission.

If you change a single byte in hw5_grade.py, run.sh or solution.json, 50 pts will be deducted.

2021-11-28

Java

物理(Physical)

LINUX

C++

Python

Processing

sas

ios

maths

maple