DAT 560M – Big Data and Cloud Computing 2023 – Homework #2
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit
DAT 560M: Big Data and Cloud Computing
Fall 2023, Mini B
Homework #2
INSTRUCTIONS
1. This is an individual assignment. You may not discuss your approach to solving these
questions with anyone, other than the instructor or TA.
2. Please include only your Student ID on the submission.
3. The only allowed material is:
a. Class notes
b. Content posted on Canvas
c. Utilize ONLY the codes we practice. Anything beyond will not get any point! 4. You are not permitted to use other online resources.
5. The submission is due by the next lab.
6. There will be TA office hours. See the schedule on the module.
For each question, submit your code and a screen-shot of the results. If the results are too long, partial results are fine.
ASSIGNMENT
In this assignment, we are going to practice MapReduce on some data and they are located in our database folder. In case you don’t know where the folder is, you may look athere:
http://ip-address/dataset where you need to replace the server IP address instead of ip-address. For example, if the server IP address is 12.345.67.890 then the address would be:
http://12.345.67.890/dataset
Part 1: Linux Code Practice (15 pts, 5 each)
Please take a look at CarSale.csv on our dataset folder. We are especially interested in looking at the first two columns: VIN, and body_type.
a) Please write a command to see if there is any VIN that has been reported more than once. b) Please write a command to find the number of each body_type. Which one is the most?
c) Let’s do a simple word count using Linux commands. Find the most 3 common words in the description. You may ignore the non-words.
Part 2: MapReduce Practice
In this homework, we will do some real data scientist job with a messy data. In both questions, try to bring as clean as possible data into your results.
a) Using only a mapper in debugging mode (means you don’t need to run it on HDFS), find the max, min, and average price for Jeeps (franchise_make = Jeep) (25 pts).
b) We are interested to find the number of colors per manufacturer. (50 pts) Points: . For the manufacturer, please refer to franchise_make.
. For the color, please refer to exterior_color.
Please note that you need to run your code on HDFS. For this question, you need 3 files: mapper.py, reducer.py, and bash.sh to run the MapReduce on HDFS.
Note: If you like to have only one output file after MapReduce, add this line into your bash code (before input line):
-Dmapred.reduce.tasks=1
i. (30 pts) Write a MapReduce code to find the requested summary. You need a mapper, a reducer, and a bash file to run them on HDFS.
. Technically this question asks something similar to
select make, color, count(color)
group by make, color
in SQL.
. Please remove the empty makers or empty or “none” colors from your results. As a hint, you are better to filter them out in the mapper.
ii. (20 pts) After you get the results from HDFS, run a Linux command to find:
. (5 pts) The most 5 populated form of maker and color.
. (5 pts) The most populated color of all Toyotas.
. (10 pts) Sort your results in descending order of the color. What are the manufacturer, color, and number in that row?
2023-11-10