Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

INFS 4020 - Big Data Concepts

Practical Test 1 (SP5 2023)

Due: By 11PM on Thursday 24 August

General Instructions

.    This test is worth 5% of your final grade and it is due no later than 11pm on Thursday 24 August.

.    The test will be marked out of 10.

.    You will need to submit your work via learnonline in zip format.

.    You will need to STRICTLY follow the Submission Instructions.

Assessment Tasks

In this assessment you are required to write two MapReduce programs and run them on Hadoop Distributed File System (HDFS).

First, create a directory for this assessment called test1 within the /home/prac/ directory as we

normally have in practicals. From here you should be able to follow the directions in Practical 2 to write and run your MapReduce programs.

Our input file (or data) for this assessment will be the text version of Common Sense by Thomas Paine in public domain. Copy the file into your input folder and rename it to test_input.txt.

Then, create /user/prac/test1 directory within HDFS.

Lastly, upload our input file (test_input.txt) to /user/prac/test1/input HDFS directory, and run our

programs using Hadoop Streaming. The output for Q1 should be put in /user/prac/test1/output1 and output for Q2 should be put in /user/prac/test1/output2.

Q1. Write a MapReduce program to determine the frequency of word lengths within an input file.

The program should return how many times each word length appears within the text file. For example, in the following text

This is a line of text.

The length of the words is

4 2 1 4 2 5

So the output file would look something like

1      1

2      2

4      2

5      1

indicating that there are one word with one letter, two words with two letters, two words with four letters and one word with five letters.

Q2. Write a MapReduce program to determine the the frequency of words by the first letter.

The program should return how many times the words starting with each first letter appears within the text file. For example, in the text

This is a line of text.

The output file would look something like

a    1

i    1

l     1

o    1

t    2

indicating that there are one word starting with ‘a’, one word starting with ‘i’, one word starting with ‘l’, one word starting with ‘o’, and two words starting with ‘t’ in the file.

The program should not differentiate upper case and lower case letters. So, a word starting with an ‘A’ should be counted in the same group of a word starting with an ‘a’ .

Do not count words start with numbers and other special characters.

Submission Instructions

You should submit two files for each question with the exact names (four files in total as shown below):

.    mapperq*.py

.    reducerq*.py

Here is the structure of what it looks like in your submission zip file.

Distribution of marks

All questions program code – 2 marks each

All questions output – 2 marks each

All questions code presentation – 1 mark each (good variable naming and comments)

Total of 10 marks