Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

DAT 560M: Big Data and Cloud Computing

Fall 2023, Mini B

Lab #2

INSTRUCTIONS

1. This is a group assignment, to be worked during the lab.

2. ONLY utilize the codes we practice.

3. Please submit the answers on Canvas.

4. Only one per groups is sufficient.

CODES TO PRACTICE

CODE EXPLANATION

WHOAMI               This code shows the current username in Linux environment

PWD                      This codes shows the current folder you are in

LS                          This is actually a synonym of list. It lists all files and folders in your current directory

LS –A                      Shows all files, even hidden ones

LS –LH                    Shows files, and folders in details readable by humans

LL                          Shows files, and folders in details. See the difference with ls –lh

CD                         This codes stands for change directory. We use it to go up, go down, go to root , …

CD                          If you just type cd and enter, wherever you are, you’ll be redirected to your home directory

CD ..                       It brings you to the upper directory. For example, from a:/bb/cc to a:/bb.

~                            It stands for your home directory. Type cd ~ and you’ll be directed to your home directory.

MKDIR                     Makes directory.

TOUCH                     Makes empty file

NANO                       It is one of the editors in Linux terminal environment. There are other popular editors like vi, gedit, …

CP                           Copy command in Linux. Example: cp a.txt tmp/a.txt

MV                           Move command in Linux. You can also use it to rename files and folders.

RM                           Delete command in Linux. It is used for files.

RM -R                       Used to delete a folder. Empty or not!

RMDIR                      It can delete a folder. But, not a folder that contains files.

CLEAR                       Clears the terminal.

TOP/HTOP                 Shows/visualizes the users, processes, memory usage, and …

WGET                       Used to download a file from the internet.

HEAD                        Shows the top n rows of a given file.

TAIL                          Shows the bottom n rows of a given file.

>                              This character is assigning character. You can assign the result of a command to a new file.

>>                            This used to add the result of a command to the end of another file.

|                               This is piping character. You can pipe the result of a command to the next command.

CAT                           It is used to display the file on the screen.

WC                            It is used as wordcount. It shows how many lines, how many words, and how many characters are in a given file.

WC - L                       Shows the number of lines of a given file.

WC -C                        Shows the number of characters of a given file.

WC -W                       Shows the number of words of a given file.

SHUF                          It is used to create a sample of a file.

SHUF -N                      It takes n randomly selected rows of a file.

CHMOD                       Change the file attribute

CHOWN                       Change the ownership of a file

HISTORY                      List the commands that the user has run so far

GREP                           Searches a certain pattern within the given file

GREP -I                        To be insensitive regarding the patterns

GREP -C                       Count the number of lines having the interested pattern

MAN                             A command to get help of almost any Linux command

--HELP                         An addition to all Linux commands to get help

CUT                             To get a segment of a given file

CUT –D                        To identify the delimiter of a given file

CUT –F                        To identify the interested segment of the given file

UNIQ                           To find the unique patterns in a given file. Always do a sort before using the uniq command.

UNIQ -D                       Find duplicates of rows in a given file

UNIQ -U                       Find only unique rows of a given file

UNIQ -C                       Count each unique line of a given file

TR                               To change, remove … characters on a given data. Remember that tr doesn’t call the data itself, data should be piped into tr.

TR -D                           Remove given character from a given data

SORT                            Sorts the given data alphabetically/numerically

SORT -R                        Reverse sort the given file alphabetically/numerically

AWK                              Awk is a scripting language used for manipulating data and generating reports.

ASSIGNMENT

In this assignment, we are going to work with a file named fda_c.csv and the file is in dataset folder on the server. The location of the file would be “/var/www/html/dataset/fda_c.csv” This dataset, is a huge file and we don’t like to have multiple copies. Hence, please don’t copy the file into your home directory and use it directly. This dataset has several columns as:

 Primary_id: ID of a patients visit

 Drug_seq: Sequence of the used drug

 Caseid: ID of the patient

 Inid_pt: Disease name

 Drugname: Name of the used drug

 Age: Age of the patient

 Gndr_cod: Gender of the patient.

 Wt: Weight of the patient

 Reporter_country: The country of which the patient record has been reported

 De: If the patient has passed away

 Lt: If the patient has life threatening record

 Ho: If patient has hospitalization record

 Ds: If patient has disability record

 Ca: If patient has congenital anomaly record

 Ri: Required Intervention

 Ot: Other outcomes

 Pt: Reaction

1- Get to know the columns of fda_c.csv. To do so, print each column name in one line.

2- Get the disease name column of this file and return only 10 first diseases. Note: Remember to ignore the header (indi_pt stands for indication preferred term)

3- Find the most frequent disease in this database.

4- Among the patients with diabetes (any type) disease, what gender is the most prominent? How many? Just look at m and f for male and female patients. Note: Assume that the diabetes only appears in disease column.

5- Find the patients who have passed away and save them on your local directory in a new file as fda_c_dead.csv. Hint: If a patient is dead, then the “de” column is identified as “death”. You may use awk.

6- We are interested to do more analysis on the diseases. Then, we like to see the 10 most frequent words in the disease column. Then let’s find the frequency of words using wordcount MapReduce. To make the process faster, let’s look at only the first 100,000 rows.

To do this, you have several tasks as follows:

a. Get only the disease column from the first 100,000 rows.

b. Upload your file into HDFS.

c. Run MapReduce job on your file.

d. Run a bash code on your file in HDFS to find the top 25 words. Hint: You may use sort command. Remove the ones that doesn’t sound right.

e. Visualize the top 25 words with any tool you may know (Excel, Python, R, …)