DAT 560M – Big Data and Cloud Computing 2023 – Lab #2
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit
DAT 560M: Big Data and Cloud Computing
Fall 2023, Mini B
Lab #2
INSTRUCTIONS
1. This is a group assignment, to be worked during the lab.
2. ONLY utilize the codes we practice.
3. Please submit the answers on Canvas.
4. Only one per groups is sufficient.
CODES TO PRACTICE
CODE EXPLANATION
WHOAMI This code shows the current username in Linux environment
PWD This codes shows the current folder you are in
LS This is actually a synonym of list. It lists all files and folders in your current directory
LS –A Shows all files, even hidden ones
LS –LH Shows files, and folders in details readable by humans
LL Shows files, and folders in details. See the difference with ls –lh
CD This codes stands for change directory. We use it to go up, go down, go to root , …
CD If you just type cd and enter, wherever you are, you’ll be redirected to your home directory
CD .. It brings you to the upper directory. For example, from a:/bb/cc to a:/bb.
~ It stands for your home directory. Type cd ~ and you’ll be directed to your home directory.
MKDIR Makes directory.
TOUCH Makes empty file
NANO It is one of the editors in Linux terminal environment. There are other popular editors like vi, gedit, …
CP Copy command in Linux. Example: cp a.txt tmp/a.txt
MV Move command in Linux. You can also use it to rename files and folders.
RM Delete command in Linux. It is used for files.
RM -R Used to delete a folder. Empty or not!
RMDIR It can delete a folder. But, not a folder that contains files.
CLEAR Clears the terminal.
TOP/HTOP Shows/visualizes the users, processes, memory usage, and …
WGET Used to download a file from the internet.
HEAD Shows the top n rows of a given file.
TAIL Shows the bottom n rows of a given file.
> This character is assigning character. You can assign the result of a command to a new file.
>> This used to add the result of a command to the end of another file.
| This is piping character. You can pipe the result of a command to the next command.
CAT It is used to display the file on the screen.
WC It is used as wordcount. It shows how many lines, how many words, and how many characters are in a given file.
WC - L Shows the number of lines of a given file.
WC -C Shows the number of characters of a given file.
WC -W Shows the number of words of a given file.
SHUF It is used to create a sample of a file.
SHUF -N It takes n randomly selected rows of a file.
CHMOD Change the file attribute
CHOWN Change the ownership of a file
HISTORY List the commands that the user has run so far
GREP Searches a certain pattern within the given file
GREP -I To be insensitive regarding the patterns
GREP -C Count the number of lines having the interested pattern
MAN A command to get help of almost any Linux command
--HELP An addition to all Linux commands to get help
CUT To get a segment of a given file
CUT –D To identify the delimiter of a given file
CUT –F To identify the interested segment of the given file
UNIQ To find the unique patterns in a given file. Always do a sort before using the uniq command.
UNIQ -D Find duplicates of rows in a given file
UNIQ -U Find only unique rows of a given file
UNIQ -C Count each unique line of a given file
TR To change, remove … characters on a given data. Remember that tr doesn’t call the data itself, data should be piped into tr.
TR -D Remove given character from a given data
SORT Sorts the given data alphabetically/numerically
SORT -R Reverse sort the given file alphabetically/numerically
AWK Awk is a scripting language used for manipulating data and generating reports.
ASSIGNMENT
In this assignment, we are going to work with a file named fda_c.csv and the file is in dataset folder on the server. The location of the file would be “/var/www/html/dataset/fda_c.csv” This dataset, is a huge file and we don’t like to have multiple copies. Hence, please don’t copy the file into your home directory and use it directly. This dataset has several columns as:
Primary_id: ID of a patients visit
Drug_seq: Sequence of the used drug
Caseid: ID of the patient
Inid_pt: Disease name
Drugname: Name of the used drug
Age: Age of the patient
Gndr_cod: Gender of the patient.
Wt: Weight of the patient
Reporter_country: The country of which the patient record has been reported
De: If the patient has passed away
Lt: If the patient has life threatening record
Ho: If patient has hospitalization record
Ds: If patient has disability record
Ca: If patient has congenital anomaly record
Ri: Required Intervention
Ot: Other outcomes
Pt: Reaction
1- Get to know the columns of fda_c.csv. To do so, print each column name in one line.
2- Get the disease name column of this file and return only 10 first diseases. Note: Remember to ignore the header (indi_pt stands for indication preferred term)
3- Find the most frequent disease in this database.
4- Among the patients with diabetes (any type) disease, what gender is the most prominent? How many? Just look at m and f for male and female patients. Note: Assume that the diabetes only appears in disease column.
5- Find the patients who have passed away and save them on your local directory in a new file as fda_c_dead.csv. Hint: If a patient is dead, then the “de” column is identified as “death”. You may use awk.
6- We are interested to do more analysis on the diseases. Then, we like to see the 10 most frequent words in the disease column. Then let’s find the frequency of words using wordcount MapReduce. To make the process faster, let’s look at only the first 100,000 rows.
To do this, you have several tasks as follows:
a. Get only the disease column from the first 100,000 rows.
b. Upload your file into HDFS.
c. Run MapReduce job on your file.
d. Run a bash code on your file in HDFS to find the top 25 words. Hint: You may use sort command. Remove the ones that doesn’t sound right.
e. Visualize the top 25 words with any tool you may know (Excel, Python, R, …)
2023-12-08