闪电代写 -代写CS作业_CS代写_Finance代写_Economic代写_Statistics代写_代码代做_IT代写_加急帮助

Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

DAT 560M: Big Data and Cloud Computing

Fall 2023, Mini B

Lab #2

INSTRUCTIONS

1. This is a group assignment, to be worked during the lab.

2. ONLY utilize the codes we practice.

3. Please submit the answers on Canvas.

4. Only one per groups is sufficient.

CODES TO PRACTICE

CODE EXPLANATION

WHOAMI This code shows the current username in Linux environment

PWD This codes shows the current folder you are in

LS This is actually a synonym of list. It lists all files and folders in your current directory

LS –A Shows all files, even hidden ones

LS –LH Shows files, and folders in details readable by humans

LL Shows files, and folders in details. See the difference with ls –lh

CD This codes stands for change directory. We use it to go up, go down, go to root , …

CD If you just type cd and enter, wherever you are, you’ll be redirected to your home directory

CD .. It brings you to the upper directory. For example, from a:/bb/cc to a:/bb.

~ It stands for your home directory. Type cd ~ and you’ll be directed to your home directory.

MKDIR Makes directory.

TOUCH Makes empty file

NANO It is one of the editors in Linux terminal environment. There are other popular editors like vi, gedit, …

CP Copy command in Linux. Example: cp a.txt tmp/a.txt

MV Move command in Linux. You can also use it to rename files and folders.

RM Delete command in Linux. It is used for files.

RM -R Used to delete a folder. Empty or not!

RMDIR It can delete a folder. But, not a folder that contains files.

CLEAR Clears the terminal.

TOP/HTOP Shows/visualizes the users, processes, memory usage, and …

WGET Used to download a file from the internet.

HEAD Shows the top n rows of a given file.

TAIL Shows the bottom n rows of a given file.

> This character is assigning character. You can assign the result of a command to a new file.

>> This used to add the result of a command to the end of another file.

| This is piping character. You can pipe the result of a command to the next command.

CAT It is used to display the file on the screen.

WC It is used as wordcount. It shows how many lines, how many words, and how many characters are in a given file.

WC - L Shows the number of lines of a given file.

WC -C Shows the number of characters of a given file.

WC -W Shows the number of words of a given file.

SHUF It is used to create a sample of a file.

SHUF -N It takes n randomly selected rows of a file.

CHMOD Change the file attribute

CHOWN Change the ownership of a file

HISTORY List the commands that the user has run so far

GREP Searches a certain pattern within the given file

GREP -I To be insensitive regarding the patterns

GREP -C Count the number of lines having the interested pattern

MAN A command to get help of almost any Linux command

--HELP An addition to all Linux commands to get help

CUT To get a segment of a given file

CUT –D To identify the delimiter of a given file

CUT –F To identify the interested segment of the given file

UNIQ To find the unique patterns in a given file. Always do a sort before using the uniq command.

UNIQ -D Find duplicates of rows in a given file

UNIQ -U Find only unique rows of a given file

UNIQ -C Count each unique line of a given file

TR To change, remove … characters on a given data. Remember that tr doesn’t call the data itself, data should be piped into tr.

TR -D Remove given character from a given data

SORT Sorts the given data alphabetically/numerically

SORT -R Reverse sort the given file alphabetically/numerically

AWK Awk is a scripting language used for manipulating data and generating reports.

ASSIGNMENT

In this assignment, we are going to work with a file named fda_c.csv and the file is in dataset folder on the server. The location of the file would be “/var/www/html/dataset/fda_c.csv” This dataset, is a huge file and we don’t like to have multiple copies. Hence, please don’t copy the file into your home directory and use it directly. This dataset has several columns as:

 Primary_id: ID of a patients visit

 Drug_seq: Sequence of the used drug

 Caseid: ID of the patient

 Inid_pt: Disease name

 Drugname: Name of the used drug

 Age: Age of the patient

 Gndr_cod: Gender of the patient.

 Wt: Weight of the patient

 Reporter_country: The country of which the patient record has been reported

 De: If the patient has passed away

 Lt: If the patient has life threatening record

 Ho: If patient has hospitalization record

 Ds: If patient has disability record

 Ca: If patient has congenital anomaly record

 Ri: Required Intervention

 Ot: Other outcomes

 Pt: Reaction

1- Get to know the columns of fda_c.csv. To do so, print each column name in one line.

2- Get the disease name column of this file and return only 10 first diseases. Note: Remember to ignore the header (indi_pt stands for indication preferred term)

3- Find the most frequent disease in this database.

4- Among the patients with diabetes (any type) disease, what gender is the most prominent? How many? Just look at m and f for male and female patients. Note: Assume that the diabetes only appears in disease column.

5- Find the patients who have passed away and save them on your local directory in a new file as fda_c_dead.csv. Hint: If a patient is dead, then the “de” column is identified as “death”. You may use awk.

6- We are interested to do more analysis on the diseases. Then, we like to see the 10 most frequent words in the disease column. Then let’s find the frequency of words using wordcount MapReduce. To make the process faster, let’s look at only the first 100,000 rows.

To do this, you have several tasks as follows:

a. Get only the disease column from the first 100,000 rows.

b. Upload your file into HDFS.

c. Run MapReduce job on your file.

d. Run a bash code on your file in HDFS to find the top 25 words. Hint: You may use sort command. Remove the ones that doesn’t sound right.

e. Visualize the top 25 words with any tool you may know (Excel, Python, R, …)

2023-12-08

Java

物理(Physical)

LINUX

C++

Python

Processing

sas

ios

maths

maple

C语言