Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

DATA3888 Practice questions for the Computer Quiz

Fill in your SID here - TEST STUDENT

Week 6/7

## This document was compiled on: 2023-04-06 14:50:47 in Australia/Sydney

Instructions

1. At the start of your RStudio session, save this R markdown (Rmd) template le as SIDXXXRQuiz.Rmd” where XXX is your Student ID.

2. Put your Student ID at the top of the Rmd le (NOT your name)

3. This is an open book quiz, and you are allowed to search the Internet and the course webpage to access any descriptions and R codes that may help you to solve the questions.

4. Tips: You are expected to label ALL graphics properly, including legends where appropriate. Suitable rounding is expected for all numerical expressions. Finally, for many questions, simply typing’ the     relevant R-code is not adequate, and you are required to answer the questions in words.

5. We suggest you begin by knitting your work to make sure the template works on your system.

6. It s a good idea to submit your work regularly - if you end up submitting the nal version late, the marker can go back and mark only the on-time submission.

7. You may nd a challenge is to keep an eye on the time. We suggest you set an alarm on your phone and give yourself plenty of time for submission.

You will require the following libraries:

library(ggthemes)

library(tidyverse)

library(class)

library(cvTools)

library(ggplot2)

library(e1071)

library(glmnet)

library(tuneR)

library(EBImage)

Question 1 - Brain Signal

One tutor has generated a series of signals from left and right eye movements from the Spikerbox. These      signals are captured and saved as WAV format in the le LRL_L1.wav” , Read the  .wav file using the  tuneR package’s  readWave function.

(a)

Skill test: write a while loop to identify the number of terms required before the product 1 × 2 × 3 × 4 × reaches above 10 million.

n = 1

p = n

while (p < 1e7) {

n = n+1

p = p*n

}

n - 1


## [1] 10

Answer:

The answer is 10.

(b) What is the dimension of this  .wav le?


## Answer

waveSeq <- readWave("LRL_L1.wav")

waveSeq


##

## Wave Object

## Number of Samples:      88246

## Duration (seconds):     8.82

## Samplingrate (Hertz):   10000

## Channels (Mono/Stereo): Mono

## PCM (integer format): TRUE

## Bit (8/16/24/32/64):    16


slotNames(waveSeq)


## [1] "left"      "right"     "stereo"    "samp.rate" "bit"       "pcm"


Answer: There are 88246 values.

(c) Visualize the data from LRL_L1.wav


## Answer

# time (in second) of the sequencing

timeSeq <- seq_len(length(waveSeq))/waveSeq@samp.rate

plot(timeSeq, waveSeq@left, type = "l", ylab="Signal", xlab="Time(seconds)")


(d) Use a window of size n, where n is equal to 0.1 seconds. Calculate the The Interquartile Range (IQR) of the signal in the moving window. Overlay this        information on the times series plot.

## Answer

windowsize = 0.1 * waveSeq@samp.rate

testStat = rep(NA, length(timeSeq) - windowsize)

for (i in 1:(length(timeSeq) - windowsize)){

testStat[i] <- IQR(waveSeq@left[i:(i + windowsize)])

}

start = round(windowsize/2)

end = start + length(testStat) - 1

plot(timeSeq, waveSeq@left, type = "l", ylab="Signal",xlab="Time(seconds)")

lines(timeSeq[c(start:end)], testStat, col="red")


Question 2 - Gene expression analysis

We will use the dataset presented in Golub et al. (1999). These data come from a study of gene expression in  three types of acute leukemias: B-cell acute lymphoblastic leukemia (B-ALL), T-cell acute lymphoblastic          leukemia (T-ALL) and acute myeloid leukemia (AML). Gene expression levels were measured for 38 B-cell ALL, 9 T-cell ALL and 25 AML tumor samples, using Affymetrix high-density oligonucleotide arrays hgu68a              containing p = 6817 human genes.

(a) What is the size of the matrix  golub . How many samples are in each of the sub-cancer types.


## Answer

load( 'GolubData.RData')

dim(golub)


head(colnames(golub))


## [1] "B-ALL:1" "T-ALL:2" "T-ALL:3" "B-ALL:4" "B-ALL:5" "T-ALL:6"


resp = matrix(unlist(sapply(colnames(golub), strsplit, ":")), nrow=2)[1,]

table(resp)


## resp

## AML B-ALL T-ALL

##  25    38     9

Answer: Matrix size = 3571 by 72.

The three types of cancers are AML, B-ALL, T-ALL with 25, 38, 9 samples respectively .

(b) Explain the following four lines of code:


varValue = apply(golub, 1, var, na.rm=TRUE)

cutoffvalue = sort(varValue, decreasing = TRUE)[150]

varid = which(varValue >= cutoffvalue)

varValue[varid] %>% head


## AFFX-HUMRGE/M10098_5_at

##

AFFX-M27830_5_at

##

AFFX-HUMRGE/M10098_M_at AFFX-HUMRGE/M10098_3_at

2.371126

D00017_at

1.110017

Answer: The code above selects the top 150 most variable genes from  golub .

(c) Use the genes selected from part (b) and build a risk calculator using KNN with K = 7 to classify the mRNA tumor samples. Estimate the KNN accuracy using        repeated cross-validation with 45 repeats and visualise your results.


## Answer (including graphics)

cvK = 5 # number of CV folds

cv_acc5_rtimes = cv_acc5 = c()

r = 45

X = t(golub[varid,])

y = resp

n = nrow(X)

for (i in 1:r) {

if (i %% 10 == 0) {

print(i)

}

cvSets = cvTools::cvFolds(n, cvK) # permute all the data, into 5 folds

cv_acc = NA # initialise results vector

for (j in 1:cvK) {

test_id = cvSets$subsets[cvSets$which == j]

X_test = X[test_id, ]

X_train = X[-test_id, ]

y_test = y[test_id]

y_train = y[-test_id]

fit5 = class::knn(train = X_train, test =