闪电代写 -代写CS作业_CS代写_Finance代写_Economic代写_Statistics代写_代码代做_IT代写_加急帮助

Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

DATA3888 (2023): Assignment 1

Instructions

1. Your assignment submission needs to be a HTML document that you have compiled using R Markdown or Quarto. Name your ﬁle as SIDXXX_Assignment.html” where XXX is your Student ID.

2. Under author, put your Student ID at the top of the Rmd ﬁle (NOT your name).

3. For your assignment, please use set .seed(3888) at the start of each chunk (where required).

4. Do not upload the code ﬁle (i.e. the Rmd or qmd ﬁle).

5. You must use code folding so that the marker can inspect your code where required.

6. Your assignment should make sense and provide all the relevant information in the text when the code is hidden. Don’t rely on the marker to understand your code.

7. Any output that you include needs to be explained in the text of the document. If your code chunk generates unnecessary output, please suppress it by specifying chunk options like message = FALSE.

8. Start each of the 3 questions in a separate section. The parts of each question should be in the same section.

9. You may be penalised for excessive or poorly formatted output.

Question 1 - Case Study 1 (Reef): Visualising data

Sully and colleagues have curated a public dataset containing characteristics linked to coral bleaching over the

last two decades. The data is in the ﬁle Reef_Check_with_cortad_variables_with_annual_rate_of_SST_change .csv,

and the authors curated coral bleaching events at 3351 locations in 81 countries from 1998 to 2017. The full description of the variables can be found in the supplementary table of the study.

a. In the paper, the authors claim “the highest probability of coral bleaching occurred at tropical mid- latitude sites (15–20 degrees north and south of the Equator)”. Create an informative map visualisation to explore this claim and comment on what you can learn from your visualisation.

b. A researcher wants to investigate coral bleaching events around the world as they occurred from 1998 to 2017. Create an interactive map visualisation, representing the information you think would be important. Justify your choice of visualisation, and comment on what you can learn from your visualisation.

Question 2 - Case Study 2 (Kidney): Blood vs Biopsy Biomarker for classiﬁcation

In the data GSE46474, we estimated the accuracy for our predictive model in graft rejection from peripheral blood gene expression dataset. However, rejection is a very active process that occurs in the kidney itself. Here we will look at a similar kidney microarray dataset. Therefore, instead of genes being isolated and

sequenced from blood, we examine another dataset GSE138043 where the samples have been sequenced from

a kidney biopsy.

a. In each of the GSE46474 and GSE138043 datasets, use the topTable function in the limma package to output the most diﬀerentially expressed genes between patients that experience graft rejection and stable patients. Which genes are overlapped between the top 300 diﬀerentially expressed genes for each dataset? In other words, which genes can be found in the top 300 diﬀerentially expressed genes for BOTH datasets?

Hint. In the GSE46474 dataset, the outcome is found in the title column of the featureData and the gene symbols are found the in Gene Symbol column of the featureData. In the GSE138043 dataset, the outcome is found in the characteristics_ch1 column of the featureData and the gene symbols are found the in gene_assignment column of the featureData, between the ﬁrst and second // symbols.

b. Consider the following framework for cross-validation for a support vector machine (SVM) classiﬁer.

Framework 1. Identify the 50 most diﬀerentially expressed genes from the entire dataset. Subset the entire dataset to the 50 most diﬀerentially expressed genes. Randomly split the data into training and testing sets (80:20 split). Build a SVM classiﬁer on the training set. Calculate the accuracy of the classiﬁer when applied on the testing set.

For each of the GSE46474 and GSE138043 datasets, use repeated 5-fold cross validation (with 50 repeats), following the framework above, to estimate the accuracy of graft survival prediction (rejection or stable). Show your results in a visualisation and comment on the result.

c. Consider the following framework for cross-validation for a support vector machine (SVM) classiﬁer.

Framework 2. Randomly split the entire dataset into training and testing sets (80:20 split). Identify the 50 most diﬀerentially expressed genes from the training data. Subset both the training and testing data to the 50 most diﬀerentially expressed genes. Build a SVM classiﬁer on the training set. Calculate the accuracy of the classiﬁer when applied on the testing set.

d. Compare all the results from b and c using an appropriate graphic. Which of framework 1 or framework

2 is more valid? Is using blood or biopsy more accurate? Justify your answers.

Question 3: Case Study 3 (Brain): Streaming classiﬁer for Brain-box

A physics instructor Zoe has created a data set stored under zoe_spiker .zip that contains brain signal series (each series is a ﬁle) which corresponds to sequences of eye movements of varying lengths. The ﬁle name corresponds to the true eye movement. For example the ﬁle LRL_z .wav corresponds to left-right-left eye movements; the ﬁle LLRLRLRL_z .wav corresponds to left-left-right-left-right-left-right-left eye movements.

There are a total of 31 ﬁles.

a. Build a classiﬁcation rule for detecting a series of {L, R} under a streaming condition where the function will take a sequence of signals as an input. Explain how your classiﬁcation rule works.

Note. Your function should take the entire .wav ﬁle as an input, but should run through the .wav ﬁle under streaming conditions (e.g., by considering overlapping/rolling windows in the signal).

b. Create a metric to estimate the accuracy of your classiﬁer on the length 3 wave ﬁles, justifying your choice. Comment on the performance of your classiﬁer (ie. is it reasonable for this context?).

c. Compare at least four diﬀerent classiﬁcation rules on the length 3 wave ﬁles, using the metric you created. (This may include changing the parameters, diﬀerent rules to identify events from non-events, or diﬀerent rules to identify left-movement from right-movement). What is your best model? Justify your answer with appropriate visualisations.

d. For the best model that you found in part c, evaluate its performance on sequences of varying lengths. Does the length of the sequence have an impact on the classiﬁcation accuracy? Justify your answer with appropriate visualisations.

dir("data/zoe_spiker/Length3")

## [1] "LLL_z .wav" "LLL_z2 .wav" "LLL_z3 .wav" "LLR_z .wav" "LLR_z2 .wav" ## [6] "LLR_z3 .wav" "LRL_z .wav" "LRL_z2 .wav" "LRL_z3 .wav" "LRR_z .wav" ## [11] "LRR_z3 .wav" "LRRz_2 .wav" "RLL_z .wav" "RLL_z2 .wav" "RLL_z3 .wav" ## [16] "RLR_z .wav" "RLR_z2 .wav" "RLR_z3 .wav" "RRL_z .wav" "RRL_z2 .wav"

## [21] "RRL_z3 .wav" "RRR_z .wav" "RRR_z2 .wav" "RRR_z3 .wav"

dir("data/zoe_spiker/Length8")

## [1] "LLRLRLRL_z .wav" "LLRRLLLR_z .wav" "LLRRRLLL_z .wav" "LRRRLLRL_z .wav" ## [5] "RRRLRLLR_z .wav"

dir("data/zoe_spiker/Long")

## [1] "LLLRLLLRLRRLRRRLRLLL_Z .wav" "RRLRRLRLRLLLLLLRRLRL_z .wav"

2023-03-13

Java

物理(Physical)

LINUX

C++

Python

Processing