闪电代写 -代写CS作业_CS代写_Finance代写_Economic代写_Statistics代写_代码代做_IT代写_加急帮助

Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

Permutation test acceleration

1 Background

The genomic data is divided into different colour-shape subsets on the outer edge of the circle (e.g.: samples from the blue triangles, the orange cylinders and the grey rectangles). Within each subset, each data point is classified to be either a 1 or a 0. The ones are labelled with a red star on the inner edge of the circle. The statistic we are interested in is the number of 1s in each colour-shaped subset. We are trying to answer the question – are there more 1s in colour-shape subset X than we would expect by random chance? The problem is that we do not know the distribution of the 1s along the circle. They are not normally distributed, or uniformly distributed for example.

The solution is to create our own empirical distribution, by using permutation testing (the most exact form of a hypothesis test). The idea is to shuffle the 1 /0 labels and the colour/shape label so that any statistical relationship between them is lost. However, the distances between the 1/0 labels need to be preserved. This is why we simply shift the 1/0 labels by a uniformly distributed random number.

2 Objectives

Currently, biologists always use python to do the permutation test, which takes a highly long time. 10M data take about 3min in server. However, the full data is about 100GB. Our aim are:

• Serial optimisation:

– Baseline the code performance,

– Write regression test cases so modifications do not break the code.

– profile the code to understand where the bottlenecks are.

– improve the code, go back to step a - continue until performance does not increase.

• Use multi-processing methods in Python.

• Using compilers, such as Numba/PyPy/Cython.

• writing c library for python with mpi, openmp, etc.

• Use GPUs.

• Producing better sampling algorithms.

Possible extensions to use machine learning: predict which subsets a data point will lie in.

2022-11-24