Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

BIOL 5300  IDCGV 2022/23 Assessment - Instructions (DRAFT)

Your assessment consists of two components:

o component 1 - mini-essay 500-1000 words

o component 2 - a report, with 3 subsections as separate sub-components 1500-3000 words in total

The full assessment should therefore be 2,000-4,000 words long.

The word limits do not include words in Figure and Table legends or the reference list.

The word limits are absolute - you will be penalised (gently) for going over or under the acceptable word count range.

Do include a reference list and cite key references in both the mini-essay and the report (where appropriate). The reference list does NOT count towards the word limits. Citations in the text DO count towards the word limits.

For both components, you should prepare your document in Word. Margins should be 2.5cm all round. Use 1.5 line spacing everywhere except in Figure/Table legends (1.2 line spacing here). Please add page numbers to the footer in the report component.

Submit both components, separately, as single documents (Word file, please; NOT pdf) via the TWO Moodle Assignments we’ve set up - as usual, for each component, you have two submissions to make:

- DRAFT - this submission will go through Turnitin and you can use that to refine it to remove any detected plagiarism

- FINAL - you MUST remember to submit your final version through this, before the deadline, and press the ‘SUBMIT’ button. (It will also go through Turnitin, but you won’t be able to use the result of that plagiarism scan)

The submission deadline for both components is
9.15am on Monday 20th February 2023

You are strongly encouraged to complete and submit the mini-essay component by the end of Fri. 17th February to give yourselves time to focus on the report for the last few days.

Component 1 - Mini-essay (25% of the course marks):

Pick a well-known example of a multifactorial/complex phenotype - it can be a human disorder or trait, or a trait in another organism - it needs to be a trait/disorder about whose genetics we know something:

· please run your idea for the phenotype past Dr Bailey (by email or in class) BEFORE doing lots of work on it

· make sure there is clear evidence that the phenotype is multifactorial and well-studied from good sources in the scientific literature

· make sure that there is clear, robust evidence that it is genetic

· don’t use popular science/unreliable sources to determine this

· don’t pick an unusual phenotype that there are few papers about - it may not be appropriate

You should pick an example that consists of ONE phenotype only (it may be related to lots of other similar phenotypes, but trying to cover all that in this document would cause you problems - keep your task simple).

Write a short essay (500-1000 words) discussing what the best way to find (i.e. locate, or map) genetic loci predisposing to this disorder/trait would be and comparing that ideal with the existing genetic studies you find in the literature - i.e. you should analyse the ways in which the existing published studies have fallen short of that ideal (you must name the studies you are using as your comparators). If there is a big literature, limit yourself to a comparison to no more than TWO of these studies. Do NOT focus on CNVs and their role in the genetic architecture - focus on SNVs (incl. SNPs, of course).

Just to re-iterate - this is not supposed to be a simple dissection of a paper - plan the essay structure in relation to the ‘ideal’ approach to doing these studies and then refer to relevant bits of the paper to illustrate points you are making. You do NOT HAVE TO cite original references for EVERY aspect of the ideal approach that you touch on, but citing a few key papers and reviews wouldn’t go amiss.

THINGS YOU MUST NOT DO for the mini-essay component:

· You must NOT pick a single gene disorder caused by high penetrance mutations - you will lose lots of marks for doing this. You may want to use textbooks to help you choose a popular complex phenotype, or maybe look at several issues of Nature Genetics or American Journal of Human Genetics and see what people are publishing on.

· You should NOT choose a phenotype identical to one of those discussed in the module 3 paper analysis tutorial, but it can be a similar phenotype if you want - if you pick a similar phenotype, it must NOT be one that is very similar to the topic YOUR GROUP presented on - this is OK ONLY if you are choosing something similar to what the OTHER groups presented on.

· Do NOT spend most of the time talking about the phenotype definition, however complicated that might be - this is just one element that relates to the ideal overall approach....

· Do NOT simply compare potential or theoretical case/control vs QTL mapping approaches for the disorder/trait you have picked - you must relate what you write in your mini-essay to what was published in the paper(s) reporting the experimental/analytical studies that you are citing.

· Do NOT focus on an example of a cancer/cancer syndrome unless you can be really sure that it has a very clear multifactorial/polygenic architecture (many cancers do not work like this to a sufficient extent - anything driven mainly by somatic mutation is probably not a good pick here).

· Do NOT include anything about identifying causal variants - this task is solely about mapping variants genetically (i.e. finding out the position in the genome of loci associated with the phenotype).

· Do NOT overcomplicate it - keep it simple, talk about standard approaches, don’t disappear down rabbit holes and sucked into dwelling on weird/wonderful/unusual approaches that might only be applicable in special circumstances...

Component 2 - Report (75% of the course marks):

Sub-component a) (25% of this component; 18.75% of the course marks):

Explain, in sufficient detail, how you would do each of the following tasks (i) - iii)) - write about BOTH the conceptual basis for the thing being measured AND the practical task of using a computer to find the solution or answer accurately and efficiently; writing notes (rather than essay-style full sentences) is fine for this section - but do NOT just write a recipe or an instruction manual; what we’re looking for here is an explanation IN FULLY ARTICULATED ENGLISH of the procedures and steps in any pipeline of operations - you can include equations/algebra/code if you think it helps to put across that explanation (i.e. you MUST also explain what the code is designed to achieve), but you don’t have to, and bear in mind that your assessors/markers may not have much experience of coding so you’re not going to be assessed on the quality of any code you’ve written for this section - the point here is to demonstrate that you know how it should be done:

i) calculate whether genotypes at 100 different SNP loci are in HWE (this will involve more than a single command; it could involve a loop, for example; you do NOT need to go through every step of any statistical procedures you talk about - just give an overview of what they are doing and how they are doing it).

ii) calculate which of these 100 SNPs fall into linkage disequilibrium groups (i.e. this analysis is likely to involve more than one individual step...); you can assume that we DO know the order and spacing of the SNPs on each of the chromosomes that they come from; you can also choose whether to talk about haplotype blocks or sliding window-type approaches, if relevant.

iii) analyse a genome-wide set of SNPs (see below*) to look for association with a dichotomous phenotype (e.g. disease case/control study) and plot the results for interpretation by scientists who are not genetics experts; you may want to specify more than one type of plot.

*You do NOT have to explain how to pick the SNPs - assume a set of tagSNPs has been chosen for you and already exist on a handy SNP-chip.

Do not write more than 500 words for sub-component a) part iii).

Please be clear - sub-component a) is NOT related to sub-component b) below - make them complete in themselves and standalone as pieces of work; do not in sub-component b) say ‘...as stated further up...’*, for example - write it anew, as you would in a report or scientific manuscript.

Sub-component b) (50% of this component; 37.5% of the course marks):

You have been given four sets of data files, specifically for this assessment. Each set relates to genotype and phenotype data for one chromosome from a GWAS. The four datasets relate to i) a dichotomous disease condition - data for two chromosomes (Chr1 and Chr2); and ii) a continuously distributed trait - data for two chromosomes (also called Chr1 and Chr2). The phenotypes in question are multifactorial and have a largely polygenic (perhaps slightly oligogenic) genetic architecture.

These 4 chromosomes are independent of each other and won’t necessarily contain the same SNPs (although some SNP names may overlap, as the data are simulated). (i.e. do NOT assume that Chr1 for the dichotomous data and Chr1 for the quantitative data will have the same markers). Your dataset is unique to you, other students will get different data.

You should analyse these data and write a report on the analysis you carried out. This report is completely separate to sub-component a) above and is about how (and why) you actually DID your analysis - do NOT logically link* sub-component b) to sub-component a) nor assume the reader knows anything from sub-component a) when you are deciding what to write in sub-component b).

Use PLINK to run association analyses (a ‘mini-GWAS’) using these datafiles and report on your findings - tell us what you were trying to achieve, how you executed it, and what you found. Find ways to plot your data nicely and ensure that you report a good amount of detail about any loci that you consider to be associated with the phenotypes in question. You can include some downstream analyses if you want e.g. conditional analysis to look for additional independent signals at hit loci. You will need to report on how good/valid the results are overall, as well as what has been found in detail at the level of individual loci. Carry out any QC that you deem necessary or helpful (make no assumptions about whether any cleaning or QC has already been done on these data; only do what you have time for given the credit-weighting of this section).

Please note - your data are simulated - the process generating the data includes some parameters that are selected at random. Human chromosomes vary quite drastically in length; for example, human chromosome 1 is 250Mb while chromosome 22 is 50Mb, a 5-fold difference. In the simulated data that you work from you will find that the chromosomes differ in length. One of the random parameter choices in the simulation is how long to make each chromosome - some of you will receive a chromosome that has a GWAS hit, but is very short, containing only a few thousand SNPs in total - do not worry about this, just assume it’s real and represents a very short chromosome! Some short chromosomes may possibly not have any GWAS hits - this is also OK, just write about what you find. To get a feel for whether the chromosomes in your dataset are similar in length or very different, look at the size of the bim files. The closer in size these files are to each other the more similar in size the chromosomes will be.

Also due to your data being simulated, you won’t get biologically meaningful results about real genes at hit loci by plotting locus details using LocusZoom (see Module 3 GWAS exercise). If you want to include analysis of a locus using LocusZoom (which would be nice...), you have two choices:
i) use the demo GWAS results files provided on the Moodle (these may be results from Joey Ward’s analyses or files downloaded from GWAS results websites e.g. for Mood Instability or Major Depressive Disorder or BMI); or
ii) use a pre-existing dataset or LocusZoom plot found elsewhere (e.g. try plotting demo data at the LocusZoom website) - in each case, simply pretend (explicitly - so we know it’s not by accident**) that the locus emerged as part of your own mini-GWAS results - write a few words to introduce this plot, as if you were writing a paper based on your own results.

In your report for this section, you need to include some explanation of the procedures and steps in any pipeline of operations*** - you can include code if you think it helps to put across that explanation, but you don’t have to, and bear in mind that your assessors/markers may not have much experience of coding so you’re not going to be assessed on the quality of any code you’ve written for this section. The report in this section should read like a mini-version of a combined Methods***/Results/Discussion section*^* of a research paper - see the papers we covered for the module 3 tutorial and others you may find yourself for ideas about content, logical flow etc. If you cite references, please include a reference list (not included in the word count).

**Don’t write just ‘...regional plots are shown in Fig. 6....’ - make it clear that you are substituting the provided output data for your own output data to create the LocusZoom plot.

***Make sure you do include some methods description - see point above - this is NOT covered by sub-component a) in this component.

*^*Note that you are NOT expected to write any introduction for sub-component b).

Sub-component c) (25% of this component; 18.75% of the course marks):

Pick a gene and pretend that this gene lies under one of the significant association peaks from your mini-GWAS analysis (you won’t be able to use your actual mini-GWAS data from sub-component b) for this, as they are simulated) - select a gene using any source you can think of: it can be a disease-related source or a population variant database (see lecture slides), or map of association hits (e.g. existing data plotted at the LocusZoom website), or a gene we have covered or mentioned in class etc.

Analyse this gene for predicted functional effects of variants it harbours. You don’t need to analyse EVERY variant in the gene - just pick a selection (at least two examples!) of missense, nonsense and/or other (e.g. potential splicing or regulatory) variants, depending on what is present in the gene. You should ensure that you pick only germline variants, not somatic variants found only in tumours.

Write a report detailing how you did the analysis, the results of the analysis, and your conclusion about the functional effects of any variants present.

Begin with a very brief introduction (just a couple of sentences will do) - start by telling us, briefly, something about the gene and the function of its product, and mention whether it is a known disease-causing/predisposing gene. If the gene you pick is the cause of a single gene disorder (SGD; check OMIM to find out...), then do include in this introduction why you think it might also harbour variants that predispose to complex phenotypes.

Include some discussion of what your findings imply about whether variation in or near this gene is likely to explain disease risk (for any disease that turns out to be associated - look up the relevant compendium websites to see whether your gene is close to any GWAS hits).

Use of annotated diagrams/Figures is encouraged. Make sure the reader can actually see and read all information provided in each Figure.

For sub-components a), b) and c), use any or all of the approaches we’ve covered in the theory and practical sessions, or any other approaches you come across yourselves - do go looking for ways to do these things, based on what you see in the literature or through online searches. The R stuff at CRAN may be particularly helpful, as may PLINK.

If you want an extra check of your English usage and spelling, do try out Grammarly (https://www.grammarly.com/).

It’s also worth checking out the Manchester Academic Phrasebank (https://www.phrasebank.manchester.ac.uk/) if you struggle to find the right way to express yourself in English with good logical flow.