Download FILTUS

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Genome evolution wikipedia , lookup

Neuronal ceroid lipofuscinosis wikipedia , lookup

Gene wikipedia , lookup

X-inactivation wikipedia , lookup

Gene expression programming wikipedia , lookup

Public health genomics wikipedia , lookup

Epigenetics of human development wikipedia , lookup

Gene nomenclature wikipedia , lookup

RNA-Seq wikipedia , lookup

Biology and consumer behaviour wikipedia , lookup

Dominance (genetics) wikipedia , lookup

Genome (book) wikipedia , lookup

Gene expression profiling wikipedia , lookup

Pharmacogenomics wikipedia , lookup

Designer baby wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Microevolution wikipedia , lookup

Exome sequencing wikipedia , lookup

Transcript
FILTUS TUTORIAL: EXOME ANALYSIS
BACKGROUND
This exercise will take you through the downstream analysis of the exome of a real patient. The
patient is a child with severe epilepsy, while both parents are healthy. The disorder has unknown
cause, but is believed to be monogenic and recessive. Furthermore, it turns out that the parents are
first cousins, suggesting that autozygosity mapping may be helpful in this case.
The exome of the child has been sequenced, and the resulting variants are annotated with Annovar.
The annotation adds information about each variant, including which gene it lies in, how it affects
the protein (synonymous, non-synonymous, stopgain, etc.), allele frequencies and predictions on its
effect on the protein function. For legal/ethical reasons, all gene names, transcript names, variant
identifiers and variant positions are irreversibly masked or changed. However, care has been taken
to preserve dependencies between the variants, so that the answers you find are almost identical
to the actual analysis of this patient.
DOWNLOAD FILTUS
Windows: Go to http://folk.uio.no/magnusv/GenetiskTeori and download "FiltusExercise.zip". Save
it somewhere on your computer and unzip it. After unzipping, start FILTUS by double clicking
“FILTUS104.exe” inside the Filtus1.0.4 folder.
Mac: Open a terminal window and run the command "pip install filtus". If this completes
successfully, you can then start the program by executing "filtus". The files needed for the exercise
can be downloaded from http://folk.uio.no/magnusv/GenetiskTeori/Mac.
If the installation didn't work, ask for help.
EXERCISE FILES
This tutorial/exercise analyzes the variant file “exome1.csv", which contains all variants resulting
from the sequencing of the patient’s exome. You should find the file in the FiltusExercise folder.
You should also see the filter configuration file “HQ.fconfig” which you will need towards the end of
the tutorial.
TUTORIAL
To get a first impression of what the file looks like, open “exome1.csv” in a plain text editor (e.g.
Notepad).
a) What is the column separator? Do the columns have headers? What are the first few columns?
Plain text editors are not well suited for exome analysis. Instead you should now load the file into
FILTUS. Do this by opening FILTUS and choosing Load variant files (simple) in the File menu. Take a
moment to study the input settings dialog, but leave everything unchanged (Filtus is good at
guessing file the format) and press “Use for all files”.
b) How many variants does FILTUS report in the “Unfiltered summaries“ box? What is the gene
count?
c)
Double click on the entry in "Unfiltered summaries" to display the variants. Locate the first
line with an exonic variant, and find out:
i.
Which gene is the variant in? Which chromosome and base position?
ii.
What sort of variant is it? How does it affect the protein?
iii.
What is the variant's frequency in the 1000 Genomes database ("1000g2010nov_ALL")?
[Hint: Triple clicking on a line will highlight it, making it easier to keep track when scrolling.]
The columns REF and ALT contain the reference allele and the alternative allele for each variant.
The observed genotype can then be read off from the GT column: 0/1 means REF/ALT (i.e.
heterozygous), while 1/1 means ALT/ALT (i.e. homozygous for the alternative allele). This is part of
the widely used VCF format for variant files, which you can read more about at
http://www.1000genomes.org if you want.
d) Consider again the first exonic variant. Is it heterozygous or homozygous? What is the observed
genotype? For experts: How many sequencing reads had the REF allele vs. the ALT allele? [Hint:
This is in the AD column.]
To get an overview of the file's structure and contents it is always a good idea to summarize some
of the key columns. For this we use the Summarize column function in the View menu.
e) Make a summary of the FILTER column. How many variants - raw count and in percent - have PASS
in this column? (These are the variants that passed all quality filters in the variant calling process.)
f)
Enter "FILTER - equal to - PASS" as a column filter and press "Apply filter". The number
in the "Filtered summaries" box should agree with your answer above. This filter should
remain present in all further analysis.
g) Do a summary of the Func column, and make sure you know roughly what the different categories
mean. How many percent of the variants are exonic? [Bonus question: How come so many of the
variants are not exonic?]
The Summarize column function also allows us to do a quick-and-dirty check of the patient’s
gender:
h) Add a column filter keeping only the variant on the X chromosome, and then make a summary of
the GT column. Use the result to deduce if the patient is a boy or a girl.
(Remove the chromosome filter before you continue.) The next steps give an example on how we
can filter the variants down to a small set of interesting variants. What makes a variant interesting
depends on the context, but in this case we focus on rare variants that have a big effect on the gene
product.
i)
We are primarily interested in variants that are either exonic or affect splicing. Add the column
filter "Func - starts with - exon OR splic" to remove everything else. How many
variants/genes remain?
j)
Make a summary of the ExonicFunc column and familiarize yourself with the different categories.
(NB: Splice site variants have empty entries in this column.)
A loss-of-function (LoF) variant is a variant that disrupts the protein, for instance by introducing a
premature stop codon or by an indel (insertion or deletion) which shifts the reading frame.
k) Reduce to LoF variants by applying a suitable column filter. Hint: Use “AND” or “OR” to combine
phrases.
l)
Also remove variants with a frequency higher than 1% in the 1000 Genomes database or the
ESP5400 database (the "1000g2010nov_ALL" and "ESP5400_ALL" columns).
How many
variants/genes remain?
NB: For the filters in the last step is it vital to tick the KIM-box ("keep if missing"). An empty entry in
a database column means that the variant is not reported in the database; we certainly want to
keep those!
We have now reduced the original haystack down to a set of high quality, very rare (or novel)
variants having damaging effect on the protein. Since the condition is thought to be recessive, we
look for genes containing at least two of the remaining variants (compound heterozygous model),
or one of them in homozygous state.
m) In the "Gene sharing" window, type "1" in the "Affected" field, choose "Recessive c/h" as
the model, and press "Analyze". How many genes turn up? Right click on the gene names to
inspect the variants in each gene (or all of them at the same time). Do any of them seem less likely
than others to be pathogenic?
It is revealed to us that the parents of our patient are first cousins. This suggests that the causal
variant lies in an autozygous stretch of the genome – i.e. a long homozygous region where both
haplotypes originate from the same great grandparent. Restricting our search to these regions will
hopefully reduce the number of genes we have to investigate.
To identify these autozygous regions we use the AutEx algorithm implemented in Filtus. We go
through the process step-by-step below. Before you start, you should click “Save current filter
configuration” in the Filters menu, and save to a file that you can reload later. You can name the file
whatever you like, but I’ll refer to it as “rare_LOF.fconfig” later.
n) For AutEx to work well it is important to remove as many erroneous variants as possible. In the
Filters menu choose “Load filter configuration” and load the “HQ.fconfig” file. Do you agree that
these filters are sensible? Press “Apply”.
o) Open the AutEx dialog from the Analysis menu. Set the parental relation to be cousins, run the
program using “1000g2010nov_ALL“ as the frequency column. How many regions are found?
You can set the “Minimum segment size” to be e.g. 1 cM and 20 variants to get rid of the worst
noise.
p) Make a plot showing the autozygosity on chromosome 7, and save it.
q) After closing the AutEx window, save the identified regions by clicking on “Save main window
content” in the File menu. Name the file something like “sample1_autozygous_regions.txt”.
r) Now load the filters you saved earlier (“rare_LOF.fconfig”), and add the file from the last step as a
“Restrict to regions” filter. Press “Apply filters”, and then repeat the “Gene sharing”
analysis. How many genes are you left with now?
s) Discuss possible ways to proceed at this point. Can you think of resources we haven't used in this
exercise, which could help us eliminate further variants? [Keywords to get you started: Technical
artefacts, family members, online resources, phenotype.]
Epilogue: The gene hiding behind the code GENE5661 is in fact KCTD7, a known epilepsy gene
matching the phenotype of our patient. Furthermore, the parents were shown to carry one copy
each of the LOF variant, confirming the recessive inheritance. In the end it was concluded that this
variant caused the patients disorder.