Download Human Sequencing - Wellcome Trust Centre for Human Genetics

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Human Sequencing
Stefano Lise
Bioinformatics & Statistical Genetics (BSG) Core
The Wellcome Trust Centre for Human Genetics
(WTCHG), Oxford
Email: [email protected]
Outline
●
Human genetic variation in health and disease
–
●
How do we identify pathogenic mutations amongst
many genomic variants?
The WGS500 project
–
Whole-genome sequencing of 500 genomes of
clinical significance
Human Genome
●
●
The (haploid) reference human genome is about 3 x
109 bases
–
Human genome is diploid => ~ 2 x 3 x 3 109 bases
–
The exome is ~ 30-60 Mb (1-2% of the genome)
Some more numbers (from GENCODE, Nov 2012)
–
20,387 protein-coding genes
●
81,626 protein-coding transcripts
–
13,220 long non-coding RNA genes
–
9,173 small non-coding RNA genes
–
13,419 pseudogenes
Sequence Variants
●
Single nucleotide variants (SNV)
●
Small insertions/deletions (INDEL)
●
Structural variants
–
Large insertions/deletions
–
Inversions
–
Copy number variants
–
Translocations
–
….
Functional Consequences
From Ensembl
Human Genome Variation
●
The 1000 Genomes Project (www.1000genomes.org)
provides a catalogue of all (most) types of human
genetic variation
–
●
●
Population-scale genome sequencing
Phase 1 (October 2012)
–
High-throughput sequencing of 1092 human genomes
–
Identified up to 98% of all SNPs with a frequency > 1% in
the population
1,500 additional genomes in the next (final) phase
Human Genome Variation
(1000 Genomes Project, Nature 491, 56-65, 2012 )
Allele Frequency
< 0.5 %
0.5 - 5 %
> 5%
Total
All variants
30 -150 K
120 – 680 K
3.6 – 3.9 M
3.7 – 4.7 M
Synonymous
139 - 640
480 - 2470
12 – 13 K
13 – 16 K
Non-synonymous
(at conserved sites)
220 – 800
(130 - 400)
540 -2400
(240 - 910)
10 – 11 K
(2.3 – 2.7 K)
11 – 14 K
(2.7 – 4 K)
LOF
10 - 20
20 - 55
85 - 105
115 - 180
HGMD-DM
(at conserved sites)
4–8
(2.5 -5)
10 – 33
(4.8 - 17)
28 – 43
(11 - 18)
40 – 85
(18 - 40)
LOF=loss-of-function variant (stop-gain, frameshift indel, essential splice site)
Conserved sites = sites with GERP conservation score > 2
Rare and Common Diseases
Only 1 or 2 causal
variants
adapted from TA Manolio et al. Nature 461, 747-753 (2009)
Sequencing Strategies
●
Targeted sequencing
–
–
●
Whole exome sequencing
–
●
E.g. screening of known genes associated with
cardiomyopathies or ataxia
Applications in clinical diagnostic
Protein coding regions
Whole genome sequencing
–
–
Can detect all types of information relevant to pathology in
a single go
Still costly, but decreasing rapidly
Identifying causal variants:
Assumptions and Filters
●
●
After variant calling, filter out low quality (confidence)
calls
Variant is unique in patients or at least very rare in
the general population, e.g. < 1%
–
●
●
Use of in-house databases too
Variant has complete penetrance: every carrier will
have the phenotype
In general these steps will not identify the pathogenic
variant uniquely but will restrict the list of candidates.
Further analysis required
Ideal Scenario
●
●
Variant is common amongst all affected and absent in all
unaffected
Variant is in a gene with known function and disrupts the
protein
(Almost) Ideal Scenario
Variant Prioritization
●
Focus first on protein-coding regions (exome)
–
–
–
●
Easier to interpret the consequences of the variant
–
●
●
Nonsense and missense mutations
Frame-shift indels
Essential splice sites disruptions
E.g. mutation affects catalytic residues in an enzyme
Targeted exome sequencing has been very successful in
disease gene discovery
Cautionary note: on average each “normal, healthy”
individual carries
–
–
10-20 rare LOF variants
2-5 rare, disease-associated variants
Non-coding variants
●
●
Many functional elements lie outside protein-coding regions
(ENCODE)
Variants can disrupt
–
–
–
–
●
Regulatory elements, e.g. transcription factor binding sites
Splicing regulatory elements (branch sites, intronic splicing
enhancers/inhibitors, …)
ncRNA transcripts
…
Many non-coding variants in individual genome sequences
lie in ENCODE-annotated functional regions
–
At least as many as in protein-coding genes
Disease models
●
Diseases can be
–
Mendelian
●
–
Sporadic
●
–
De novo mutation
Cancer
●
●
Dominant, recessive or X-linked
Driver mutations
Analysis strategy needs to be adjusted to each
disease category
Autosomal Dominant Disease
●
Familiar, inherited disorder
●
Search for heterozygous variants
–
●
Present in affected individuals, absent in non-affected ones
Linkage analysis can substantially narrow the genomic search
space
–
E.g. SNP array all family members and sequence one or two affected
members
Recessive Disease
●
●
Suspected consanguinity
Search for homozygous
variants
–
●
Heterozygous in parents
Homozygosity mapping by SNP
arrays can substantially reduce
the number of variants for
follow-up
●
●
No indication of consanguinity
Search for compound
heterozygous variants
–
Affected individual carries two
separate variants in the same
gene
–
Each parent carries one of the
two variants
Sporadic Genetic Disease
●
Dominant disorder, parents are unaffected
●
Search for de novo mutations
–
●
Expect 50-100 de novo mutations in “normal, healthy” individual
–
●
Present in child and not in parents
Father’s age effect, 2 extra mutations per year (Kong et al, Nature 488,
471–475, 2012)
Sometimes difficult to distinguish from a recessive disease
Cancer
●
Matched normal to tumour samples
●
Search for somatic variants
–
Present in tumour(s), absent in normal sample
●
Identify driver mutations
●
More on this tomorrow, JB Cazier’s lecture
Predicting Phenotypic Consequences
●
Methods based on comparative genomics
●
Evolution as a measure of deleteriousness
–
●
Variants at conserved positions more likely to be deleterious
Several conservation scores
–
phyloP - single-site score (http://compgen.bscb.cornell.edu/phast/)
–
GERP - single-site score
(http://mendel.stanford.edu/sidowlab/downloads/gerp/index.html)
–
phastCons – region-based score
(http://compgen.bscb.cornell.edu/phast/)
–
…
Conservation Scores
Benign vs Pathogenic Variants
Gilissen et al, European Journal of Human Genetics (2012) 20, 490–497;
b-haemoglobin locus
From GM Cooper & J Shendure, Nature Reviews Genetics 12, 628-640 (2011)
Protein Sequence Variants
●
Most established methods. They exploit
–
–
–
–
●
Amino acid properties, e.g. charge, size, …
Structural information, e.g. local secondary structure,
surface/core amino acid, …
Evolutionary information, e.g. pattern of observed substitutions
Database information, e.g. known binding site
Several methods available
–
–
–
SIFT (http://sift.bii.a-star.edu.sg/)
Polyphen-2 (http://genetics.bwh.harvard.edu/pph2/)
…
PolyPhen-2
(http://genetics.bwh.harvard.edu/pph2/)
●
●
Prediction based on sequence, phylogenetic and structural
information characterizing the substitution
–
8 sequence-based properties
–
3 structure-based properties
The 11 properties (features) used as input of a probabilistic classifier
–
Trained to differentiate benign from pathogenic variants
Non-coding variants
●
A substantial fraction of disease causing mutations are
not exonic
–
●
●
Regulatory variants can have a large effect
More difficult to discover
–
●
Probably under-represented in databases
Non-coding positions less conserved than coding positions
ENCODE has provided a detailed map of regulatory
regions
–
Search for variants that disrupt a consensus sequence
motif within a known binding site
Gene Prioritization Methods
●
Methods focus on genes rather than on variants
–
●
Identify the genes most likely to cause a given disease
in a list of candidates
Methods combine heterogeneous pieces of
information
–
–
–
–
Shared biological pathways with other disease genes
Orthologues genes involved in similar diseases in
model organisms
Localization in affected tissue
…
Follow up
●
Definite proof of pathogenicity requires
–
Validation in independent patient cohort
●
–
In vitro functional experiments
●
–
But many diseases are genetically heterogeneous and
caused by extremely rare variants
Evaluate molecular consequences, e.g. disruption of
expression or protein folding
In vivo experiments in model organisms
●
Is the human phenotype reproduced in, e.g., a knock-out
mouse?
Bioinformatics Challenges
●
●
●
How reliably can we read and annotate an individual’s genome?
How well can we interpret genetic variation in the context of a clinical
presentation?
Community experiment to objectively assess computational methods
–
Critical Assessment of Genome Interpretation (CAGI 2012)
●
●
●
●
–
Distinguish between exomes of Crohn’s disease patients and healthy individuals
PGP genomes: predict clinical phenotypes from genome data, and match individuals to
their health records
Whole genomes of a family affected by primary congenital glaucoma: discover the
genetic basis of the disease
...
Critical Assessment of Massive Data Analysis (CAMDA 2013)
●
Reliable variant calling
●
…
The WGS500 Project
●
●
Collaboration involving the WTCHG, Oxford BRC, Oxford University
Hospitals and Illumina
Sequence 500 genomes of clinical significance
–
Mendelian diseases
–
Immunological disorders
–
Cancers
●
Target coverage: 25x (50x for cancer)
●
Diverse set of experimental designs
●
–
Familial: Linkage information
–
De novo: trios
–
Cancer: Tumour-normal, metastases, multiple-mets, ..
Substantial follow-up (screening and functional) to establish candidacy
Overview of processing
400 genomes
Oxford
Genomics
100 genomes
Illumina
QC
Large-scale CNV
scan
Read alignment
(Stampy)
Homozygosity
scan
Individual/grou
p variant calls
(Platypus)
Union file
Individual
genotypes
Web server
Annotated
genotypes
Read
alignment
and calls
(Eland/
Casava)
Referencecompressed
Archive
• Frequency (1000G, EVS)
• Conservation
• Coding consequence (x2)
• Predicted effect (x3)
• Pathogenicity (HGMD)
• Regulatory annotation
Case Study
PI: Dr A Nemeth
• 3 affected individuals from a highly consanguineous
family
– Childhood developmental ataxia
– Cognitive impairment
Targeted Sequencing
• Targeted sequencing on V3 using a panel of > 100 known
ataxia genes
– Found an homozygous stop codon in SPTBN2
– Mutation present as homozygous in all 3 affected individuals and
as heterozygous in parents of V3, by Sanger sequencing
• Mutations in SPTBN2 cause spinocerebellar ataxia type 5
(SCA5)
– Sometimes referred to as “Lincoln ataxia”
– Autosomal dominant, slowly progressing, adult onset
• Is the cognitive impairment due to the mutation in SPTBN2?
– Could be caused by mutations in a second gene (homozygous
or compound heterozygous)
• Investigated this possibility using a combination of SNP
array and whole genome sequencing
Homozygosity Mapping
•
•
SNP array genotyped V1, V2, V3, IV3 and IV4 (~300K SNPs)
Identified regions of homozygosity (ROH) shared by V1, V2 and V3
and not present in either IV3 or IV4
– Homozygosity mapping with PLINK
– Found 23 regions totalling 28.7 Mb
– Largest segments on chromosome 11
Chromosome 11
Whole Genome Sequencing
•
Searched for rare, homozygous variants in shared ROH
– Present in 1000 Genomes with an allele frequency < 1%
– Not observed in other WGS500 samples
•
Found 68 candidate variants
Functional class
Exonic
•
•
•
2
Stop gain
Synonymous
(1)
(1)
ncRNA
1
UTR
3
Intronic
•
Number of
variants
40
Upstream
1
Intergenic
21
Based on evolutionary conservation and available information in databases (eg
HGMD) the only likely pathogenic variant is the stop codon in SPTBN2
Excluded also a compound heterozygous model (data not shown)
SPTBN2 variant
•
The position is actually not well conserved
– E.g. G->A in gorilla, baboon and mouse
– GERP = -6.71
– PhyloP = -1.28
•
•
TGT and TGC encode for cysteine
TGA is a stop codon
SPTBN2 knock-out mouse
• Investigated a mouse knock-out of SPTBN2
(Mandy Jackson Lab, Edinburgh)
– Ataxia (previously reported)
– Morphological abnormalities in neurons from
prefrontal cortex, an area believed to be
important in human for cognitive tasks
– Deficits in object recognition tasks
• The mouse model supports the hypothesis that
both ataxia and cognitive impairment are
caused by the recessive mutation in SPTBN2
WGS500 overview of findings (as of Dec 2012)
• Project about 75% complete, with 292 samples (195 case
studies) over 38 projects with initial analysis
• 75/195 cases there is at least one candidate viewed by the PI
and analysts as a strong candidate for causing (strongly
contributing to) the phenotype
– 45/82 in Mendelian
– 19/61 in Immune
– 11/52 in Cancer
• Papers in press/submitted to date on
– Ataxia, CMS, CLL, Multiple adenomas