Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Human Sequencing Stefano Lise Bioinformatics & Statistical Genetics (BSG) Core The Wellcome Trust Centre for Human Genetics (WTCHG), Oxford Email: [email protected] Outline ● Human genetic variation in health and disease – ● How do we identify pathogenic mutations amongst many genomic variants? The WGS500 project – Whole-genome sequencing of 500 genomes of clinical significance Human Genome ● ● The (haploid) reference human genome is about 3 x 109 bases – Human genome is diploid => ~ 2 x 3 x 3 109 bases – The exome is ~ 30-60 Mb (1-2% of the genome) Some more numbers (from GENCODE, Nov 2012) – 20,387 protein-coding genes ● 81,626 protein-coding transcripts – 13,220 long non-coding RNA genes – 9,173 small non-coding RNA genes – 13,419 pseudogenes Sequence Variants ● Single nucleotide variants (SNV) ● Small insertions/deletions (INDEL) ● Structural variants – Large insertions/deletions – Inversions – Copy number variants – Translocations – …. Functional Consequences From Ensembl Human Genome Variation ● The 1000 Genomes Project (www.1000genomes.org) provides a catalogue of all (most) types of human genetic variation – ● ● Population-scale genome sequencing Phase 1 (October 2012) – High-throughput sequencing of 1092 human genomes – Identified up to 98% of all SNPs with a frequency > 1% in the population 1,500 additional genomes in the next (final) phase Human Genome Variation (1000 Genomes Project, Nature 491, 56-65, 2012 ) Allele Frequency < 0.5 % 0.5 - 5 % > 5% Total All variants 30 -150 K 120 – 680 K 3.6 – 3.9 M 3.7 – 4.7 M Synonymous 139 - 640 480 - 2470 12 – 13 K 13 – 16 K Non-synonymous (at conserved sites) 220 – 800 (130 - 400) 540 -2400 (240 - 910) 10 – 11 K (2.3 – 2.7 K) 11 – 14 K (2.7 – 4 K) LOF 10 - 20 20 - 55 85 - 105 115 - 180 HGMD-DM (at conserved sites) 4–8 (2.5 -5) 10 – 33 (4.8 - 17) 28 – 43 (11 - 18) 40 – 85 (18 - 40) LOF=loss-of-function variant (stop-gain, frameshift indel, essential splice site) Conserved sites = sites with GERP conservation score > 2 Rare and Common Diseases Only 1 or 2 causal variants adapted from TA Manolio et al. Nature 461, 747-753 (2009) Sequencing Strategies ● Targeted sequencing – – ● Whole exome sequencing – ● E.g. screening of known genes associated with cardiomyopathies or ataxia Applications in clinical diagnostic Protein coding regions Whole genome sequencing – – Can detect all types of information relevant to pathology in a single go Still costly, but decreasing rapidly Identifying causal variants: Assumptions and Filters ● ● After variant calling, filter out low quality (confidence) calls Variant is unique in patients or at least very rare in the general population, e.g. < 1% – ● ● Use of in-house databases too Variant has complete penetrance: every carrier will have the phenotype In general these steps will not identify the pathogenic variant uniquely but will restrict the list of candidates. Further analysis required Ideal Scenario ● ● Variant is common amongst all affected and absent in all unaffected Variant is in a gene with known function and disrupts the protein (Almost) Ideal Scenario Variant Prioritization ● Focus first on protein-coding regions (exome) – – – ● Easier to interpret the consequences of the variant – ● ● Nonsense and missense mutations Frame-shift indels Essential splice sites disruptions E.g. mutation affects catalytic residues in an enzyme Targeted exome sequencing has been very successful in disease gene discovery Cautionary note: on average each “normal, healthy” individual carries – – 10-20 rare LOF variants 2-5 rare, disease-associated variants Non-coding variants ● ● Many functional elements lie outside protein-coding regions (ENCODE) Variants can disrupt – – – – ● Regulatory elements, e.g. transcription factor binding sites Splicing regulatory elements (branch sites, intronic splicing enhancers/inhibitors, …) ncRNA transcripts … Many non-coding variants in individual genome sequences lie in ENCODE-annotated functional regions – At least as many as in protein-coding genes Disease models ● Diseases can be – Mendelian ● – Sporadic ● – De novo mutation Cancer ● ● Dominant, recessive or X-linked Driver mutations Analysis strategy needs to be adjusted to each disease category Autosomal Dominant Disease ● Familiar, inherited disorder ● Search for heterozygous variants – ● Present in affected individuals, absent in non-affected ones Linkage analysis can substantially narrow the genomic search space – E.g. SNP array all family members and sequence one or two affected members Recessive Disease ● ● Suspected consanguinity Search for homozygous variants – ● Heterozygous in parents Homozygosity mapping by SNP arrays can substantially reduce the number of variants for follow-up ● ● No indication of consanguinity Search for compound heterozygous variants – Affected individual carries two separate variants in the same gene – Each parent carries one of the two variants Sporadic Genetic Disease ● Dominant disorder, parents are unaffected ● Search for de novo mutations – ● Expect 50-100 de novo mutations in “normal, healthy” individual – ● Present in child and not in parents Father’s age effect, 2 extra mutations per year (Kong et al, Nature 488, 471–475, 2012) Sometimes difficult to distinguish from a recessive disease Cancer ● Matched normal to tumour samples ● Search for somatic variants – Present in tumour(s), absent in normal sample ● Identify driver mutations ● More on this tomorrow, JB Cazier’s lecture Predicting Phenotypic Consequences ● Methods based on comparative genomics ● Evolution as a measure of deleteriousness – ● Variants at conserved positions more likely to be deleterious Several conservation scores – phyloP - single-site score (http://compgen.bscb.cornell.edu/phast/) – GERP - single-site score (http://mendel.stanford.edu/sidowlab/downloads/gerp/index.html) – phastCons – region-based score (http://compgen.bscb.cornell.edu/phast/) – … Conservation Scores Benign vs Pathogenic Variants Gilissen et al, European Journal of Human Genetics (2012) 20, 490–497; b-haemoglobin locus From GM Cooper & J Shendure, Nature Reviews Genetics 12, 628-640 (2011) Protein Sequence Variants ● Most established methods. They exploit – – – – ● Amino acid properties, e.g. charge, size, … Structural information, e.g. local secondary structure, surface/core amino acid, … Evolutionary information, e.g. pattern of observed substitutions Database information, e.g. known binding site Several methods available – – – SIFT (http://sift.bii.a-star.edu.sg/) Polyphen-2 (http://genetics.bwh.harvard.edu/pph2/) … PolyPhen-2 (http://genetics.bwh.harvard.edu/pph2/) ● ● Prediction based on sequence, phylogenetic and structural information characterizing the substitution – 8 sequence-based properties – 3 structure-based properties The 11 properties (features) used as input of a probabilistic classifier – Trained to differentiate benign from pathogenic variants Non-coding variants ● A substantial fraction of disease causing mutations are not exonic – ● ● Regulatory variants can have a large effect More difficult to discover – ● Probably under-represented in databases Non-coding positions less conserved than coding positions ENCODE has provided a detailed map of regulatory regions – Search for variants that disrupt a consensus sequence motif within a known binding site Gene Prioritization Methods ● Methods focus on genes rather than on variants – ● Identify the genes most likely to cause a given disease in a list of candidates Methods combine heterogeneous pieces of information – – – – Shared biological pathways with other disease genes Orthologues genes involved in similar diseases in model organisms Localization in affected tissue … Follow up ● Definite proof of pathogenicity requires – Validation in independent patient cohort ● – In vitro functional experiments ● – But many diseases are genetically heterogeneous and caused by extremely rare variants Evaluate molecular consequences, e.g. disruption of expression or protein folding In vivo experiments in model organisms ● Is the human phenotype reproduced in, e.g., a knock-out mouse? Bioinformatics Challenges ● ● ● How reliably can we read and annotate an individual’s genome? How well can we interpret genetic variation in the context of a clinical presentation? Community experiment to objectively assess computational methods – Critical Assessment of Genome Interpretation (CAGI 2012) ● ● ● ● – Distinguish between exomes of Crohn’s disease patients and healthy individuals PGP genomes: predict clinical phenotypes from genome data, and match individuals to their health records Whole genomes of a family affected by primary congenital glaucoma: discover the genetic basis of the disease ... Critical Assessment of Massive Data Analysis (CAMDA 2013) ● Reliable variant calling ● … The WGS500 Project ● ● Collaboration involving the WTCHG, Oxford BRC, Oxford University Hospitals and Illumina Sequence 500 genomes of clinical significance – Mendelian diseases – Immunological disorders – Cancers ● Target coverage: 25x (50x for cancer) ● Diverse set of experimental designs ● – Familial: Linkage information – De novo: trios – Cancer: Tumour-normal, metastases, multiple-mets, .. Substantial follow-up (screening and functional) to establish candidacy Overview of processing 400 genomes Oxford Genomics 100 genomes Illumina QC Large-scale CNV scan Read alignment (Stampy) Homozygosity scan Individual/grou p variant calls (Platypus) Union file Individual genotypes Web server Annotated genotypes Read alignment and calls (Eland/ Casava) Referencecompressed Archive • Frequency (1000G, EVS) • Conservation • Coding consequence (x2) • Predicted effect (x3) • Pathogenicity (HGMD) • Regulatory annotation Case Study PI: Dr A Nemeth • 3 affected individuals from a highly consanguineous family – Childhood developmental ataxia – Cognitive impairment Targeted Sequencing • Targeted sequencing on V3 using a panel of > 100 known ataxia genes – Found an homozygous stop codon in SPTBN2 – Mutation present as homozygous in all 3 affected individuals and as heterozygous in parents of V3, by Sanger sequencing • Mutations in SPTBN2 cause spinocerebellar ataxia type 5 (SCA5) – Sometimes referred to as “Lincoln ataxia” – Autosomal dominant, slowly progressing, adult onset • Is the cognitive impairment due to the mutation in SPTBN2? – Could be caused by mutations in a second gene (homozygous or compound heterozygous) • Investigated this possibility using a combination of SNP array and whole genome sequencing Homozygosity Mapping • • SNP array genotyped V1, V2, V3, IV3 and IV4 (~300K SNPs) Identified regions of homozygosity (ROH) shared by V1, V2 and V3 and not present in either IV3 or IV4 – Homozygosity mapping with PLINK – Found 23 regions totalling 28.7 Mb – Largest segments on chromosome 11 Chromosome 11 Whole Genome Sequencing • Searched for rare, homozygous variants in shared ROH – Present in 1000 Genomes with an allele frequency < 1% – Not observed in other WGS500 samples • Found 68 candidate variants Functional class Exonic • • • 2 Stop gain Synonymous (1) (1) ncRNA 1 UTR 3 Intronic • Number of variants 40 Upstream 1 Intergenic 21 Based on evolutionary conservation and available information in databases (eg HGMD) the only likely pathogenic variant is the stop codon in SPTBN2 Excluded also a compound heterozygous model (data not shown) SPTBN2 variant • The position is actually not well conserved – E.g. G->A in gorilla, baboon and mouse – GERP = -6.71 – PhyloP = -1.28 • • TGT and TGC encode for cysteine TGA is a stop codon SPTBN2 knock-out mouse • Investigated a mouse knock-out of SPTBN2 (Mandy Jackson Lab, Edinburgh) – Ataxia (previously reported) – Morphological abnormalities in neurons from prefrontal cortex, an area believed to be important in human for cognitive tasks – Deficits in object recognition tasks • The mouse model supports the hypothesis that both ataxia and cognitive impairment are caused by the recessive mutation in SPTBN2 WGS500 overview of findings (as of Dec 2012) • Project about 75% complete, with 292 samples (195 case studies) over 38 projects with initial analysis • 75/195 cases there is at least one candidate viewed by the PI and analysts as a strong candidate for causing (strongly contributing to) the phenotype – 45/82 in Mendelian – 19/61 in Immune – 11/52 in Cancer • Papers in press/submitted to date on – Ataxia, CMS, CLL, Multiple adenomas