* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download PPT - Bioinformatics.ca
Gene therapy wikipedia , lookup
Gene expression programming wikipedia , lookup
Gene therapy of the human retina wikipedia , lookup
Oncogenomics wikipedia , lookup
Gene desert wikipedia , lookup
Nutriepigenomics wikipedia , lookup
Public health genomics wikipedia , lookup
Expanded genetic code wikipedia , lookup
Genome evolution wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Gene expression profiling wikipedia , lookup
Epigenetics of neurodegenerative diseases wikipedia , lookup
Protein moonlighting wikipedia , lookup
Genome (book) wikipedia , lookup
Neuronal ceroid lipofuscinosis wikipedia , lookup
Gene nomenclature wikipedia , lookup
Frameshift mutation wikipedia , lookup
Genetic code wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Microevolution wikipedia , lookup
Designer baby wikipedia , lookup
Helitron (biology) wikipedia , lookup
Canadian Bioinformatics Workshops www.bioinformatics.ca Module #: Title of Module 2 Module 8 – Variants to Networks Part 1 – How to annotate variants and prioritize potentially relevant ones Jüri Reimand Informatics and Biocomputing Ontario Institute for Cancer Research Learning Objectives of Module I have detected somatic variants in a cancer sample. What information can I use to interpret them? • What variant annotations can I use? • How do impact prediction models work? • How to use an annotation tool: Annovar (LAB) Module 8 bioinformatics.ca Introduction Module 8 bioinformatics.ca Variant vs Gene Information We have to consider information at two levels: • Gene – Is the gene central to processes related to cancer? (e.g. proliferation, apoptosis, matrix degradation) – Is the gene sensitive to perturbation? (e.g. haploinsufficiency) • Variant – What is the variant effect on the gene product? Module 8 bioinformatics.ca Integrating Different Evidences Variant Recurrence Gene Product Function / Pathway Module 8 Variant Gene Product Effect bioinformatics.ca On Variant Size Small: 1-50 bp • SNV (Single Nucleotide Variants): 1 bp substitution, relatively easy to detect • Small In/dels: a bit more challenging to detect • Most available in databases; can be mapped by exact coordinates Medium: 100-1,000 bp • Insertions, Deletions, Translocations, Complex re-arrangements • Most challenging to detect • More tolerant mapping (e.g. partners of gene fusion) Large: > 5 kbp • Copy number variants relatively easy to detect using arrays, more challenging using next generation sequencing • More tolerant mapping (e.g. 50% reciprocal overlap, cytoband(s)) Module 8 bioinformatics.ca Variant Annotation Components A. Variant database mapping A. B. C. B. C. D. Gene mapping (coding/splicing, UTR, intergenic) Gene product effect type (e.g. loss of function, missense) Coding Missense Effect Scoring A. B. C. E. Allele frequencies from reference data-sets (1000G, NHLBI-ESP, ExAC) dbSNP (sequence variation database) COSMIC (somatic variant database) SIFT PolyPhen2 MutationAssessor Other Effect Scoring A. B. C. PhyloP (conservation) CADD Splicing-regulatory predictions Module 8 bioinformatics.ca Variant databases and allele frequencies Module 8 bioinformatics.ca 1000 Genomes (Phase 3) • Goal: – Identify all variants at > 1% frequency in represented human populations • Subjects: 2,504 – Apparently healthy – Ethnicities: caucasian European, admixed Latin Americans, African, South Asians, East Asians • Platform: Illumina – Low coverage (2-4x) whole genome – Exon (50x coverage) Module 8 bioinformatics.ca NHLBI-ESP • Goal: – discover heart, lung and blood disorder variants at frequency < 1% • Subjects: 6,503 (ESP 6500 release) – Not necessarily healthy (includes individuals with extreme subclinical traits and diseased) – Ethnicities: 2,203 African-Americans, 4,300 European-Americans • Platform: Illumina, exome sequencing (average 110x) Module 8 bioinformatics.ca ExAC (Exome Aggregation Consortium) • Goal: – Compile the largest set of exomes ever • Subjects: 60,706 (unrelated) – Not necessarily healthy: includes cardiovascular, autoimmune, schizophrenia and cancer, but removed individuals with severe pediatric disease – Ethnicities: non-Finnish European, Finnish, Latin Americans, African, South Asians, East Asians, Other • Platform: Illumina, exome • Variant calling: – GATK Module 8 bioinformatics.ca dbSNP • Broad scope repository of “small” genetic variation (e.g. NCBI counterpart for structural variants: dbVAR) – – – – Submissions before and after NGS era Includes polymorphisms found in general population Includes rare germline disease-associated (or suspected to be) Includes somatic variants (also in COSMIC) Good to look up variants If you want to use as filter, make sure you remove “clinically flagged” variants (somatic, germline) Module 8 bioinformatics.ca COSMIC • “Catalogue of Somatic Mutation In Cancer” • Reference database for somatic variation in cancer • Worth following up variants matching COSMIC entries – How many studies/samples was it found in? 1, many? – Does the variant overlap a hotspot? – Is the gene frequently mutated? Module 8 bioinformatics.ca Gene mapping Module 8 bioinformatics.ca Gene Mapping: Types of Genes Types of genes: • Protein-coding genes • Non-protein-coding RNA genes (e.g. miRNA) • Different functional relevance • Different knowledge of variant effects Module 8 bioinformatics.ca Gene Mapping: Parts of Genes • Protein-coding genes have these parts: – – – – UTR (transcribed, not translated) Coding exons (translated) Introns (spliced out, not translated) Splice sites Also: • Upstream, downstream transcribed gene • Inter-genic Module 8 bioinformatics.ca Gene Mapping: Annovar’s priority system • Gene types and parts: what if they overlap..? • Whenever more than one mapping is possible, Annovar will follow this priority system • You can also ask Annovar to report all possible effects Module 8 bioinformatics.ca Gene Mapping: Annovar’s priority system Protein Coding Gene G1 >>>> >> >> > > >>>> >>>>>>>>> > TSS of G1 (Transcription Start Site) Module 8 Non-coding RNA ncR1 (e.g. miRNA) bioinformatics.ca Gene Mapping: Annovar’s priority system >> >> >>>> >>>> > > >>>>>>>>> > G1 Downstream ncR1 Downstream G1 UTR 3’ ncR1 G1 Exonic ncR1 G1 Intronic G1 Exonic G1 Intronic G1 Exonic G1 Intronic G1 Splicing ** G1 Exonic G1 UTR 5’ G1 Upstream ** Splice sites after the first were omitted to avoid clutter Module 8 bioinformatics.ca Gene Mapping: Database • Goal: map our variants to (coding and non-coding) genes • RefSeq is the suggested database for transcribed gene and coding sequence definition – In the lab we will use Annovar with RefSeq database • Other databases available: UCSC known genes, Ensembl Module 8 bioinformatics.ca Gene product effect type Module 8 bioinformatics.ca Gene Product Effect • Regulatory / other non-protein-coding sequences: difficult to establish what a change “means” (certain cases are easier, e.g. miRNA seed) • Protein-coding sequences: how is protein sequence affected? Definitely easier to chase after protein effects But should don’t forget other gene products exist… Module 8 bioinformatics.ca Gene Product Effect: Protein-coding • Stop-gain SNV: adds a STOP codon truncated protein • Frameshift In/Del: shifts the reading frame protein translated incorrectly from that point • Splicing: alters key sites guiding splicing • In-frame In/Del: removes/add one or more amino acids • Stoploss: loss of STOP codon extra piece in the protein • Missense SNV: modifies one amino acid • Synonymous: no amino acid change Module 8 bioinformatics.ca Loss of Function (LOF) Variants Definition: Stop-gain, Frameshift, Splicing These are the more disruptive, BUT: • What percentage of the protein is affected? • Are there multiple transcript isoforms? • Splicing effect difficult to predict – Cryptic splice sites • Frameshift can be rescued by another frameshift or bypassed by splicing Module 8 bioinformatics.ca Missense Variants: Tell Me More.. • How do we tell if a missense alters protein function? • • • • • • Type of amino acid change (amino acid groups) Conservation across species Conserved protein domain Secondary protein structure Tertiary (3D) protein structure + simulation Other functional features (e.g. phosphosite) • Machine learning model tying all of these together – What training set? Module 8 bioinformatics.ca Missense Example: Back to BRAF BRAF V600E T>C Somatic Pathogenic BRAF V600A T>A Somatic / germline Pathogenicity untested Module 8 bioinformatics.ca Conservation and Missense Variant Scoring Models Module 8 bioinformatics.ca Conservation • Conservation is a powerful and broadly used idea • How conserved is a given nucleotide or genomic interval, comparing different species to human? • How conserved is an amino acid in a protein sequence? • Available from UCSC (nucleotide conservation): – PhyloP score – useful to assess single variants – PhastCons score/element – useful to assess putative regulatory regions and genes not coding proteins – Multi-species alignment – generally useful Module 8 bioinformatics.ca Look for coding exons, UTRs and third nucleotide within codons Module 8 bioinformatics.ca PhyloP • PhyloP: test to detect if nucleotide substitution rates are faster or slower than expected under neutral drift – Only where aligned sequence available! • PhyloP score – Positive: conserved (e.g. PhyloP > 2) – Zero: neutral – Negative: more diverged than neutral • Species group: – All vertebrates – Only placental mammals – Only primates Module 8 bioinformatics.ca Conservation • Main caveat: – if you use conservation for a given position, this will not tell you directly what is the effect of your variants, but only if the position is important! Module 8 bioinformatics.ca Missense Variant Effect: Scoring Models Overview Criteria to keep in mind: • What features are used? – Nucleotide / amino acid conservation – Amino acid physicochemical properties • Direct scoring versus Machine learning – Machine learning models are heavily dependent on the training-set used • What data-set used for assessment / learning / optimization? – E.g. Activating / gain-of-function versus inactivating / loss-of-function mutations – E.g. Mendelian disorders (prevailingly loss-of-function) versus cancer (some are unique to cancer, e.g. drug resistance) Module 8 bioinformatics.ca SIFT • Broadly used, relatively old (first published: 2001) • Designed for deleterious mutation (i.e. disruptive of protein function) • Based uniquely on protein sequence (amino acid) conservation 1. 2. 3. 4. 5. Start from query protein sequence Identify similar protein sequences (PSI-BLAST) Multiple alignment of protein sequences (orthologs and paralogs) Amino acid x residue probability matrix (PSSM) For every residue, amino acid probability reweighted by amino acid diversity at the position (sum of frequency rank * frequency) Score: probability of observing amino acid normalized by residue conservation cut-off: 0.05 (based on case studies) Predicting deleterious amino acid substitutions. Ng PC, Henikoff S. Genome Res. 2001 May;11(5):863-74. Module 8 bioinformatics.ca PolyPhen2 • Integrates multiple features – 8 sequence-based, 3 structure-based (nucleotide and amino acid level) (e.g. side chain volume change, overlap with PFAM domain, multiple alignment metrics) • Machine learning method (Naïve Bayes) Requires training set – Set 1: HumDiv – Positive: damaging alleles for known Mendelian disorders (Uniprot) – Negative: nondamaging differences between human proteins and related mammalian homologs – Performance 5-fold crossv: (TP ~ 80%, FP ~10%), (TP ~ 90%, FP ~ 20%) – Set 2: HumVar – Positive: all human disease causing mutations (Uniprot) – Negative: non-synonymous SNPs without disease association Richer model than SIFT More biased towards training set(s) than SIFT A method and server for predicting damaging missense mutations. Adzhubei IA, Schmidt S, Peshkin L, […], Bork P, Kondrashov AS, Sunyaev SR. Nat Methods. 2010 Apr;7(4):248-9. Module 8 bioinformatics.ca MutationAssessor • Direct / theoretical model (no machine learning) • Based on amino acid conservation also specifically modeling conservation unique to protein subfamilies (can be regarded as an enhanced SIFT) • Entropy-based score based on protein sequence alignment • Performs well for (recurrent) somatic variants Predicting the functional impact of protein mutations: application to cancer genomics. Reva B, Antipin Y, Sander C. Nucleic Acids Res. 2011 Sep 1;39(17):e118 Module 8 bioinformatics.ca CADD • Intended as a measure of “deleteriousness” for coding and noncoding sequence, not biased to known disease variation – However does not model gene specific constrain in detail • Machine learning model (Linear SVM) – Negative training set: nearly fixed human alleles, variant if compared to inferred human-chimp ancestral genome – Positive training set: simulated variants based on mutation model aware of sequence context and primate substitution rates – Predictive features (63): VEP (Variant Effect Predictor) output, UCSC tracks, Encode tracks includes missense predictions and nucleotide-level conservation – Performance assessment: using pathogenic variants from ClinVar performs a bit better PhyloP for all sites and PolyPhen/SIFT for missense coding A general framework for estimating the relative pathogenicity of human genetic variants. Kircher M, Witten DM, Jain P, O'Roak BJ, Cooper GM, Shendure J. Nat Genet. 2014 Mar;46(3):310-5. Module 8 bioinformatics.ca Pathogenic ClinVar vs NHLBI-ESP > 5% Module 8 CADD bioinformatics.ca Splicing Regulatory Predictions • Goal: predict how SNVs affect exon inclusion / exclusion • Strategy: 1. Learn “Wild Type” splicing code based on reference genome sequence motifs and experimentally-measured splicing patterns in human tissues 2. “Mutant” code: predicts splicing change when variant alters splicingguiding sequence motif Does not learn based on known disease splicing alterations Science 2015 Module 8 bioinformatics.ca Phosphorylation and other protein modifications • Post-translational modifications (PTMs) extend protein function • Human: >130,000 PTM sites, 12% of protein sequence • Enriched in inherited disease and somatic cancer mutations • Negatively selected in population • Often not detected with mutation assessment tools Reimand et al, 2013 Mol Sys Bio; 2015 PLOS Genet Module 8 bioinformatics.ca Effect Scoring: Conclusive Remarks 1. Nucleotide-level conservation (PhyloP) is simple yet powerful, and multiple alignments can be additionally inspected 2. Missense scoring models are powerful, but their strengths and weaknesses need to be understood 3. Variants should be always reviewed putting all information in context – – – – Consider conservation and effect scores using different models Review the amino acid change and sequence context Look for clusters of somatic variants and protein domain Don’t forget gene-level information! Module 8 bioinformatics.ca We are on a Coffee Break & Networking Session Module 8 bioinformatics.ca