Download PPT - Bioinformatics.ca

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Gene therapy wikipedia , lookup

Gene expression programming wikipedia , lookup

Epistasis wikipedia , lookup

Gene therapy of the human retina wikipedia , lookup

Oncogenomics wikipedia , lookup

Gene desert wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Public health genomics wikipedia , lookup

Genomics wikipedia , lookup

Expanded genetic code wikipedia , lookup

Genome evolution wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Gene expression profiling wikipedia , lookup

Epigenetics of neurodegenerative diseases wikipedia , lookup

Protein moonlighting wikipedia , lookup

RNA-Seq wikipedia , lookup

Gene wikipedia , lookup

Genome (book) wikipedia , lookup

Neuronal ceroid lipofuscinosis wikipedia , lookup

Gene nomenclature wikipedia , lookup

Frameshift mutation wikipedia , lookup

Genetic code wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Microevolution wikipedia , lookup

NEDD9 wikipedia , lookup

Designer baby wikipedia , lookup

Helitron (biology) wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Point mutation wikipedia , lookup

Transcript
Canadian Bioinformatics Workshops
www.bioinformatics.ca
Module #: Title of Module
2
Module 8 – Variants to Networks
Part 1 – How to annotate variants and
prioritize potentially relevant ones
Jüri Reimand
Informatics and Biocomputing
Ontario Institute for Cancer Research
Learning Objectives of Module
I have detected somatic variants in a cancer sample.
What information can I use to interpret them?
• What variant annotations can I use?
• How do impact prediction models work?
• How to use an annotation tool: Annovar (LAB)
Module 8
bioinformatics.ca
Introduction
Module 8
bioinformatics.ca
Variant vs Gene Information
We have to consider information at two levels:
• Gene
– Is the gene central to processes related to cancer?
(e.g. proliferation, apoptosis, matrix degradation)
– Is the gene sensitive to perturbation?
(e.g. haploinsufficiency)
• Variant
– What is the variant effect on the gene product?
Module 8
bioinformatics.ca
Integrating Different Evidences
Variant
Recurrence
Gene Product
Function /
Pathway
Module 8
Variant
Gene Product
Effect
bioinformatics.ca
On Variant Size
Small: 1-50 bp
• SNV (Single Nucleotide Variants): 1 bp substitution, relatively easy to detect
• Small In/dels: a bit more challenging to detect
• Most available in databases; can be mapped by exact coordinates
Medium: 100-1,000 bp
• Insertions, Deletions, Translocations, Complex re-arrangements
• Most challenging to detect
• More tolerant mapping (e.g. partners of gene fusion)
Large: > 5 kbp
• Copy number variants relatively easy to detect using arrays, more challenging
using next generation sequencing
• More tolerant mapping (e.g. 50% reciprocal overlap, cytoband(s))
Module 8
bioinformatics.ca
Variant Annotation Components
A.
Variant database mapping
A.
B.
C.
B.
C.
D.
Gene mapping (coding/splicing, UTR, intergenic)
Gene product effect type (e.g. loss of function, missense)
Coding Missense Effect Scoring
A.
B.
C.
E.
Allele frequencies from reference data-sets (1000G, NHLBI-ESP, ExAC)
dbSNP (sequence variation database)
COSMIC (somatic variant database)
SIFT
PolyPhen2
MutationAssessor
Other Effect Scoring
A.
B.
C.
PhyloP (conservation)
CADD
Splicing-regulatory predictions
Module 8
bioinformatics.ca
Variant databases
and allele frequencies
Module 8
bioinformatics.ca
1000 Genomes (Phase 3)
• Goal:
– Identify all variants at > 1% frequency in represented human populations
• Subjects: 2,504
– Apparently healthy
– Ethnicities: caucasian European, admixed Latin Americans, African, South
Asians, East Asians
• Platform: Illumina
– Low coverage (2-4x) whole genome
– Exon (50x coverage)
Module 8
bioinformatics.ca
NHLBI-ESP
• Goal:
– discover heart, lung and blood disorder variants at frequency < 1%
• Subjects: 6,503 (ESP 6500 release)
– Not necessarily healthy (includes individuals with extreme subclinical traits and diseased)
– Ethnicities: 2,203 African-Americans, 4,300 European-Americans
• Platform: Illumina, exome sequencing (average 110x)
Module 8
bioinformatics.ca
ExAC (Exome Aggregation Consortium)
• Goal:
– Compile the largest set of exomes ever
• Subjects: 60,706 (unrelated)
– Not necessarily healthy: includes cardiovascular, autoimmune, schizophrenia
and cancer, but removed individuals with severe pediatric disease
– Ethnicities: non-Finnish European, Finnish, Latin Americans, African, South
Asians, East Asians, Other
• Platform: Illumina, exome
• Variant calling:
– GATK
Module 8
bioinformatics.ca
dbSNP
• Broad scope repository of “small” genetic variation
(e.g. NCBI counterpart for structural variants: dbVAR)
–
–
–
–
Submissions before and after NGS era
Includes polymorphisms found in general population
Includes rare germline disease-associated (or suspected to be)
Includes somatic variants (also in COSMIC)
 Good to look up variants
 If you want to use as filter, make sure you remove
“clinically flagged” variants (somatic, germline)
Module 8
bioinformatics.ca
COSMIC
• “Catalogue of Somatic Mutation In Cancer”
• Reference database for somatic variation in cancer
• Worth following up variants matching COSMIC entries
– How many studies/samples was it found in? 1, many?
– Does the variant overlap a hotspot?
– Is the gene frequently mutated?
Module 8
bioinformatics.ca
Gene mapping
Module 8
bioinformatics.ca
Gene Mapping:
Types of Genes
Types of genes:
• Protein-coding genes
• Non-protein-coding RNA genes (e.g. miRNA)
• Different functional relevance
• Different knowledge of variant effects
Module 8
bioinformatics.ca
Gene Mapping:
Parts of Genes
• Protein-coding genes have these parts:
–
–
–
–
UTR (transcribed, not translated)
Coding exons (translated)
Introns (spliced out, not translated)
Splice sites
Also:
• Upstream, downstream transcribed gene
• Inter-genic
Module 8
bioinformatics.ca
Gene Mapping:
Annovar’s priority system
• Gene types and parts: what if they overlap..?
• Whenever more than one mapping is possible, Annovar
will follow this priority system
• You can also ask Annovar to report all possible effects
Module 8
bioinformatics.ca
Gene Mapping:
Annovar’s priority system
Protein Coding Gene G1
>>>>
>> >>
>
>
>>>>
>>>>>>>>>
>
TSS of G1
(Transcription Start Site)
Module 8
Non-coding RNA
ncR1 (e.g. miRNA)
bioinformatics.ca
Gene Mapping:
Annovar’s priority system
>> >>
>>>>
>>>>
>
>
>>>>>>>>>
>
G1 Downstream
ncR1 Downstream
G1 UTR 3’
ncR1
G1 Exonic
ncR1
G1 Intronic
G1 Exonic
G1 Intronic
G1 Exonic
G1 Intronic
G1 Splicing **
G1 Exonic
G1 UTR 5’
G1 Upstream
** Splice sites after the first were omitted to avoid clutter
Module 8
bioinformatics.ca
Gene Mapping:
Database
• Goal:
map our variants to (coding and non-coding) genes
• RefSeq is the suggested database for transcribed gene
and coding sequence definition
– In the lab we will use Annovar with RefSeq database
• Other databases available: UCSC known genes, Ensembl
Module 8
bioinformatics.ca
Gene product effect type
Module 8
bioinformatics.ca
Gene Product Effect
• Regulatory / other non-protein-coding sequences:
difficult to establish what a change “means”
(certain cases are easier, e.g. miRNA seed)
• Protein-coding sequences:
how is protein sequence affected?
 Definitely easier to chase after protein effects
 But should don’t forget other gene products exist…
Module 8
bioinformatics.ca
Gene Product Effect: Protein-coding
• Stop-gain SNV: adds a STOP codon  truncated protein
• Frameshift In/Del: shifts the reading frame
 protein translated incorrectly from that point
• Splicing: alters key sites guiding splicing
• In-frame In/Del: removes/add one or more amino acids
• Stoploss: loss of STOP codon  extra piece in the protein
• Missense SNV: modifies one amino acid
• Synonymous: no amino acid change
Module 8
bioinformatics.ca
Loss of Function (LOF) Variants
Definition: Stop-gain, Frameshift, Splicing
These are the more disruptive, BUT:
• What percentage of the protein is affected?
• Are there multiple transcript isoforms?
• Splicing effect difficult to predict
– Cryptic splice sites
• Frameshift can be rescued by another frameshift or
bypassed by splicing
Module 8
bioinformatics.ca
Missense Variants: Tell Me More..
• How do we tell if a missense alters protein function?
•
•
•
•
•
•
Type of amino acid change (amino acid groups)
Conservation across species
Conserved protein domain
Secondary protein structure
Tertiary (3D) protein structure + simulation
Other functional features (e.g. phosphosite)
• Machine learning model tying all of these together
– What training set?
Module 8
bioinformatics.ca
Missense Example:
Back to BRAF
BRAF V600E
T>C
Somatic
Pathogenic
BRAF V600A
T>A
Somatic / germline
Pathogenicity untested
Module 8
bioinformatics.ca
Conservation and
Missense Variant Scoring Models
Module 8
bioinformatics.ca
Conservation
• Conservation is a powerful and broadly used idea
• How conserved is a given nucleotide or genomic interval,
comparing different species to human?
• How conserved is an amino acid in a protein sequence?
• Available from UCSC (nucleotide conservation):
– PhyloP score – useful to assess single variants
– PhastCons score/element – useful to assess putative regulatory
regions and genes not coding proteins
– Multi-species alignment – generally useful
Module 8
bioinformatics.ca
Look for coding exons, UTRs and third nucleotide within codons
Module 8
bioinformatics.ca
PhyloP
• PhyloP: test to detect if nucleotide substitution rates are faster or
slower than expected under neutral drift
– Only where aligned sequence available!
• PhyloP score
– Positive: conserved (e.g. PhyloP > 2)
– Zero: neutral
– Negative: more diverged than neutral
• Species group:
– All vertebrates
– Only placental mammals
– Only primates
Module 8
bioinformatics.ca
Conservation
• Main caveat:
– if you use conservation for a given position, this will not tell you
directly what is the effect of your variants, but only if the
position is important!
Module 8
bioinformatics.ca
Missense Variant Effect:
Scoring Models Overview
Criteria to keep in mind:
• What features are used?
– Nucleotide / amino acid conservation
– Amino acid physicochemical properties
• Direct scoring versus Machine learning
– Machine learning models are heavily dependent on the training-set used
• What data-set used for assessment / learning / optimization?
– E.g. Activating / gain-of-function versus inactivating / loss-of-function mutations
– E.g. Mendelian disorders (prevailingly loss-of-function) versus cancer (some are
unique to cancer, e.g. drug resistance)
Module 8
bioinformatics.ca
SIFT
• Broadly used, relatively old (first published: 2001)
• Designed for deleterious mutation (i.e. disruptive of protein function)
• Based uniquely on protein sequence (amino acid) conservation
1.
2.
3.
4.
5.
Start from query protein sequence
Identify similar protein sequences (PSI-BLAST)
Multiple alignment of protein sequences (orthologs and paralogs)
Amino acid x residue probability matrix (PSSM)
For every residue, amino acid probability reweighted by amino acid diversity at the
position (sum of frequency rank * frequency)
 Score: probability of observing amino acid normalized by residue conservation
cut-off: 0.05 (based on case studies)
Predicting deleterious amino acid substitutions.
Ng PC, Henikoff S. Genome Res. 2001 May;11(5):863-74.
Module 8
bioinformatics.ca
PolyPhen2
•
Integrates multiple features
– 8 sequence-based, 3 structure-based (nucleotide and amino acid level)
(e.g. side chain volume change, overlap with PFAM domain, multiple alignment metrics)
•
Machine learning method (Naïve Bayes)  Requires training set
– Set 1: HumDiv
– Positive: damaging alleles for known Mendelian disorders (Uniprot)
– Negative: nondamaging differences between human proteins and related mammalian
homologs
– Performance 5-fold crossv: (TP ~ 80%, FP ~10%), (TP ~ 90%, FP ~ 20%)
– Set 2: HumVar
– Positive: all human disease causing mutations (Uniprot)
– Negative: non-synonymous SNPs without disease association
 Richer model than SIFT
 More biased towards training set(s) than SIFT
A method and server for predicting damaging missense mutations.
Adzhubei IA, Schmidt S, Peshkin L, […], Bork P, Kondrashov AS, Sunyaev SR. Nat Methods. 2010 Apr;7(4):248-9.
Module 8
bioinformatics.ca
MutationAssessor
• Direct / theoretical model (no machine learning)
• Based on amino acid conservation also specifically modeling
conservation unique to protein subfamilies (can be regarded as an
enhanced SIFT)
• Entropy-based score based on protein sequence alignment
• Performs well for (recurrent) somatic variants
Predicting the functional impact of protein mutations: application to cancer genomics.
Reva B, Antipin Y, Sander C. Nucleic Acids Res. 2011 Sep 1;39(17):e118
Module 8
bioinformatics.ca
CADD
• Intended as a measure of “deleteriousness” for coding and noncoding sequence, not biased to known disease variation
– However does not model gene specific constrain in detail
• Machine learning model (Linear SVM)
– Negative training set: nearly fixed human alleles, variant if compared to
inferred human-chimp ancestral genome
– Positive training set: simulated variants based on mutation model aware of
sequence context and primate substitution rates
– Predictive features (63): VEP (Variant Effect Predictor) output, UCSC tracks,
Encode tracks  includes missense predictions and nucleotide-level
conservation
– Performance assessment: using pathogenic variants from ClinVar performs a
bit better PhyloP for all sites and PolyPhen/SIFT for missense coding
A general framework for estimating the relative pathogenicity of human genetic variants.
Kircher M, Witten DM, Jain P, O'Roak BJ, Cooper GM, Shendure J. Nat Genet. 2014 Mar;46(3):310-5.
Module 8
bioinformatics.ca
Pathogenic ClinVar vs NHLBI-ESP > 5%
Module 8
CADD
bioinformatics.ca
Splicing Regulatory Predictions
• Goal: predict how SNVs affect exon inclusion / exclusion
• Strategy:
1.
Learn “Wild Type” splicing code based on reference genome sequence
motifs and experimentally-measured splicing patterns in human tissues
2. “Mutant” code: predicts splicing change when variant alters splicingguiding sequence motif
 Does not learn based on known disease splicing alterations
Science 2015
Module 8
bioinformatics.ca
Phosphorylation and other protein
modifications
• Post-translational modifications
(PTMs) extend protein function
• Human: >130,000 PTM sites,
12% of protein sequence
• Enriched in inherited disease and
somatic cancer mutations
• Negatively selected in population
• Often not detected with mutation
assessment tools
Reimand et al, 2013 Mol Sys Bio; 2015 PLOS Genet
Module 8
bioinformatics.ca
Effect Scoring:
Conclusive Remarks
1. Nucleotide-level conservation (PhyloP) is simple yet powerful,
and multiple alignments can be additionally inspected
2. Missense scoring models are powerful, but their strengths and
weaknesses need to be understood
3. Variants should be always reviewed putting all information in
context
–
–
–
–
Consider conservation and effect scores using different models
Review the amino acid change and sequence context
Look for clusters of somatic variants and protein domain
Don’t forget gene-level information!
Module 8
bioinformatics.ca
We are on a Coffee Break &
Networking Session
Module 8
bioinformatics.ca