Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
CSE291: Personal genomics for bioinformaticians Class meetings: TR 3:30-4:50 MCGIL 2315 Office hours: M 3:00-5:00, W 4:00-5:00 CSE 4216 Contact: [email protected] Today’s schedule: • 3:30-4:15 Evaluating de novo mutations • 4:15-4:20 Break • 4:20-4:50 Other constraint-based metrics Announcements: • PS5 due today De novo mutations and constraint CSE291: Personal Genomics for Bioinformaticians 03/09/17 Outline • Challenges in analyzing de novo mutations • A framework to interpret de novos (Samocha et al. 2014) • Constraint analysis in ExAC • Analyzing constraint genomewide Challenges in analyzing de novo mutations Challenges in analyzing de novo mutations C,D A,B • Requires accurate calling of genotypes in parents and child A,E • Can’t trace inheritance in families • Depending on the disease, de novo mutations may be spread across multiple genes. • Kabuki (1 major gene, KMT2D) • ASD (many genes) (hard!) Autism spectrum disorder (ASD) Characteristics • Difficulty communicating and interacting with others • Repetitive behaviors, limited interests or activities • Symptoms that hurt the individual’s ability to function socially, at school or work, or other areas • Usually recognized in the first two years of life • 46% of ASD children have above average IQ, whereas others are severely impaired Risk factors • Boys > girls • Sibling with ASD • Having older parents • Other genetic conditions (e.g. Down Syndrome, Fragile Genetics of ASD • Over 50% of ASD risk attributed to genetic variation • Diverse genetic architecture! • Common variation (combination of small additive effects from many SNPs) • Copy number variants • Rare variants and monogenic cases (~25% of cases have an identifiable causative mutation, often de novo) Initial ASD exome sequencing studies Limited ability to implicate individual genes Complication: what is background de novo rate? Which gene is expected have more mutations? GENE 1 GENE 2 Example: TTN, huge protein, frequent de novos Which about this one? (Hint: CpG highly mutable) CGCGCGCGCGCGCGCG AATCACTACACGTCATCG A A framework to interpret de novo mutations The key: an accurate mutation model What factors determine the number of mutations in a gene? • Length! (longer gene, more opportunities for mutation) • Sequence! In particular, trinucleotide context • Also: replication timing, recombination, … The key: an accurate mutation model 1. Create a “neutral” mutation rate table for each trinucleotide context The key: an accurate mutation model 2. For each coding base, determine probability of each base changing to each other base 3. Determine the outcome of each change 4. Sum probabilities (synonymous, missense, nonsense) across all bases of a gene P-values and Z-scores P-value: probability of finding the observed, or more extreme, results when the null hypothesis (H0) is true Example: rolling a pair of dice. H0: dice are fair H1: dice are Observation: we roll a 12 P-value: 1/36 (probability we roll 12 or greater under H0) P-values and Z-scores “Null” distribution (results under H0) Frequency Observation Result P-value: percent of area under the curve to the right of the red line Z-score = (Observation-mean)/(standard deviation) Calculating constraint • Fit a linear model to predict observed number of synonymous variants using calculated mutation probabilities • Apply fit using missense probabilities to determine expected number of rare missense variants per gene • For synonymous and missense mutations: perform chi-squared test comparing observed to expected counts • Convert to a signed Z-score: • Negative values (not constrained): more variants than expected • Positive values (constrained): fewer variants than Validating the model • Synonymous variants mostly neutral: Pergene mutation probabilities highly correlated with number of observed rare synonymous mutations per gene in ESP • Significant fraction of genes depleted for missense variation (p<10-16) • Most strongly constrained genes associated with dominant or sporadic diseases in OMIM (27/86) • Genes with average constraint associated with recessive disorders (11/111) Now that we have a model, we can… In ASD: 1. Test for a genome-wide excess of certain mutation classes in c 2. Test whether individual genes have excess of de novo mutations in cases at genome-wide significance 3. Test whether specific sets of genes show excesses of de novo mutations Beyond ASD: Which genes are most constrained, and thus most likely to be disease-causing? Very useful prioritization Application to ASD Individually significant genes Genetic constraint in intellectual disability Individually significant genes Stratifying ASD patients by IQ Gene-set constraint enrichment Evaluating constrained genes Constraint analysis in ExAC Exome Aggregation Consortium http://exac.broadinstitute.org/ IDEA: collect ALL available exome sequencing datasets: • 1000 Genomes • ESP • NIMH • SIGMA-T2D • GoT2D • TCGA • More… 60,000+ total exomes! Gene constraint model fits ExAC data Synonymous variants Missense variants Nonsense variants Gene constraint model fits ExAC data Synonymous variants Nonsense variants Missense variants A new metric for LoF mutations Challenge: constraint score power depends on gene size • Bigger genes = higher number of obs/exp mutations, stronger p-values Alternative metric: ratio of missing non-sense variants λgene = (obs # nonsense)/(expected # nonsense) λgene ≈ 1 for tolerant (null) genes λgene ≈ 0.5 for recessive disease genes λgene ≈ 0 for haploinsufficient (intolerant) genes Use EM algorithm to estimate P(null), P(recessive), P(HI) for each gene A new metric for LoF mutations Use EM algorithm to estimate P(null), P(recessive), P(HI) for each gene pLI (probability of being loss of function intolerant P(HI) P(null) + P(recessive) + P(HI) pLI (prob. Of being loss-of-function intolerant) Highly constrained genes are broadly expressed Constraint metric provides additional information Extending gene-based constraint metrics What other functional units can be evaluated for constrai • Individual exons • Different transcripts • Protein domains What about non-coding regions of the genome? (up next ExAC browser http://exac.broadinstitute.org/ Analyzing constraint genome-wide Constraint: a general framework Chunk 1 Chunk 2 Chunk 3 Genome Hard/impossible to measure constraint at a specific position. Why? Instead, aggregate positions into convenient “chunks”. Compare observed vs. expected variation Coding regions: • Genes • Transcripts • Exons Can also aggregate by annotations: • Synonymous • Missense • Nonsense • Transcription factor binding site disruption What about for non-coding regions? • Promoters? • Enhancers? • Windows of x bp? Constraint: other challenges Lots of factors affected mutation patterns: • We know about length, sequence context, but also: • GC content • Replication timing • DNA secondary structure • Recombination • Demographic history • Selection on nearby regions Technical: genotyping errors and coverage. Why? For whole genomes: fewer data and coverage than exomes, reducing power INSIGHT Key challenge: compare polymorphism in elements of interest (e.g. TFBS), with presumably neutral flanking regions Aggregate across many similar short elements across the genome Also integrates phylogenetic information from multiple outgroup species Gronau et al. 2013 fitCons Combines INSIGHT with functional annotations Gather functional annotations e.g. histone modifications, TFBS Cluster genomic regions based on annotations Apply INSIGHT to infer selection parameters for each cluster Assign “fitCons” (fitness consequences) to each genomic region based on cluster assignment Gulko et al. 2014 fitCons Gulko et al. 2014 fitCons Gulko et al. 2014 10,000 Whole genome sequences from HLI