Download Constraint - Amazon S3

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
CSE291: Personal genomics for bioinformaticians
Class meetings: TR 3:30-4:50 MCGIL 2315
Office hours: M 3:00-5:00, W 4:00-5:00 CSE 4216
Contact: [email protected]
Today’s schedule:
• 3:30-4:15 Evaluating de novo mutations
• 4:15-4:20 Break
• 4:20-4:50 Other constraint-based metrics
Announcements:
•
PS5 due today
De novo mutations and
constraint
CSE291: Personal Genomics for
Bioinformaticians
03/09/17
Outline
• Challenges in analyzing de novo
mutations
• A framework to interpret de novos
(Samocha et al. 2014)
• Constraint analysis in ExAC
• Analyzing constraint genomewide
Challenges in analyzing de
novo mutations
Challenges in analyzing de novo mutations
C,D
A,B
• Requires accurate calling of genotypes
in parents and child
A,E
• Can’t trace inheritance in families
• Depending on the disease, de novo
mutations may be spread across
multiple genes.
• Kabuki (1 major gene, KMT2D)
• ASD (many genes) (hard!)
Autism spectrum disorder (ASD)
Characteristics
• Difficulty communicating and interacting with others
• Repetitive behaviors, limited interests or activities
• Symptoms that hurt the individual’s ability to function
socially, at school or work, or other areas
• Usually recognized in the first two years of life
• 46% of ASD children have above average IQ, whereas
others are severely impaired
Risk factors
• Boys > girls
• Sibling with ASD
• Having older parents
• Other genetic conditions (e.g. Down Syndrome, Fragile
Genetics of ASD
• Over 50% of ASD risk attributed to genetic
variation
• Diverse genetic architecture!
• Common variation (combination of small
additive effects from many SNPs)
• Copy number variants
• Rare variants and monogenic cases (~25%
of cases have an identifiable causative
mutation, often de novo)
Initial ASD exome sequencing studies
Limited ability to implicate
individual genes
Complication: what is background de novo rate?
Which gene is expected have more mutations?
GENE 1
GENE 2
Example: TTN, huge
protein, frequent de
novos
Which about this one? (Hint: CpG highly mutable)
CGCGCGCGCGCGCGCG
AATCACTACACGTCATCG
A
A framework to interpret de
novo mutations
The key: an accurate mutation model
What factors determine the number of mutations
in a gene?
• Length! (longer gene, more
opportunities for mutation)
• Sequence! In particular, trinucleotide
context
• Also: replication timing, recombination,
…
The key: an accurate mutation model
1. Create a “neutral” mutation rate table
for each trinucleotide context
The key: an accurate mutation model
2. For each coding base, determine
probability of each base changing to each
other base
3. Determine the outcome of each change
4. Sum probabilities (synonymous, missense,
nonsense) across all bases of a gene
P-values and Z-scores
P-value: probability of finding the observed, or
more extreme, results when the null hypothesis
(H0) is true
Example: rolling a pair of dice.
H0: dice are fair
H1: dice are
Observation: we roll a 12
P-value: 1/36 (probability we roll 12 or greater under H0)
P-values and Z-scores
“Null” distribution (results under H0)
Frequency
Observation
Result
P-value: percent of
area under the curve
to the right of the
red line
Z-score = (Observation-mean)/(standard deviation)
Calculating constraint
• Fit a linear model to predict observed number of
synonymous variants using calculated mutation
probabilities
• Apply fit using missense probabilities to determine
expected number of rare missense variants per gene
• For synonymous and missense mutations: perform
chi-squared test comparing observed to expected
counts
• Convert to a signed Z-score:
• Negative values (not constrained): more variants
than expected
• Positive values (constrained): fewer variants than
Validating the model
• Synonymous variants mostly neutral: Pergene mutation probabilities highly
correlated with number of observed rare
synonymous mutations per gene in ESP
• Significant fraction of genes depleted for
missense variation (p<10-16)
• Most strongly constrained genes
associated with dominant or sporadic
diseases in OMIM (27/86)
• Genes with average constraint associated
with recessive disorders (11/111)
Now that we have a model, we can…
In ASD:
1. Test for a genome-wide excess of certain mutation classes in c
2. Test whether individual genes have excess of de novo
mutations in cases at genome-wide significance
3. Test whether specific sets of genes show excesses of de
novo mutations
Beyond ASD:
Which genes are most constrained, and thus most
likely to be disease-causing? Very useful prioritization
Application to ASD
Individually significant genes
Genetic constraint in intellectual disability
Individually significant genes
Stratifying ASD patients by IQ
Gene-set constraint enrichment
Evaluating constrained genes
Constraint analysis in ExAC
Exome Aggregation Consortium
http://exac.broadinstitute.org/
IDEA: collect ALL available exome sequencing datasets:
• 1000 Genomes
• ESP
• NIMH
• SIGMA-T2D
• GoT2D
• TCGA
• More…
60,000+ total exomes!
Gene constraint model fits ExAC data
Synonymous variants
Missense variants
Nonsense variants
Gene constraint model fits ExAC data
Synonymous variants
Nonsense variants
Missense variants
A new metric for LoF mutations
Challenge: constraint score power depends on gene
size
• Bigger genes = higher number of obs/exp
mutations, stronger p-values
Alternative metric: ratio of missing non-sense
variants
λgene = (obs # nonsense)/(expected # nonsense)
λgene ≈ 1 for tolerant (null) genes
λgene ≈ 0.5 for recessive disease genes
λgene ≈ 0 for haploinsufficient (intolerant) genes
Use EM algorithm to estimate P(null), P(recessive),
P(HI) for each gene
A new metric for LoF mutations
Use EM algorithm to estimate P(null),
P(recessive), P(HI) for each gene
pLI (probability of being loss of function intolerant
P(HI)
P(null) + P(recessive) + P(HI)
pLI (prob. Of being loss-of-function intolerant)
Highly constrained genes are broadly expressed
Constraint metric provides additional information
Extending gene-based constraint metrics
What other functional units can be evaluated for constrai
• Individual exons
• Different transcripts
• Protein domains
What about non-coding regions of the genome? (up next
ExAC browser
http://exac.broadinstitute.org/
Analyzing constraint
genome-wide
Constraint: a general framework
Chunk 1
Chunk 2
Chunk 3
Genome
Hard/impossible to measure constraint at a specific position. Why?
Instead, aggregate positions into convenient “chunks”. Compare observed vs.
expected variation
Coding regions:
• Genes
• Transcripts
• Exons
Can also aggregate by annotations:
• Synonymous
• Missense
• Nonsense
• Transcription factor binding site disruption
What about for non-coding regions?
• Promoters?
• Enhancers?
• Windows of x bp?
Constraint: other challenges
Lots of factors affected mutation patterns:
• We know about length, sequence context, but also:
• GC content
• Replication timing
• DNA secondary structure
• Recombination
• Demographic history
• Selection on nearby regions
Technical: genotyping errors and coverage. Why?
For whole genomes: fewer data and coverage than
exomes, reducing power
INSIGHT
Key challenge: compare polymorphism in elements of interest (e.g. TFBS), with presumably
neutral flanking regions
Aggregate across many similar short elements across the genome
Also integrates phylogenetic information from multiple outgroup species
Gronau et al. 2013
fitCons
Combines INSIGHT with functional annotations
Gather functional annotations
e.g. histone modifications, TFBS
Cluster genomic regions
based on annotations
Apply INSIGHT to infer
selection parameters for
each cluster
Assign “fitCons” (fitness
consequences) to each
genomic region based on
cluster assignment
Gulko et al. 2014
fitCons
Gulko et al. 2014
fitCons
Gulko et al. 2014
10,000 Whole genome sequences from HLI