Download Computational research for medical discovery at

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

The Cancer Genome Atlas wikipedia , lookup

Transcript
Software tools for the analysis of medically
important sequence variations
Gabor T. Marth, D.Sc.
Boston College
Department of Biology
[email protected]
http://bioinformatics.bc.edu/marthlab
Pfizer visit, March 7. 2006
Our lab focuses on three main projects…
1. software tools for clinical case-control association
studies
2. software for SNP discovery in clonal and resequencing data,
3. connecting HapMap and pharmaco-genetic data
1. We developing computer software to aid tagSNP
selection and association testing
GUI
tags
input data views
study specification
user input
1
2
representative
computational samples
gene
annotations
5-site Computaionally Generated LD (r )
reference samples
computational
sample database
tag evaluation
marker selection
association testing
0.8
LD views
0.6
0.4
0.2
1-4 Mrk Sep.
association statistics
5-9 Mrk Sep.
10-17 Mrk Sep.
user control interface
(discussed in more detail)
18-26 Mrk Sep.
0
0
0.2
0.4
0.6
LA LD (r2)
0.8
1
2. We build computer tools for SNP discovery
1.
• inherited (germ line)
polymorphisms are
important as they can
predispose to disease
P( SNP ) 

all var iable
• looking for SNPs and short INDELs
P( S N | RN )
P( S1 | R1 )
 ... 
 PPr ior ( S1 ,..., S N )
PPr ior ( S1 )
PPr ior ( S N )
P( SiN | R1 )
P( Si1 | R1 )
S
 ... 
 PPr ior ( Si1 ,..., SiN )
 ... 
PPr ior ( SiN )
S i1 [ A ,C ,G ,T ] S iN [ A ,C ,G ,T ] PPr ior ( S i1 )
Marth et al.
Nature Genetics 1999
• we have a 5-year NIH R01 grant to re-develop our
computer package, PolyBayes© , our SNP discovery
tool originally developed while the PI was at the
Washington University Medical School
Apply our tools for genome-scale SNP mining
genome reference
EST
WGS
BAC
~ 10 million
Sachidanandam et al.
Nature 2001
Extend our methods for SNP detection in medical resequencing data from traditional Sanger sequencers…
Homozygous C
Heterozygous C/T
Homozygous T
… and in 454 pyrosequence data
• detection of heterozygotes
in medical re-sequencing data
• accurate base calling for de novo sequencing
454 sequence from the NCBI Trace Archive
(discussed in more detail)
Figure from Nordfors, et. al.
Human Mutation 19:395-401 (2002)
Developing methods to detect somatic mutations (as
distinguished from inherited polymorphisms)
• the detection of somatic mutations, and their distinction from inherited
polymorphism, will be important to separate pre-disposing variants from
mutations that occur during disease progression e.g. in cancer
© Brian Stavely, Memorial University of Newfoundland
(discussed in more detail)
Process DNA methylation data obtained with sequencing
DNA methylation is important e.g. because hypo- and hypermethylation is
consistently present in various cancers
we are developing methods to interpret DNA
methylation data obtained with sequencing, in
the presence of methodological artifacts such
as incomplete bi-sulfite conversion of unmethylated cytosines
Issa. Nature Reviews Cancer, 4, 2004: 988-993
Lewin et. al. Bioinformatics, 20:3005-30012, 2004
… and tools to integrate genetic and epigenetic data from varied
sources to find “common themes” during cancer development
somatic
mutations
chromosome
rearrangements
methylation
profiles
chromatin
structure
copy number
changes
gene
expression
profiles
repeat expansions
3. We are planning a project to connect multi-marker
haplotypes to drug metabolic phenotypes
• predicting metabolic phenotypes (ADR) based on haplotype markers
• evolutionary origin of drug
metabolizing enzyme polymorphisms
Computer software to aid case-control association studies:
tagSNP selection and association testing (details)
2
5-site Computaionally Generated LD (r )
1
0.8
0.6
0.4
0.2
1-4 Mrk Sep.
5-9 Mrk Sep.
10-17 Mrk Sep.
18-26 Mrk Sep.
0
0
0.2
0.4
0.6
0.8
1
LA LD (r2)
Dr. Eric Tsung
Clinical case-control association studies – concepts
• association studies are designed to find disease-causing genetic variants
• genotyping cases and controls at various polymorphisms
clinical cases
• searching “significant” marker allele
frequency differences between cases
and controls
AF(controls)
clinical controls
AF(cases)
Association study designs
• region(s) interrogated: single gene, list of candidate genes (“candidate gene study”),
or entire genome (“genome scan”)
• direct or indirect:
causative variant
• single-SNP marker or multiSNP haplotype marker
• single-stage or multi-stage
marker that is co-inherited
with causative variant
causative variant
Marker (tag) selection for association studies
for economy, one cannot genotype every SNP in thousands of clinical samples:
marker selection is the process where a subset of all available SNPs is chosen
1. hypothesis driven (i.e. based on gene function)
2. LD-driven – based entirely on the reduction of redundancy presented by the
linkage disequilibrium (LD) between SNPs; tags represent other SNPs they are
correlated with
causative variant
The International HapMap project
The international HapMap project was designed to
provide a set of physical and informational reagents for
association studies by mapping out human LD structure
http://www.hapmap.org
LD varies across samples
there are large differences in LD
between different human populations…
European reference (CEU)
African reference (YRI)
… and even between samples from the
same population.
Other European samples
Sample-to-sample LD differences make tagSNP selection
problematic
groups of SNPs that are in LD in the
HapMap reference samples may not
be in a future set of clinical samples…
… and tags that were selected based
on LD in the HapMap may no longer
work (i.e. represent the SNPs they
were supposed to) in the clinical
samples…
… possibly resulting in missed disease
associations.
Natural marker allele frequency differences confound
association testing
• the HapMap reference samples are much smaller than clinical sample sizes
cases: 500-2,000 chromosomes
reference samples: ~ 120 chromosomes
controls: 500-2,000 chromosomes
• therefore difficult to assess statistical significance
of candidate associations
AF(controls)
• difficult to accurately assess both marker allele frequency (single-SNP or
haplotype frequency) in the clinical samples and naturally occurring variation of
marker allele frequency differences between cases and controls
AF(cases)
We are developing technology for assessing sample-tosample variance in silico
we estimate LD differences between
HapMap and future clinical samples…
cases
association
testing
reference
tag evaluation
tag selection
…by generating
“computational” samples
representing future
clinical samples…
controls
“cases”
… and use computational “proxy”
samples for tabulating LD and
allele frequency differences.
“controls”
Two methods of computational sample generation
Method 1. “Data-relevant Coalescent”. This
algorithm uses a population genetic model to
connect mutations in the HapMap reference
to mutations in future clinical samples. Full
model but computationally slow.
“HapMap”
HapMap
“cases”
“controls”
Method 2. The PAC method (product of
approximate conditionals, Li & Stephens).
This method constructs “new” samples as
mosaics of existing haplotypes, mimicking
the effects of recombination. An
approximation but fast.
Computational samples
HapMap (CEU)
Computational (PAC)
Extra genotypes (Estonia)
Computational (Coalescent)
MARKER EVALUATION with computational samples
test if markers selected from the HapMap continue to
“tag” other SNPs in their original LD group
MARKER SELECTION with computational samples
selecting tags in multiple consecutive
sets of computational samples and
choosing for the association study the
best-performing tags
ASSOCIATION TESTING with computational samples
“cases”
tabulating ΔAF in “cases” vs. “controls” in
multiple consecutive computational pairs of
samples provides the natural range of allele
frequency differences to decide if a candidate
association is statistically significant
“controls”
“cases”
AF(controls)
“controls”
“cases”
“controls”
AF(cases)
Do computational samples represent future clinical genotypes
realistically?
1
0.8
0.6
0.4
0.2
0
0
we quantify the quality of representation by
comparing the correlation of LD between
corresponding pairs of markers (i.e. ask if
two markers were in strong LD in one set of
samples, are they ALSO in strong LD in the
other set?
0.2
0.4
0.6
0.8
1
LD difference -- comparison to extra experimental genotypes
• we have analyzed two extra genotype sets collected at the HapMap SNPs in
three genome regions, from our clinical collaborators (Prof. Thomas Hudson,
McGill; Prof. Stanley Nelson, UCLA)
0.949 +/- 0.013
0.963 +/- 0.014
0.978 +/- 0.010
AF difference -- comparisons to extra experimental genotypes
0.06
AF Diff, Comp Samples
0.05
0.04
0.03
0.02
0.01
0
0
0.01
0.02
0.03
0.04
0.05
0.06
AF Diff, Estonian Data
• according to our limited initial test, computational samples can represent
future clinical samples well for estimating sample-to-sample variability
A new marker selection and association testing software tool
• data visualization
• gene annotations overlaid on physical map of
SNPs (i.e. the human genome sequence)
tags
gene
annotations
• representative computational
sample generation
LD views
• advanced tag selection functionality
• advanced association testing functionality
reference samples
2
5-site Computaionally Generated LD (r )
1
representative
computational samples
0.8
0.6
0.4
0.2
1-4 Mrk Sep.
association statistics
5-9 Mrk Sep.
10-17 Mrk Sep.
18-26 Mrk Sep.
0
0
0.2
0.4
0.6
0.8
1
LA LD (r2)
• multi-level user customization including
user conveniences e.g. tag prioritization
based on SNP assay score
User community
• companies designing new generations of whole-genome or specialized SNP arrays
• researchers comparing alternative platforms (e.g. Affymetrix 500K and the
Illumina 300K ) most suitable for their study
• clinical researchers designing candidate gene studies
• researchers designing second-stage follow-up studies in specific genome regions
after an initial genome scan (our methods can take advantage of first-stage data
already available in the clinical samples)
• the association testing features should be useful for analysts regardless of study
design
Base calling and SNP detection in sequence traces including
454 data
Aaron Quinlan
Base calling and SNP detection in sequence traces including
454 “pyrogram” data
• PolyBayes was originally written to find SNPs
in clonal sequences in large SNP discovery
projects
• medical re-sequencing projects require the detection of SNPs in heterozygous
diploid sequence traces
5’
3’
5’
3’
C
G
C
G
C
G
T
A
Heterozygote detection in sequence traces
Ind. 1
Ind. 2
Ind. 3
Ind. 4
Individual traces
• we use a machine learning method (Support Vector Machine, SVM) to recognize
characteristic features of homozygous vs. heterozygous positions
Aggregating information from multiple traces
P(GT | Read) = .98
resultant genotype call
P(GT ) = .993
P(GT | Read) = .87
forward/reverse sequences from
same individual
Discovery vs. genotyping
discovery: “uninformed prior”
don’t know if site is polymorphic
have to test each site
Prior(CT) = .001
genotyping: “informed prior”
1. site is known to be polymorphic
2. allele frequency estimate
Prior(CT) = 0.34
Our heterozygote detection works better than other methods
Fraction of
Data
Analyzed
False
Discovery
Rate
Fraction of
Heterozygotes
Found
Fraction of
Homozygotes
Found
PolyBayes+
85.1
0.0375
86.60%
97.8%
Polyphred 5
86.17
0.0389
83.16%
82.63%
Performance Measured on ~1000 Alignments covering 500Kb
Region of Chromosome 4
Base calling for “pyrograms”
26
55 24 15 10 7 5 4 2 1 0 0
TCAGGGGGGGGGGGACGACAAGGCGTGGGGA
• readout in pyrosequencing is based on
instantaneous detection of base
incorporation… multiple bases of the same
type are incorporated in the same cycle
• the identity of consecutive bases is
very reliable but the length of mononucleotide runs (base number) is difficult
to quantify (great for re-sequencing; but
problematic for de novo sequencing)
• we have access to standardized data
formats
From NCBI Trace Archive
SNP genotyping with pyrosequencers
we are in the process of identifying discriminating
pyrogram features to use in our machine-learning
methods to recognize polymorphic positions within traces
Nordfors, et. al. Human Mutation 19:395-401 (2002)
Somatic mutation detection
Michael Stromberg
Somatic mutations
the detection of somatic mutations, and their
distinction from inherited polymorphism, is
important to separate pre-disposing variants
from mutations that occur during disease
progression e.g. in cancer
© Brian Stavely, Memorial University of Newfoundland
1. detect the mutations
2. classify whether somatic or inherited
Detecting somatic mutations with comparative data
• based on comparison of cancer and normal tissue from
the same individual
• often cancer tissue is highly heterogeneous and the
somatic mutant allele may represent at low allele
frequency
Detecting somatic mutations with subtraction
• if normal tissue samples are not available,
we detect SNPs in cancer tissue against
e.g. the human genome reference sequence
• search for evidence that these mutations are
genetic
• subtract apparent mutations that are present in sequence variation
databases
Detecting somatic mutations with subtraction
• we have applied our methods for somatic mutation
detection in murine mitochondrial sequences
heteroplasmy
homoplasmy
• we will be applying our methods for
human nuclear DNA from our
collaborators
Using new haplotype resources to connect genotype and clinical
outcome in pharmaco-genetic systems
• the HapMap was designed as a tool to detect high-frequency (common)
phenotypic (e.g. disease-causing) alleles
• important drug metabolizing enzymes are relatively few in number, well
studied, are at known genome locations, many associated phenotypes are
well described
• many functional alleles are known, and of high frequency (common)
• multi-SNP alleles are highly predictive of metabolic phenotype
• clinical phenotype (adverse drug reaction) less predictable
• ideal candidate for applying haplotype resources
Multi-marker haplotypes as accurate markers for ADRs?
genetic marker (haplotype)
in genome regions of drug
metabolizing enzyme
(DME) genes
computational prediction
based on haplotype
structure
functional allele (known
metabolic polymorphism)
clinical endpoint
(adverse drug reaction)
molecular phenotype (drug
concentration measured in
blood plasma)
Resources
• functional alleles
• LD and haplotype structure in the
HapMap reference samples, based on
high-density SNP map
• specifics of enzymedrug interactions
• existing DME P genotyping chips
Evolutionary questions
• mutations single-origin or recurrent?
• geographic origin of mutations?
• mutation age?
• analysis based on complete local
variation structure and haplotype
background of functional mutations
• specifics of the selection process that led
to specific functional alleles?
Proposed steps of analysis
• complete polymorphic structure?
• ethnicity?
haplotype block?
• additional functional SNPs?
• haplotypes vs. functional alleles?
• haplotypes vs. metabolic phenotype?
• haplotypes vs. ADR phenotype?
clinical phenotype
(ADR)
haplotype
functional allele
(genotype)
metabolic phenotype