Download An Overview of Algorithms for Reconstructing - CS-CSIF

Document related concepts

Pattern recognition wikipedia , lookup

The Measure of a Man (Star Trek: The Next Generation) wikipedia , lookup

Data (Star Trek) wikipedia , lookup

Transcript
Estimating and Reconstructing
Recombination in Populations:
Problems in Population
Genomics
Dan Gusfield
UC Davis
Different parts of this work are joint with Satish Eddhu, Charles
Langley, Dean Hickerson, Yun Song, Yufeng Wu, Z. Ding
University of Puerto Rico, Mayaguez,
Feb. 22, 2007
What is population genomics?
• The Human genome “sequence” is done.
• Now we want to sequence many individuals
in a population to correlate similarities and
differences in their sequences with genetic
traits (e.g. disease or disease susceptibility).
• Presently, we can’t sequence large numbers
of individuals, but we can sample the
sequences at SNP sites.
SNP Data
• A SNP is a Single Nucleotide Polymorphism - a site in the
genome where two different nucleotides appear with
sufficient frequency in the population (say each with 5%
frequency or more). Hence binary data.
• SNP maps are being compiled with a density of about 1
site per 1000.
• SNP data is what is mostly collected in populations - it is
much cheaper to collect than full sequence data, and
focuses on variation in the population, which is what is of
interest.
Haplotype Map Project:
HAPMAP
• NIH lead project ($100M) to find common SNP
haplotypes (“SNP sequences”) in the Human population.
• Association mapping: HAPMAP used to try to associate
genetic-influenced diseases with specific SNP haplotypes,
to either find causal haplotypes, or to find the region near
causal mutations.
• The key to the logic of Association mapping is historical
recombination in populations. Nature has done the
experiments, now we try to make sense of the results.
Our work: Reconstructing the
Evolution of SNP Sequences
• I: Clean mathematical and algorithmic results:
Galled-Trees, near-uniqueness, graph-theory lower
bound, and the Decomposition theorem
• II: Practical computation of Lower and Upper
bounds on the number of recombinations needed.
Construction of (optimal) phylogenetic networks;
uniform sampling; haplotyping with ARGs
• III: Extension to Gene Conversion
• IV: Applications
Perfect Phylogeny: Where it all
starts
The Evolution of SNP Sequences
by Point Mutations
sites 12345
Ancestral sequence 00000
1
4
Site mutations on edges
3
The tree derives the set M:
2
10100
10100
5
10000
01011
01010
00010
10000
00010
01010
01011
Extant sequences at the leaves
Sequence Recombination
01011
10100
S
P
5
Single crossover recombination
10101
A recombination of P and S at recombination point 5.
The first 4 sites come from P (Prefix) and the sites
from 5 onward come from S (Suffix).
A Phylogenetic Network or ARG
00010
a:00010
10010
00000
4
3
1
00100
2
b:10010
P
c:00100
S
3
01100
p
d:10100
5
00101
S
4
01101
f:01101
e:01100
g:00101
A Min ARG for Kreitman’s data
ARG created by
SHRUB
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
An illustration of why we are
interested in recombination:
Association Mapping of
Genetic Diseases Using
ARGs
Association Mapping
• A major strategy being practiced to find genes
influencing disease from haplotypes of a subset of
SNPs.
– Disease mutations: unobserved.
• A simple example to explain association mapping
and why ARGs are useful, assuming the true ARG is
known.
Disease mutation site
0
1
0
SNPs
0
1
Very Simplistic Mapping the Unobserved
Mutation of Mendelian Diseases with ARGs
The single disease
00000
mutation occurs near
4
00010
sites
1
or
2!
a:00010
3
What part of 01100
10010 1
d, e, f inherit?
00100
1 2 3 4 5
2
b:10010
d:
01100
P
S
e:
c:00100
P
2
f:
?
Where is the
disease mutation?
d:10100
Assumption (for now):
A sequence is diseased
iff it carries the single
disease mutation
5
00101
S
4
01101
g:00101
?
f:01101
e:01100
Diseased
Mapping Disease Gene with
Inferred ARGs
• “..the best information that we could possibly
get about association is to know the full
coalescent genealogy…” – Zollner and
Pritchard, 2005
• But we do not know the true ARG!
• Goal: infer ARGs from SNP data for
association mapping
– Not easy and often approximation (e.g. Zollner and
Pritchard)
– Improved results to do Y. Wu (RECOMB 2007)
Using the Network in
Association Mapping
• Given case-control data M, uniformly sample the
minimum ARGs (in practice for small windows of
fixed number of SNPs)
• Build the ``marginal” tree for each interval between
adjacent recombination points in the ARG
• Look for non-random clustering of cases in the tree;
accumulate statistics over the trees to find the best
mutation explaining the partition into cases and
controls.
One Min ARG for the data
Input Data
00101
10001
10011
11111
10000
00110
Seqs 0-2: cases
Seqs 3-5: controls
sample
The marginal tree for the
interval past both breakpoints
Input Data
00101
10001
10011
11111
10000
00110
Tree
Seqs 0-2: cases
Seqs 3-5: controls
Cases
Experimental results on Cystic Fibrosis data.
Disease mutation is at 885kb. Our estimate is at
844kb.
80
Average Chi-square value
70
60
50
40
30
20
10
0
0
5
10
15
Marker indices
20
25
Back to Clean Algorithmic and
Mathematical Results
The Perfect Phylogeny Model for
the History of SNP sequences
Only one mutation per site
allowed.
sites 12345
Ancestral sequence 00000
1
4
Site mutations on edges
3
The tree derives the set M:
2
10100
10100
5
10000
01011
01010
00010
10000
00010
01010
01011
Extant sequences at the leaves
When can a set of sequences be
derived on a perfect phylogeny?
Classic NASC: Arrange the sequences in a
matrix. Then (with no duplicate columns),
the sequences can be generated on a unique
perfect phylogeny if and only if no two
columns (sites) contain all four pairs:
0,0 and 0,1 and 1,0 and 1,1
This is the 4-Gamete Test
A richer model
M
10100
10000
01011
01010
00010
10101 added
Pair 4, 5 fails the four
gamete-test. The sites 4, 5
``conflict”.
12345
00000
1
4
3
10100
2
00010
5
10000
0101101010
Real sequence histories often involve recombination.
Network with Recombination
M
10100
10000
01011
01010
00010
10101 new
12345
00000
1
4
3
2
10100
The previous tree with one
recombination event now derives
all the sequences.
P
00010
5
10000
5
10101
S
0101101010
Minimizing recombinations in
Phylogenetic networks
Problem: given a set of sequences M, find a
phylogenetic network generating M, minimizing
the number of recombinations used to generate M.
The minimization objective is a rough, but useful,
reflection of the true number of ``observable”
recombinations that have occurred in the
derivation of M.
Minimization is an NP-hard
Problem
There is no known efficient solution to this problem and there likely
will never be one.
What we can do: Solve special cases optimally with efficient
algorithms (galled-trees);
Solve small data-sets optimally with
algorithms that are not provably efficient but work well in
practice;
Efficiently compute lower and upper bounds on the number of
needed recombinations (HapBound, Shrub);
Galled-Trees: an efficient special
case
Definition: Recombination Cycle
• In a Phylogenetic Network, with a
recombination node x, if we trace two paths
backwards from x, then the paths will
eventually meet.
• The cycle specified by those two paths is
called a ``recombination cycle”.
Galled-Trees
• A phylogenetic network where no
recombination cycles share an edge is called
a galled tree.
• A cycle in a galled-tree is called a gall.
• Problem: If Haplotype matrix M cannot be
generated on a true tree, can it be generated
on a galled-tree?
4
Incompatibility Graph
4
3
1
1
3
2
5
p s
a: 00010
2
b: 10010
c: 00100
d: 10100
2
p
e: 01100
5
4
s
f: 01101
g: 00101
Results about galled-trees
• Theorem: Efficient (provably polynomial-time) algorithm to determine
whether or not any haplotype set H can be derived on a galled-tree.
• Theorem: A galled-tree (if one exists) produced by the algorithm
minimizes the number of recombinations used over all possible
phylogenetic-networks.
• Theorem: If M can be derived on a galled tree, then the Galled-Tree is
``nearly unique”. This is important for biological conclusions derived
from the galled-tree.
Gusfield et al. papers from 2003-2005.
Elaboration on Near Uniqueness
Theorem: The number of arrangements (permutations) of the
sites on any gall is
at most three, and this happens only if the gall has two
sites.
If the gall has more than two sites, then the number of
arrangements is at most two.
If the gall has four or more sites, with at least two sites
on each side of the recombination point (not the side of
the gall) then the arrangement is forced and unique.
Theorem: All other features of the galled-trees for M are invariant.
Efficient Bounding Algorithms
We cannot efficiently compute the exact minimum
number of needed recombinations, in general, but
we can efficiently compute close lower and upper
bounds on the minimum number.
The bounds and the computations to obtain them
have many practical applications.
The general composite lower
bound method (S. Myers 2002)
Given a set of intervals on the line, and for each interval I, a
number N(I), which is a (local) lower bound on the number of
recombinations needed in interval I, define Vmin as the
minimum number vertical lines needed so that every
interval I intersects at least N(I) of the vertical lines.
Vmin is a valid lower bound on the total number of
recombinations needed in the whole data. Vmin is a called a
composite bound.
Vmin is easy to compute by a left-to-right myopic algorithm.
The Composite Method (Myers & Griffiths 2003)
1. Given a set of intervals, and
2. for each interval I, a number N(I)
Composite Problem: Find the minimum number of vertical
lines so that every I intersects at least N(I) vertical lines.
2
2
1
1
M
2
2
3
8
Haplotype (local) Lower Bound
(S. Myers)
• Rh = Number of distinct sequences (rows) - Number of
distinct sites (columns) -1 <= minimum number of
recombinations needed (folklore)
• Generally Rh is really bad bound, often negative, when
used on large intervals, but Very Good when used as local
bounds on small intervals with the Composite Method, and
other methods.
Composite Subset Method
(Myers)
• Let S be a subset of sites, and Rh(S) be the
haplotype bound computed on the input sequences
restricted to the sites in S. If the leftmost site in S
is L and the rightmost site in S is R, then use
Rh(S) as a local bound N(I) for interval I = [S,L].
• Compute Rh(S) on many subsets, and then solve
the composite problem to find a valid composite
bound.
RecMin (S. Myers)
• Computes local bounds using subsets of sites, but
limits the size and the span of the subsets. Default
parameters are s = 6, w = 15 (s = size, w = span).
• Generally, impractical to set s and w large, so
generally one doesn’t know if increasing the
parameters would increase the composite bound.
• Still, RecMin often gives a bound more than three
times the HK bound. Example LPL data: HK
gives 22, RecMin gives 75.
Optimal RecMin Bound (ORB)
• The Optimal RecMin Bound is the lower bound
that RecMin would produce if both parameters
were set to their maximum possible values.
• In general, RecMin cannot compute the ORB in
practical time.
• We have developed a practical program,
HAPBOUND, based on integer linear
programming that guarantees to compute the
ORB, and have incorporated ideas that lead to
even higher lower bounds than the ORB.
HapBound: The general approach
For an interval of sites I, let H(I) be the largest
haplotype lower bound obtained from any subset
of sites in I.
We have shown that we can efficiently compute
H(I) by using integer linear programming.
We set N(I) = H(I) in the composite method, and
the resulting composite bound is the ORB.
HapBound vs. RecMin on LPL
from Clark et al.
Program
Lower Bound
Time
RecMin (default)
59
3s
RecMin –s 25 –w 25
75
7944s
RecMin –s 48 –w 48
No result
5 days
HapBound ORB
75
31s
HapBound -S
78
1643s
2 Ghz PC
Example where RecMin has
difficulity in Finding the ORB on a
25 by 376 Data Matrix
Program
RecMin default
RecMin –s 30 –w 30
RecMin –s 35 –w 35
RecMin –s 40 –w 40
RecMin –s 45 –w 45
HapBound
HapBound -S
Bound
36
42
43
43
43
44
48
Time
1s
3m 25s
24m 2s
2h 9m 4s
10h 20m 59s
2m 59s
39m 30s
Constructing Optimal
Phylogenetic Networks
Optimal = minimum number of
recombinations. Called Min ARG.
Kreitman’s 1983 ADH Data
•
11 sequences, 43 segregating sites
•
Both HapBound and SHRUB took only a fraction of a
second to analyze this data.
•
Both produced 7 for the number of detected
recombination events
Therefore, independently of all other methods, our lower
and upper bound methods together imply that 7 is the
minimum number of recombination events.
A Min ARG for Kreitman’s data
ARG created by
SHRUB
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
The Human LPL Data (Nickerson et al. 1998)
(88 Sequences, 88 sites)
Our new lower
and upper
bounds
Optimal RecMin
Bounds
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
(We ignored insertion/deletion, unphased sites, and sites with missing data.)
An illustration of why we are
interested in recombination:
Association Mapping of
Complex Diseases Using
ARGs
Association Mapping
• A major strategy being practiced to find genes
influencing disease from haplotypes of a subset of
SNPs.
– Disease mutations: unobserved.
• A simple example to explain association mapping
and why ARGs are useful, assuming the true ARG is
known.
Disease mutation site
0
1
0
SNPs
0
1
Very Simplistic Mapping the Unobserved
Mutation of Mendelian Diseases with ARGs
The single disease
00000
mutation occurs near
4
00010
sites
1
or
2!
a:00010
3
What part of 01100
10010 1
d, e, f inherit?
00100
1 2 3 4 5
2
b:10010
d:
01100
P
S
e:
c:00100
P
2
f:
?
Where is the
disease mutation?
d:10100
Assumption (for now):
A sequence is diseased
iff it carries the single
disease mutation
5
00101
S
4
01101
g:00101
?
f:01101
e:01100
Diseased
Mapping Disease Gene with
Inferred ARGs
• “..the best information that we could possibly
get about association is to know the full
coalescent genealogy…” – Zollner and
Pritchard, 2005
• But we do not know the true ARG!
• Goal: infer ARGs from SNP data for
association mapping
– Not easy and often approximation (e.g. Zollner and
Pritchard)
– Improved results to do Y. Wu (RECOMB 2007)
Application: Association
Mapping
• Given case-control data M, uniformly sample the
minimum ARGs (in practice for small windows of
fixed number of SNPs)
• Build the ``marginal” tree for each interval between
adjacent recombination points in the ARG
• Look for non-random clustering of cases in the tree;
accumulate statistics over the trees to find the best
mutation explaining the partition into cases and
controls.
One Min ARG for the data
Input Data
00101
10001
10011
11111
10000
00110
Seqs 0-2: cases
Seqs 3-5: controls
sample
The marginal tree for the
interval past both breakpoints
Input Data
00101
10001
10011
11111
10000
00110
Tree
Seqs 0-2: cases
Seqs 3-5: controls
Cases
Experimental results on Cystic Fibrosis data.
Disease mutation is at 885kb. Our estimate is at
844kb.
80
Average Chi-square value
70
60
50
40
30
20
10
0
0
5
10
15
Marker indices
20
25
Haplotyping (Phasing)
genotypic data using a Min
ARG
Genotypes and Haplotypes
Each individual has two “copies” of each
chromosome.
At each site, each chromosome has one of two
alleles (states) denoted by 0 and 1 (motivated by
SNPs)
0 1 1 1 0 0 1 1 0
1 1 0 1 0 0 1 0 0
Two haplotypes per individual
Merge the haplotypes
2 1 2 1 0 0 1 2 0
Genotype for the individual
Haplotyping Problem
• Biological Problem: For disease association studies,
haplotype data is more valuable than genotype data, but
haplotype data is hard to collect. Genotype data is easy to
collect.
• Haplotyping (Phasing) Problem: Given a set of n
genotypes, determine the original set of n haplotype pairs
that generated the n genotypes. This is hopeless without a
genetic model for the evolution of haplotype sequences.
Haplotyping by Minimizing
Recombinations
We want to haplotype genotypic data by
finding those pairs of haplotypes (that
explain the genotypes) and minimize
the number of recombinations needed
to derive the haplotypes. Minimizing
recombination encodes the biology.
We have a branch and bound algorithm
that finds the haplotypes minimizing the
number of recombinations, building a
Min ARG for deduced haplotypes. It
works for genotype data with a small
number of sites, but a larger number of
genotypes.
Application: Detecting
Recombination Hotspots with
Genotype Data
• Bafna and Bansel (2005) uses recombination lower
bounds to detect recombination hotspots with
haplotype data.
• We apply our program on the genotype data
– Compute the minimum number of recombinations for all
small windows with fixed number of SNPs
– Plot a graph showing the minimum level of recombinations
normalized by physical distance
– Initial results shows this approach can give good estimates
of the locations of the recombination hotspots
Recombination Hotspots on
Jeffreys, et al (2001) Data
Jeffery et al (2001) data. Slide window size = 5
8
7
6
5
4
3
2
1
0
-1 0
Result from Bafna and Bansel
(2005), haplotype data
50
100
150
Our result on genotype data
200
250
Application: Missing Data Imputation by
Constructing near-optimal ARGs
For  = 5.
Datasets with about 1,000 entries
Dataets with about 10,000 entries
#Seq
#Sites
%missing
Accuracy
#Seq
#Sites
%missing
Accuracy
20
50
5%
96 %
20
100
5%
95 %
20
20
32
50
50
32
10 %
30 %
5%
95 %
93 %
97 %
20
20
45
100
100
45
10 %
30 %
5%
95 %
93 %
98 %
32
32
50
32
32
20
10 %
30 %
5%
96 %
94 %
97 %
45
45
100
45
45
20
10 %
30 %
5%
97 %
96 %
97 %
50
50
20
20
10 %
30 %
96 %
94 %
100
100
20
20
10 %
30 %
96 %
95 %
Haplotyping genotype data via
a minimum ARG
• Compare to program PHASE, in order to try to
understand why Phase is so accurate.
• Experience shows PHASE may give solutions whose
recombination is close to the minimum
– Example: In all solutions of PHASE for three sets of
case/control data from Steven Orzack, recombinatons are
minimized.
– Simulation results: PHASE’s solution minimizes
recombination in 57 of 100 data (20 rows and 5 sites).
I would like to thank the organizers of
the Information Technology and
Life Sciences Symposium for inviting me,
and thank you for your attention.
Papers and
Software on wwwcsif.cs.ucdavis.edu/~gusfield