Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Algorithms for
Universal DNA Tag Array
Design and Optimization
Ion Mandoiu
Computer Science & Engineering Department
University of Connecticut
High-Throughput Assay Design
• Fast growing number of applications, with more stringent
constraints on sensitivity, reproducibility, cost, etc.
• Source of challenging combinatorial problems
–
–
–
–
–
–
Multiplex PCR primer set selection
Probe selection
Fidelity probes
Mask design
…
This talk: design and optimization of universal tag arrays
Overview
Background on DNA Microarrays
Universal Tag Arrays
Tag Set Design Problem
Tag Assignment Problem
Conclusions
Watson-Crick Complementarity
• Four nucleotide types: A,C,T,G
• A’s paired with T’s (2 hydrogen bonds)
• C’s paired with G’s (3 hydrogen bonds)
DNA Microarrays
Labeled DNA/RNA mixture flushed
over array of probes
Laser activation of fluorescent labels
Optical scanning used to
identify probes with
complements in the mixture
Images courtesy of Affymetrix.
Applications
• Gene expression (transcription analysis)
• Genomic-based microorganism identification
• Single Nucleotide Polymorphism (SNP)
genotyping
Gene Expression
• Cells express different subsets of genes
under different environments
Translation
Transcription
mRNA
Gene
Protein
Two-Color Technique
• Sample labeled RED
• Control labeled GREEN
• YELLOW probes hybridize to both
sample and control
•BLACK probes hybridize to neither
cell type 1
RNA 1
target 1
Cy3 Cy3Cy3
target 2
Cy5Cy5Cy5
RNA 2
cell type 2
Microarray Technologies
• Arrays of cDNAs
– Obtained by reverse transcription from
Expressed Sequence Tags (ESTs)
• Oligonucleotide arrays
– Short (20-60bp) synthetic DNA strands
Robotic cDNA Arrayers
Pin Technology
Quill Pen Technology
Ink jet Technology
Pin Ring Technology
VLSIPS Array Synthesis
Images courtesy of Affymetrix.
Oligonucleotide Arrays
• Pros
– Highly versatile
– Very high integration (VLSIPS ~106 probes/cm2)
• Cons
– Higher cost
– Application specific, non-reusable designs
Overview
Background on DNA Microarrays
Universal Tag Arrays
Tag Set Design Problem
Tag Assignment Problem
Conclusions
Universal Arrays
• Brenner 97, Morris et al. 98
• Key idea
– Array consisting of application independent tags
– Two-part “reporter” probes: aplication specific
primers ligated to antitags
– Detection carried by a sequence of reactions
separately involving the primer and the antitag part
of reporter probes
Universal Array Experiment
+
Universal Array Advantages
• Cost effectiveness
– Array does not change economies of scale
• Customization
– Fast, only need to synthesize new set of reporter
probes
• Reliability
– Solution phase hybridization better understood than
hybridization on solid support
Overview
Background on DNA Microarrays
Universal Tag Arrays
Tag Set Design Problem
Tag Assignment Problem
Conclusions
Tag Set Requirements
• Hybridization constraints
(H1) Antitags hybridize strongly to complementary tags
(H2) No antitag hybridezes to a non-complementary tag
(H3) Antitags do not cross-hybridize to each other
• Two conflicting goals
– Maximize number of tags
– Ensure hybridization constraints
Hybridization Model
• Melting temperature Tm: temperature at
which 50% of duplexes are in hybridized state
• 2-4 rule
Tm = 2 #(As and Ts) + 4 #(Cs and Gs)
• More accurate models exist, e.g., the nearneighbor model, however, 2-4 rule reasonably
accurate for short oligos
Tag Set Design Problem
• [Ben-Dor et al. 00] Conservative formalization of
(H1)+(H2) based on nucleation complex theory
and 2-4 rule
(C1) Every tag must have total weight h
(C2) No 2 tags share a common substring of weight c
Where
– w(A)=w(T)=1, w(C)=w(G)=2
– c, h are given constant (Affymetrix uses l+h in C1)
c-h codes
• c-token: left-minimal DNA string of weight c, i.e.,
– w(x) c
– w(x’) < c for every proper suffix of x
• [Ben-Dor et al. 00] A set of tags is called c-h code if
(C1) Every tag has weight h
(C2*) Every c-token is used at most once
c-h code problem: given c and h, find largest c-h code
c-h code problem
• [Ben-Dor et al. 00] give a constructive upperbound on largest c-h code size and a nearoptimal algorithm based on DeBruijn sequences
• [MT05] c-h code problem can be formulated as a
maximum integer flow problem with capacity
constraints on (disjoint) sets of vertices
– Solvable in practical time for small values of c
Token Content of a Tag
c=4
CCAGATT
CC
CCA
CAG
AGA
GAT
GATT
Layered c-token graph, length-l tags
c/2
(c/2)+1 …
l-1
l
c1
s
t
cN
token s-t path in layer graph
c=4, l=7
tag = CCAGATT
Layer 2
3
4
5
6
7
sCC CCACAGAGAGATGATTt
Integer Program Formulation
O(hN) constraints and variables, where N = #c-tokens
Number of c-tokens
Let Gn = #strings of weight n
G1 = 2; G2 = 6; Gn = 2Gn-2 + 2Gn-1
W=A or T, S=C or G
Token type
Num tokens
<c-2>S
2 Gc-2
S<c-3>S
4 Gc-3
<c-1>W
2 Gc-1
S<c-2>W
4 Gc-2
ILP Results
Periodic Tags
• If multiple c-token copies are not allowed in a tag
(uniqueness property), every tag uses roughly one ctoken per letter (except for first up to c-1 letters)
• A tag t is periodic if it is the prefix of () for
some string
– t periodic with period t uses at most || c-tokens
• Lemma: there is always an optimum solution
consisting exclusively of periodic tags and tags
with uniqueness property
Portion of successor c-token graph, c=4
CC
AAG
AAC
AAAA
AAAT
Tag Set Design Algorithm
1. Construct “successor” graph G over c-tokens
2. T{}
3. For all cycles C defining periodic tags, in increasing
order of cycle length, do
–
If C has no c-tokens in common with T, then
•
•
Add a tag defined by C to T
Remove C from G
4. Perform a pruned alphabetic tree search and add to T
tags with no c-tokens in common with T
5. Return T
Vertex-disjoint Cycle Packing Problem
• Given directed graph G, find maximum number of
vertex disjoint directed cycles in G
• [MT 05] APX-hard even for regular directed graphs
with in-degree and out-degree 2
– h-c/2+1 approximation factor for tag set design problem
• [Salavatipour and Verstraete 05]
– Quasi-NP-hard to approximate within (log1- n)
– O(n1/2) approximation algorithm
Experimental Results
Antitag-to-Antitag Hybridization
• Additional practical constraint: antitags do not
cross-hybridize (including self)
– Ignored by Ben-Dor et al
• Formalization in c-token hybridization model:
(C3) No two (anti)tags contain complementary
substrings of weight c
• Cycle packing and tree search extend easily
Results w/ Extended Constraints
Overview
Background on DNA Microarrays
Universal Tag Arrays
Tag Set Design Problem
Tag Assignment Problem
Conclusions
More Possible Mis-Hybridizations
Degrees of freedom: partition of primers on multiple arrays, tag
assignment within an array (may be forced to leave tags unassigned)
Here we focus on avoiding case (a), primer-to-tag hybridization
Assignable Primers
• Set P of primers is assignable to set T of tags if, for
every tags t,t’ and primers p,p’ with t hybridizing to
primer p, at most one of the assignments (p,t), (p,t’),
(p’,t) can be made
p’
t
p
t’
Characterization of Assignable Sets
• [Ben-Dor 04] Set P is assignable to T iff X+Y |P|,
where, in the conflict graph induced by P+T
– X = number of primers incident to a degree 1 tag
– Y = number of degree 0 tags
X=1
Y=2
MAPS Problem
• Maximum Assignable Primer Set (MAPS) Problem:
given primer set P and tag set T, find maximum size
assignable subset of P
• [Ben-Dor 04] Greedy deletion heuristic: repeatedly
delete primer of maximum weight from P until it
becomes assignable, where
– Potential of tag t is 2-|P(t)|
– Potential of primer p is sum of potentials of conflicting tags
Universal Array Multiplexing Problem
• Multiplexing Problem: given primer set P and tag set
T, find partition of P into minimum number of
assignable sets
• [Ben-Dor 04] Repeatedly find approximate MAP using
greedy deletion
Integration with Probe Selection
• In practice, several primer candidates with equivalent
functionality
– In SNP genotyping, can pick primer from either forward and
reverse strand
– In gene expression/identification applications, many primers
have desired length, Tm, etc.
Pooled Array Multiplexing Problem
•
[MPT 05] Given set of primer pools P and tag set T,
find a primer from each pool and a partition of
selected primers into minimum number of assignable
sets
X+Y Characterization no Longer Holds
Pooled Multiplexing Algorithms
1. Primer-Del = greedy deletion for pools similar to
[Ben-Dor et al 04]
2. Primer-Del+ = same but never delete last primer from
pool unless no other choice
3. Min-Pot = select primer with min potential from each
pool, then run Primer-Del
4. Min-Deg = select primer with min conflict degree,
then run Primer-Del
Results: GenFlex Tags, c=7
Results: GenFlex Tags, c=8
GenFlex vs. PerTags, c=8 (|T|=213)
Overview
Background on DNA Microarrays
Universal Tag Arrays
Tag Set Design Problem
Tag Assignment Problem
Conclusions
Conclusions
• New techniques for tag set design and tag assignment
lead to significantly improved multiplexing
– Enforcing antitag-to-antitag hybridization constraints make
universal arrays more reliable
• Other applications of universal tags:
– Lab-on-chip, DNA-driven assembly (e.g., carbon nanotubes), DNA computing [Brenneman&Condon 02]
• Other applications of assignment techniques:
– Genotyping by mass-spectroscopy [Aumann et al 05]
– Genotyping using l-mer arrays
Ongoing Work
•
Special type of universal arrays: l-mer arrays
– Initially introduced for sequencing by hybridization, but
proved impractical
– Currently investigated for use in resequencing by hybridization
– More promising application: SNP genotyping
– Isothermal prefix sets instead of l-mer arrays?
•
•
Deeper integration between tag assignment and probe
selection, e.g., in string barcoding
Extend algorithms to more accurate near-neighbor
hybridization model
– monotonic Tm c-tokens successor graph
Open Problems
•
Settle approximation complexity of (vertex) disjoint cycle
packing ([Salavatipour and Verstraete 05] show
approximation preserving reductions between edge and
vertex disjoint versions)
•
Establish better approximation bound for special instances
arising in tag set design
•
Improved approximation algorithms for maximum
assignable tag set and the tag assignment problems
Acknowledgments
• Claudia Prajescu and Dragos Trinca
• UCONN Research Foundation