Download Designing Compressive Sensing DNA Microarrays

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Replisome wikipedia , lookup

DNA profiling wikipedia , lookup

Zinc finger nuclease wikipedia , lookup

DNA nanotechnology wikipedia , lookup

Microsatellite wikipedia , lookup

United Kingdom National DNA Database wikipedia , lookup

Helitron (biology) wikipedia , lookup

Transcript
DESIGNING COMPRESSIVE SENSING DNA MICROARRAYS
Mona A. Sheikh,∗ Olgica Milenkovic,† Richard G. Baraniuk∗
Rice University
University of Illinois at Urbana-Champaign
∗
†
ABSTRACT
A Compressive Sensing Microarray (CSM) is a new device
for DNA-based identification of target organisms that leverages the nascent theory of Compressive Sensing (CS). In contrast to a conventional DNA microarray, in which each genetic
sensor spot is designed to respond to a single target organism,
in a CSM each sensor spot responds to a group of targets. As
a result, significantly fewer total sensor spots are required. In
this paper, we study how to design group identifier probes that
simultaneously account for both the constraints from the CS
theory and the biochemistry of probe-target DNA hybridization. We employ Belief Propagation as a CS recovery method
to estimate target concentrations from the microarray intensities.
Index Terms— Compressive Sensing, DNA Microarrays,
Probe Design, Hybridization Affinity
1. INTRODUCTION
Biosensing of pathogens is a timely research area of high consequence. An accurate and rapid biosensing paradigm has the
potential to impact the fields of healthcare, defense, and environmental monitoring, among others.
The DNA microarray is a frequently applied solution for
microbe sensing that has a significant edge over competitors
due to its ability to sense many organisms in parallel [1].
The principle behind all current DNA microarrays is basically
the same. A DNA microarray consists of genetic sensors or
spots, each containing DNA sequences termed probes. From
the perspective of a microarray, each DNA sequence can be
viewed as a sequence of four DNA bases {A, T , G, C} that
tend to bind with one another in complementary pairs: A with
T and G with C. Therefore, a DNA subsequence in a target
organism’s genetic sample will tend to bind or “hybridize”
with its complementary subsequence on a microarray to form
a stable structure. The target DNA sample to be identified is
fluorescently tagged before it is flushed over the microarray.
The extraneous DNA is washed away so that only the bound
DNA is left on the array. The array is then scanned using laser
light of a wavelength designed to trigger fluorescence in the
Supported by NSF Award CCF 0729216. Rice Electrical & Computer
Engineering, UIUC Electrical & Computer Engineering; Email:{msheikh,
richb}@rice.edu; [email protected]; Web: dsp.rice.edu/csdna
spots where binding has occurred. A specific pattern of array
spots will fluoresce, which is then used to infer the genetic
makeup in the test sample.
There are three issues with traditional DNA microarrays
that stem from the fact that each sensing spot is designed to
uniquely identify only one target of interest.
The first concern is that very often the targets in a test sample have similar base sequences, causing them to hybridize
with the wrong probe. These cross-hybridization events lead
to errors in the array readout. Current microarray design
methods do not address cross-matches between similar DNA
sequences.
The second concern is the restriction on the number of
organisms that can be identified. In a typical biosensing application, multiple organisms must be identified, which necessitates a large number of spots. As a consequence readout
systems for traditional DNA arrays are difficult to miniaturize.
The third concern is the inefficient utilization of the large
number of array spots in traditional microarrays. While the
number of potential agents in a sample can be very large,
only a few agents are expected to be present in a significant concentration at a given time and location or in a given
air/water/soil sample. Therefore, in a traditionally designed
microarray only a small fraction of the large number of spots
will be active at a given time, corresponding to the few targets
present.
To combat these issues, we have proposed a new microarray architecture using “combinatorial testing sensors” in order to reduce the number of sensor spots. We refer to this
new type of array as a Compressive Sensing DNA Microarray
(CSM), since it is based on the nascent theory of Compressive
Sensing [2–4]. Each spot in a CSM identifies a group of target organisms, and several spots together generate a unique
pattern identifier for a single target. Designing the probes that
perform this combinatorial sensing is the essence of the microarray design process, and what we aim to describe in this
paper.
Note that although we address microarrays for a biosensing application in this paper, the principles behind CSMs are
also applicable for the laboratory identification of diseasecausing genes in a single organism.
This paper is organized as follows. Section 2 describes the
2. PRINCIPLES OF CS MICROARRAYS
2.1. Compressive Sensing
Compressive Sensing (CS) is a recently developed sampling
theory for sparse signals [5]. The main result of CS, introduced by Candès, Romberg, and Tao [6] and Donoho [7], is
that a length-N signal x that is K-sparse in some basis can
be recovered exactly from just M = O(K log(N/K)) measurements of the signal. In this paper we choose the canonical
basis; hence x has K ! N nonzero and N − K zero entries.
In matrix notation, we measure y = Φx, where x is the
N ×1 sparse signal vector we aim to sense, y is an M ×1 measurement vector, and the measurement matrix Φ is an M × N
matrix. Since M < N , recovery of the signal x from the
measurements y is ill-posed in general; however the additional assumption of signal sparsity makes recovery possible.
In the presence of measurement noise, the model becomes
y = Φx + n where n denotes a vector of, for example, i.i.d.
additive white Gaussian noise samples with zero mean.
The two critical conditions to realize CS are that: (i) the
vector x to be sensed is sufficiently sparse and (ii) the rows
of Φ are sufficiently incoherent with the signal sparsity basis.
Incoherence is achieved if Φ satisfies the so-called Restricted
Isometry Property (RIP) [5]. For example, random matrices built from Gaussian and Bernoulli distributions satisfy the
RIP with high probability. Φ can also be sparse with only L
nonzero entries per row (L can vary from row to row) [8].
Various methods have been developed to recover a sparse
x from the measurements y, including linear programs and
greedy algorithms. When Φ itself is sparse, Belief Propagation and related graphical algorithms can also be applied for
fast signal reconstruction [8].
An important property of CS is its information scalability;
CS measurements can be used for a wide range of statistical
inference tasks besides signal reconstruction, including estimation, detection and classification.
2.2. Compressive Sensing Meets Microarrays
Let the N × 1 vector x represent the concentrations of N
possible organisms in a sample; we assume that only K ! N
of them are actually present. We aim to design a microarray
that implements a Φ that satisfies the RIP. Such a CSM will
be able to reconstruct an estimate of x using M ! N probe
spots.
To obtain a CS-type measurement scheme, we can choose
each probe in a CSM to be a group identifier such that the
readout of each probe is a probabilistic combination of all the
"MxN =
Sensing matrix
M spots
connection between Compressive Sensing and DNA microarrays. Section 3 describes the factors from bioinformatics and
DNA biochemistry that influence probe-target hybridization,
gives an algorithm for the design of the CSM probes based on
these factors, and an example of probe design. Finally Section
4 describes future directions of this work and concludes.
!11 !12 ... ...
!21 !22 ... ...
...
!1N
!2N
... ...
...
...
!M1 !M2 ... ...
...
!MN
...
...
...
N target agents
Fig. 1. Structure of the sensing matrix Φ of the CSM with M
spots identifying N target agents
targets in its group. The probabilities are representative of
each probe’s hybridization affinity (or stickiness) to those targets in its group; the targets that are not in its group have low
affinity to the probe. Then the readout signal at each spot of
the microarray is a linear combination of hybridization affinities between its probe sequence and each of the target agents.
Figure 1 illustrates the sensing process pictorially. To formalize, we assume there are M spots on the CSM and N
targets; we have far fewer spots than target agents, so that
M ! N . For 1 ≤ i ≤ M and 1 ≤ j ≤ N , the probe
at spot i hybridizes with target j with affinity φi,j . The target j occurs in the tested DNA sample with concentration
xj . Then the measured microarray signal intensity vector
y = {yi }, i = 1, . . . , M fits nicely
!into the CS measurement
model of Section 2.1. Each yi = i,j φi,j xj .
In related work, group testing has previously been proposed for microarrays [9]. The chief advantage of a CS-based
approach over direct group testing [10] is its information scalability. With a reduced number of measurements, we are
able to not just detect, but also estimate the target signal.
This is important, because often pathogens in the environment are only harmful to us in large concentrations. Furthermore, we are able to use CS recovery methods such as Belief
Propagation that decode x while accounting for experimental
noise and measurement nonlinearities due to excessive target
molecules [4].
3. FROM Φ TO PROBE SEQUENCES
Given a sparse target vector x, our goal is to design a microarray that implements a CS measurement matrix Φ with the RIP.
In the following we make the design assumption that we can
choose the Φ to achieve in the microarray. This specifies both
the number of probe spots M and the desired degree of hybridization φi,j between probe i and each potential target j.
3.1. Factors Affecting Hybridization Affinity
An element of Φ should not just be representative of a probe’s
hybridization affinity to a target but reflect the spot intensity
of a probe-target hybridization, since the measurements available to us are in fact the spot intensities. Given a model to
predict a spot intensity from any (probe i, target j) pair, we
can use it to engineer an appropriate probe sequence to hybridize with a given target and create a desired spot intensity
element φi,j in Φ.
There are two main steps to translating (probe, target) →
spot intensity. First, we require a multivariate model that uses
features of the probe and target sequences to predict a hybridization affinity value between them. Second, we must
translate this hybridization value to a microarray spot intensity using a model that takes into account physical parameters
of the experiment such as background noise.
While there are many theories for which features of the
probe and target sequences most influence hybridization,
one factor they all agree on is the sequence similarity between probe and target when they are locally aligned using
the Smith-Waterman (S-W) alignment algorithm. This is a
dynamic programming algorithm that finds the optimal local alignment between two sequences with respect to some
scoring system. For example, if we have two sequences
AGCCAGGG and GT AAGGGA, then the S-W alignment
returns:
AGCCAGGG
|||||
GT AAGGGA
where the two sequences have offsets of 5 and 4 respectively
in order for them to be maximally aligned.
Ideally we would like an all-inclusive hybridization affinity model that takes into account all sequence features that potentially affect hybridization. An exhaustive list of hybridization sequence features is available in [11]. In the absence of
such a model, we use percent identity (defined as the percentage of aligned bases in the S-W alignment) as a measure of hybridization affinity. We currently do not include a
noise model and merely proceed under the assumption that sequence similarity is equivalent to hybridization affinity, which
is equivalent to spot intensity.
There are other critical factors from DNA biochemistry to
consider when designing a microarray probe, such as its GCcontent. This is defined as the fraction of G and C bases in
the probe. It is the single biggest factor affecting the melting
temperature of a DNA probe, which in turn must be as constant as possible between all probes for good probe synthesis. Another important factor from biochemistry is the secondary structure of the probes; we must check that they do
not fold onto themselves, which may hinder their hybridization with a target. Secondary structure can be predicted using
dynamic programming algorithms with thermodynamic parameters [12, 13].
3.2. Probe Selection Algorithm
In practice, probe selection is a delicate task due to the finicky
nature of DNA hybridization. Our design goal is to build
up Φ one row at a time by finding DNA sequences that are
shared between groups of targets. A convenient way to do
this is to utilize the COGs (Clusters of Orthologous Groups
of Proteins) database [14], an NIH-governed database that
organizes prokaryote and unicellular eukaryotes into groups
based on the similarity of their protein sequences. Since protein sequences can be translated back to DNA, this gives us
a basis for grouping organisms according to their DNA sequence similarity. The advantage of using the COGs database
is that it is based on exhaustive alignments between organisms
and is therefore a good leapfrog into target grouping. Furthermore, as more genomes are sequenced, they are added into the
grouping so that the database is continually expanded. Each
COG corresponds to a potential row of Φ.
In preparation for the probe design algorithm, we must
choose: (i) a sparse Φ to approximate [8]; (ii) a COG for
each row of Φ, such that its target grouping resembles that
row; (iii) an appropriate probe length (for an oligonucleotide
microarray this can currently be between 20-70 bases.)
The probe design algorithm is as follows:
Algorithm 1 : Probe design for CSMs
1. For rows i :1→M :
2. Use a sliding window of probe length on the shortest
target sequence of the L targets to find all potential
probe candidates satisfying the GC-content constraint.
3. For each of the remaining L − 1 targets in row i:
(a) Perform a Smith-Waterman alignment between
all probe candidates and the target;
(b) Calculate the hybridization affinity between each
probe candidate and the target;
(c) If the hybridization affinities of a probe with all
L targets are close to their corresponding φi,j values, store the probe;
(d) End
4. Check for loop formation in the secondary structure of
the complements of all the surviving probe candidates.
5. Choose the probe with zero or fewest loops. If more
than one, choose the probe with the shortest length loop
and highest target hybridization affinities.
6. End
Two modifications will improve the algorithm in practical use. First, the preprocessing step described for choosing
COGs is fairly restrictive. Instead we might integrate this into
the algorithm and use the guidelines for good Φ design [15]
to adaptively choose every subsequent row of Φ according to
the available COG groupings for the targets we are interested
in. Second, in Step 2 of Algorithm 1 we choose the shortest length target to find potential probe candidates because it
leaves us a larger search space (consisting of the longer targets) for a probe candidate to potentially align with. However,
to be precise, an exhaustive search of potential probe candidates over all target sequences is necessary.
3.3. Toy Probe Design Example for Φ3×7
We describe a small-scale probe design example, designing 3
probes of length 26 to identify 7 unicellular organisms. The
genomes are available from the NCBI Genbank database:
4. DISCUSSION AND CONCLUSION
50
Frequency
40
30
20
10
0
0
5
10
15
20
Mean Squared Error in Target Reconstruction
Fig. 2. MSE histogram (summarizing 500 trials) during target
estimation by Belief Propagation with N = 7, K = 1, M = 3.
1.
2.
3.
4.
5.
6.
7.
Mth = Methanothermobacter thermautotrophicus
Mja = Methanococcus jannaschii
Mac = Methanosarcina acetivorans str.C2A
Pab = Pyrococcus horikoshii
Afu = Archaeoglobus fulgidus
Mka = Methanopyrus kandleri AV19
Tvo = Thermoplasma volcanium
5. REFERENCES
[1] “Affymetrix microarrays,” http://www.affymetrix.com/products/arrays/
specific/cexpress.affx.
In the probe design algorithm, we choose to approximate a
3 × 7 Hamming group testing matrix Φ, which satisfies the
RIP. We generated 3 probe sequences of length 26. None of
the probes have loops in their secondary structure. Read from
the 5’ to the 3’ end of the DNA strand:
Probe 1: AGAGGAT GAAGAAGT T GGT CGGT CT C
Probe 2: GT CCT GCCCCGT GT CCT CT T AAT GAC
Probe 3: AAAGGGAGGAGCT T T T GGAGGT ACGA
Using the hybridization affinity model from Section 3.1, we
predict that the probes will hybridize with the target to produce a Φ:
Targets →
Probe 1
Probe 2
Probe 3
1
0.80
0.80
0.45
2
0.82
0.36
0.71
3
1.00
1.00
1.00
4
0.95
0.48
0.75
5
0.62
0.75
0.81
There are several directions for improving CSM probe design.
First, while the COGs database is a good starting point for
grouping organisms, it would be more useful to have groupings based directly on the hybridization affinities of targets
with probes. Second, for a biosensing application that classifies 1000s of organisms we need longer probe sequences
(∼70 bases), which may be better designed using a simultaneous “multiple alignment” between targets. Third, we can
include design features from existing unique probe selection
algorithms [16] and integrate a check for each probe’s nonalignment with the N − L non-targets of its Φ row.
6
0.58
0.57
0.80
7
0.52
0.76
0.57
While the predicted Φ is non-sparse (real-valued), many
of its entries will be effectively zero due to DNA hybridization
effects. Even if a probe has some sequence similarity with a
target, it may not be sufficient for hybridization. Thus, we
apply a thresholding rule that accounts for probe-target hybridization affinities to sparsify the non-sparse Φ. As a result,
the effective Φ (that we use in our reconstruction algorithm) is
sparse. In our example above, by applying a hard threshold of
0.75%, we transform Φ into the 3 × 7 Hamming group testing
matrix. Admittedly, these thresholding rules are somewhat
arbitrary and will ultimately need to be determined through
experimental calibration.
We apply Belief Propagation to decipher the sensor readout [4, 8]. Figure 2 shows a histogram of error estimates in
reconstructing a single target (only 1 of the 7 organisms is
present) using the above Φ. We expect that using a real-valued
Φ with an accurate spot intensity model will estimate the targets even better.
[2] O. Milenkovic, R. Baraniuk, and T. Simunic-Rosing, “Compressed
sensing meets bioinformatics: A novel DNA microarray design.,” in
Second Annual ITA Workshop, San Diego, California, Jan. 2007.
[3] M. A. Sheikh, S. Sarvotham, O. Milenkovic, and R. G. Baraniuk,
“Compressed sensing DNA microarrays,” Rice University Technical
Report ECE-07-06, May 2007.
[4] M. A. Sheikh, O. Milenkovic, and R. G. Baraniuk, “DNA array decoding from nonlinear measurements by belief propagation,” in IEEE SSP
Workshop, Madison, WI, Aug. 2007.
[5] E. J. Candès and T. Tao, “Decoding by linear programming,” IEEE
Trans. Inform. Theory, vol. 51, pp. 4203–4215, Dec. 2005.
[6] E. J. Candès and T. Tao, “Near optimal signal recovery from random
projections: Universal encoding strategies?,” IEEE Trans. Info. Theory,
vol. 52, no. 12, pp. 5406–5425, Dec. 2006, Preprint.
[7] D. L. Donoho, “Compressed sensing,” IEEE Trans. Info. Theory, vol.
52, no. 4, pp. 1289–1306, Sept. 2006.
[8] S. Sarvotham, D. Baron, and R. G. Baraniuk, “Compressed sensing
reconstruction via belief propagation,” Rice University Tech Report
ECE-06-01, Oct. 2006.
[9] A. Schliep, D. Torney, and S. Rahmann, “Group testing with DNA
chips: Generating designs and decoding experiments,” in Proc. of Computational Systems Bioinformatics Conf., 2003.
[10] D. Z. Du and F. K. Hwang, Combinatorial group testing and its applications, World Scientific Publishing Co., 2000.
[11] Y. Chen, C.C. Chou, X. Lu, E. Slate, K. Peck, W. Xu, E. Voit, and
J. Almeida, “A multivariate prediction model for microarray crosshybridization.,” BMC Bioinformatics, vol. 7, no. 1, pp. 101–112, 2006.
[12] I. Hofacker, “Vienna RNA secondary structure server.,” Nucleic Acids
Research, vol. 31, pp. 3429–3431, 2003.
[13] J. SantaLucia Jr., “A unified view of polymer, dumbbell, and oligonucleotide DNA nearest-neighbor thermodynamics.,” Proc. Natl. Acad.
Sci., pp. 1460–1465, 1998.
[14] “Clusters of orthologous groups — NCBI/NIH,”
http://www.ncbi.nlm.nih.gov/COG/.
Available at
[15] S. Sarvotham and R. G. Baraniuk, “Bounds and practical constructions
of optimal CS matrices,” Rice University Technical Report ECE, Nov.
2007.
[16] J. Rouillard, M. Zuker, and E. Gulari, “Oligoarray 2.0: Oligo probe
design using a thermodynamic approach.,” Nucleic Acids Research,
vol. 31, pp. 3057–3062, 2003.