* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Designing Compressive Sensing DNA Microarrays
Survey
Document related concepts
Transcript
DESIGNING COMPRESSIVE SENSING DNA MICROARRAYS Mona A. Sheikh,∗ Olgica Milenkovic,† Richard G. Baraniuk∗ Rice University University of Illinois at Urbana-Champaign ∗ † ABSTRACT A Compressive Sensing Microarray (CSM) is a new device for DNA-based identification of target organisms that leverages the nascent theory of Compressive Sensing (CS). In contrast to a conventional DNA microarray, in which each genetic sensor spot is designed to respond to a single target organism, in a CSM each sensor spot responds to a group of targets. As a result, significantly fewer total sensor spots are required. In this paper, we study how to design group identifier probes that simultaneously account for both the constraints from the CS theory and the biochemistry of probe-target DNA hybridization. We employ Belief Propagation as a CS recovery method to estimate target concentrations from the microarray intensities. Index Terms— Compressive Sensing, DNA Microarrays, Probe Design, Hybridization Affinity 1. INTRODUCTION Biosensing of pathogens is a timely research area of high consequence. An accurate and rapid biosensing paradigm has the potential to impact the fields of healthcare, defense, and environmental monitoring, among others. The DNA microarray is a frequently applied solution for microbe sensing that has a significant edge over competitors due to its ability to sense many organisms in parallel [1]. The principle behind all current DNA microarrays is basically the same. A DNA microarray consists of genetic sensors or spots, each containing DNA sequences termed probes. From the perspective of a microarray, each DNA sequence can be viewed as a sequence of four DNA bases {A, T , G, C} that tend to bind with one another in complementary pairs: A with T and G with C. Therefore, a DNA subsequence in a target organism’s genetic sample will tend to bind or “hybridize” with its complementary subsequence on a microarray to form a stable structure. The target DNA sample to be identified is fluorescently tagged before it is flushed over the microarray. The extraneous DNA is washed away so that only the bound DNA is left on the array. The array is then scanned using laser light of a wavelength designed to trigger fluorescence in the Supported by NSF Award CCF 0729216. Rice Electrical & Computer Engineering, UIUC Electrical & Computer Engineering; Email:{msheikh, richb}@rice.edu; [email protected]; Web: dsp.rice.edu/csdna spots where binding has occurred. A specific pattern of array spots will fluoresce, which is then used to infer the genetic makeup in the test sample. There are three issues with traditional DNA microarrays that stem from the fact that each sensing spot is designed to uniquely identify only one target of interest. The first concern is that very often the targets in a test sample have similar base sequences, causing them to hybridize with the wrong probe. These cross-hybridization events lead to errors in the array readout. Current microarray design methods do not address cross-matches between similar DNA sequences. The second concern is the restriction on the number of organisms that can be identified. In a typical biosensing application, multiple organisms must be identified, which necessitates a large number of spots. As a consequence readout systems for traditional DNA arrays are difficult to miniaturize. The third concern is the inefficient utilization of the large number of array spots in traditional microarrays. While the number of potential agents in a sample can be very large, only a few agents are expected to be present in a significant concentration at a given time and location or in a given air/water/soil sample. Therefore, in a traditionally designed microarray only a small fraction of the large number of spots will be active at a given time, corresponding to the few targets present. To combat these issues, we have proposed a new microarray architecture using “combinatorial testing sensors” in order to reduce the number of sensor spots. We refer to this new type of array as a Compressive Sensing DNA Microarray (CSM), since it is based on the nascent theory of Compressive Sensing [2–4]. Each spot in a CSM identifies a group of target organisms, and several spots together generate a unique pattern identifier for a single target. Designing the probes that perform this combinatorial sensing is the essence of the microarray design process, and what we aim to describe in this paper. Note that although we address microarrays for a biosensing application in this paper, the principles behind CSMs are also applicable for the laboratory identification of diseasecausing genes in a single organism. This paper is organized as follows. Section 2 describes the 2. PRINCIPLES OF CS MICROARRAYS 2.1. Compressive Sensing Compressive Sensing (CS) is a recently developed sampling theory for sparse signals [5]. The main result of CS, introduced by Candès, Romberg, and Tao [6] and Donoho [7], is that a length-N signal x that is K-sparse in some basis can be recovered exactly from just M = O(K log(N/K)) measurements of the signal. In this paper we choose the canonical basis; hence x has K ! N nonzero and N − K zero entries. In matrix notation, we measure y = Φx, where x is the N ×1 sparse signal vector we aim to sense, y is an M ×1 measurement vector, and the measurement matrix Φ is an M × N matrix. Since M < N , recovery of the signal x from the measurements y is ill-posed in general; however the additional assumption of signal sparsity makes recovery possible. In the presence of measurement noise, the model becomes y = Φx + n where n denotes a vector of, for example, i.i.d. additive white Gaussian noise samples with zero mean. The two critical conditions to realize CS are that: (i) the vector x to be sensed is sufficiently sparse and (ii) the rows of Φ are sufficiently incoherent with the signal sparsity basis. Incoherence is achieved if Φ satisfies the so-called Restricted Isometry Property (RIP) [5]. For example, random matrices built from Gaussian and Bernoulli distributions satisfy the RIP with high probability. Φ can also be sparse with only L nonzero entries per row (L can vary from row to row) [8]. Various methods have been developed to recover a sparse x from the measurements y, including linear programs and greedy algorithms. When Φ itself is sparse, Belief Propagation and related graphical algorithms can also be applied for fast signal reconstruction [8]. An important property of CS is its information scalability; CS measurements can be used for a wide range of statistical inference tasks besides signal reconstruction, including estimation, detection and classification. 2.2. Compressive Sensing Meets Microarrays Let the N × 1 vector x represent the concentrations of N possible organisms in a sample; we assume that only K ! N of them are actually present. We aim to design a microarray that implements a Φ that satisfies the RIP. Such a CSM will be able to reconstruct an estimate of x using M ! N probe spots. To obtain a CS-type measurement scheme, we can choose each probe in a CSM to be a group identifier such that the readout of each probe is a probabilistic combination of all the "MxN = Sensing matrix M spots connection between Compressive Sensing and DNA microarrays. Section 3 describes the factors from bioinformatics and DNA biochemistry that influence probe-target hybridization, gives an algorithm for the design of the CSM probes based on these factors, and an example of probe design. Finally Section 4 describes future directions of this work and concludes. !11 !12 ... ... !21 !22 ... ... ... !1N !2N ... ... ... ... !M1 !M2 ... ... ... !MN ... ... ... N target agents Fig. 1. Structure of the sensing matrix Φ of the CSM with M spots identifying N target agents targets in its group. The probabilities are representative of each probe’s hybridization affinity (or stickiness) to those targets in its group; the targets that are not in its group have low affinity to the probe. Then the readout signal at each spot of the microarray is a linear combination of hybridization affinities between its probe sequence and each of the target agents. Figure 1 illustrates the sensing process pictorially. To formalize, we assume there are M spots on the CSM and N targets; we have far fewer spots than target agents, so that M ! N . For 1 ≤ i ≤ M and 1 ≤ j ≤ N , the probe at spot i hybridizes with target j with affinity φi,j . The target j occurs in the tested DNA sample with concentration xj . Then the measured microarray signal intensity vector y = {yi }, i = 1, . . . , M fits nicely !into the CS measurement model of Section 2.1. Each yi = i,j φi,j xj . In related work, group testing has previously been proposed for microarrays [9]. The chief advantage of a CS-based approach over direct group testing [10] is its information scalability. With a reduced number of measurements, we are able to not just detect, but also estimate the target signal. This is important, because often pathogens in the environment are only harmful to us in large concentrations. Furthermore, we are able to use CS recovery methods such as Belief Propagation that decode x while accounting for experimental noise and measurement nonlinearities due to excessive target molecules [4]. 3. FROM Φ TO PROBE SEQUENCES Given a sparse target vector x, our goal is to design a microarray that implements a CS measurement matrix Φ with the RIP. In the following we make the design assumption that we can choose the Φ to achieve in the microarray. This specifies both the number of probe spots M and the desired degree of hybridization φi,j between probe i and each potential target j. 3.1. Factors Affecting Hybridization Affinity An element of Φ should not just be representative of a probe’s hybridization affinity to a target but reflect the spot intensity of a probe-target hybridization, since the measurements available to us are in fact the spot intensities. Given a model to predict a spot intensity from any (probe i, target j) pair, we can use it to engineer an appropriate probe sequence to hybridize with a given target and create a desired spot intensity element φi,j in Φ. There are two main steps to translating (probe, target) → spot intensity. First, we require a multivariate model that uses features of the probe and target sequences to predict a hybridization affinity value between them. Second, we must translate this hybridization value to a microarray spot intensity using a model that takes into account physical parameters of the experiment such as background noise. While there are many theories for which features of the probe and target sequences most influence hybridization, one factor they all agree on is the sequence similarity between probe and target when they are locally aligned using the Smith-Waterman (S-W) alignment algorithm. This is a dynamic programming algorithm that finds the optimal local alignment between two sequences with respect to some scoring system. For example, if we have two sequences AGCCAGGG and GT AAGGGA, then the S-W alignment returns: AGCCAGGG ||||| GT AAGGGA where the two sequences have offsets of 5 and 4 respectively in order for them to be maximally aligned. Ideally we would like an all-inclusive hybridization affinity model that takes into account all sequence features that potentially affect hybridization. An exhaustive list of hybridization sequence features is available in [11]. In the absence of such a model, we use percent identity (defined as the percentage of aligned bases in the S-W alignment) as a measure of hybridization affinity. We currently do not include a noise model and merely proceed under the assumption that sequence similarity is equivalent to hybridization affinity, which is equivalent to spot intensity. There are other critical factors from DNA biochemistry to consider when designing a microarray probe, such as its GCcontent. This is defined as the fraction of G and C bases in the probe. It is the single biggest factor affecting the melting temperature of a DNA probe, which in turn must be as constant as possible between all probes for good probe synthesis. Another important factor from biochemistry is the secondary structure of the probes; we must check that they do not fold onto themselves, which may hinder their hybridization with a target. Secondary structure can be predicted using dynamic programming algorithms with thermodynamic parameters [12, 13]. 3.2. Probe Selection Algorithm In practice, probe selection is a delicate task due to the finicky nature of DNA hybridization. Our design goal is to build up Φ one row at a time by finding DNA sequences that are shared between groups of targets. A convenient way to do this is to utilize the COGs (Clusters of Orthologous Groups of Proteins) database [14], an NIH-governed database that organizes prokaryote and unicellular eukaryotes into groups based on the similarity of their protein sequences. Since protein sequences can be translated back to DNA, this gives us a basis for grouping organisms according to their DNA sequence similarity. The advantage of using the COGs database is that it is based on exhaustive alignments between organisms and is therefore a good leapfrog into target grouping. Furthermore, as more genomes are sequenced, they are added into the grouping so that the database is continually expanded. Each COG corresponds to a potential row of Φ. In preparation for the probe design algorithm, we must choose: (i) a sparse Φ to approximate [8]; (ii) a COG for each row of Φ, such that its target grouping resembles that row; (iii) an appropriate probe length (for an oligonucleotide microarray this can currently be between 20-70 bases.) The probe design algorithm is as follows: Algorithm 1 : Probe design for CSMs 1. For rows i :1→M : 2. Use a sliding window of probe length on the shortest target sequence of the L targets to find all potential probe candidates satisfying the GC-content constraint. 3. For each of the remaining L − 1 targets in row i: (a) Perform a Smith-Waterman alignment between all probe candidates and the target; (b) Calculate the hybridization affinity between each probe candidate and the target; (c) If the hybridization affinities of a probe with all L targets are close to their corresponding φi,j values, store the probe; (d) End 4. Check for loop formation in the secondary structure of the complements of all the surviving probe candidates. 5. Choose the probe with zero or fewest loops. If more than one, choose the probe with the shortest length loop and highest target hybridization affinities. 6. End Two modifications will improve the algorithm in practical use. First, the preprocessing step described for choosing COGs is fairly restrictive. Instead we might integrate this into the algorithm and use the guidelines for good Φ design [15] to adaptively choose every subsequent row of Φ according to the available COG groupings for the targets we are interested in. Second, in Step 2 of Algorithm 1 we choose the shortest length target to find potential probe candidates because it leaves us a larger search space (consisting of the longer targets) for a probe candidate to potentially align with. However, to be precise, an exhaustive search of potential probe candidates over all target sequences is necessary. 3.3. Toy Probe Design Example for Φ3×7 We describe a small-scale probe design example, designing 3 probes of length 26 to identify 7 unicellular organisms. The genomes are available from the NCBI Genbank database: 4. DISCUSSION AND CONCLUSION 50 Frequency 40 30 20 10 0 0 5 10 15 20 Mean Squared Error in Target Reconstruction Fig. 2. MSE histogram (summarizing 500 trials) during target estimation by Belief Propagation with N = 7, K = 1, M = 3. 1. 2. 3. 4. 5. 6. 7. Mth = Methanothermobacter thermautotrophicus Mja = Methanococcus jannaschii Mac = Methanosarcina acetivorans str.C2A Pab = Pyrococcus horikoshii Afu = Archaeoglobus fulgidus Mka = Methanopyrus kandleri AV19 Tvo = Thermoplasma volcanium 5. REFERENCES [1] “Affymetrix microarrays,” http://www.affymetrix.com/products/arrays/ specific/cexpress.affx. In the probe design algorithm, we choose to approximate a 3 × 7 Hamming group testing matrix Φ, which satisfies the RIP. We generated 3 probe sequences of length 26. None of the probes have loops in their secondary structure. Read from the 5’ to the 3’ end of the DNA strand: Probe 1: AGAGGAT GAAGAAGT T GGT CGGT CT C Probe 2: GT CCT GCCCCGT GT CCT CT T AAT GAC Probe 3: AAAGGGAGGAGCT T T T GGAGGT ACGA Using the hybridization affinity model from Section 3.1, we predict that the probes will hybridize with the target to produce a Φ: Targets → Probe 1 Probe 2 Probe 3 1 0.80 0.80 0.45 2 0.82 0.36 0.71 3 1.00 1.00 1.00 4 0.95 0.48 0.75 5 0.62 0.75 0.81 There are several directions for improving CSM probe design. First, while the COGs database is a good starting point for grouping organisms, it would be more useful to have groupings based directly on the hybridization affinities of targets with probes. Second, for a biosensing application that classifies 1000s of organisms we need longer probe sequences (∼70 bases), which may be better designed using a simultaneous “multiple alignment” between targets. Third, we can include design features from existing unique probe selection algorithms [16] and integrate a check for each probe’s nonalignment with the N − L non-targets of its Φ row. 6 0.58 0.57 0.80 7 0.52 0.76 0.57 While the predicted Φ is non-sparse (real-valued), many of its entries will be effectively zero due to DNA hybridization effects. Even if a probe has some sequence similarity with a target, it may not be sufficient for hybridization. Thus, we apply a thresholding rule that accounts for probe-target hybridization affinities to sparsify the non-sparse Φ. As a result, the effective Φ (that we use in our reconstruction algorithm) is sparse. In our example above, by applying a hard threshold of 0.75%, we transform Φ into the 3 × 7 Hamming group testing matrix. Admittedly, these thresholding rules are somewhat arbitrary and will ultimately need to be determined through experimental calibration. We apply Belief Propagation to decipher the sensor readout [4, 8]. Figure 2 shows a histogram of error estimates in reconstructing a single target (only 1 of the 7 organisms is present) using the above Φ. We expect that using a real-valued Φ with an accurate spot intensity model will estimate the targets even better. [2] O. Milenkovic, R. Baraniuk, and T. Simunic-Rosing, “Compressed sensing meets bioinformatics: A novel DNA microarray design.,” in Second Annual ITA Workshop, San Diego, California, Jan. 2007. [3] M. A. Sheikh, S. Sarvotham, O. Milenkovic, and R. G. Baraniuk, “Compressed sensing DNA microarrays,” Rice University Technical Report ECE-07-06, May 2007. [4] M. A. Sheikh, O. Milenkovic, and R. G. Baraniuk, “DNA array decoding from nonlinear measurements by belief propagation,” in IEEE SSP Workshop, Madison, WI, Aug. 2007. [5] E. J. Candès and T. Tao, “Decoding by linear programming,” IEEE Trans. Inform. Theory, vol. 51, pp. 4203–4215, Dec. 2005. [6] E. J. Candès and T. Tao, “Near optimal signal recovery from random projections: Universal encoding strategies?,” IEEE Trans. Info. Theory, vol. 52, no. 12, pp. 5406–5425, Dec. 2006, Preprint. [7] D. L. Donoho, “Compressed sensing,” IEEE Trans. Info. Theory, vol. 52, no. 4, pp. 1289–1306, Sept. 2006. [8] S. Sarvotham, D. Baron, and R. G. Baraniuk, “Compressed sensing reconstruction via belief propagation,” Rice University Tech Report ECE-06-01, Oct. 2006. [9] A. Schliep, D. Torney, and S. Rahmann, “Group testing with DNA chips: Generating designs and decoding experiments,” in Proc. of Computational Systems Bioinformatics Conf., 2003. [10] D. Z. Du and F. K. Hwang, Combinatorial group testing and its applications, World Scientific Publishing Co., 2000. [11] Y. Chen, C.C. Chou, X. Lu, E. Slate, K. Peck, W. Xu, E. Voit, and J. Almeida, “A multivariate prediction model for microarray crosshybridization.,” BMC Bioinformatics, vol. 7, no. 1, pp. 101–112, 2006. [12] I. Hofacker, “Vienna RNA secondary structure server.,” Nucleic Acids Research, vol. 31, pp. 3429–3431, 2003. [13] J. SantaLucia Jr., “A unified view of polymer, dumbbell, and oligonucleotide DNA nearest-neighbor thermodynamics.,” Proc. Natl. Acad. Sci., pp. 1460–1465, 1998. [14] “Clusters of orthologous groups — NCBI/NIH,” http://www.ncbi.nlm.nih.gov/COG/. Available at [15] S. Sarvotham and R. G. Baraniuk, “Bounds and practical constructions of optimal CS matrices,” Rice University Technical Report ECE, Nov. 2007. [16] J. Rouillard, M. Zuker, and E. Gulari, “Oligoarray 2.0: Oligo probe design using a thermodynamic approach.,” Nucleic Acids Research, vol. 31, pp. 3057–3062, 2003.