Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Application of Graphics Processing Unit to Compute Multiple Sequence Alignment using Star Method for Soybean Genes Muhammad Adi Puspo Sujiwo1, Wisnu Ananta Kusuma1, Agus Buono1, Mukhlis Hidayat2 1 Department of Computer Science, Bogor Agricultural University 2 University of Syah Kuala ABSTRACT Multiple sequence alignment (MSA) is an important method for bioinformatics with purpose in plant breeding. However, aligning large number of sequences may take long time due to high complexities in both time and space. Recently, GPU has been a promising parallel computing architecture due to its goodperformance per price. Our objective is to employ GPU to accelerate the computation of MSA using star method. This paper presents an implementation of star method that runs on the GPU and written using CUDA. Parallelization of star method is obtained by adopting wavefront parallel pattern. Tests using semi-synthetic data from soybean genes revealed that our program shows significant speedups compared to CPU-only implementation. Keywords: Multiple Sequence Alignment, Parallel Processing, GPU, Bioinformatics INTRODUCTION Multiple Sequence Alignment (MSA) is one of areas of research in bioinformatics, that are aligning three or more biological sequences as generalization of pairwise alignment. MSA is useful in phylogenetic analysis in assessing origin of species, and in searching of Single Nucleotide Polymorphism (SNP) that is useful for DNA fingerprinting, investigation of hereditary diseases and molecular genetics-based plant breeding. The optimal solution for MSA is based on exhaustive dynamic programming, which has been show to be an NP-complete problem[1]. Most approach to approximate MSA is based on heuristic or probabilistic, such as star method or progressive method. However, these methods still hold excessively high complexities; especially when performed on single processor. One of the most promising architecture of processors to handle this complexity is Graphical Processing Unit (GPU). The GPU was designed formerly for graphics accelerator on personal computers and game consoles, but now has been developed to run general-purpose computing for many purposes, such as numerical computation and bioinformatics. Compared to generalpurpose CPU, GPU has a very high level of parallelism due to its conception as graphics accelerator. The drawbacks are, it has some weaknesses in handling branching instructions. Therefore, algorithms must take this into accounts when they are to be implemented in the GPUs. Beside reducing the number of branches, vectorization of memory access patterns and reduction of CPU-GPU synchronizations will play important roles in optimization of GPU programs. Many parallel solutions for MSA have been developed before, such as [2] and[3]. These solutions are based on progressive method. Meanwhile, this paper will only discuss the parallelization of star method in the GPU. Despite its objective and use case as short aligner, the importance of this software can not be ruled out. Many MSA-like routines are used in other larger software in bioinformatics, mainly DNA assembly and short-read mapper routines. Therefore, our software will have more usability when included as solutions for these problems. GPU COMPUTING Graphics Processing Unit (GPU) was conceived primarily as accelerator for graphics applications such as games and CAD. However, it is realized that GPU can also handle general purpose parallel computing tasks due to its inherently parallel architecture. Interests in GPGPU increases primarily since the introduction of programming tools like CUDA and OpenCL that enable programmers express problems in general programming language. Figure 1 illustrates GPU architecture relevant for general purpose GPU. In short, a GPU may be viewed as an shared memory multiprocessor architecture, similar to today’s multicore processors in personal computers. A GPU consists of several independent shared multiprocessors (SM). Each SM comprises several stream processors (SP) which shares common memory controller, register sets, caches and shared memory for its SPs.This small and fast shared memory may be used as a kind of ”user-managed cache”. It is important to note that GPU is a separate device from main CPU, connected using a bus like PCI Express. Therefore, they have separate address spaces; CPU can’t easily access GPU memory and vice versa. Due to high latency and limited bus bandwidth, GPU programs should limit this cross-bus communications (data transfer and synchronization). Figure 1 General GPU Architecture For programmers, GPU computing systems consist of two parts: host or traditional CPU and one or more GPU devices. A GPU program will then comprises two major parts: serial sections, which only have slight data parallelism will better run on CPU; and parallel sections that have rich data parallelism are to be run on GPU. The GPU section is called kernel. When executed, kernel will generate a large number of threads, called a grid. These threads can be grouped together into blocks to exploit synchronization and shared memory. This thread, block and grid hierarchy is illustrated in Figure 2. Thread Block Block Block Grid Figure 2 Programmer-view of a GPU Device STAR METHOD FOR MSA AND ITS PARALLELIZATION Computation of MSA with Star Method can be divided into three main stages (Figure 3): 1. Pairwise Similarity Score Computation. Objectives of this stage are: computing score for each pair of sequences, and computing gap vectors for all sequences. Complexity for this stage is 𝑂(𝑠 2 )where 𝑠 is number of sequences to align to. 2. Selection of pivot sequence designated as star, continued by construction of star sequence by adding appropriate gaps from gap vector. 3. Realignment all other sequences against star. Complexity of this stage is 𝑂(𝑠). In our research, parallelization effort is focused in stages 1) and 3) as these are the most timeconsuming process (involving Smith-Waterman algorithm) in star method. a d* a b c d e a b c d e a b c d e a b c d e b d* d* c d* e d* Figure 3 Three Steps of Star Method in computing MSA Parallelization of Pairwise Similarity Score Given a pair of sequence S1 and S2 with lengths l1 and l2, pairwise similarity score is defined as number of exact matches in the optimal local alignment of S1 and S2 as described in[3] and [4]. This score is computed using Smith-Waterman algorithm as described in [5], by constructing a matrix H. This matrix is initialized as 𝐻𝑘0 = 𝐻0𝑙 = 0 for0 ≤ 𝑘 ≤ 𝑙1 and0 ≤ 𝑙 ≤ 𝑙2 . The elements of H are calculated as 0 𝐻𝑖−1,𝑗−1 + 𝑠(𝑐1 , 𝑐2 ) 𝐻𝑖𝑗 = max 𝐻𝑖−1,𝑗 − 𝑔 𝐻𝑖,𝑗−1 − 𝑔 { (1) where 𝑠(𝑐1 , 𝑐2 ) is a value from substitution matrix for pairs c1 from S1 when compared to c2 fromS2, and g is gap penalty. Parallelization of H is derived from data dependency in each cell. From equation (1), each cell has dependency from three other cells: left, above and top left. Using wavefront pattern as described in [6], computation of H cells is parallelized per anti-diagonal. This pattern and its dependencyare illustrated inFigure 4. a 1 a2 0 b1 0 b2 0 b3 0 b4 0 b5 0 b6 0 0 0 a3 a 4 a5 0 0 0 a6 a 7 a8 0 0 0 Time Figure 4Cells of alignment matrix on the same anti-diagonal can be computed in parallel To facilitate vectorization and ensure that memory access to H is coalesced, elements of H is arranged such that each anti-diagonal is placed one another, rather than using the more conventional row- or column-major. Using this arrangement, each thread in a block can directly reads and writes consecutive cells of matrix H. After computing H, this program performs backtracking, intended to build gap vectors for all sequences. The gap vector of a sequence is defined as number of maximum gaps between each character (base) of that sequence [5]. Backtracking is started from a cell of H that has highest value; this value is stored as pairwise similarity score. From this starting point (𝑖, 𝑗), the process goes backwards to one of positions (𝑖 − 1, 𝑗 − 1),(𝑖, 𝑗 − 1)or (𝑖 − 1, 𝑗), whichever that holds highest value. This process is continued until 𝑖 = 0or 𝑗 = 0. Computation of pair similarity scores may be parallelized further by considering that each pair may be executed separately, as long as they have no same sequence indexes. This requirement is imposed from backtracking process that builds gap vectors for all sequences: gap vector for a sequence is valid only when computed by one process. If we put all pairs in matrix, concurrent executions may be performed per anti-diagonal as described in Figure 5. 1,2 1 1,3 1,4 1,5 1,6 1,7 1,8 2,3 2,4 2,5 2,6 2,7 2,8 3,4 3,5 3,6 3,7 3,8 4,5 4,6 4,7 4,8 5,6 5,7 5,8 2 3 4 5 6 7 8 1 2 6,7 3 6,8 4 7,8 5 Figure 5 Map of a group of 8 sequences. All numbered pairs with the same anti-diagonal number may be executed concurrently. This research does not attempt to solve quadratic space complexity of Smith-Waterman; so memory requirement of matrixH is high. In fact, it can limits parallel pair executions such that at one time there is only one pair similarity score is computed. For example, alignment of two mutations of rhodopsin gene (used in this research) with average lengths 8720 requires 87202 × 4 ≈ 300 MB1. For a GPU with 1 GB RAM, this means there are only three concurrent pairs as these. Using these two parallelisms (computation of matrix H and pair execution), we can build GPU solution for pairwise similarity score matrix computation by mapping pairwise matrix H for one GPU block, where each thread in the block is mapped to one cell in Figure 4. Meanwhile, one anti-diagonal of sequence pair matrix is mapped into GPU grid. However, we must take into account that grid size is limited by GPU memory as described above. When handling these memory-limited cases, we should be aware that parallelization scheme described above will launch too little GPU blocks (so that GPU utilization is too low), which in turn may reduce GPU performance[7]. This case may be solved by breaking the grid into multiple CPU threads. Each CPU thread creates a stream to handle one pair of sequences (thus creates one instances of matrix H). Each stream is capable to launch multiple blocks, so the GPU 1 Matrix is stored as dense integer array, so each cell requires 4 bytes. may be kept busy. As suggested in [8], there is a threshold of sequence length for using this scheme. The authors set this to be 4096. Selection of Star After computing pairwise similarity score matrix, one sequence is chosen as “star”. The choice for star is based upon accumulative score from similarity score. Afterwards, this sequence is transformed into a new sequence by inserting appropriate number of gaps from its particular gap vector that has been computed during stage one. This stage may be performed only by one process; so parallelization option is limited. Building a new, gapped sequence from un-gapped one is described in Figure 6. gap vector 2 source - - A 0 0 A G G A 1 A - 0 T T 1 T T - 0 0 C A C A 1 G G - Figure 6 Building a gapped sequence from its gap vector and source Realignment The last stage in MSA with star method is alignment of remaining sequences against star sequence without adding gaps into star. The main differences of this stage when compared to stage one is the computation of cell value of H; equation (1) is modified into 0 𝐻𝑖−1,𝑗−1 + 𝑠(𝑐1 , 𝑐2 ) 𝐻𝑖𝑗 = max 𝐻𝑖−1,𝑗 − 𝑔 𝐻𝑖,𝑗−1 { (2) Backtracking process in this stage is also similar to stage one; however, only one gap vector is created. Process is continued by building gapped sequences from respective gap vectors as described in Figure 6. TESTING AND PERFORMANCE MEASUREMENT In this research, the authors developed two version of MSA programs; respectively, the GPU version and CPU version. The CPU version is parallelized using OpenMP [8] to utilize multicore systems. All programs are written in C++ and compiled using GCC 4.3 under Linux. CUDA program is written using CUDA Toolkit 5.5. The GPU version was measured on an Nvidia GeForce GTX 550 Ti, with 4 SM comprising 192 SP and 1 GB RAM. As comparison, the CPU version was run on a server with Intel Xeon E5620 (4-cores, 8 visible processors with HyperThreading enabled [8]) and 8 GB RAM. For testing purposes, the authors generated four artifical datasets derived from soybean genes in NCBI GenBank [10], as listed in Table 1. Each sequence is mutated randomly to generate 2, 7, 15, 31 and 63 other sequences in their respective datasets. Due to limited space, only runtime for insulin and rhodopsin genes that are included in this paper; results for other genes is available from subversion server below. Table 1 Genes used in this experiment Gene Name BBI GBD2 APX2 SS NCBI ID 548083 UniProtKB/TrEMBL:Q43709 60658 547508 Length 834 1607 3077 6918 Table 2shows total running time (in milliseconds) for MSA programs, using configurations 1, 2, 4, and 8 CPUs and varying number of sequences derived from BBI. In this table and the following graphics, it is clear that a high number of sequences favors multi-CPU and GPU. Going from 4 to 8 CPU (which means utilizing hyperthreading or virtual CPU) will only increase performance slightly when number of sequence rises. Meanwhile, the GPU shows superior performance than CPU for short sequences. Table 2 Runtime of CPU and GPU-based MSA for BBI (in milliseconds) Configuration CPU 1 CPU 2 CPU 4 CPU 8 GPU # of sequence 3 141 115 114 114 96 8 703 444 216 217 232 16 2549 1468 725 387 490 32 9117 5033 2722 1408 1144 64 34775 18413 9527 5208 3145 GPU 64 CPU 8 32 16 CPU 4 8 3 CPU 2 0 2 4 6 8 10 12 Figure 7 Speed-up of CPUs and GPU MSA program for BBI datasets. Each bar represents problem sizes: 3, 8, 16, 32 and 64 sequences. Table 3 Runtime of CPU and GPU-based MSA for SS (in milliseconds) Configuration # of sequence 3 8 16 32 64 CPU 1 6264 43296 168037 648999 2522419 CPU 2 5237 27041 94794 349362 1364315 CPU 4 5272 14078 51101 192832 742694 CPU 8 5221 12544 28766 103355 397075 GPU 771 5363 21361 82699 320062 GPU 64 CPU 8 32 16 CPU 4 8 3 CPU 2 0 2 4 6 8 10 Figure 8 Speed-up of CPUs and GPU MSA program for SS datasets Table 3shows runtime (in milliseconds) of MSA programs for SS gene with varying number of sequences. As Table 2 before, number of CPU is also varying to assess parallelization performance. In this table and the following graphic in Figure 8, the GPU is shown to give almost constant speed-up despite increasing number of sequences. The authors assume that this constant speed-up is consistent as predicted above, since GPU memory cannot accomodate more than three matrix H in the case of SS. CONCLUSION We presented a program that implements star method for computing MSA problem. Experimental results show that GPU parallelization of star method can deliver slightly higher performance than CPU solutions, even for a high number of sequences. However, our program can only handle limited sequence lengths due to high space requirements of Smith-Waterman algorithm. In the future, one of key improvements for our program is to cut down space complexity up to linear by implementing modification of Smith-Waterman as suggested by[11]. Other suggestions are to support multi-GPU installations or develop heterogenous parallelism (i.e. support GPU and multi-core CPU simultaneously). This research show potential advantage of GPU as general computing device for scientific problem. Considering that GPU used in this research is low-end device, it makes sense that GPU is more cost-effective than CPU. The future research may involve built-in GPUs that are embedded in current portable computers (tablet, smartphones), which will expand more possibilities. ACKNOWLEDGEMENT This research is supported by grant from KKP3N of Ministry of Agriculture, Indonesia. The authors would like to thank the Directorate of Informations Systems of IPB for hosting materials of this programs. All source code, documentation and testing data is available via Subversion from http://code.ipb.ac.id:8082/bioinformatika/cudaMSA. REFERENCES [1] L. Wang dan T. Jiang, “On the Complexity of Multiple Sequence Alignment,” Journal of Computational Biology, vol. 1, no. 4, pp. 337-348, 1994. [2] A. Datta and J. Ebedes, "Multiple Sequence Alignment in Parallel on a Cluster of Workstations," in Parallel Computing for Bioinformatics and Computational Biology, New Jersey, John Wiley & Sons, Inc., 2006, pp. 193-210. [3] L. Y, M. DL dan S. B, “MSA-CUDA: Multiple Sequence Alignment on Graphics Processing Units with CUDA,” dalam 20th IEEE International Conference on Applicationspecific Systems, Architectures and Processors, 2009. [4] D. Gusfield, “Efficient Methods for Multiple Sequence Alignment with Guaranteed Error Bounds,” 1991. [5] T. F. Smith and M. S. Waterman, "Identification of Common Molecular Subsequences," Journal of Molecular Biology, vol. 147, pp. 195-197, 1981. [6] J. Anvik, “Generating parallel programs from the wavefront design pattern,” dalam Proceedings of the 7th International Workshop on High-Level Parallel Programming Models and Supportive Environments, 2002. [7] D. B. Kirk dan W.-M. W. Hwu, Programming Massively Parallel Processors: A Hands-On Approach, Morgan Kaufman, 2012. [8] B. Chapman, G. Jost dan R. Van der Pas, Using OpenMP: Portable Shared Memory Parallel Programming, Massachusetts Institute of Technology, 2008. [9] OSDEV Community Staff, “HyperThreading Technology Overview,” OSDEV, 26 March 2006. [Online]. Available: http://web.archive.org/web/20090227123128/http://www.osdcom.info/content/view/30/39/. [10] “NCBI Home Page,” [Online]. Available: http://www.ncbi.nlm.nih.gov/. [11] E. W. Myers and W. Miller, "Optimal Alignments in Linear Space," Department of Computer Science, University of Arizona, Tucson, 1988. [12] Y. Ye and H. Tang, "Dynamic Programming Algorithms for Biological Sequence and Structure Comparison," in Bioinformatics Algorithms: Techniques and Applications, New Jersey, John Wiley & Sons, Inc., 2008, pp. 9-28. [13] V. Volkov dan J. Demmel, “LU, QR and Cholesky Factorizations using Vector Capabilities of GPUs,” University of California in Berkeley, Berkeley, 2008. [14] M. McCool, J. Reinders and A. Robison, Structured Parallel Programming: Patterns for Efficient Computation, Waltham: Elsevier, 2012. [15] A. Bustamam, G. Ardaneswari and D. Lestari, "Implementation of CUDA GPU-Based Parallel Computing on Smith-Waterman Algorithm to Sequence Database Searches," in 2013 International Conference on Advanced Computer Science and Information Systems, Bali, 2013. [16] J. Blazewicz, W. Frohmberg, M. Kierzynka, E. Pesch dan P. Wojciechowski, “Protein Alignment Algorithms With an Efficient Backtracking Routine on Multiple GPUs,” BMC Bioinformatics, 2011. [17] J. Anvik, “Generating parallel programs from the wavefront design pattern,” dalam Proceedings of the 7th International Workshop on High-Level Parallel Programming Models and Supportive Environments, 2002.