Download Star Method for MSA and Its Parallelization

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Genomics wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Metagenomics wikipedia , lookup

Computational phylogenetics wikipedia , lookup

Sequence alignment wikipedia , lookup

Multiple sequence alignment wikipedia , lookup

Smith–Waterman algorithm wikipedia , lookup

Transcript
Application of Graphics Processing Unit to
Compute Multiple Sequence Alignment
using Star Method for Soybean Genes
Muhammad Adi Puspo Sujiwo1, Wisnu Ananta Kusuma1, Agus Buono1, Mukhlis Hidayat2
1
Department of Computer Science, Bogor Agricultural University
2
University of Syah Kuala
ABSTRACT
Multiple sequence alignment (MSA) is an important method for bioinformatics with purpose in
plant breeding. However, aligning large number of sequences may take long time due to high
complexities in both time and space. Recently, GPU has been a promising parallel computing
architecture due to its goodperformance per price. Our objective is to employ GPU to accelerate
the computation of MSA using star method. This paper presents an implementation of star
method that runs on the GPU and written using CUDA. Parallelization of star method is
obtained by adopting wavefront parallel pattern. Tests using semi-synthetic data from soybean
genes revealed that our program shows significant speedups compared to CPU-only
implementation.
Keywords: Multiple Sequence Alignment, Parallel Processing, GPU, Bioinformatics
INTRODUCTION
Multiple Sequence Alignment (MSA) is one of areas of research in bioinformatics, that are
aligning three or more biological sequences as generalization of pairwise alignment. MSA is
useful in phylogenetic analysis in assessing origin of species, and in searching of Single
Nucleotide Polymorphism (SNP) that is useful for DNA fingerprinting, investigation of
hereditary diseases and molecular genetics-based plant breeding.
The optimal solution for MSA is based on exhaustive dynamic programming, which has been
show to be an NP-complete problem[1]. Most approach to approximate MSA is based on
heuristic or probabilistic, such as star method or progressive method. However, these methods
still hold excessively high complexities; especially when performed on single processor.
One of the most promising architecture of processors to handle this complexity is Graphical
Processing Unit (GPU). The GPU was designed formerly for graphics accelerator on personal
computers and game consoles, but now has been developed to run general-purpose computing
for many purposes, such as numerical computation and bioinformatics. Compared to generalpurpose CPU, GPU has a very high level of parallelism due to its conception as graphics
accelerator. The drawbacks are, it has some weaknesses in handling branching instructions.
Therefore, algorithms must take this into accounts when they are to be implemented in the
GPUs. Beside reducing the number of branches, vectorization of memory access patterns and
reduction of CPU-GPU synchronizations will play important roles in optimization of GPU
programs.
Many parallel solutions for MSA have been developed before, such as [2] and[3]. These
solutions are based on progressive method. Meanwhile, this paper will only discuss the
parallelization of star method in the GPU.
Despite its objective and use case as short aligner, the importance of this software can not be
ruled out. Many MSA-like routines are used in other larger software in bioinformatics, mainly
DNA assembly and short-read mapper routines. Therefore, our software will have more usability
when included as solutions for these problems.
GPU COMPUTING
Graphics Processing Unit (GPU) was conceived primarily as accelerator for graphics
applications such as games and CAD. However, it is realized that GPU can also handle general
purpose parallel computing tasks due to its inherently parallel architecture. Interests in GPGPU
increases primarily since the introduction of programming tools like CUDA and OpenCL that
enable programmers express problems in general programming language.
Figure 1 illustrates GPU architecture relevant for general purpose GPU. In short, a GPU may be
viewed as an shared memory multiprocessor architecture, similar to today’s multicore processors
in personal computers. A GPU consists of several independent shared multiprocessors (SM).
Each SM comprises several stream processors (SP) which shares common memory controller,
register sets, caches and shared memory for its SPs.This small and fast shared memory may be
used as a kind of ”user-managed cache”. It is important to note that GPU is a separate device
from main CPU, connected using a bus like PCI Express. Therefore, they have separate address
spaces; CPU can’t easily access GPU memory and vice versa. Due to high latency and limited
bus bandwidth, GPU programs should limit this cross-bus communications (data transfer and
synchronization).
Figure 1 General GPU Architecture
For programmers, GPU computing systems consist of two parts: host or traditional CPU and one
or more GPU devices. A GPU program will then comprises two major parts: serial sections,
which only have slight data parallelism will better run on CPU; and parallel sections that have
rich data parallelism are to be run on GPU. The GPU section is called kernel. When executed,
kernel will generate a large number of threads, called a grid. These threads can be grouped
together into blocks to exploit synchronization and shared memory. This thread, block and grid
hierarchy is illustrated in Figure 2.
Thread
Block
Block
Block
Grid
Figure 2 Programmer-view of a GPU Device
STAR METHOD FOR MSA AND ITS PARALLELIZATION
Computation of MSA with Star Method can be divided into three main stages (Figure 3):
1. Pairwise Similarity Score Computation. Objectives of this stage are: computing score for
each pair of sequences, and computing gap vectors for all sequences. Complexity for this
stage is 𝑂(𝑠 2 )where 𝑠 is number of sequences to align to.
2. Selection of pivot sequence designated as star, continued by construction of star sequence
by adding appropriate gaps from gap vector.
3. Realignment all other sequences against star. Complexity of this stage is 𝑂(𝑠).
In our research, parallelization effort is focused in stages 1) and 3) as these are the most timeconsuming process (involving Smith-Waterman algorithm) in star method.
a
d*
a b c d e
a
b
c
d
e
a b c d e
a
b
c
d
e
b
d*
d*
c
d*
e
d*
Figure 3 Three Steps of Star Method in computing MSA
Parallelization of Pairwise Similarity Score
Given a pair of sequence S1 and S2 with lengths l1 and l2, pairwise similarity score is defined as
number of exact matches in the optimal local alignment of S1 and S2 as described in[3] and [4].
This score is computed using Smith-Waterman algorithm as described in [5], by constructing a
matrix H. This matrix is initialized as 𝐻𝑘0 = 𝐻0𝑙 = 0 for0 ≤ 𝑘 ≤ 𝑙1 and0 ≤ 𝑙 ≤ 𝑙2 . The
elements of H are calculated as
0
𝐻𝑖−1,𝑗−1 + 𝑠(𝑐1 , 𝑐2 )
𝐻𝑖𝑗 = max
𝐻𝑖−1,𝑗 − 𝑔
𝐻𝑖,𝑗−1 − 𝑔
{
(1)
where 𝑠(𝑐1 , 𝑐2 ) is a value from substitution matrix for pairs c1 from S1 when compared to c2
fromS2, and g is gap penalty.
Parallelization of H is derived from data dependency in each cell. From equation (1), each cell
has dependency from three other cells: left, above and top left. Using wavefront pattern as
described in [6], computation of H cells is parallelized per anti-diagonal. This pattern and its
dependencyare illustrated inFigure 4.
a 1 a2
0
b1
0
b2
0
b3
0
b4
0
b5
0
b6
0
0
0
a3 a 4 a5
0
0
0
a6 a 7 a8
0
0
0
Time
Figure 4Cells of alignment matrix on the same anti-diagonal can be computed in parallel
To facilitate vectorization and ensure that memory access to H is coalesced, elements of H is
arranged such that each anti-diagonal is placed one another, rather than using the more
conventional row- or column-major. Using this arrangement, each thread in a block can directly
reads and writes consecutive cells of matrix H.
After computing H, this program performs backtracking, intended to build gap vectors for all
sequences. The gap vector of a sequence is defined as number of maximum gaps between each
character (base) of that sequence [5]. Backtracking is started from a cell of H that has highest
value; this value is stored as pairwise similarity score. From this starting point (𝑖, 𝑗), the process
goes backwards to one of positions (𝑖 − 1, 𝑗 − 1),(𝑖, 𝑗 − 1)or (𝑖 − 1, 𝑗), whichever that holds
highest value. This process is continued until 𝑖 = 0or 𝑗 = 0.
Computation of pair similarity scores may be parallelized further by considering that each pair
may be executed separately, as long as they have no same sequence indexes. This requirement is
imposed from backtracking process that builds gap vectors for all sequences: gap vector for a
sequence is valid only when computed by one process. If we put all pairs in matrix, concurrent
executions may be performed per anti-diagonal as described in Figure 5.
1,2
1
1,3
1,4
1,5
1,6
1,7
1,8
2,3
2,4
2,5
2,6
2,7
2,8
3,4
3,5
3,6
3,7
3,8
4,5
4,6
4,7
4,8
5,6
5,7
5,8
2
3
4
5
6
7
8
1
2
6,7
3
6,8
4
7,8
5
Figure 5 Map of a group of 8 sequences. All numbered pairs with the same anti-diagonal number may be
executed concurrently.
This research does not attempt to solve quadratic space complexity of Smith-Waterman; so
memory requirement of matrixH is high. In fact, it can limits parallel pair executions such that at
one time there is only one pair similarity score is computed. For example, alignment of two
mutations of rhodopsin gene (used in this research) with average lengths 8720 requires 87202 ×
4 ≈ 300 MB1. For a GPU with 1 GB RAM, this means there are only three concurrent pairs as
these.
Using these two parallelisms (computation of matrix H and pair execution), we can build GPU
solution for pairwise similarity score matrix computation by mapping pairwise matrix H for one
GPU block, where each thread in the block is mapped to one cell in Figure 4. Meanwhile, one
anti-diagonal of sequence pair matrix is mapped into GPU grid. However, we must take into
account that grid size is limited by GPU memory as described above.
When handling these memory-limited cases, we should be aware that parallelization scheme
described above will launch too little GPU blocks (so that GPU utilization is too low), which in
turn may reduce GPU performance[7]. This case may be solved by breaking the grid into
multiple CPU threads. Each CPU thread creates a stream to handle one pair of sequences (thus
creates one instances of matrix H). Each stream is capable to launch multiple blocks, so the GPU
1
Matrix is stored as dense integer array, so each cell requires 4 bytes.
may be kept busy. As suggested in [8], there is a threshold of sequence length for using this
scheme. The authors set this to be 4096.
Selection of Star
After computing pairwise similarity score matrix, one sequence is chosen as “star”. The choice
for star is based upon accumulative score from similarity score. Afterwards, this sequence is
transformed into a new sequence by inserting appropriate number of gaps from its particular gap
vector that has been computed during stage one. This stage may be performed only by one
process; so parallelization option is limited. Building a new, gapped sequence from un-gapped
one is described in Figure 6.
gap vector
2
source
-
-
A
0
0
A
G
G
A
1
A
-
0
T
T
1
T
T
-
0
0
C
A
C
A
1
G
G
-
Figure 6 Building a gapped sequence from its gap vector and source
Realignment
The last stage in MSA with star method is alignment of remaining sequences against star
sequence without adding gaps into star. The main differences of this stage when compared to
stage one is the computation of cell value of H; equation (1) is modified into
0
𝐻𝑖−1,𝑗−1 + 𝑠(𝑐1 , 𝑐2 )
𝐻𝑖𝑗 = max
𝐻𝑖−1,𝑗 − 𝑔
𝐻𝑖,𝑗−1
{
(2)
Backtracking process in this stage is also similar to stage one; however, only one gap vector is
created. Process is continued by building gapped sequences from respective gap vectors as
described in Figure 6.
TESTING AND PERFORMANCE MEASUREMENT
In this research, the authors developed two version of MSA programs; respectively, the GPU
version and CPU version. The CPU version is parallelized using OpenMP [8] to utilize multicore systems. All programs are written in C++ and compiled using GCC 4.3 under Linux. CUDA
program is written using CUDA Toolkit 5.5. The GPU version was measured on an Nvidia
GeForce GTX 550 Ti, with 4 SM comprising 192 SP and 1 GB RAM. As comparison, the CPU
version was run on a server with Intel Xeon E5620 (4-cores, 8 visible processors with
HyperThreading enabled [8]) and 8 GB RAM.
For testing purposes, the authors generated four artifical datasets derived from soybean genes in
NCBI GenBank [10], as listed in Table 1. Each sequence is mutated randomly to generate 2, 7,
15, 31 and 63 other sequences in their respective datasets. Due to limited space, only runtime for
insulin and rhodopsin genes that are included in this paper; results for other genes is available
from subversion server below.
Table 1 Genes used in this experiment
Gene Name
BBI
GBD2
APX2
SS
NCBI ID
548083
UniProtKB/TrEMBL:Q43709
60658
547508
Length
834
1607
3077
6918
Table 2shows total running time (in milliseconds) for MSA programs, using configurations 1, 2,
4, and 8 CPUs and varying number of sequences derived from BBI. In this table and the
following graphics, it is clear that a high number of sequences favors multi-CPU and GPU.
Going from 4 to 8 CPU (which means utilizing hyperthreading or virtual CPU) will only increase
performance slightly when number of sequence rises. Meanwhile, the GPU shows superior
performance than CPU for short sequences.
Table 2 Runtime of CPU and GPU-based MSA for BBI (in milliseconds)
Configuration
CPU 1
CPU 2
CPU 4
CPU 8
GPU
# of sequence
3
141
115
114
114
96
8
703
444
216
217
232
16
2549
1468
725
387
490
32
9117
5033
2722
1408
1144
64
34775
18413
9527
5208
3145
GPU
64
CPU 8
32
16
CPU 4
8
3
CPU 2
0
2
4
6
8
10
12
Figure 7 Speed-up of CPUs and GPU MSA program for BBI datasets. Each bar represents problem sizes: 3,
8, 16, 32 and 64 sequences.
Table 3 Runtime of CPU and GPU-based MSA for SS (in milliseconds)
Configuration
# of sequence
3
8
16
32
64
CPU 1
6264
43296
168037
648999
2522419
CPU 2
5237
27041
94794
349362
1364315
CPU 4
5272
14078
51101
192832
742694
CPU 8
5221
12544
28766
103355
397075
GPU
771
5363
21361
82699
320062
GPU
64
CPU 8
32
16
CPU 4
8
3
CPU 2
0
2
4
6
8
10
Figure 8 Speed-up of CPUs and GPU MSA program for SS datasets
Table 3shows runtime (in milliseconds) of MSA programs for SS gene with varying number of
sequences. As Table 2 before, number of CPU is also varying to assess parallelization
performance. In this table and the following graphic in Figure 8, the GPU is shown to give
almost constant speed-up despite increasing number of sequences. The authors assume that this
constant speed-up is consistent as predicted above, since GPU memory cannot accomodate more
than three matrix H in the case of SS.
CONCLUSION
We presented a program that implements star method for computing MSA problem.
Experimental results show that GPU parallelization of star method can deliver slightly higher
performance than CPU solutions, even for a high number of sequences. However, our program
can only handle limited sequence lengths due to high space requirements of Smith-Waterman
algorithm.
In the future, one of key improvements for our program is to cut down space complexity up to
linear by implementing modification of Smith-Waterman as suggested by[11]. Other suggestions
are to support multi-GPU installations or develop heterogenous parallelism (i.e. support GPU
and multi-core CPU simultaneously).
This research show potential advantage of GPU as general computing device for scientific
problem. Considering that GPU used in this research is low-end device, it makes sense that GPU
is more cost-effective than CPU. The future research may involve built-in GPUs that are
embedded in current portable computers (tablet, smartphones), which will expand more
possibilities.
ACKNOWLEDGEMENT
This research is supported by grant from KKP3N of Ministry of Agriculture, Indonesia. The
authors would like to thank the Directorate of Informations Systems of IPB for hosting materials
of this programs. All source code, documentation and testing data is available via Subversion
from http://code.ipb.ac.id:8082/bioinformatika/cudaMSA.
REFERENCES
[1] L. Wang dan T. Jiang, “On the Complexity of Multiple Sequence Alignment,” Journal of
Computational Biology, vol. 1, no. 4, pp. 337-348, 1994.
[2] A. Datta and J. Ebedes, "Multiple Sequence Alignment in Parallel on a Cluster of
Workstations," in Parallel Computing for Bioinformatics and Computational Biology, New
Jersey, John Wiley & Sons, Inc., 2006, pp. 193-210.
[3] L. Y, M. DL dan S. B, “MSA-CUDA: Multiple Sequence Alignment on Graphics
Processing Units with CUDA,” dalam 20th IEEE International Conference on Applicationspecific Systems, Architectures and Processors, 2009.
[4] D. Gusfield, “Efficient Methods for Multiple Sequence Alignment with Guaranteed Error
Bounds,” 1991.
[5] T. F. Smith and M. S. Waterman, "Identification of Common Molecular Subsequences,"
Journal of Molecular Biology, vol. 147, pp. 195-197, 1981.
[6] J. Anvik, “Generating parallel programs from the wavefront design pattern,” dalam
Proceedings of the 7th International Workshop on High-Level Parallel Programming
Models and Supportive Environments, 2002.
[7] D. B. Kirk dan W.-M. W. Hwu, Programming Massively Parallel Processors: A Hands-On
Approach, Morgan Kaufman, 2012.
[8] B. Chapman, G. Jost dan R. Van der Pas, Using OpenMP: Portable Shared Memory Parallel
Programming, Massachusetts Institute of Technology, 2008.
[9] OSDEV Community Staff, “HyperThreading Technology Overview,” OSDEV, 26 March
2006.
[Online].
Available:
http://web.archive.org/web/20090227123128/http://www.osdcom.info/content/view/30/39/.
[10] “NCBI Home Page,” [Online]. Available: http://www.ncbi.nlm.nih.gov/.
[11] E. W. Myers and W. Miller, "Optimal Alignments in Linear Space," Department of
Computer Science, University of Arizona, Tucson, 1988.
[12] Y. Ye and H. Tang, "Dynamic Programming Algorithms for Biological Sequence and
Structure Comparison," in Bioinformatics Algorithms: Techniques and Applications, New
Jersey, John Wiley & Sons, Inc., 2008, pp. 9-28.
[13] V. Volkov dan J. Demmel, “LU, QR and Cholesky Factorizations using Vector Capabilities
of GPUs,” University of California in Berkeley, Berkeley, 2008.
[14] M. McCool, J. Reinders and A. Robison, Structured Parallel Programming: Patterns for
Efficient Computation, Waltham: Elsevier, 2012.
[15] A. Bustamam, G. Ardaneswari and D. Lestari, "Implementation of CUDA GPU-Based
Parallel Computing on Smith-Waterman Algorithm to Sequence Database Searches," in
2013 International Conference on Advanced Computer Science and Information Systems,
Bali, 2013.
[16] J. Blazewicz, W. Frohmberg, M. Kierzynka, E. Pesch dan P. Wojciechowski, “Protein
Alignment Algorithms With an Efficient Backtracking Routine on Multiple GPUs,” BMC
Bioinformatics, 2011.
[17] J. Anvik, “Generating parallel programs from the wavefront design pattern,” dalam
Proceedings of the 7th International Workshop on High-Level Parallel Programming
Models and Supportive Environments, 2002.