Download 1447437435_Sequence alignment GPU

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Homology modeling wikipedia , lookup

Structural alignment wikipedia , lookup

Transcript
ACCELERATING PAIRWISE SEQUENCE ALIGNMENT USING
GPU: REVIEW ARTICLE
2
ABSTRACT
Sequence alignment algorithms are used in comparisons of sequences and identifying origins
of it. Various techniques are used for aligning sequences such as Dynamic Programming which
gives accurate results but is slow. Besides, Heuristics methods which is faster but imprecise. So,
accelerating alignment algorithms is a vital to be able for mining and analyzing large sequencing
databases with accurate result. Many hardware methods are used for acceleration as FPGA and
GPU. In this paper we concentrate on acceleration of sequence alignment algorithms using GPU.
KEY WORDS: Bioinformatics, Sequence alignment, GPU.
1. INTRODUCTION :
Bioinformatics is a mixture field of biological and information sciences, which study methods of storing,
retrieving and analyzing biological data. Biological data include nucleic acid (Deoxy ribonucleic acid
“DNA” or Ribonucleic acid “RNA”) sequence and protein sequence. Proteins, DNA and RNA are
coded by sequence of characters. The comparison of biological sequences is one of the major
problems in Bioinformatics. It can be achieved by aligning the sequences to find out the match
characters and this operation called sequence alignment. Mainly, alignment process is used to
find the similarity between biological sequences. Also, it may be used to find origin of a query
sequences. There are two kinds of sequence alignment Pairwise sequence alignment and
Multiple sequence alignment. Pairwise alignment consist of aligning two sequences but Multiple
alignment align more than two sequences. Dynamic programming and heuristics methods are
used to compute sequence alignment. Dynamic programming (DP) methods find all optimal
solutions of alignment but are slow. In contrast, heuristic methods find alignments with less
precision but faster more than DP.
Graphical Processing Unit (GPU) was used to accelerate sequence alignment algorithms by
executing it in parallel manner. In this paper we are interested in pairwise alignments and
pairwise alignment algorithms using Graphics Processor Units (GPUs) are presented.
The rest of the paper is organized as follows : In section 3, we present the three types of
pairwise alignments algorithms. In section 4, we give an overview on GPUs. In section 5, we
show how GPUs can be used to accelerate pairwise alignment algorithms. Finally, in the last
section, we present the conclusion.
3
2. PAIRWISE ALIGNMENT ALGORITHMS
There are three types of pairwise alignments algorithms: pairwise global alignment, pairwise
local alignment and pairwise semi-global alignment. Let us begin by pairwise local alignment
algorithms.
2.1 Pairwise Local Alignment Algorithms
The most used Dynamic Programming (DP) algorithm for pairwise local alignment is the one of Smith
and Waterman [1]. The main difference with the algorithm of Needleman and Wunsch [2] is that any cell
of the matrix M can be considered as a starting point for the calculation of the scores and that any score
which becomes lower than zero stops the progression of the calculation of the scores. The associated cell
is then reinitialized to zero and can be considered as a new starting point. Equation 1 used for the
calculation of each score during the transformation of the initial matrix is the following:
 se(i, j )  M [i  1, j  1],



 se(i, j )  max( M [ x, j  1]  P), 
M [i, j ]  max
se(i, j )  max( M [i  1, y ]  P), 


0



(1)
where i + 2 < x ≤ m and j + 2< y ≤ n, se is the score between the character at position i in S1 and
the one at position j in S2 and P is a gap penalty. With m and n are respectively the lengths of the
sequences S1 and S2 to align. Time complexity of Smith and Waterman algorithm is O(m*n).
2.2 Pairwise Global Alignment Algorithms
A pairwise global alignment involves the alignment of two entire sequences. There are two
main approaches to construct a pairwise global alignment:
(i) Dynamic programming approach [3,4]: The most used dynamic programming algorithm
for pairwise global alignment is the one of Needleman and Wunsch [2]. By using this algorithm,
the construction of a pairwise global alignment of two sequences S1 and S2, with respective
lengths m and n, is performed in two steps:
(i.1) During the first step, we construct a matrix M of size m*n and we initialize it by
using a substitution matrix, e.g., PAM (Percent Accepted Mutations) [5], BLOSUM (BLOcks
4
SUbstitution Matrix) [6]. Then, we transform matrix M by adding scores line by line, starting by
the right lower cell and ending by the left upper one, by using the following equation:
M[i,j] = se(i,j) + max(M[x,y])
(2)
where x = i + 1 and j < y ≤ n, or i < x ≤ m and y = j + 1, and se is the score between the character
at position i in S1 and the one at position j in S2. We can also incorporate in the equation a gap
penalty. A gap is a character, e.g., ‘ - ‘, inserted in aligned sequences so that aligned characters
are found in front of each other. It is sufficient to subtract from the calculation of every sum a
penalty according to their position. So, equation 2 becomes:
 M [i  1, j  1], 


M [i, j ]  se(i, j )  max M [ x, j  1]  P, 
 M [i  1, y ]  P 


(3)
where i + 2< x ≤ m and j + 2 < y ≤ n, and P is a gap penalty.
The gap penalty P can have several possible forms as shown in Table 1.
Where k is the number of successive gaps and a, b and c are constants.
(i.2) During the second step, we establish a path in the matrix, called maximum scores
path, which leads to an optimal pairwise global alignment. The construction of this path is
achieved by starting from the cell that contains the maximum score in the transformed matrix,
which corresponds normally to the leftmost upper cell, and allowing three types of possible
movements:
(a) Diagonal movement: This movement corresponds to the passage from a cell (i,j) to a
cell (i+1,j+1).
(b) Vertical movement: This movement corresponds to the passage from a cell (i,j) to a cell
(i+1,j).
(c) Horizontal movement: This movement corresponds to the passage from a cell (i,j) to a
cell (i,j+1).
Time complexity of Needleman and Wunsch algorithm is O(m*n).
Other dynamic programming algorithms for pairwise global alignment exist, such as the one of Huang
and Chao [7] and NGILA [8].
5
2.3 Pairwise Semi-Global Alignment
A pairwise semi-global alignment is like pairwise global alignment but ignoring start gaps, i.e. gaps
that occur before the first character in a sequence, and end gaps, i.e., gaps that occur after the last
character of a sequence. An overlap of two sequences is considered an alignment where start and end gaps
are ignored.
One used DP algorithm for pairwise semi-global alignment is the one of Sodiya et al. [6]. To align
two sequences A and B, |A|=m and |B|=n, this algorithm begins by initializing an (m+1)*(n+1) M matrix.
Starting from cell M[0,0], it iterates through each cell whose value is determined through a choice of three
transitions to that cell:
• Diagonal step: Indicates an alignment between the (i−1) symbol in A with the (j−1) symbol in B.
The alignment score added to the value of M[i-1, j-1] measures the level of alignment of the symbols
defined in the scoring system, denoted as diagonal.
• Vertical step: Indicates the insertion of a gap into A, and alignment of the gap with the (j −1)
symbol in B. The gap penalty is added to the value of the M[i,j-1], denoted as top. The gap penalty for this
transition is dependent on the scoring system used.
• Horizontal step: Indicates the insertion of a gap into B, and alignment of the gap with the (i-1)
symbol in A. The gap penalty is added to the value of M[i-1,j], denoted as left. As with the vertical step,
the gap penalty for this transition is dependent on the scoring system used. Time complexity of SemiGlobal algorithm O(m*n).
3. GRAPHICS PROCESSOR UNITS
Graphics processing unit is a device that was designed to accelerate the computation of Graphics
operations. GPU is basically an electronic device consisting of many processors with memory. It was
used to accelerate image building for output to a display unit. GPU has high memory bandwidth and high
computational capabilities and higher parallel structure that makes it useful as a general purpose unit for
other applications rather than any imaging applications like computational biology.
A GPU consists of many multiprocessors and a large Dynamic RAM (DRAM). Each multiprocessor is
coupled with a small cache memory and consists of a large number of cores, i.e., Arithmetic Logical
Units (ALUs), controlled by a control unit.
Comparison between GPU and CPU is as shown in Fig 1. A CPU consists of a small number of
ALUs and a large DRAM and a single large cache memory. GPU has large number of ALUs
6
(Arithmatic Logic Unit) that give it high computational power in performing data parallel
computations faster than CPU. GPU has small cache memory and control unit for each set of
ALUs. Also GPU has a higher bandwidth between memory and processing elements (ALUs)
that make it performing operations faster.
GPUs are well adopted to solve problems with data-parallel processing, i.e., the same
program is executed on many data elements in parallel. Data-parallel processing maps data
elements to parallel processing threads. A thread is a sequence of instructions that may be
executed in parallel with each other. Data-parallel processing is an efficient way to accelerate
many algorithms.
GPUs were originally designed to accelerate computer graphics algorithms. With GPUs
impressive speedups can be achieved using a programming model that is simpler than the one
required for FPGAs. However, their high computational capabilities and their highly parallel
structure opened up to them a wide a range of other fields, like scientific computing [10],
computational geometry [11] and bioinformatics [12].
GPU memory consists of an on – chip memory with processors like shared , local memory and
registers and off chip memory which is separated from processors like global, constant and texture
memory[23].
There are differnet access time to different memory that is as shown in Fig 4. This Fig show access time
to each memory of GPU by processors where global memory has the highest access time around 400 –
600 cycle. Access time for constant cache, texture cache and shared memory approximately is the same
as access time to registers 1 cycle.
4. ACCELERATING PAIRWISE ALIGNMENT ALGORITHMS
With the new sequencing technologies, the number of biological sequences in databases, like
GenBank [13] and PubMed [14], is increasing exponetionally. In addition, the length of one of
these sequences is large, hundreds of bases. Hence, comparing, via a pairwise alignment
algorithm, a query sequence to the sequences of one of theses database becomes expensive in
computing time and memory space. So there is need to accelerate pairwise alignment
algorithms .
Many hardware solutions to accelerate the alignment process such as
Multiprocessor Architecture, FPGA and GPU. But in this paper we focus on GPU.
7
GPU programming is based on OpenGL [15] and now on CUDA parallel programming languages
which is the computing engine in NVIDIA Graphics processing units [23]. So, we have two types of
implementations of SW algorithm on GPUs : Either by using OpenGL parallel programming
language or by using CUDA one.
4.1 Accelerating Smith and Waterman Algorithm by Using OpenGL
The first implementations of SW algorithm on GPUs are described in [16]. These
implementations are based on similar approaches and use OpenGL [15]. They operate as follows:
First the database and the query sequence are copied to the GPU memory as textures [9]. The
score matrix is then processed in an anti-diagonal way. For each element of the current antidiagonal, a pixel is drawn. Drawing this pixel executes a small program, called pixel shader, that
computes the score for the cell. The results are on a texture, which is then used as input for the
next pass.
The implementation of [15] searched 99,8% of Swiss-Prot database, now merged into the
Universal Protein Resource (UniProt) database [17], and obtained a maximum speed of 650
Mega Cell Updates Per Second (MCUPS) compared to around 75 for the compared CPU
version. The Cell-Updates Per Second (CUPS) is defined as follows :
CUPS =
query sequence length * database size
run time
(1.4)
The implementation of [18] has two versions, the first one is with traceback and the second
one is without. Both versions were benchmarked by using a Geforce 7800 GTX GPU and
executed on a database of just 983 sequences. The version without traceback obtained a
maximum speed of 241 MCUPS, compared to 178 with traceback and 120 for the compared
CPU version.
4.2 Accelerating Smith and Waterman Algorithm by Using CUDA
SW-CUDA [19] is used CUDA for sequence alignment implementation on GPU . Each GPU thread
computes the whole alignment of the query sequence with one database sequence. The threads are
grouped in a grid of blocks during execution. In order to make the most efficient use of the GPU
resources , the computing time of all the threads in the same grid must be as near as possible.
8
Experimental studies have been done to compare SW- CUDA running on two Geforce 8800 GTX with
BLAST [20] and SSEARCH [21] , running on a 3 GHz Intel Pentium IV processor. The execution times
of CUDA implementation were up to 30 times faster than SSEARCH and up to 2.4 times faster than
BLAST. SW- CUDA was also 3 times faster than Single Instruction Multiple Data (SIMD)
implementation [24].
SW – CUDA needed to be improved to utilize the full resources on the GPU because it use local
memory of GPU which is the slowest memory resource on GPU.
An effective development is proposed that using on-chip shared memory that reducing amount of data
transfer from global memory to processing elements in a GPU [19]. That gives result of reducing data
fetch amount to 1 / 140.
An implementation proposed by Striemer [25] is 23 times speed than SSEARCH. SW computation matrix
are computed purely using on GPU in Striemer’s Implementation. It works in 3 stages, 1) load biological
sequences database to GPU’s global memory. 2) Each thread used to compute alignment score between a
query sequence and each sequence in a database. 3) Alignment scores back to CPU to determine highest
alignment score. So this implementation used to give alignment score only not the alignment.
Unlike SW - CUDA, this implementation don’t involve using of CPU for partial SW computations, it
were done purely using GPU. Another difference that this implementation use constant memory of GPU
to save query sequence and substitution matrix because constant cache has access time as register access
time. SW-CUDA achieves speeds of more than 3.5 GCUPS on a workstation running two GeForce 8800
GTX GPUs. SW-CUDA was also compared to Single Instruction Multiple Data (SIMD) implementation
[22]. The expeimental results show that SW-CUDA performs from 2 to 30 times faster than any other
previous implementation.
CONCLUSION
Biological databases’ size is increasing exponetionally in addition to size of sequences is
large. So, biological computation processes such as sequences alignment are needed to be faster.
Many hardware solutions are used to accelerate theses algorithms such as FPGA, GPU. Also,
sequence alignment is used in other biological functions such as Gene Tracer Algorithm [24] that
find out the ancestors of an offspring from a biological database. This algorithm also will be
accelerated using GPU.
9
REFERENCES
[1] T. F. Smith, M. S. Waterman, Identification of common molecular subsequences, J.
Molecular Biology, N°147: (1981), p195–197.
[2] C. S. B Needleman, C. D. Wunsch, “A general method applicable to the search for
similarities in the amino acid sequence of two proteins”. Journal of molecular biology,
vol. 48, no. 1, pp. 443-453. 1970.
[3] : R. Bellman Dynamic Programming. Princeton University Press, Princeton, New Jersey,
1957
[4]: R. Bellman, S. Dreyfus, Applied Dynamic Programming, Princeton University Press (1962).
[5] : M. O. Dayhoff, R. M. Schwartz, B. C. Orcutt, A model of evolutionary change in proteins,
in Atlas of Protein Sequence and Structure, chapter 22, National Biomedical Research
Foundation, Washington, DC: (1978), p345–358.
[6] : S. Henikoff, J. G. Henikoff, Amino acid substitution matrices from protein blocks, Proc.
Natl. Acad. Sci. USA, Vol. 89, N°22 : (1992), p10915-10919.
[7] : X. Huang, K. M. Chao, A generalized global alignment algorithm, Bioinformatics, Vol. 19,
N°2: (2003), p228–233.
[8] : R. A. Cartwright: NGILA: Global pairwise alignments with logarithmic and affine gap costs,
Bioinformatics, Vol. 23, N°11 : (2007), p1427–1428.
[9] : J. D. Owens, M. Houston, D. Luebke, S. Green, J. E. Stone, and J. C. Phillips, GPU
Computing, Proceedings of the IEEE, Vol. 96, No. 5 : (May 2008).
[10] : J. Krüger and R. Westermann, Linear algebra operators for GPU implementation of
numerical algorithms, ACM Transactions on Graphics (TOG), in Proc ACM
SIGGRAPH’03, Vol. 22, Issue 3 : (July 2003), p908–916.
[11] : P. Agarwal, S. Krishnan, N. Mustafa, and S. Venkatasubramanian. Streaming geometric
optimization using graphics hardware, in Proc. 11th Annual European Symposium on
Algorithms (ESA’03), Budapest, Hungary : (September, 2003).
[12] : M. Charalambous, P. Trancoso, and A. Stamatakis. Initial experiences porting a
bioinformatics application to a graphics processor, in Proc. 10th Panhellenic Conference
on Informatics (PCI’05), Volos, Greece : (November 2005).
[13] : GenBank. http://www.ncbi.nlm.nih.gov/genbank/
10
[14] : PubMed. http://www.ncbi.nlm.nih.gov/pubmed/
[15] : D. Shreiner, M. Woo, J. Neider, T. Davis, OpenGL Programming Guide, 5th edition.
Reading, MA: Addison-Wesley (Publish.) : (August) 2005.
[16] : W. Liu, B. Schmidt, G. Voss, A. Schroder, and W. Muller-Wittig, Bio-Sequence Database
Scanning on a GPU, in Proc. 20th IEEE International Parallel & Distributed Processing
Symposium (IPDPS’06), 5th IEEE International Workshop on High Performance
Computational Biology) Workshop (HICOMB’06), Rhode Island, Greece : (2006).
[17] : UniProt. http://www.uniprot.org/
[18] : Y. Liu, W. Huang, J. Johnson, S. Vaidya, GPU accelerated Smith-Waterman, in Proc.
Computational Science (ICCS’06), Lecture Notes in Computer Science, Vol. 3994,
Springer-Verlag, Berlin, Germany : (2006), p188-195.
[19] : S. A. Manavski and G. Valle, CUDA compatible GPU cards as efficient hardware
accelerators for Smith-Waterman sequence alignment, BMC Bioinformatics, Vol. 9
(Suppl 2), S10 : (2008).
[20] : W. R. Pearson, D. J. Lipman, Improved tools for biological sequence comparison, Proc
Natl Acad Sci U S A, Vol. 85: (April 1988), p2444-2448
[21] : W. R. Pearson, Searching protein sequence libraries: comparison of the sensitivity and
selectivity of the Smith-Waterman and FASTA algorithms, Genomics, Volume 11, Issue 3
: (November 1991), p635-650.
[22] : M. Farrar, Striped Smith-Waterman speeds database searches six times over other SIMD
implementations, Bioinformatics, Vol. 23, Issue 2 : (2007), p156-161.
[23] Nvidia CUDA parallel programming book.
[24] M.issa, A.Alzohairy,H.Abo bakr and I.Ziedan,“Tracking Genes Modifications in
the Pedigree through GeneTracer Algorithm”, International Workshop on Database and
Expert Systems Applications (DEXA),IEEE, 2012 23rd .
[25] Gregory M. Striemer and Ali Akoglu ,” Sequence Alignment with GPU: Performance and
Design Challenges”, IEEE Xplore,2009.
11
Table 1 Gap penalties
Linear gap penalty
P = a*k
Affine gap penalty
P = a*k + c
Logarithmic gap penalty
P = b*log k + c
Logarithmic-affine gap penalty
P = a*k + b*log k + c
Fig 1. CPU organization versus GPU one [23]
Fig 2. Memory bandwidth for CPU and GPU [23]
12
Fig 3. FLOPS for CPU and GPU [23].
Fig 4 : Access time between processors and different memory of GPU device [23,25].