Download Introduction - IISc-SERC

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Compilation and Parallelization Techniques with Tool
Support to Realize Sequence Alignment Algorithm on FPGA
and Multicore
Sunita Chandrasekaran1 Oscar Hernandez2
Douglas Leslie Maskell1 Barbara Chapman2
Van Bui2
1Nanyang
Technological University, Singapore
2University of Houston, HPCTools, Texas, USA
1 0 0
Speed Up
100
600
1000
Ideal
1 0
1
2
4
8
1 6
3 2
threads
6 4
1 2 8
•
•
•
•
•
•
•
•
Challenge
Application – Bioinformatics
Proposed Idea
Tool Support
Tuning Methodology
Scheduling
Execution and Tuning Model
Conclusion and Future Work
Challenge
• Reconfigurable Computing – Customizing
a computational fabric for specific
applications, e.g. FPGA (Field
Programmable Gate Array)
• Reconfigurable Computing and HPC is a
reality…
• Fills the gap between hardware and
software
• FPGA based accelerators – Involving
massive parallelism and extensible
hardware optimizations
• Portions of the application can be run on
reprogrammable hardware
Important to identify the hot spots in the application to determine
which portion to be applicable on the software and which portion on the
hardware.
Paper presents a tuning methodology to identify the bottlenecks in the
program using a parallelizing compiler with the help of static and analysis
tools
Application
Bioinformatics – Multiple Sequence Alignment
Arranging the primary sequences of DNA, RNA or protein to identify
the regions of similarity
Areas of research
in Bioinformatics
Sequence
Alignment
Internal small
Stretches of
Similarity
Local
N-W
algorithm
Global
S-W
algorithm
Gene Structure
Prediction
Phylogenetic
Tree
Classification
and
Identification of
genes
Constructed
based on the
distances
between the
sequences
End to End Alignment
Protein
Folding
2D
3D
Smith Waterman Algorithm
• Similar subsequences of two sequences
• Implemented by large bioinformatics organizations
• Dynamic programming algorithm used to compute local alignment
of pair of sequences
• Impractical due to time and space complexities
• Progressive alignment is the widely used heuristic- distance value
between each pair of sequences- phylogenetic tree- pairwise
alignment of various profiles
• Hardware implementations of the algorithm exploit opportunities
for parallelism and further accelerate the execution
Proposed Idea
• Efficient C code implementation of the MSA
• Preprocessing steps and parallel processing approaches
• Profiling to determine the performance bottlenecks, identifying the
areas of the code that can benefit from the parallelization
• High level optimizations to be performed to obtain a better speed-up
Improving the CPI
• Including pipelining, data prefetching, data locality, avoiding resource
contention and support parallelization of the main kernel
Tool Support
OpenUH Compiler Infrastructure
Front-end
(C/C++ & Fortran 77/90)
IPA
(Inter Procedural Analyzer)
Source code w/
OpenMP directives
Portable OpenMP
Runtime library
LNO
(Loop Nest Optimizer)
WOPT
(global scalar optimizer)
Linking
Native
compilers
IR-to-source translation
(whir2c & whirl2f)
Backend
Source code w/
OMP lib calls
Executables
The OpenUH Compiler
• Based on the Open64 compiler.
• A suite of optimizing compiler tools for Linux/Intel IA-64
systems and IA-32 (source-to-source).
• First release open-sourced by SGI
– Available for researchers/developers in the community.
• Multiple languages and multiple targets
– C, C++ and Fortran77/90
– OpenMP 2.0 support
(University of Houston, Tsinghua University, PathScale)
OpenUH/64 includes The Dragon Analysis Tool
Call Graph
Array Regions
Flow Graph
Data Dependence Analysis
TAU- Profiling Toolkit for Performance Analysis of Parallel
programs written in Fortran, C, C++, Java or Python
Tuning Methodology
• Bottlenecks in the program are identified with hardware
performance counters
The following are the investigations:
• Count of useful instructions = 7.63E+9
• No-opt operations = 44% (moving this portion to the
reconfigurable platform would be inefficient)
• Branch Mispredictions = 75% (this would stall the pipeline,
cause wastage of resources)
• Cycles per instruction = 0.3178 (Instructions are stalling)
Goal:
To reduce total cycles, reduce stalls, no-ops, conditionals and
hoist loops outside, improve memory locality
• Used software parallel programming paradigm, OpenMP and
pragmas to parallelize the code
• Realized the dependencies in the program with Dragon tool
• Control Flow and Data Flow graph used to distinguish
between regions
• Aggressive privatizations applied to most of the arrays
• Fine grained locks define to access shared arrays
• Hot spots of the application identified
OpenMP Pseudo code
msap {
#pragma parallel region private(..)
firstprivate(..)
{
#pragma omp for
for(…)
Initialize Array of Locks
#pragma omp for no wait
for (…) {
for (…) {
for (…) {
Computations ()
for (…) {
{
Computations ()
}
// update to shared data
omp_set_lock()
Updates to shared data.
omp_unset_lock()
}
Result Obtained after performing optimizations:
•
•
•
•
Count of useful instructions = 8.40E+9
No-opt operations = 24%
Branch Mispredictions = 59%
Cycles per instruction = 0.28 (Lowered, hence higher
performance)
CPI improvements of 11.89% - Reduction in branch misprediction
of 21.33% - NOP instructions reduced by 45.45%
Parameters
Unoptimized
Optimized
Useful
Instruction
Count
7.63E + 9
8.40E + 9
NOP operations
44%
24%
Branch
Mispredictions
75%
59%
Cycles/Inst
CPI
0.3178
0.28
Scheduling
Static Scheduling
•
•
•
•
Reduced synchronization/communication overhead
Uneven sized tasks
Load imbalances and idle processors leading to wastage of resources
Triangular matrix- resultant matrix not achieved - No ideal speed-up
Dynamic Scheduling
• Option of Flexibility
• As the parallel loop is executed, number of iterations each thread performs
is determined dynamically
• Loop divided into chunks of h iterations or chunk size equaling to 1 or x%
of the hth iterations.
• Ideal speed-up of ~80% achieved
Dynamic Scheduling (Triangular Matrix)
Vs
Static Scheduling
Execution and Tuning Model
Conclusion and Future Work
• Multithreaded application achieves 78% of ideal speed-up on dynamic
scheduling with 128 threads on 1000 sequence protein data set.
• Looking at translating OpenMP to Impulse-C, a tool for main stream
embedded programmers seeking high performance through FPGA coprocessing
• Plan to address the lack of tools and techniques for turn-key mapping of
algorithms to the hybrid CPU-FPGA systems by developing an OpenUH
add – on module to perform this mapping automatically