Download M. Shakya, High Performance Computing for Genetic Research, the

yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts

Silencer (genetics) wikipedia, lookup

Gene expression profiling wikipedia, lookup

Exome sequencing wikipedia, lookup

Non-coding DNA wikipedia, lookup

Endogenous retrovirus wikipedia, lookup

RNA-Seq wikipedia, lookup

Whole genome sequencing wikipedia, lookup

Community fingerprinting wikipedia, lookup

Genomic library wikipedia, lookup

Molecular evolution wikipedia, lookup

Genome evolution wikipedia, lookup

Artificial gene synthesis wikipedia, lookup

Milind Shakya
Winter 2009
High Performance Computing for Genetic Research
High Performance Computing is accelerating several projects on genetic research, one of them
being the famous Human Genome Project. The Human Genome Project's main goal is to identify each
and every gene in the human DNA. The project has to deal with over 30,000 genes and 3 billion
chemical base pairs. HPC is therefore vital for processing, sequencing and storing the genome
information. Many life science companies like the Sanger Institute, use a HPC management software
solution called Platform LSF for intelligently scheduling parallel and serial workloads. It is software
for managing and accelerating batch workload processing for compute-and data-intensive applications.
Platform LSF
Platform LSF is a key product of the Platform Accelerate Suite which allows the management and
acceleration of batch workload processes for mission-critical compute- and data-intensive application
workload. It can intelligently schedule and guarantee the completion of batch workload across
distributed, virtualized, High Performance Computing (HPC) environment. Platform LSF increases the
utilization of existing hardware resources no matter the industry of use. This HPC management
software has many amiable features like:
1.High-performance, flexible, scalable HPC infrastructure software solution
2.Takes maximum advantage of modern multi-core and multi-threaded architectures with
advanced new scheduling controls for both sequential and parallel jobs.
3.Proven to work in heterogeneous IT environments utilizing all IT infrastructure resources
regardless of operating system –including desktops, servers and mainframes – or the
architecture – support for 32-bit and 64-bit operating systems, and multi-core CPU
4.Its scheduler called the Platform LSF Session Scheduler enables low-latency, high-throughput
job scheduling
The Genome Research Center Laboratory
According to Platform LSF administrator of Sanger Institute Tim Cutts, the Human Genome
Research Project would be very expensive on single machines. Initially, the research team encountered
several problems that the traditional supercomputers were not set up to deal with. Consequently. the
research center incorporated clusters of computers to better utilize the principles of High Performance
Computing to further enhance the progress of the Human Genome Project.
The Sanger Institute has a total of 12 clusters, 8 of which run Platform LSF. These 8 clusters
included the three largest cluster out the 12 clusters as well. The cluster is basically a multivendor
heterogeneous Linux environment, with dual and quad core IBM and HP machines, as well as several
SGI Altix machines in the various clusters. The largest cluster having 700 plus nodes is used for
general purpose research by the institutes's researchers.
DNA sequencing run on the cluster takes around three days to complete. One can imagine the
huge amount of data the system has to work with. Each run generates about two tera bytes of data, so
that is four tera bytes per machine per week. There are 30 machines that are solely devoted for the
process of processing DNA. Therefore, the institute is processing 120 terabytes of raw data every week.
The sequencing is carried out on the 128 node Platform cluster. Figure 1 shows a typical DNA
sequence run being carried out on multiple nodes.
From the graph above we can derive that a single one of the instruments has the sequencing capacity of
the entire genome research institute a decade ago.
Parallel Algorithms for Sequence Analysis
Gene Model building becomes really time consuming when done manually. So, automation of
processes like recognition of features in sequence, such as a genes, proves very beneficial. Five distinct
types of algorithms namely pattern recognition, statistical measurement, sequence comparison,
gene modeling, and data mining must be combined into a coordinated toolkit to synthesize the
complete analysis.
DNA sequences are almost impossible to interpret through visual examination. However, when
examined under the computer DNA sequence has proved to be a rich source of interesting patterns
having periodic, stochastic, and chaotic properties that vary in different functional domains. These
properties and methods to measure them form the basis for recognizing the parts of the sequence that
contain important biological features. In genomics and computational biology, pattern recognition
systems often employ artificial neural networks or other similar classifiers to distinguish sequence
regions containing a particular feature from those regions that do not. Machine-learning methods allow
computer-based systems to learn about patterns from examples in DNA sequence. They have proven to
be valuable because our biological understanding of the properties of sequence patterns is very limited.
Also, the underlying patterns in the sequence corresponding to genes or other features are often very
weak, so several measures must be combined to improve the reliability of the prediction. A well-known
example of this is ORNL’s GRAIL gene detection system, deployed originally in 1991, which
combined seven statistical pattern measures using a simple feed-forward neural network
Seamless high-performance computing.
Megabases of DNA sequence being analyzed each day will strain the capacity of existing
supercomputing centers. Interoperability between high-performance computing centers will be needed
to provide the aggregate computing power, managed through the use of sophisticated resource
management tools. The system must be fault-tolerant to machine and network failures so that no data or
results are lost. This advocates the use of High Availability High Performance Clusters as well.
Diagram showing a region of 60 kilobases of sequence (horizontal axis) with several genes predicted by GRAIL (the
up and down peaks represent gene coding segments of the forward and reverse strand of the double-stranded DNA
helix). Visualization tools like the GRAIL interface provide researchers with an overview of large genomic regions
and often have hyperlinks to underlying detailed information.
Visualization for data and collaboration.
The sheer volume and complexity of the analyzed information and links to data in many remote
databases require advanced data visualization methods to allow user access to the data. Users need to
interface with the raw sequence data; the analysis process; and the resulting synthesis of gene models,
features, patterns, genome map data, anatomical or disease phenotypes; and other relevant data. In
addition, collaborations among multiple sites are required for most large genome analysis problems, so
collaboration tools, such as video conferencing and electronic notebooks, are very useful. A display of
several genes and other features from our GRAIL Internet server is shown in Fig. 6. Even more
complex and hierarchical displays are being developed that will be able to zoom in from each
chromosome to see the chromosome fragments (or clones) that have been sequenced and then display
the genes and other functional features at the sequence level. Linked (or hyperlinked) to each feature
will be detailed information about its properties, the computational or experimental methods used for
its characterization, and further links to many remote databases that contain additional information.
Analysis processes and intelligent retrieval agents will provide the feature details available in the
interface and dynamically construct links to remote data.
The development of the sequence analysis engine represents part of the rapid changes in the biological
sciences paradigm to one that makes much greater use of computation, networking, simulation and
modeling, and sophisticated data management systems. Unlike any other existing system in the
genomics arena, the sequence analysis engine will link components for data input, analysis, storage,
update, and submission in a single distributed high-performance framework that is designed to carry
out a dynamic and continual discovery process over a 10-year period. The approach outlined is flexible
enough to use a variety of hardware and data resources; configure analysis steps, triggers, conditions,
and updates; and provide the means to maintain and update the description of each genomic region.
Users (individuals and large-scale producers) can specify recurrent analysis and data mining operations
that continue for years. The combined computational process will provide new knowledge about the
genome on a scale that is impossible for individual researchers using current methods. Such a process
is absolutely necessary to keep up with the flood of data that will be gathered over the remainder of the
Human Genome Project.
High Performance Computing, helps the modern genetic research by accelerating the comparisons of
similar genetic structures. It helps the genetic industry by providing ability to perform massive regular
up dates to the genome browser database to incorporate the latest research advances. Research on
genomes of numerous species generates 100,000s of processes per week, that must be scheduled and
completed as efficiently as possible and thus HPC makes all this possible in fraction of the original
time frame.The HPC community has its roots in solving computational problems in physics (such as
fluid flow, structural analyses, and molecular dynamics). However, traditional approaches to these
problems, and to ranking HPC systems based on the Linpackbenchmark, may not be the optimal
approach to HPC architectures in computational biology. Many researchers are carefully considering
the architectural needs of HPC systems to enable next-generation biology. New HPC algorithms for
biomedical research will require tight integration of computation with database operations and queries,
along with the ability to handle new types of queries that are highly dependent on irregular, spatial or
temporal locality.