* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download M. Shakya, High Performance Computing for Genetic Research, the
Silencer (genetics) wikipedia , lookup
Gene expression profiling wikipedia , lookup
Exome sequencing wikipedia , lookup
Non-coding DNA wikipedia , lookup
Endogenous retrovirus wikipedia , lookup
Whole genome sequencing wikipedia , lookup
Community fingerprinting wikipedia , lookup
Genomic library wikipedia , lookup
Molecular evolution wikipedia , lookup
Milind Shakya HAPC Winter 2009 High Performance Computing for Genetic Research Introduction High Performance Computing is accelerating several projects on genetic research, one of them being the famous Human Genome Project. The Human Genome Project's main goal is to identify each and every gene in the human DNA. The project has to deal with over 30,000 genes and 3 billion chemical base pairs. HPC is therefore vital for processing, sequencing and storing the genome information. Many life science companies like the Sanger Institute, use a HPC management software solution called Platform LSF for intelligently scheduling parallel and serial workloads. It is software for managing and accelerating batch workload processing for compute-and data-intensive applications. Platform LSF Platform LSF is a key product of the Platform Accelerate Suite which allows the management and acceleration of batch workload processes for mission-critical compute- and data-intensive application workload. It can intelligently schedule and guarantee the completion of batch workload across distributed, virtualized, High Performance Computing (HPC) environment. Platform LSF increases the utilization of existing hardware resources no matter the industry of use. This HPC management software has many amiable features like: 1.High-performance, flexible, scalable HPC infrastructure software solution 2.Takes maximum advantage of modern multi-core and multi-threaded architectures with advanced new scheduling controls for both sequential and parallel jobs. 3.Proven to work in heterogeneous IT environments utilizing all IT infrastructure resources regardless of operating system –including desktops, servers and mainframes – or the architecture – support for 32-bit and 64-bit operating systems, and multi-core CPU architectures 4.Its scheduler called the Platform LSF Session Scheduler enables low-latency, high-throughput job scheduling The Genome Research Center Laboratory According to Platform LSF administrator of Sanger Institute Tim Cutts, the Human Genome Research Project would be very expensive on single machines. Initially, the research team encountered several problems that the traditional supercomputers were not set up to deal with. Consequently. the research center incorporated clusters of computers to better utilize the principles of High Performance Computing to further enhance the progress of the Human Genome Project. The Sanger Institute has a total of 12 clusters, 8 of which run Platform LSF. These 8 clusters included the three largest cluster out the 12 clusters as well. The cluster is basically a multivendor heterogeneous Linux environment, with dual and quad core IBM and HP machines, as well as several SGI Altix machines in the various clusters. The largest cluster having 700 plus nodes is used for general purpose research by the institutes's researchers. DNA SEQUENCING USING HPC DNA sequencing run on the cluster takes around three days to complete. One can imagine the huge amount of data the system has to work with. Each run generates about two tera bytes of data, so that is four tera bytes per machine per week. There are 30 machines that are solely devoted for the process of processing DNA. Therefore, the institute is processing 120 terabytes of raw data every week. The sequencing is carried out on the 128 node Platform cluster. Figure 1 shows a typical DNA sequence run being carried out on multiple nodes. Fig.1http://www.archbishopofcanterbury.org/media/image/2/9/archbishop-at-sangerinstitute_large.jpg Source: From the graph above we can derive that a single one of the instruments has the sequencing capacity of the entire genome research institute a decade ago. Parallel Algorithms for Sequence Analysis Gene Model building becomes really time consuming when done manually. So, automation of processes like recognition of features in sequence, such as a genes, proves very beneficial. Five distinct types of algorithms namely pattern recognition, statistical measurement, sequence comparison, gene modeling, and data mining must be combined into a coordinated toolkit to synthesize the complete analysis. DNA sequences are almost impossible to interpret through visual examination. However, when examined under the computer DNA sequence has proved to be a rich source of interesting patterns having periodic, stochastic, and chaotic properties that vary in different functional domains. These properties and methods to measure them form the basis for recognizing the parts of the sequence that contain important biological features. In genomics and computational biology, pattern recognition systems often employ artificial neural networks or other similar classifiers to distinguish sequence regions containing a particular feature from those regions that do not. Machine-learning methods allow computer-based systems to learn about patterns from examples in DNA sequence. They have proven to be valuable because our biological understanding of the properties of sequence patterns is very limited. Also, the underlying patterns in the sequence corresponding to genes or other features are often very weak, so several measures must be combined to improve the reliability of the prediction. A well-known example of this is ORNL’s GRAIL gene detection system, deployed originally in 1991, which combined seven statistical pattern measures using a simple feed-forward neural network Seamless high-performance computing. Megabases of DNA sequence being analyzed each day will strain the capacity of existing supercomputing centers. Interoperability between high-performance computing centers will be needed to provide the aggregate computing power, managed through the use of sophisticated resource management tools. The system must be fault-tolerant to machine and network failures so that no data or results are lost. This advocates the use of High Availability High Performance Clusters as well. Diagram showing a region of 60 kilobases of sequence (horizontal axis) with several genes predicted by GRAIL (the up and down peaks represent gene coding segments of the forward and reverse strand of the double-stranded DNA helix). Visualization tools like the GRAIL interface provide researchers with an overview of large genomic regions and often have hyperlinks to underlying detailed information. source: https://hpcrd.lbl.gov/SciDAC08/files/presentations/SciDAC_2008_oehmen_final.ppt Visualization for data and collaboration. The sheer volume and complexity of the analyzed information and links to data in many remote databases require advanced data visualization methods to allow user access to the data. Users need to interface with the raw sequence data; the analysis process; and the resulting synthesis of gene models, features, patterns, genome map data, anatomical or disease phenotypes; and other relevant data. In addition, collaborations among multiple sites are required for most large genome analysis problems, so collaboration tools, such as video conferencing and electronic notebooks, are very useful. A display of several genes and other features from our GRAIL Internet server is shown in Fig. 6. Even more complex and hierarchical displays are being developed that will be able to zoom in from each chromosome to see the chromosome fragments (or clones) that have been sequenced and then display the genes and other functional features at the sequence level. Linked (or hyperlinked) to each feature will be detailed information about its properties, the computational or experimental methods used for its characterization, and further links to many remote databases that contain additional information. Analysis processes and intelligent retrieval agents will provide the feature details available in the interface and dynamically construct links to remote data. The development of the sequence analysis engine represents part of the rapid changes in the biological sciences paradigm to one that makes much greater use of computation, networking, simulation and modeling, and sophisticated data management systems. Unlike any other existing system in the genomics arena, the sequence analysis engine will link components for data input, analysis, storage, update, and submission in a single distributed high-performance framework that is designed to carry out a dynamic and continual discovery process over a 10-year period. The approach outlined is flexible enough to use a variety of hardware and data resources; configure analysis steps, triggers, conditions, and updates; and provide the means to maintain and update the description of each genomic region. Users (individuals and large-scale producers) can specify recurrent analysis and data mining operations that continue for years. The combined computational process will provide new knowledge about the genome on a scale that is impossible for individual researchers using current methods. Such a process is absolutely necessary to keep up with the flood of data that will be gathered over the remainder of the Human Genome Project. Conclusion: High Performance Computing, helps the modern genetic research by accelerating the comparisons of similar genetic structures. It helps the genetic industry by providing ability to perform massive regular up dates to the genome browser database to incorporate the latest research advances. Research on genomes of numerous species generates 100,000s of processes per week, that must be scheduled and completed as efficiently as possible and thus HPC makes all this possible in fraction of the original time frame.The HPC community has its roots in solving computational problems in physics (such as fluid flow, structural analyses, and molecular dynamics). However, traditional approaches to these problems, and to ranking HPC systems based on the Linpackbenchmark, may not be the optimal approach to HPC architectures in computational biology. Many researchers are carefully considering the architectural needs of HPC systems to enable next-generation biology. New HPC algorithms for biomedical research will require tight integration of computation with database operations and queries, along with the ability to handle new types of queries that are highly dependent on irregular, spatial or temporal locality. Sources: 1) http://www.platform.com/resources/casestudies/Sanger-CS-08.pdf 2) http://www.ornl.gov/info/ornlreview/v30n3-4/genome.htm 3) http://www.hpc.usu.edu/files/uploads/ACRES/HPC_Larson.pdf 4) https://hpcrd.lbl.gov/SciDAC08/files/presentations/SciDAC_2008_oehmen_final.ppt 5) http://www.archbishopofcanterbury.org/media/image/2/9/archbishop-at-sangerinstitute_large.jpg 6) http://delivery.acm.org/10.1145/1030000/1029523/p34-bader.pdf? key1=1029523&key2=2061363321&coll=ACM&dl=ACM&CFID=20190045&CFTOKEN=25 928572