Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
BioFilter An architecture for parallel deployment and dynamic chaining of standalone BioInformatics tools. Master’s Thesis Avinash Kewalramani Outline • • • • • • • Motivation. Abstract. Software Patterns and BioFilter. Biofilter architecture. Performance results. Future work. Conclusion. Motivation • • • • • • Most Bioinformatics tools are processor intensive. Few have capabilties to run on multi-processor machines Almost none have capabilty to be run on Beowulf clusters. Bioinformatics analysis is mostly based on multiple tools with their Input/Output chained together. Human intervention to set up this chaining BioFilter provides the above two capabilities – – Easy and fast deployment of such tools on the cluster providing a huge gain in performance Allows creation of dynamic pipelines of different analysis tools as per the requirement. Abstract • • • • • • • • Uses “Pipes and Filter pattern” to setup dynamic pipelines. It uses “Client-Dispatcher-Server” pattern for parallelization. BioFilter is not tool-specific but more of a plug and play environment. Original tools are not altered. So output is consistent with single CPU machine runs Cluster hardware, operating system and queuing system independent Operating system on the cluster should support multiprocessing via forks BioFilter code has been implemented in Object Oriented Perl Cluster should have a shared disk system and Perl installed. Typical Annotation Pipelines Prodom Blocks Prosite Genome Glimmer Blast BlastParser Psort tRNAscan Coils Seg FastaRecordsFile Blast BlastParser Sofware Patterns and BioFilter • • Patterns are based on a three part schema – Context – Problem – Solution Pattern categories – Architectural – Design Pipes and Filters pattern • Context : Processing of data streams • Problem – Specifications and Forces • Avoiding Monolithic system design • Classes which perform single transformation or analysis provide greater flexibilty and reuse. • System can be built by different people on the team. • Dynamic combination of such classes to form pipelines. • Common interface provides transparent use. Pipes and Filters pattern • Solution – Divide the task into several sequential processing steps with the data flowing through the system connecting these steps. – Components • • • • Filters – Active/Passive Pipes – Queues/method calls Data Source -Provider Data Sink -Consumer Pipes and Filter pattern Static structure of a Pull Filter for Blast Pipeline Pipes and Filters Pattern Dynamic structure of a Pull Filter for Blast Pipeline Client-Dispatcher-Server Pattern • Context: A software system integrating a set of distributed servers, with the servers running locally or distributed over a network. • Problem – Specification and Forces • Software system that uses servers distributed over a network must provide a means of communication between them • Separation of Functional/Communication code. • Service providers should be location independent. Client-Dispatcher-Server Pattern • Solution – Components • Client – Carries out domain specific tasks – Requests the dispatcher to set up a communication channel with the server. • Server – Provides set of operations to the clients – Registers with the dispatcher • Dispatcher – Location transparency for the Servers. Client-Dispatcher-Server pattern Static structure for Blast Filter Client-Dispatcher-Server pattern Dynamic structure for Blast Filter BioFilter Architecture Framework • Splitting of the input data set and the database – Database splitting model is cumbersome. – Monolithic shared disk database with input data splitting is easier. – Capacity computing model • Exclusivity of partitions of the input data set • Capability to process one input partition at a time. – Synchronization of results is an issue. • Queuing System – Interaction with queuing system is part of the architecture BioFilter Architecture Static structure-Blast Pipeline BioFilter Architecture Dynamic structure-Blast Pipeline BioFilter Architecture • Crash Recovery – TCP/IP based socket communication is blocking – Timed sockets overcome blocking – Timed waiting is predetermined but dynamically computed dependent on the computational intensity of the job submitted in the communication between components. • Interaction with nodes outside the cluster is possible by manually starting the server or by rsh utilities Filter Variants • Simple Filters. • Split Filters. • Join Filters Concrete Source Split FastaRecordsFile OrfFilter Simple Join Fasta2TblFilter BuildImmFilter Simple GlimmerFilter Split TranslationFilter Performance Results • 240 node cluster, Each node is a Pentium III 1200 Mhz processor • 20GB local disk and memory ranging from 1-2GB. • Shared disk via Netapps disk server. • Speedup Ratio for x nodes= Time to run the job on the initial number of nodes/Time to run the job on x nodes Performance Results Benchmark Test 1 • 1000 protein sequences from Bacteria (287 AA average length) blast against NR database. • Servers ranging from 1-200 • Non-linear increase in performance – Shared Database bottleneck – Single Dispatcher Performance Results Benchmark Test 2 –Data Scalability Performance Results Benchmark Test 3-It’s just not a Blast accelerator Performance Results Benchmark Test 3 Future Work • Tests required to analyse and cure the performance bottleneck. • Forking the pipeline for added parallelism, requires a queue type of datastructure Conclusion Advantages •Eliminate intermediate files, provides filter reuse and rapid prototyping of pipelines by filter recombination and exchange •Provides an environment for easy deployment of tools which are embarrasingly parallel on the Beowulf clusters. •Development time and Run time preformance gain. •Encapsulates the queuing system information in the implementation. •Interaction with nodes outside the cluster. •Provides exchangeabilty of servers, location and migration transperancy, reconfiguration of servers and fault tolerance Conclusion Disadvantages • Speed increase not proportional to the number of nodes. •Absence of a GUI based interface. • Architecture not tested with tools where the input data cannot be split. •Error handling is very difficult. •Pipeline disaster recovery is difficult. •Client-Dispatcher-Server pattern makes the architecture slower. Thanks and Possibilities • Advisors – Dr Sun Kim (Indiana University-School of Informatics) – Thomas Brettin (Los Alamos National Labs) • Clients or Open Source. – California Digital. – Rocketcalc. – Microway. – Western Scientific. References • • • • • • • • • • • • • • S. Salzberg, A. Delcher, S. Kasif, and O. White.,"Microbial gene identification using interpolated Markov models" Nucleic Acids Research 26:2 (1998), 544-548. A.L. Delcher, D. Harmon, S. Kasif, O. White, and S.L. Salzberg., "Improved microbial gene identification with GLIMMER" Nucleic Acids Research, 27:23, 4636-4641. Lowe, T.M. & Eddy, S.R. (1997) ``tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence'', Nucl. Acids Res., 25, 955-964. Altschul, S.F.,et al., "Basic local alignment search tool.". J Mol Bio. 215 403-410(1990) Nakai K, Kanehisa M.,"Expert system for predicting protein localization sites in gram-negative bacteria." Proteins. 1991;11(2):95-110. Lupas, A., Van Dyke, M., and Stock, J.,"Predicting Coled Coils from Protein Sequences", Science 252:1162-1164. Wootton, J. C. and S. Federhen (1993)., "Statistics of local complexity in amino acid sequences and sequence databases.", Computers in Chemistry 17:149-163. Wootton, J. C. and S. Federhen (1996).," Analysis of compositionally biased regions in sequence databases. Methods in Enzymology 266: 554-571. Servant F, Bru C, Carrère S, Courcelle E, Gouzy J, Peyruc D, Kahn D (2002) ProDom: Automated clustering of homologous domains. Briefings in Bioinformatics. vol 3, no 3:246-251 J.G. Henikoff, E.A. Greene, S. Pietrokovski & S. Henikoff, "Increased coverage of protein families with the blocks database servers", Nucl. Acids Res. 28:228-230 (2000). S.Henikoff, J.G.Henikoff & S. Pietrokovski, "Blocks+: A non-redundant database of protein alignment blocks derived from multiple compilations", Bioinformatics 15(6):471-479 (1999). Sigrist C.J., Cerutti L., Hulo N., Gattiker A., Falquet L., Pagni M., Bairoch A., Bucher P., "PROSITE: a documented database using patterns and profiles as motif descriptors." Brief Bioinform. 3:265-274(2002). Mark Grand. "Patterns in Java .Volume 1" ,Second Edition,Wiley Publication Inc ,2002 Frank Buschmann, Regine Meunier, Hans Rohnert, Peter Sommerlad,Michael Stal."Pattern Oriented Software Architecture Volume 1:A system of Patterns" Chicester,England:John Wiley and Sons, 1996.