Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Current Bioinformatics, 2008, 3, 000-000 1 Web & Grid Technologies in Bioinformatics, Computational and Systems Biology: A Review Azhar A. Shah1, Daniel Barthel1, Piotr Lukasiak2,3, Jacek Blazewicz2,3 and Natalio Krasnogor*,1 1 School of Computer Science, University of Nottingham, Jubilee Campus, NG81BB, Nottingham, UK 2 Institute of Computing Science, Poznan University of Technology, Piotrowo 2, 60-965 Poznan, Poland 3 Institute of Bioorganic Chemistry, Laboratory of Bioinformatics, Polish Academy of Sciences, Noskowskiego 12/14, 61704 Poznan, Poland Abstract: The acquisition of biological data, ranging from molecular characterization and simulations (e.g. protein folding dynamics), to systems biology endeavors (e.g. whole organ simulations) all the way up to ecological observations (e.g. as to ascertain climate change’s impact on the biota) is growing at unprecedented speed. The use of computational and networking resources is thus unavoidable. As the datasets become bigger and the acquisition technology more refined, the biologist is empowered to ask deeper and more complex questions. These, in turn, drive a runoff effect where large research consortia emerge that span beyond organizations and national boundaries. Thus the need for reliable, robust, certified, curated, accessible, secure and timely data processing and management becomes entrenched within, and crucial to, 21st century biology. Furthermore, the proliferation of biotechnologies and advances in biological sciences has produced a strong drive for new informatics solutions, both at the basic science and technological levels. The previously unknown situation of dealing with, on one hand, (potentially) exabytes of data, much of which is noisy, has large experimental errors or theoretical uncertainties associated with it, or on the other hand, large quantities of data that require automated computationally intense analysis and processing, have produced important innovations in web and grid technology. In this paper we present a trace of these technological changes in Web and Grid technology, including details of emerging infrastructures, standards, languages and tools, as they apply to bioinformatics, computational biology and systems biology. A major focus of this technological review is to collate up-to-date information regarding the design and implementation of various bioinformatics Webs, Grids, Web-based grids or Grid-based webs in terms of their infrastructure, standards, protocols, services, applications and other tools. This review, besides surveying the current state-of-the-art, will also provide a road map for future research and open questions. Keywords: Semantic web, web services, web agents, grid computing, middleware, in silico experiments. 1. INTRODUCTION ‘The impact of computing on biology can fairly be considered a paradigm change as biology enters the 21st century. In short, computing and information technology applied to biological problems is likely to play a role for 21st century biology that is in many ways analogous to the role that molecular biology has played across all fields of biological research for the last quarter century and computing and information technology will become embedded within biological research itself’ [1]. As a visualization of the above referred conclusion regarding the embedding of computing and information technology (IT) in biological research, one can look at the current state-of-the-art in web and grid technologies as applied to bioinformatics, computational biology and systems biology. The World Wide Web or simply the web has revolutionized the field of IT and related disciplines, by providing information-sharing services on top of the internet. Similarly, grid technology has revolutionized the field of computing by providing location-independent resource sharingservices such as computational power, storage, databases, *Address correspondence to this author at the ASAP Group, School of Computer Science, University of Nottingham, Jubilee Campus, NG81BB, Nottingham, UK; Tel: +44 115 8467592; Fax: +44 115 8467066; E-mail: [email protected] 1574-8936/08 $55.00+.00 networks, instruments, software applications and other computer related hardware equipment. These information and resource-sharing capabilities of web and grid technologies could upgrade a single user computer into a global supercomputer with vast computational power and storage capacity. The so called upgraded web and grid-enabled global super computer would make itself a potential candidate to be used in resource-hungry computing domains. For example, it could be used to efficiently solve complex calculations such as parameter sweep scenario with Monte Carlo simulation and modeling techniques, which would normally require several decades of execution time on a traditional single desktop processor. A quick look at the literature reveals that web and grid technologies are continuously being taken up by the biological community as an alternate to traditional monolithic high performance computing mainly because of the inherent nature of biological resources (distributed, heterogeneous and CPU intensive), smaller financial costs, better flexibility, scalability and efficiency offered by the web and gridenabled environment. An important factor that provides the justification behind the ever growing use of web and grid technologies in life sciences is the continuous and rapid increase in biological data production. About eight years ago it has been approximated that the amount of information produced by a single gene laboratory could be as higher as 100 © 2008 Bentham Science Publishers Ltd. 2 Current Bioinformatics, 2008, Vol. 3, No. 1 terabytes, which is equivalent to about 1 million encyclopedias [2]. Although being extensive, these data also differ in terms of storage and access technologies and are dispersed throughout the world. Currently, there are no uniform standards, or at least not yet been adopted properly by the biological community as a whole, to deal with the diverse nature, type, location and storage formats of this data. On the other hand, in order to obtain the most comprehensive and competitive results, in many situations, a biologist may need to access several different types of data which are publicly available in more than 700 [3] biomolecular databases. One way to handle this situation would be to convert the required databases into a single format and then store it on a single storage device with extremely large capacity. Considering the tremendous size and growth of data, this solution seems to be infeasible, inefficient and very costly. The application of Web and Grid technology provides an opportunity to standardize the access to these data in an efficient, automatic and seamless way through the use of appropriate DataGrid middleware technologies. These technologies include grid middleware specific Data Management Services (DMS), distributed storage environments such as Open Source Grid Services Architecture-Data Access and Integration (OGSADAI) (http://www.ogsadia.org) with Distributed Query Processing (OGSA-DQP), Storage Resource Broker (SRB) [4] and IBM DiscoveryLink [5] middleware etc. Furthermore, it is also very common for a typical biological application that involves very complex analysis of large-scale datasets and other simulation related tasks to demand for high throughput computing power in addition to seamless access to very large biological datasets. The traditional approach towards this solution was to purchase extremely costly special-purpose super computers or dedicated clusters. This type of approach is both costly and somewhat limited as it locks the type of computing resources. Another problem associated with this approach would be that of poor utilization of very costly resources, i.e. if a particular application finishes its execution then the resources could remain idle. Grid technology on the other hand provides more dynamic, scalable and economical way of achieving as much computing power as needed through computational grid infrastructures connected to a scientist’s desktop machine. There are many institutional, organizational, national and international Data/Computational/Service Grid testbeds and well established production grid environments, which provide these facilities free of charge to their respective scientific communities. Some of these projects include Biomedical Research Informatics Delivered by Grid Enabled Services(BRIDGES) (http://www.brc.dcs.gla.ac.uk/projects/), Enabling Grids for E-scienceE project (EGEE) ( http://public.eu-egee.org), Biomedical Informatics Research Network project (BIRN) (http://www.nbirn.net), National Grid Service UK (http://www.ngs.ac.uk), OpenBioGrid Japan [6], SwissBioGrid [7], Asia Pacific BioGrid (http://www.apgrid.org), North Carolina BioGrid (http:// www.ncbiotech.org), KidneyGrid [8], Virtual Laboratory for drug design on World Wide Grid [9] etc. All these projects consist of an internet based interconnection of a large number of pre-existing individual computers or dedicated clusters located at various distributed institutional and organizational sites that are part of the consortium. Shah et al. Some other large scale bioinformatics grid projects have provided a platform where a biologist can design and run complex in silico experiments by combing several distributed and heterogeneous resources that are wrapped as webservices. Examples of these are myGrid [10, 11], BioMOBY [12-14], Seqhound (http://www.blueprint.org /seqhound) and Biomart (http://www.biomart.org) etc., which allow for the automatic discovery and invocation of many bioinformatics applications, tools and databases such as EMBOSS [15] suite of bioinformatics applications and some other publicly available services from the National Center for Biomedical Informatics (NCBI) (http://www.ncbi. nlm.nih.gov) and European Bioinformatics Institute (EBI) (http://www.ebi. ac.uk). These projects also provide some special toolkits with necessary application programming interfaces (APIs), which could be used to transform any legacy bioinformatics application into a web-service that can be deployed on their platforms. The availability of these BioGrid projects brought into sharp focus the need for better user interfaces as to provide the biologist with easier access to these web/grid resources. This has led to the development of various web based interfaces, portals, workflow management systems, problem solving environments, frameworks, application programming environments, middleware toolkits, data and resource management approaches along with various ways of controlling grid access and security. This review attempts to provide an upto-date coherent and curated overview of the most recent advances in these technologies as applied to life sciences. The review aims at providing a complementary source of additional information to some previous reviews in this field such as [16, 17]. The organization of the paper is as follows: section 2 presents an overview of the state-of-the-art on web and grid technologies focusing on the direction of progress of each technology and how they are becoming part and parcel of life sciences. Section 3 describes the architecture, implementation and services provided by a selection of ‘flagship’ projects. The idea is to present some success stories with brief architectural details that could provide the basic building blocks to a researcher interested in the further exploration and utilization of web and grid technologies in life sciences. Section 4 presents the analysis of the reviewed literature with a clear indication of certain key open problems within the existing technological approaches and provides a roadmap and open questions for the future. Finally, section 5 provides the concluding remarks. 2. STATE-OF-THE-ART OVERVIEW Among the many advances that the computational sciences have provided to the life sciences, the proliferation of web and grid technologies is one of the most conspicuous. Driven by the demands of biological research, these technologies have moved from their classical and somewhat static architectures to more dynamic and service-oriented ones. The direction of current development in these technologies is coalescing towards an integrated and unified Web-based grid service [18] or Grid-based web service environment (Fig. 2). Accompanying this rapid growth, a huge diversity of approaches to implementation and deployment routes have been investigated in relation to the use of various innovative web and grid technologies for the solution of problems related to life sciences. This section provides an Web & Grid Technologies Current Bioinformatics, 2008, Vol. 3, No. 1 3 Fig. (1). Hieratical Organization of the state-of-the-art overview. overview of some of these works through a hierarchal organization as illustrated in Fig. 1. 2.1. Application of Web Technologies in Life Sciences As can be observed from Fig. 2, currently there are three main thrusts in the development of web technologies: Se- mantic-web, Web-services and web-agents. The continuous growth of these technologies takes on a converging path giving rise to agent based semantic web services and web portals. The use and effect of these technologies in relation to life sciences are analyzed in the following three subsections: Fig. (2). Review of technological infrastructure for life sciences: Classical HTML based web started in 1991 and traditional Globus based grid was introduced by Ian Foster in 1997. With the introduction and development of semantic web, web-services and web agents in and after 2001, the new web and grid technologies are being converged into a single uniform platform termed as ‘service-oriented autonomous semantic grid’ that could satisfy the needs of HT (high throughput) [19] experimentations in diverse fields of life sciences as depicted above. 4 Current Bioinformatics, 2008, Vol. 3, No. 1 2.1.1. Semantic Web Technology One of the most important limitations of the information shared through classical web technology is that it is only interpretable by human and hence it limits the automation required for more advanced and complex life science applications. Therefore, the basic purpose of semantic web technology is to eliminate this limitation by enabling the machine (computer) to interpret/understand the meaning (semantics) of the information and hence allow artificial intelligence based applications to carry out decisions autonomously. It does so by adding some important features to the basic information-sharing service provided by the classical web technology. These features provide a common format for interchange of data through some standard languages and data models such as XML (eXtensible Markup Language), RDF (Resource Description Framework) along with several variants of schema and semantic based markup languages such as, Web Ontology Language (OWL) and Semantic Web Rules-Language (SWRL) etc. Each of these models and languages has its own benefits and limitations. Therefore, a particular model or language being highly useful for the solution of one problem at some point in time may not be suitable for another problem or even the same problem at another point in time. For example, Wang et al. [21] argues that although initially XML was used as a data standard for platform independent exchange and sharing of data, however, because of its basic syntactic and document-centric nature, it was found limited, especially for the representation of rapidly increasing and diverse ‘omic’ data. Therefore, currently RDF along with some new variants of OWL such as OWL-Lite, OWL-DL and OWL-Full are being adopted for implementation that would free the end-user (biologist) from performing manual invocation of analysis tools and interpretation of partial results needed for further execution of remaining software components in a complex in silico experiment. It is therefore, we find various ways towards the use of semantic web for life sciences mainly focusing on data/application integration, data provenance, knowledge discovery, machine learning and mining etc. For example, Sooho et al. [22], discussed the development of a semantic framework based on publicly available ontologies such as GlycO and ProPreO that could be used for modeling the structure and function of enzymes, glycans and pathways. This framework uses a sublanguage of OWL called OWLDL [23] to integrate extremely large (~500MB) and structurally diverse collection of biomolecules. Biological database integration as discussed here, also encounter the problem of inconsistencies between databases. Therefore, there have also been certain efforts for providing some external semantic-based tools for the measurement of the degree of inconsistencies between different databases. One such effort is discussed by Chen et al. [24]. It describes an ontology-based method to determine if two databases are compatible. The database compatibility determination is based on the results of semantically matching the reference attributes though a mathematical function. Several practical examples have been demonstrated with encouraging results. This measure can be used as criteria to decide whether a particular database can be integrated or not and hence making the process of integration more predictable, efficient and reliable. Shah et al. The autonomous and uniform integration, invocation and access to biological data and resources as provided by semantic web have also created an environment that supports the use of in silico experiments. Proper and effective use of in silico experiments requires the maintenance of user specific provenance data such as record of goals, hypothesis, materials, methods, results and conclusions of an experiment. For example, Zhao et al. [25] showcased the design of a RDF based provenance log for a typical in silico experiment that performs DNA sequence analysis as a part of myGrid [10, 11] middleware services. The authors have reported the use of Life Science Identifiers (LSIDs) for achieving location independent access to distributed data and metadata resources. Similarly, RDF and OWL have been used for associating uniform semantic information and relationships between resources, while Haystack [26], a semantic web browser, has been used for delivering the provenance-based web pages to the end-user. The use of RDF model as compared to XML provides more flexible and graph-based resource description with location independent resource identification through URIs (Universal Resource Identifier). There are various other significant contributions that illustrate the use of semantic web technology for the proper integration and management of biological data. For example, The Gene Ontology Consortium [27, 28], make uses of semantic web technologies to provide a central gene ontology resource for unification of biological information. Ruttenberg et al. [29, 30] used OWL-DL to develop a data exchange format that facilitates integration of biological pathway knowledge. Similarly, Whetzel et al. [31] and Navarange et al. [32] used semantic web to provide a resource for the development of tools for microarray data acquisition and query according to the concepts specified in Minimum Information About a Microarray Experiment (MIAME) standard [33]. Further information about some other semantic web and ontology based applications and tools for life sciences is presented in Table 1. 2.1.2. Web Service Technology Web-service technology further extends the capabilities of classical web and semantic web by allowing information and resources to be shared among machines even in a distributed heterogeneous environment (such as a grid environment). Therefore, applications developed as web services can easily interoperate with peer applications. Web services are defined through Web Service Description Language (WSDL) and deployed and discovered through Universal Description, Discovery and Integration (UDDI) protocol. They can exchange XML based messages through Simple Object Access Protocol (SOAP) over different computer platforms. Furthermore, with the introduction of Web Service Resource Framework (WSRF), now web services have become more capable of storing the state information during the execution of a particular transaction. These features of web-services have made them extremely important to be applied to life science domain. Today many life science applications are being developed as web services. For example, the National Center for Biotechnology Information (NCBI) provides a wide range of biological databases and analytical tools as web services such as all the Entrez e-utilities Web & Grid Technologies Table 1. Current Bioinformatics, 2008, Vol. 3, No. 1 5 Semantic-Web and Ontology Based Resources Semantic Web Based Application/Tool Full Name and Source Usage BioPAX [43, 44] Biological Pathway Exchange http://www.biopax.org/ Data exchange format for biological pathway data MGED [45, 46] Microarray for Gene Expression Data http://www.mged.org/ Data standard for Systems Biology TAMBIS [47] Transparent Access to Multiple Bioinformatics Information Sources http://img.cs.man.ac.uk/tambis Biological Data Integration caCORE SDK [48] Software Development Kit for cancer informatics http://ncicb.nci.nih.gov/infrastructure/cacoresdk Semantically integrated bioinformatics software system AutoMed Toolkit [49] AutoMatic Generation of Mediator Tools for Heterogeneous Data Integration http://www.doc.ic.ac.uk/automed/ Tools for assisting transformation and integration of distributed data Gaggle [50] Gaggle http://gaggle.systemsbiology.org/docs/ An integrated environment for systems biology EcoCyc [51] Encyclopedia of Escherichia coli K-12 Genes and Metabolism http://ecocyc.org/ Molecular catalog of the E. coli cell SBML [52] Systems Biology Markup Language http://sbml.org/index.psp Computer-readable models of biochemical reaction networks. CellML [53] Cell Markup Language http://www.cellml.org/ Storage and exchange of computer-based mathematical models for biomolecular simulations OBO [54] Open Biomedical Ontology http://obo.sourceforge.net/ Open source controlled-vocabularies for different biomedical domains GO [28] Gene Ontology http://www.geneontology.org/ Controlled-vocabulary for genes GMOD [55] Generic Model Organism Database http://www.gmod.org/home An integrated organism database PSI-MI [56] Proteomics Standards Initiative: Molecular Interactions http://psidev.sourceforge.net/ Data standard for proteomics including EInfo, ESearch, EPost, ESummary, EFetch, ELink, MedLine, and PubMed. Similarly, the European Institute for Bioinformatics (EBI) provides many biological resources as web services such as SoapLab, WSDbfetch, WSfasta, WSBLast, WSInterProScan, EMBOSS amongst others. A comprehensive list of all publicly available and accessible biological web services developed by different organizations, institutions and groups can be found at myGrid website (http://taverna.sourceforge.net). All these web services can be used as part of complex application specific programs. IBM provides WebSphere Information Integrator (WS II) as an easy way for developers to integrate individual web-service components into large programs. As an example, North Carolina BioGrid (NC BioGrid) [34] in collaboration with IBM uses web services to integrate several bioinformatics applications to high performance grid computing environment. The NC BioGrid also provides a tool (WSDL2Perl) to facilitate the wrapping of Perl based legacy bioinformatics applications as web services. Other open source projects that provide registry, discovery and use of web services for biosciences include BioMOBY [12-14], myGrid [10, 11], and caBIG (https://cabig. nci.nih.gov/) etc. The integration and interoperatibility of distributed and heterogeneous biological resources through web services have also opened an important niche for data mining and knowledge discovery. For example, Hahn U et al. [35] introduced web based reusable text mining middleware services that could be used for medical knowledge discovery. The middleware provides a Java based API for clients to call searching and mining services. Similarly, Hong et al. [36] used web services for the implementation of a microarray data mining system for drug discovery. Due to the success of web services to provide flexible, evolvable and scalable architectures with interoperability, between heterogeneous applications and platforms, the grid middleware is also being transformed from its pre-web service versions to the new ones based on web services. There are several initiatives in this direction such as Globus [37], an open source grid middleware, which has adopted web service architecture in its current version of Globus Toolkit 4; the EGEE project is also moving from its pre-web service middleware LCG2 [38] to a new web service based middleware named gLite [39]; and similarly Imperial College E-science Network Infrastructure (ICENI) [40] and myGrid are also adopting the web services through the use of Jini, OGSI, JXTA and other technologies [41, 42]. 2.1.3. Agent-Based Semantic Web-services Agents are described as software components that exhibit autonomous behavior and are able to communicate with their peers in a semantically defined high-level language such as FIPA-ACL (Foundation of Intelligent Physical AgentsAgents Communication Language). Since the main focus of agent technology is to enable the software components to perform certain tasks on behalf of the user, this somewhat relates and supports the goals of web-services technology and hence the two technologies have started converging to- 6 Current Bioinformatics, 2008, Vol. 3, No. 1 wards the development of more autonomous web-services that exhibit the behavior of both web-services as well as agents. There have been many attempts regarding the use of agents in bioinformatics, computational biology and systems biology. For example, Merelli et al. [57], reported the use of agents for the automation of bioinformatics tasks and processes such as phylogenetic analysis of diseases, protein secondary structure prediction, stem cell analysis, and simulation among others. In their paper, the authors also highlight some key open challenges in agents research: analysis of mutant proteins, laboratory information management system (LIMS), cellular process modeling and formal and semiformal methods in bioinformatics. Similarly, the use of mobile agents for the development of a decentralized, selforganizing peer-to-peer grid computing architecture for computational biology has been demonstrated in [66], which we have selected as one of our case studies and briefly described in section 3.2. Similarly, Luc et al. [58], suggested the use of agents in myGrid [10, 11] middleware in order to best fit the ever-dynamic and open nature of biological resources. In particular, the authors propose the use of agents for ‘personalization, negotiation and communication’. The personalization agent can act on behalf of the user to automatically provide certain preferences such as the selection of preferred resources for a workflow based in silico experiment and other user related information. The user-agent can store these preferences and other user related information from previously conducted user activities and thus freeing the user from tedious repetitive interactions. The user-agent could also provide a point of contact for notification and other services requiring user interaction during the execution of a particular experiment. Other experiences related to the use of agents for biological data management and annotations have also been discussed in [59, 60]. 2.2. Application of Grid Technologies in Life Sciences Since its inception, the main focus of grid technology has been to provide platform independent global and dynamic resource-sharing service in addition to co-ordination, manageability, and high performance. In order to best satisfy these goals, its basic architecture has undergone substantial changes to accommodate other emergent technologies. As shown in Fig. 2, the grid has moved from its initial static and pre-web service architecture to a more dynamic Web Service Resource Framework (WSRF) based Open Grid Service Architecture (OGSA) [61]. This architecture combines existing grid standards with emerging Service Oriented Architectures (SOAs) and web technologies in order to provide an innovative grid architecture known as service-oriented semantic grid. The main characteristics of this service-oriented semantic grid would be to maintain intelligent agents that could act as software services (grid services) capable of performing well-defined operations and communicating with peer services through uniform standard protocols. This paradigm shift in the grid’s architecture is deemed to have a significant impact on the usability of bioinformatics, computational biology and systems biology. The impact is also evident from the very large number of literary contributions that report the experiences of using grid technologies for these domains. We have tried to summarize the findings of some of these contributions under the relevant categories of a typical BioGrid infrastructure. Some important components of a typical Shah et al. BioGird infrastructure are shown in Fig. 3, and are briefly described in following ten subsections: 2.2.1. Grid-Enabled Applications and Tools The actual process of developing (writing the code), deploying (registering, linking and compiling), testing (checking the results and performing debugging if necessary) and executing (scheduling, coordinating and controlling) an application in a grid-based environment is far from trivial. Mainly, the difficulties faced by developers arise because of incapability of traditional software development tools and techniques to support the development of some sort of virtual application or workflows, whose components can run on multiple machines within heterogeneous and distributed environment like grid [62]. Despite these difficulties, there are several grid-enabled applications for life sciences [17], mostly developed by using either standard languages such as Java along with message passing interfaces (e.g. MPICH/G) or web services. For example, Jacq et. al. [63] reported the deployment of various bioinformatics applications on the European Data Grid (EDG) testbed project. One of the deployed applications was PhyloJava, a GUI based application that calculates the polygenetic trees of a given genomic sequence using fastDNAml [64] algorithm. This algorithm uses bootstrapping, which is a reliable albeit computationally intensive technique that calculates the consensus from a large number of repeated individual tree calculations (about 500-1000 repeats). The gridification of this application was carried out at a granularity of 50 for a total of 1000 independent sequence comparison jobs (20 independent packets of 50 jobs each) and then merging the individual job results to get the final bootstrapped tree. The selection of the appropriate value of granularity depends upon the proper consideration of the overall performance because highly parallelized jobs can be hampered by resource brokering and scheduling times, whereas poorly parallelized jobs would not give significant CPU time gain [63]. The execution of this gridified application on the EDG testbed required the installation of a Globus [37] based EDG user interface on a Linux RedHat Machine, the use of Job Description Language (JDL) and the actual submission of the parallel jobs through Java Jobs Submission Interface (JJSI). It is reported that the gridified execution of this application provided 14 times speedup compared against a non-grid based standalone execution. The deviation in gain from ideal (speed up of 20) is considered to be the effect of network and communication overhead (latencies). Similarly, the gridification of other applications such as a grid-enabled bioinformatics portal for protein sequence analysis, grid-enabled method for securely finding unique sequences for PCR primers, and grid-enabled BLAST for orthology rules determination has also been discussed in [63] with successful and encouraging results. A brief description of some other grid-enabled applications is presented in Table 2. Still there might be many other legacy applications that could take advantage of grid based resources; however, the migration of these applications to grid environment requires more sophisticated tools than what is currently available [65]. 2.2.2. Grid-based BioPortals As mentioned above, the actual process of developing and deploying an individual application on grid requires sig- Web & Grid Technologies Current Bioinformatics, 2008, Vol. 3, No. 1 7 Fig. (3). Major components of a generic BioGrid infrastructure: application layer services help the user to design, execute, monitor and visualize the output of grid-enabled applications; middleware services provide access and management services for the use of physical layer resources; each site at physical layer has usually a pool of compute elements managed by some local resource manager system (LRMS) such as Sun Grid Engine (SGE), Portable Batch System (PBS), Load Sharing Facility (LSF), and Condor etc. (see subsections 2.2.1 to 2.2.10 for further details). nificant level of expertise and considerable period of time. This issue hinders the usage of available grid infrastructures. Therefore, in order to enhance the use of different grid infrastructures, some individuals, groups, institution and organizations have started to provide the most frequently used and standard domain specific resources as grid-enabled services, which can be accessed by any or authenticated researcher through a common browser based single-point-of-access, without the need of installing any additional software. In this context a grid portal is considered to be an extended webbased application server with the necessary software capabilities to communicate with the backend grid resources and services [66]. This type of environment provides full level of abstraction and makes it easy to exploit the potential of grid seamlessly. Grid-based portals are normally developed using some publicly available grid portal construction toolkits such as GridPort Toolkit [67, 68], NinF Portal Toolkit [66], GridSphere (http://www. gridsphere.org), IBM WebSphere (http://www-306.ibm.com/software/websphere) etc. Most of these toolkits follow the Java portlet specification (JSR 168) standard and thus make it easy for the developer to design the portal front-end and connect it to the backend resources through middleware services. For example, GridSphere [66], enables the developer to specify the requirements of the portal front-end (e.g. authentication, user interaction fields, job management, resources etc) in terms of an XML based file, which automatically generates a JSP file (through Java bases XML parser), that provides an HTML based web page for front-end. Similarly, it provides some general-purpose Java Servlets that can communicate to grid-enabled backend applications and resources though Globus based GridRPC mechanism. The toolkit also helps the developer for the gridification of applications and resources needed at the backend. It is because of this level of ease for the creation of grid-based portals that in [69] it is claimed that ‘portal technology has become critical for future implementation of the bioinformatics grids’. Another example is that of BRIDGES (http://www.brc.dcs.gla.ac.uk/projects/bridges) project, which provides portal-based access to many biological resources (federated databases, analytical and visualization tools etc) distributed across all the major UK centers with appropriate level of authorization, convenience and privacy. It uses IBM WebSphere based portal technology, because of its ‘versatility and robustness’. The portal provides a separate workspace for each user that can be configured by the user as per requirements and the configuration settings are stored using session management techniques. This type of environment can help in many important fields of life sciences such as the field of exploratory genetics that leads towards the understanding of complex disease phenotypes such as heart disease, addiction and cancer on the basis of analysis of data from multiple sources (e.g. model organism, clinical drug trials and research studies etc). Similarly [70] presents an- 8 Current Bioinformatics, 2008, Vol. 3, No. 1 Table 2. Shah et al. Grid-Enabled Applications Grid-Enabled Application Task and source Grid Middleware and Application Level Tools, Services and Languages Effect of Gridification GADU/GNARE [75] Globus Toolkit and Condor/G for distributing DAG based workflows Analysis of 2314886 sequences on a single 2GHz CPU can take 1061 days Task: Genome Analysis and Database Update GriPhyN Virtual Data System for workflow management. A grid with an average of 200 nodes took only 8 days and 16 hours for the above task. http://compbio.mcs.anl.gov User interface to standard databases (NCBI, JGI etc.) and analysis tools (BLAST, PFAM etc.) MCell [76] Globus GRAM, SSH, NetSolve, PBS for remote job starting/monitoring A typical r_disk MCell simulation on a single 1.5 GHz CPU can take 329 days Task: Computational biology simulation framework based on Monte Carlo algorithm GrdiFTP and scp for moving application data to grid http://www.mcell.cnl.salk.edu/ Java based GUI, Relational Database (Oracle), Adoptive scheduling A grid with an average of 113 dual CPU nodes took only 6 days and 6 hours for the above task. Grid Cellware [77] Task: Modeling and Simulation for systems biology Globus, Apache Axis, GridX-Meta Scheduler GUI based jobs creation editor Jobs mapped and submitted as web services http://www.cellware.org other instance of system that is easy to use, is scalable and extensible, providing among others, secure and authenticated access to standard bioinformatics databases and analysis tools such as nucleotide and protein databases, BLAST [71], CLUSTAL [72] etc. A common portal engine was developed with the reusable components and services from Open Grid Computing Environment Toolkit (OGCE) [73] that combine the components of different individual grid portal toolkits. This common portal engine was integrated with a biological application frame work by using PISE [74] (web interface generator for molecular biology). The portal provides access to around 200 applications related to molecular biology and also provides the way to add any other application through the description of a simple XML based file. 2.2.3. BioGrid Application Development Toolkits Although some general purpose grid toolkits such as Globus, COSM (http://www.mithral.com/projects/cosm) and GridLab (http://www.gridlab.org), provide certain tools (APIs and run time environments) for the development of grid-enabled applications, they are primarily aimed at the provision of low level core services needed for the implementation of a grid infrastructure. Therefore, it seems to be difficult and time consuming for an ordinary programmer to go through the actual process of developing and testing a grid enabled application using these toolkits; instead, there are some simulation based environments such as EDGSim (http://www.hep.ucl.ac.uk/~pac/EDGSim), extensible grid simulation environment [78], GridSim [79] and GridNet [80], that could be used at initial design and verification stage. As different application domains require certain specific set of tools that could make the actual process of gridenabled application development life-cycle (development, deployment, testing and execution) to be more convenient and efficient. One such proposal for the development of a ‘Grid Life sciences Application Developer (GLAD)’ was presented in [81]. This publicly available toolkit works on top of the ALiCE (Adaptive scaLable Internet-based Computing Engine), a light weight grid middleware and provides Different stochastic (Gillespie, Gibson etc.), deterministic (Euler Forward, Runge–Kutta) and MPI based swarm algorithms have been successfully implemented in a way to distribute their execution on grid nodes. a Java based grid application programming environment for life sciences. It provides a list of commonly used bioinformatics algorithms and programs as reusable library components along with other software components needed for interacting (fetching, parsing etc) with remote distributed and heterogeneous biological databases. The toolkit also assists in the implementation of task level parallelism (by providing effective parallel execution control system) for algorithms and applications ranging from those having regular computational structures (such as database searching applications) or irregular patterns (such as phylogenetic tree) [82]. Certain limitations of GLAD include the non-conformance of AliCE with OGSA standard and the use of socket based data communication, which might not be good for performance of critical applications. Another grid application development toolkit for bioinformatics that provides high level user interface with a problem solving environment related to biomedical data analysis has been presented in [83]. The toolkit provides a Java based GUI that enables the user to design a Direct Acyclic Graph (DAG) based workflow selecting a variety of bioinformatics tools and data (wrapped as java based JAX-RPC web services). The toolkit also enables the user to assign appropriate dependencies and relationships among the components of the workflow. Once the workflow is submitted on the grid, the scheduler would use the information from dependencies and relationships to determine the order of execution of individual components. 2.2.4. Grid-based Problem Solving Environments The Grid-based Problem Solving Environment (PSE) is another way of providing a higher level of interface such as graphical user interface or web interface to an ordinary user so that he/she could design, deploy and execute any gridenabled application related to a particular class of specific domain and visualize the results without knowing the underlying architectural and functional details of the backend resources and services. In fact, grid-based PSE brings the grid application programming at the level of drawing, that is, instead of writing the code and worrying about the compiling and execution, the user can just use appropriate GUI compo- Web & Grid Technologies nents provided by the PSE to compose, compile and run the application in a grid environment. PSEs are developed using high level languages such as Java and are targeted to transform the user designed/modeled application into an appropriate script (distributed application or web service) that could be submitted to a grid resource allocation and management service for execution and on completion of the execution, the results are displayed through appropriate visualization mechanism. There are several different grid-based PSEs available for bioinformatics applications e.g. Cannataro M et al. [84] described the design and architecture of a PSE (Proteus) that provides an integrated environment for biomedical researchers to search, build and deploy distributed bioinformatics applications on computational grids. The PSE uses semantic based ontology (developed in DAMIL+OIL language (http://www.daml.org)) to associate the essential meta-data such as goals and requirements to three main classes of bioinformatics resources such as data sources (e.g. SwissProt and PDB database), software components (e.g. BLAST, SRS, Entrez and EMBOSS an open source suite of bioinformatics applications for sequence analysis) and tasks/processes (e.g. sequence alignment, secondary structure prediction and similarity comparison) and stores this information in a meta-data repository. The data sources are specified on the basis of kind of biological data, its storage format and the type of the data source. Similarly, the components and tasks are modeled on the basis of the nature of tasks, steps and order in which tasks are to be performed, algorithm used, data source and type of the output etc. On the basis of this ontology, the PSE provides a dictionary (knowledge-base) of data and tools locations allowing the users to compose their applications as workflows by making use of all the necessary resources without worrying about their underlying distributed and heterogeneous nature. The modeled applications are automatically translated into grid execution scripts corresponding to GRSL (Globus Resource Specification Langue) and are then submitted for execution on grid through GRAM (Globus Resource Allocation Manager). The performance of this PSE was checked with a simple application that used TribeMCL (http://www.ebi.ac.uk/ research/cgg/tribe), for clustering human protein sequences, which were extracted from SwissProt database by using seqret program of the EMBOSS suite and compared all against all for similarity though BLAST program. In order to take advantage of the grid resources and enhance the performance of similarity search process, the output of seqret program was spilt into three separate files in order to run three instances of BLAST in parallel. The individual BLAST outputs were concatenated and transformed into a Markov Matrix required as input for TribeMCL. Finally, the PSE displayed the results of clustering in an opportune visualization format. It was observed that total clustering process on grid took 11h50’53’’ as compared to 26h48’26’’ on standalone machine. It was also noted on the basis of another experimental case (taking just 30 protein sequences for clustering) that the data extraction and result visualization steps in the clustering process are nearly independent of the number of protein sequences (i.e. approximately same time was observed in the case of all protein vs 30 protein sequences). Furthermore, BLAST computation takes more time as compared to TribeMCL (for all against all case BLAST took Current Bioinformatics, 2008, Vol. 3, No. 1 9 8h50’13’’ while TribeMCL took 2h50’28’’) [17]. Another PSE for bioinformatics has been proposed in [85]. It uses Condor/G for the implementation of PSE that provides an integrated environment for developing component based workflows through commonly used bioinformatics applications and tools such as Grid-BLAST, Grid-FASTA, GridSWSearch, Grid-SWAlign and Ortholog-Picker etc. Condor/G is an extension to grid via Globus and it combines the inter-domain resource management protocols of Globus Toolkit with intra-domain resource management methods of Condor to provide computation management for multiinstitutional grid. The choice of Condor/G is justified on the basis of its low implementation overhead as compared to other grid technologies. The implementation of a workflow based PSE is made simple by the special functionality of Condor meta-scheduler DAGMan (Directed Acyclic Graph Manager), which supports the cascaded execution of programs in a grid environment. The developed prototype model was tested by integrated (cascaded) execution of the above mentioned sequence search and alignment tools in grid environment. In order to enhance the efficiency, the sequence databases and queries were splited into as much parts as the number of available nodes, where the independent tasks were executed in parallel. 2.2.5. Grid-Based Workflow Management Systems As discussed in the context of PSE, a workflow is a process of composing an application by specifying the tasks and their order of execution. A grid-based workflow management system provides all the necessary services for the creation, execution and visualization of the status and results of the workflow in a seamless manner. These features make workflows ideal for the design and implementation of life science applications that consist of multiple steps and require the integrated access and execution of various data and application resources. Therefore one can find various domain specific efforts for the development of proper workflow management systems for life sciences (Table 3). There have been several important demonstrations of different types of life science applications on grid-based workflow management systems. For example, the design and execution of a tissue-specific gene expression analysis experiment for human have been demonstrated in a grid-based workflow environment called ‘WildFire’ [86]. The workflow takes as an input 24 compressed GeneBank files corresponding to 24 human chromosomes and after decompression, it performs exon extraction (through exonx program) from each file in parallel resulting in 24 FASTA files. In order to further increase the level of granularity, each FASTA file is split into five sub-files (through dice script developed in Perl), making a total of 120 small files ready for parallel processing with BLAST. Each file was matched against a database of transcripts (‘16, 385 transcripts obtained from Mammalian Gene Collection’). The execution of this experiment on a cluster of 128 Pentium III nodes took about 1 hour and 40 minutes, which is reported to be 9 times less than the time required for the execution of its sequential version. The iteration and dynamic capabilities of WildFire have also been demonstrated through the implementation of a swarm algorithm for parameter estimation problem related to biochemical pathway model based on 36 unknowns and 8 differential equations. Similarly, the effectiveness of Taverna 10 Current Bioinformatics, 2008, Vol. 3, No. 1 Table 3. Shah et al. Grid-Based Workflow Management Systems Workflow Management System Ref. Supported Grid Middleware Technologies and Platforms Condor/G, SGE, PBS, LSF Wildfire [86] http://wildfire.bii.a-star.edu.sg/ Workflows are mapped into Grid Execution Language (GEL) script Main Features GUI-based drag-and-drop environment for workflow construction through EMBOSS Suite of tools Platform: Windows and Linux Supports complex operations such as iteration and dynamic parallelism myGrid middleware Workflows are mapped into web services GUI-based workbench for creating In-silico experiments using EMBOSS suite of tools, NCBI, EBI, DDBj, SoapLab, BioMoby and other web services. Platform: cross platform Currently does not support complex operations. Open source and extensible SCULF language for workflows Taverna [87] http://taverna.sourceforge.net/ Open source and extensible Globus Toolkit4.1 ProGenGrid [93] http://www.cact.unile.it/projects/ GridFTP and DIME for data transfer. UML-based editor for workflow construction, execution and monitoring iGrid information service for resource and web service discovery. AutoDoc for drug design RASMOL for visualization Java Axis and gSOAP Toolkit [87] workflow has been demonstrated by construction of a workflow that provides genetic analysis of the ‘Graves’ disease. The demonstrated workflow makes use of Sequence Retrieval System (SRS), mapping database service and other programs deployed as SoapLab services to obtain information about candidate genes, which have been identified through Affymetrix U95 microarray chips as being involved in Graves’ disease. The main functionality of the workflow was to map a candidate gene to an appropriate identifier corresponding to biological databases such as Swiss-Prot and EMBL in order to retrieve the sequence and published literature information about that gene through SRS and MedLine services. The result of tBLAST search against the PDB provided identification of some related genes, whereas the information about the molecular weight and isoelectric point of the candidate gene was provided by the Pepstat program of EMBOSS suite. Similarly, Taverna has also been demonstrated with the successful execution of some other workflows for a diversity of in silico experiments such as pathway map retrieval and tracking of data provenance. 2.2.5. Grid-Based Frameworks In software development, a framework specifies the required structure of the environment needed for the development, deployment, execution and organization of a software application/project related to a particular domain in an easy, efficient, standardized, collaborative, future-proof and seamless manner. When becoming fully successful and widely accepted and used, most of these frameworks are also made available as Toolkits. There are some general frameworks for grid-based application development such as Grid Application Development Software (GrADS) [88], Cactus [89] and IBM Grid Application Development Framework for Java (GAF4J) (http://www.alphaworks.ibm.com/tech/GAF4J). Similarly, some specific grid-based frameworks for life sciences have also been proposed and demonstrated such as Grid Enabled Bioinformatics Application Framework (GEBAF) [90], that proposes an integrated environment for grid-enabled bioinformatics application using a set of open source tools. The open source tools used in GEBAF include Bioperl Toolkit [91], Globus Toolkit [37], Nimrod/G [92] and Citrina (database management tool (http://www.gmod.org/citrina)). The framework provides a portal based interface that allows the user to submit a query of any number of sequences to be processed with BLAST against publicly available sequence databases. The user options are stored in a hash data structure by creating a new directory for each experiment and a script using the BioPerl::SeqIO module, dividing the user query into sub-queries each consisting of just a single sequence. The distributed query is then submitted through Nimrod/G plan file for parallel execution on the grid. Each grid node maintains an updated and formatted version of the sequence database through Citrina. The individual output of each sequence query is parsed and concatenated by another script that generates the summary of the experiment in the form of an XML and Comma Separated Value (CSV) files containing the number of most significant hits from each query. The contents of these files are then displayed through the result interface. A particular demonstration for BLAST was carried out with 55,000 sequences against SwissProt database. With 55,000 parallel jobs, the grid has been fully exploited within the limits of its free nodes and it has been observed that the job management overhead was low as compared to the actual search time for BLASTing of each sequence. Although summarizing thousands of results is somewhat slow and nontrivial, its execution time remains insignificant when compared with the experimenting time itself. The developed scripts were also tested for reusability with other similar applications such as ClustalW and HMMER, with little modification. A web service interface has been proposed for future development of GEBAF in order to make use of other bioinformatics services such as Ensemble. Similarly, for better data management, Storage Resource Broker (SRB) middleware is also proposed as an addition for the future. GEBAF is not the only grid-enabling framework available. For example, Asim et al. [94] described the benefits of using the Grid Architecture Development Software (GrADS) Web & Grid Technologies framework [88] for the gridification of bioinformatics applications such as FASTA [77]. Though there already existed an MPI-based master-slave version of the FASTA but it used a different approach: it kept the reference database at the master side and made the master responsible for equal distribution of database to slaves and the subsequent collection and concatenation of the results. In contrast to that, GrADS based implementation makes reference databases (as a whole or as a portion) available at some or all of the worker nodes through database replication. Thus, the master at first sends a message to workers for loading their databases into memory and then it distributes the search query and collects the results back. This type of data-locality approach eliminates the communication overhead associated with the distribution of large scale databases. Furthermore, through the integration of various software development and grid middleware technologies (such as Configurable Object Program, Globus Monitoring and Discovery Service (MDS) and Network Weather Service (NWS)), GrADS framework provides all the necessary user, application and middleware services for the composition, compilation, scheduling, execution and real time monitoring of the applications on a grid infrastructure in a seamless manner. 2.2.6. BioGrid Data Management Approaches It is a well recognized fact that most of the publicly available biological data originates from different sources of information, i.e. it is heterogeneous and is acquired, stored and accessed in different ways at different locations around the world, i.e. it is distributed [95]. The heterogeneity of data may be syntactic i.e. difference in file formats, query languages and access protocols etc, semantic i.e. genomic and proteomic data etc, or schematic i.e. difference in the names of database tables and fields etc. In order for this heterogeneous and distributed data to be accessed in a uniform and federated environment, one has to make use of appropriate web and grid technologies to form an intermediatary bridge (Fig. 4). The gridification of biological databases and applications is also motivated by the fact that their number, size Current Bioinformatics, 2008, Vol. 3, No. 1 11 and diversity are growing rapidly and continuously. This makes it impossible for an individual biologist to store a local copy of any major databases and execute either data or computer-intensive application in a local environment. This inability of locality also demands for the gridenablement of the resources. However, an important factor that hinders the deployment of existing biological applications, analytical tools and databases on grid-based environments is their inherent pre-grid design (legacy interface). This is so because the design suits the requirements of a local workstation environment in terms of input/output. The gridification of such applications requires a transparent mechanism to connect local input/output with a grid-based distributed input/output through some intermediatary tools such as grid middleware specific Data Management Services (DMS) and distributed storage environments. One such example of biological data management in grid environment has been discussed in [96]. It provides a transparent interface for legacy bioinformatics applications, tools and databases to be connected to computational grid infrastructures such as EGEE, without incurring any change in the code of these applications. Authors have reported the use of modified Parrot [97] as a tool to connect a legacy bioinformatics application to the EGEE database management system. The EGEE database management system enables location and replication of databases needed for the management of very large distributed data repositories. With Parrot-based connection, the user is freed from the overhead of performing file staging and specifying in advance an application’s data need. Rather, an automated agent launched by the Parrot takes care of replicating the required data from the remote site and supplying it to the legacy application as it would have been accessing data with local input/output capabilities. The agent resolves logical file name to the storage file name, selects the best location for replication and launching the program for execution on the downloaded data. For the purpose of demonstration, authors have reported the deployment (virtualization) of some biological databases such as. Fig. (4). Major architectural components of a biological DataGrid environment (reproduced from [6] and annotated). 12 Current Bioinformatics, 2008, Vol. 3, No. 1 Shah et al. Swiss-Prot and TrEMBL [98] by registering these databases with the replica management service (RMS) of EGEE. Similarly, programs for protein sequence comparison such as BLAST [71], FASTA [99], ClustalW [72] and SSearch [100] have been deployed by registering them with the experiment software management service (ESM). The deployed programs were run on a grid environment and their access to the registered databases was evaluated by two methods: replication (by copying the required database directly to the local disk) and remote input/output (attaches the local input/output stream of the program to the copied data in cache or on-thefly mode). The evaluation of these methods showed that both methods perform similarly in terms of efficiency e.g. on a database of about 500,000 protein sequences (205 MB), each method takes about 60 seconds for downloading from any grid node and about four times less than this time in case the data node is near the worker node. It is important to note, however, that the replication method creates an overhead in terms of free storage capacity on the worker node. This problem may be particularly if the size of the database to be replicated is too high or if the worker node has many CPUs sharing the same storage and each accessing a different set of databases. This is the reason why remote input/output method overweighs the replication. However, the real selection of any method depends on the nature of program (algorithm). For compute-intensive programs such as SSearch, remote input /output is always better (as it works on copying progressive file blocks) whereas for data-intensive programs such as BLAST and FASTA, the replication method may work better [96]. Similarly, there are some other DataGrid projects which provide an integrative use of these highly heterogeneous and distributed data sources in an easy and efficient way. A biologist could take advantage of some specific Data Grid infrastructure and middleware services such as BRIDGES (http://www.brc.dcs.gla.ac.uk/projects/bridges), BIRN (http://www.nbirn.net) and various other European Union Data Grid projects e.g. EU-DataGrid (http://www.edg.org/) and EU-DataGrid for Italy (http://web. datagrid.cnr.it/Tutorial_Rome) etc. These Data Grid projects make use of standard middleware technologies such as Storage Resource Table 4. age Resource Broker (SRB), OGSA-DAI and IBM Discovery Link. Some of the important features of these DataGrid middleware technologies are listed in Table 4. 2.2.8. Computing and Service Grid Middleware In the same way that a computer operating system provides a user-friendly interface between user and computer hardware, Grid middleware provides important services needed for easy, convenient and proper operation and functionality of grid infrastructure. These services include access, authentication, information, security and monitoring services as well as data and resource description, discovery and management services. In order to further reduce the difficulties involved in the process of installation, configuration and setting-up of grid middleware, there have also been proposals for the development of Grid Virtual Machines (GVM) and Grid Operating Systems [101-105]. The development of specific grid operating systems or even embedding grid middleware as a part of existing operating systems would greatly boost-up the use of grid computing in all computer related domains, but this is yet to be seen in the future. The important features of currently used computing and service grid middleware are listed in Table 5. 2.2.9. Local Resource Management System (LRMS) The grid middleware interacts with different clusters of computers through Local Resource Management System (LRMS) also known as Job Management System (JMS). The LRMS (such as Sun Grid Engine, Condor/G and Nimrod/G) is responsible for submission, scheduling and monitoring of jobs in a local area network environment and providing the results and status information to the grid middleware through appropriate wrapper interfaces. Some of the important features [106] of commonly used LRMS software are listed in Table 6. 2.2.10. BioGird Infrastructure Mostly, BioGrid infrastructure is based on the simple idea of cluster computing and is leading towards the creation of a globally networked massively parallel supercomputing infrastructure that connects not only the computing units along with their potential hardware, software and data re- DataGrid Middleware Technologies DataGrid Technology Main Features and Services SRB (Storage Resource Broker) [4] Hierarchical logical name space for managing distributed heterogeneous databases http://www.sdsc.edu/srb/ Data transfer, replication, publication, sharing and preservation services. Callable library functions for applications programs Runs on Windows, Linux and Unix platforms and is freely available for academic use OGSA-DAI (Open Grid Services-Data Access and Integrations) http://www.ogsadai.org.uk/ Web service based data access and integration Supports two main web service specifications namely web services interoperability (WS-I) and web services resource framework (WSRF). Runs on all platforms and is freely available as open source. IBM DiscoveryLink [5] http://www-306.ibm.com/software/data/integration/ Suite of wrappers for relational and non-relational databases Federated view of distributed heterogeneous resources Runs on Windows, Linux, Unix, and IBM Aix platforms and is free for authorized academic use Web & Grid Technologies Table 5. Current Bioinformatics, 2008, Vol. 3, No. 1 13 Computing and Service Grid Middleware Grid Middleware Goal and Developer Brief Description of Architecture and Services OGSA-WSRF based architecture Globus Toolkit GT4 [37] Credential management services (MyProxy, Delegation, SimpleCA) Goal: To provide a suit of services for job, data, and resource management. Data management services (GridFTP, RFT, OGSA-DAI, RLS, DRS) Developer: Argonne National Laboratory, University of Chicago and other partners Information and monitoring services (Index, Trigger and WebMDS) http://www.globus.org/ Resource management services (RSL and GRAM) Instrument management services (GTCP) Platform: Linux LCG-2: a pre-web service middleware based on Globus 2 LCG-2/ gLite [39] Goal: Large scale data handling and compute power infrastructure Developer: EGEE project in collaboration with VTD, US and other partners. http://glite.web.cern.ch/glite/ gLite: an advanced version of LCG-2 based on web services architecture Authentication and security services (GSI, X.509, SSL, CA) Information and monitoring services (Globus-MDS, R-GMA, GIIS, BDII) Resource management services (GUID, SURL) Data management services (WMS, SLI) Platform: Linux and Windows UNICORE [108] Goal: Light weight grid middleware UNICORE: a pre-web service grid middleware based on OGSA standard UNICORE 6: a web service-based and OGSA compatible advanced version of UNICORE UNICORE : Fujitsu Lab EU, Security services (X5.09, CA, SSL/TLS) UniGrid: EU Funded Project Execution management engine (XNJS) http://www.unicore.eu/ myGrid [10, 11] Platform: Unix/Linux platform OGSA and web service-based architecture Goal: Service-oriented in-silico environment for life sciences. Semantic web-based annotation, ontologies and discovery management ServiceGrid @ University of Manchester, UK Talisman: user interface and workbench for in-silico experiment design http://www.mygrid.org.uk/ ABCGrid [107] Goal: Easily installable and simple grid setup for bioinformatics Platform: Linux, Windows, and Mac Java based client server implementation with a package of three independent and easy to install programs; ABCUser, ABCMaster and ABCWorker Platform: Windows, Linux/Unix, Mac OS X platform Center for Bioinformatics, Peking University http://abcgrid.cbi.pku.edu.cn/ sources, but also expensive laboratory and industrial equipment, and ubiquitous sensor device in order to provide unlimited computing power and experimental, setup required for modern day biological experiments. Moreover, this infrastructure is also being customized in a way that it becomes easily accessible by all means of an ordinary general purpose desktop/laptop machine or any type of handheld devices. Some of the major components of a generic BioGrid infrastructure has been illustrated in Fig. 3. The overall architectural components are organized at three major levels (layers) of services. The focus of application layer services is to provide user-friendly interfaces to a biologist for carrying out the desired grid-based tasks with minimum steps of usability and interaction (enhanced automation and intelligence). Similarly, the focus of grid middleware services is to provide seamless access and usability of distributed and heterogeneous physical layer resources to the application layer services. In the following sections we discuss various contributions related to the development and use of some of these services at both application and middleware level. The design and implementation of a typical BioGrid infrastructure vary mainly in terms of the availability of resources and demands of the biological applications that are supposed to use that particular grid. There are many infra- structures starting from an institutional/organizational grids consisting of simple PC based clusters or combination of clusters [109-112] to national and international BioGrid projects with different architectural models and for appropriate problem handling. The models include Computing Grid architecture (providing basic services for task scheduling, resource discovery, allocation and management etc), Data Grid architecture (providing services for locating, accessing, integrating and management of data), Service Grid architecture (services for advertising, registering and invocation of resources) and Knowledge Grid architecture (services for sharing collaborative scientific published or unpublished data). The infrastructure details of some major BioGrid projects are presented in Table 7. It may be observed that the same infrastructure may be used to serve more than one application models based on the availability of some additional service and resources. 3. SOME FLAGSHIP BIOGRID PROJECTS We present here some selected flagship case studies which have elicited a positive public response from bioscientists for their special role and contribution to the life science domain. The description of most important imple- 14 Current Bioinformatics, 2008, Vol. 3, No. 1 Table 6. Shah et al. Job Management Systems (Local Resource Manager) General features Platform, GUI and APIs Local Resource Manager Job support Job description, type and MPI support Platform: Solaris, Apple Macintosh, Linux and Windows platform Sun Grid Engine 6 Open Source developed by Sun Microsystems http://gridengine.sunsource.net Condor –G (version 6.8.2) Open source developed by University of Wisconsin User friendly GUI and portal Supports standard and complex job types with arrays and workflows DRMAA API Provides modules for integration with MPI Integration with globus through GE-GT Adopter 5 Million jobs on more than 10,000 hosts Solaris, Apple Macintosh, Linux, Windows platform Classified Advertisements (classads) for job description Web-service interface Supports standard and complex jobs with arrays and workflows DRMAA API http://www.cs.wisc.edu/condor Nimrod-G 3.0.1 [92] Open source research prototype developed by Monash University other flavors: Nimrod/O, Netsolve, Active Sheets Shell scripts for job description Condor-G is Globus enabled and allows jobs to be sent to other resource managers Provides modules for integration with MPI Platform: Linux, Solaris, Mac with x86 and sparc architecture Nimrod Agent Language for job description Provides web portal and API Uses GRAM interfaces to dispatch jobs to computers Supports integration with Globus, Legion, Condor, NetSolve and others http://www.csse.monash.edu.au/~davi da/nimrod/nimrodg.htm mentation strategies along with some main services is provided with the help of appropriate illustrations. 3.1. EGEE Project The Enabling Grids for E-science in Europe (EGEE) project evolved as a testbed from its precedor European Data Grid (EDG) project and completed its first phase during 2004-2006 by providing support to only three scientific applications. The second phase of this project (2006-2008) Table 7. has evolved from the testbed to a pilot production project that runs more than 20 applications and projects related to different domains of science and industry ranging from highenergy physics to life science and nanotechnology [96]. For example, BioInfoGrid (http://www.bioinfogrid.eu/), WISDOM (http://wisdom.eu-egee.fr) and EMBRACE (http:// www.embracegrid.info) are some of the major biomedical applications that are continuously making use of the EGEE infrastructure. It is during the 2nd phase that the project has BioGrid Infrastructure Projects BioGrid Project Asia Pacific BioGrid http://www.apgrid.org Open BioGrid Japan OBIGRID Japan [6] http://www.obigrid.org Swiss BioGrid [7] http://www.swissbiogrid.org Enabling Grids for EsciencE (EGEE) [113] http://www.eu-egee.org North Carolina BioGrid http://www.ncbiotech.org Grid Technologies and Infrastructure Main Applications Middleware: Globus1.1.4 FASTA, BLAST, SSEARCH, MFOLD Job managers: Nimrod/G, LSF. SGE Virtual Lab DOCK, EMBASSY, PHYLIB Size: 5 nodes, 25+ CPUs, 5 sites EMBOSS suite of applications Middleware: Globus 3.2. Workflow based distributed bioinformatics environment Ipv6 for secure communication BLAST search service VPN over internet for connecting multiple sites Genome annotation system Size: 363 nodes, 619 CPUs, 27 sites Biochemical network simulator Middleware: NorduGrid’s ARC and GridMP High throughput compound docking into protein structure binding sites Infrastructure consists of heterogeneous hardware platforms including both clusters and Desktop-PC grids High throughput analysis of proteomics data. Middleware: gLite WISDOM: drug discovery application Size: More than 30,000 CPUs and 20 Petabytes storage. GATE: radio therapy planning and medical tomography application Maintains 20,000 concurrent jobs on average. SiMRI3D: parallel MRI simulator Connects more than 90 institutions in 32 countries world wide GPS@: Grid Protein Sequence @Analysis and other applications Avaki data grid middleware with Virtual File System across grid nodes. Bioinformatics datasets and applications installed on native file system and shared across the grid. Heterogeneous standalone and clustered processors with a variety of operating systems and cluster management systems Unified view of data and computers by making them appear local Web & Grid Technologies been renamed as Enabling Grid for E-scienceE in order to enhance its scope from European to international level [113]. At this time, the overall EGEE infrastructure consists of more than 30,000 CPUS with 20 petabytes of storage capacity provided by various academic institutes and other organizations and industries around the world in the form of highspeed and high-throughput compute clusters, which are being updated and interoperated through its web-service based light-weight, more dynamic and inter-disciplinary grid middleware named gLite [39]. Like EGEE, gLite middleware also builds on a combination of various other projects including LCG-2 (http://cern.ch/LCG), DataGrid (http://www. edg.org), DataTag (http://cern.ch/datatag), Globus Alliance (http://www.globus.org), GriPhyN (http://www.griphyn.org) and iVDGL (http://www.ivdgl.org). Technical description of some of the gLite services is presented in the following sections: 3.1.1. Authentication and Security Services EGEE uses Grid Security Infrastructure (GSI) for authentication (through digital X.509 certificate) and secure communication (through SSL: Secure Socket Layer protocol with enhancements for single sign-on and delegation). Therefore, in order to use the EGEE grid infrastructure resources, the user has to register first and get a digital certificate from appropriate Certificate Authority (CA). When the user signs in with the original digital certificate which is protected with a private key and a password, the system then creates another passwordless temporary certificate called proxy certificate that is then associated with every user request and activity. 3.1.2. Information and Monitoring Services The gLite 3 uses Globus MDS (Monitoring and Discovery Service) for resource discovery and status information. Additionally, it uses Relational Grid Monitoring Architecture (R-GMA) for accounting and monitoring. In order to provide more stable information services, the gLite Grid Information Indexing Server (GIIS) uses BDII (Berkeley Database Information Index Server) that stores data in a more stable manner than original Globus based GIIS. 3.1.3. Data Management Services As in traditional computing, the primary unit of data management in EGEE grid is also the file [39]. gLite provides a location independent way of accessing files on EGEE grid through the use of Unix based hierarchical logical file naming mechanism. When a file is registered for the first time on the grid, it is assigned a GUID (Grid Unique Identifier that is created from User Unique Identifier; MAC address and a time stamp) and it is bound with an actual physical location represented by SURL (Storage URL). Once a file is registered on the EGEE grid, it cannot be modified or updated because the data management system creates several replicas of the file in order to enhance the efficiency of subsequent data access. Thus, updating any single file would create the problem of data inconsistency which has not as yet been solved in EGEE. 3.1.4. Workload Management System (Resource Broker) The new gLite based work load management system (WMS or resource broker) is capable of receiving even mul- Current Bioinformatics, 2008, Vol. 3, No. 1 15 tiple inter-dependent jobs described by Job Description Language (JDL) and it dispatches these jobs to most appropriate grid sites (selection of appropriate grid site is based on the dynamic process of match-making) and then keeps track of the status of the jobs and retrieves the results back when jobs are finished. While dispatching the jobs, the resource broker uses Data Location Interface (DLI) service to supply input files along with job to the worker node. 3.2. Organic Grid: Self Organizing Computational Biology on Desktop Grid The idea of Organic Grid [114] is based on the decentralized functionality and behavior of self organizing, autonomous and adaptive organisms (entities) in natural complex systems. The examples of natural complex systems include functioning of biological systems and behavior of social insects such as ant and bees. The idea of the organic grid leads towards a novel grid infrastructure that could eliminate the limitations of traditional grid computing. The main limitation of traditional grid computing lies in their centralized approach. For example, a Globus based computational grid may use a centralized meta-scheduler and thus it would be limited to smaller number of machines only. Similarly, Desktops Grid Computing based on distributed computing infrastructure such as BOINC may use centralized master/slave approach and thus would be only suitable for coarse-grained independent jobs only. The idea of Organic Grid is to provide a ‘ubiquitous’ type peer-to-peer grid computing model capable of executing arbitrary computing tasks on a very large number of machines over network of any quality, by redesigning the existing desktop computing model in a way that it supports distributed adaptive scheduling through the use of mobile agents. In essence, it means that a user application submitted on such type of architecture would be encapsulated in some type of a mobile agent containing the application code along with the scheduling code. After encapsulation, the mobile agent can decide itself (based on its scheduling code and the network information) to move to any machine that has appropriate resources needed for the proper execution of the application. This type of mechanism provides the same type of user-level abstractness as provided by traditional Globus-based grid but additionally it builds on decentralized scheduling approach that enables the grid to span to very large number of machines in a more dynamic peer-to-peer computing model. The use of mobile agents (which are based on RMI mechanism that is built on top of client/server architecture) as compared to their alternate service based architecture, provides higher level of ease and abstractedness in terms of validation (experimentation) of different types of scheduling, monitoring and migration schemes. Although the project uses a scheduling scheme that builds on tree-structured overlay network, it is made adaptive based on some value of application specific performance metric. For example, the performance metric for a data-intensive application such as BLAST would give high consideration to bandwidth capacity of the communication link before actually scheduling the job on a particular node. Similarly, it will select a high-speed node for another application that comes under the class of compute-intensive applications. Furthermore, in order to provide uninterrupted execution with dynamic and transparent migration features, the project makes use of strongly mobile agents instead of 16 Current Bioinformatics, 2008, Vol. 3, No. 1 traditional weakly mobile agents (Java based mobile agents that cannot access their state information). Following the common practice in grid computing research, the proof-ofconcept has been demonstrated with the execution of NCBI BLAST (that falls in the class of independent task application) on a cluster of 18 machines with heterogeneous platform and ranked under the categories of fast, medium and slow through the introduction of appropriate delays in the application code. The overall task required the comparison of a 256 KB sequence against a set of 320 data chunks each of size 512 KB. This gave rise to 320 subtasks, each responsible for matching the candidate 256 KB sequence against one specific 512 KB data chunk. The project successfully carried out the execution of these tasks and it has been observed that by adopting the scheduling according to the dynamics of the architecture greatly improves performance and quality of results. The project is being further extended to provide the support for different categories of applications and enabling the user to configure different scheduling schemes for different applications through some easy to use APIs. 3.3. Advancing Clinico-Genomic Trials on Cancer (ACGT) ACGT is a Europe wide integrated biomedical grid for post-genomic research on cancer (http://www. eu-acgt.org) [115]. It intends to build on the results of other biomedical grid projects such as caBIG, BIRN, MEDIGRID and MyGrid. The project is based on open source and open access architecture and provides basic tools and services required for medical knowledge discovery, analysis and visualization. The overall grid infrastructure and services are aimed to provide an environment that could help scientists to: Shah et al. a. Reveal the effect of genetic variations on oncogenesis b. Promote the molecular classification of cancer and development of individual therapies c. Modeling of in silico tumor growth and therapy response In order to create the required environment that supports the implementation of these objectives, ACGT focuses on the development of a virtual web that interconnects various cancer related centres, organizations and individual investigators across the Europe through appropriate web and grid technologies. Mainly, it uses semantic web and ontologies for data integration and knowledge discovery and Globus toolkit with its WS-GRAM, MDS and GSI services for cross organization resource sharing, job execution, monitoring and result visualization. Additionally, ACGT also uses some higher level grid services from Gridge framework developed at Poznan Supercomputing and Networking Centre (PSNC) [116]. These services include GRMS (Grid Resource Management System), GAS (Grid Authorization System) and DMS (Data Management System). These additional services provide the required level of dynamic and policy-based resource management; efficient and reliable data handling; and monitoring and visualization of results. Fig. 5 provides a usage scenario of these services in the context of ACGT environment. 3.4. KidneyGrid KindneyGrid project [8] uses web service-based technologies to develop a collaborative e-Research platform in the form of a virtual organization that interconnects and provides interaction among different entities (e.g. kidney model Fig. (5). ACGT integrated environment usage scenario (reproduced from [116]). Web & Grid Technologies developers, resources and users) related to the study of human and animal kidneys. Fig 6 shows the basic architecture of the system along with the list of typical interactions among its components. The interaction among heterogeneous legacy components and the overall functionality of the project has been achieved by using following web and grid technologies [8]: Gridsphere Framework for portal development MyProxy for virtual organization Gridbus as application level meta-scheduler (resource broker) WSRF for the development of wrappers that provide interaction among legacy resources and application and GT4, Unicore, Alchemi as core grid middleware at different sites The project has been tested and evaluated by the implementation of a medullary kidney model. It allows the user to compose, run and monitor an in silico experiment by selecting appropriate kidney model and required parameters through a web-based user interface. The results of the experiment are also displayed with the help of suitable visualization tools. 3.5. Virtual Laboratory Virtual laboratory project [9] develops an integrated eResearch environment that could provide the computational Current Bioinformatics, 2008, Vol. 3, No. 1 17 power and fast access to databases required for molecular modeling-based drug design. The process of molecular modeling requires each target protein to be screened out against all the molecules stored in a particular chemical database (CDB). Given the huge size of CDBs (millions of compounds in a single database), it requires decades of years on a single computer to complete a single docking experiment. This time could be reduced to a single day or even less than that, with a grid based architecture. As a testbed, the Virtual Laboratory project interconnects geographically distributed resources at three main sites i.e. Monash University, Melbourne, Australia; AIST, Tokyo, Japan; and Argonne National Lab., Chicago, USA. It uses Nimrod parameter modeling tools to adopt DOCK molecular modeling software to run as a parametric sweep application in the grid environment. DOCK jobs are scheduled on the grid by using Nimrod/G resource broker. Each resource on the grid is accessed via Globus where as GRACE software toolkit has been used for resource trading. Similarly, some intelligent tools were used for chemical database management. The virtual laboratory architecture has been tested with the molecular docking of some 200 molecules from aldrich 300 CDB using the 3D structure of endothelin converting enzyme (ECE) as a target receptor that is involved in hypotension. With the value of range parameter ‘ligandnumber’ from 1 to 200 and step size of 1, there were 200 docking jobs. The common input files and executables required for these jobs were pre-staged on all the grid resources at different sites using globus-rcp command. These jobs were successfully scheduled and executed on the avail- Fig. (6). Architecture of the KidneyGrid project: steps 1-9 illustrate a typical user/component interaction and activities in the system (Reproduced from [8]). 18 Current Bioinformatics, 2008, Vol. 3, No. 1 able resources using deadline and budget constraints optimization schemes for meeting the user requirements regarding the completion time and affordable amount of money (see [117,118]). As a result of these experiments, it has been proven that economy driven and service-oriented architectures provide the best utilization of grid resources. 4. CONCLUSIONS AND FUTURE RECOMMENDATIONS This paper presents a scrupulous analysis of the state-ofthe-art in web and grid technology for bioinformatics, computational biology and systems biology in a manner that provides a clear picture of currently available technological solutions for a very wide range of problems. While surveying the literature, it has been observed that there are many gridbased PSEs, Workflow Management Systems, Portals and Toolkits under the name of Bioinformatics but not as many for Systems or Computational Biology. However, in each case a mix of projects and applications has been found overlapping from bioinformatics to computational and systems biology. Thus, the paper provides a synthetic overview of several major contributions under each technological category in a way that could help and serve a wide range of individuals and organizations interested in: Setting up a local, enterprise or global IT infrastructure for life sciences. Solving a particular life science related problem by selecting the most appropriate technological options that have been successfully demonstrated and reported herein. Migrating already existing bioinformatics legacy applications, tools and services to a grid-enabled environment in a way that requires less effort and is motivated with previous related studies provided herein. Comparing a newly developed application/ service features with those currently available. Starting a research and development career related to the use of web and grid technology for biosciences. Based on the analyses of the state-of-the-art, we identify below some key open problems: The use of semantic web technologies such as domain ontologies for life sciences is still not at its full level of maturity, perhaps because of semi-structured nature of XML and limited expressiveness of ontology languages [28]. Biological data analysis and management are still quite difficult jobs because of the lack of development and adaptation of optimized and unified data models and query engines. Some of the existing bioinformatics ontologies and workflow management systems are simply in the form of Directed Acyclic Graphs (DAGs) and their descriptions are lacking expressiveness in terms of formal logic [86]. Lack of open-source standards and tools required for the development of thesaurus and meta-thesaurus services [22]. Shah et al. Need for appropriate query, visualization and authorization mechanism for the management of provenance data and meta-data in in silico experiments [10, 86]. Some of the BioGrid projects seem to be discontinued in terms of information updating. This might arise from funding problems or difficulties associated with their implementation. There is a lack of domain specific mature application programming models, toolkits and APIs for gridenabled application development, deployment, debugging and testing. There still seems to be a gap between the application layer and middleware layer of a typical BioGrid infrastructure because existing middleware services do not fully facilitate the demands of applications such as there is no proper support in any grid middleware for automatic application deployment on all grid nodes. It is not trivial to deploy existing bioinformatics applications on available grid testbed (such as NGS, EGEE etc), as this requires the installation and configuration of specific operating system and grid middleware toolkits, which is not, from a biologist enduser point of view, a trivial task. It has been observed that there are still many issues with grid based workflow management systems in terms of their support for complex operations (such as loops), legacy bioinformatics applications and tools, use of proper ontology and web services etc. [86]. The job submission process on existing grid infrastructures seems to be quite complex because of inappropriate maturity of resource broker services. Lack of appropriate implementation initiative regarding knowledge grid infrastructure for life sciences. These facts provide evidence that web and grid technologies are still moving from an infant to a semi-mature state in terms of their proper application and use in life sciences. We believe that much work is to be done in easing the task of exploiting these technologies properly. In particular we recommend: Automation of the grid deployment process for legacy applications Domain specific Grid application programming models with appropriate toolkits and libraries More user friendly interfaces with dynamic problem solving environments, portals and workflows Complete virtualization of all resources needed for the development, deployment and execution of an application Application of agent technology to support modeling and simulation of complex biological processes A single unified grid middleware that could provide the services for Data, Computing, Service and Knowledge Grid Bringing the grid middleware at the level of operating system in terms of its ease of use, efficiency and reli- Web & Grid Technologies ability. It should be an operating system for a word wide virtual supercomputer. Implementation of more dynamic service and knowledge grid infrastructures for life sciences. Development of sophisticated data management services for aggregation, integration, query, visualization and inference of complex biological knowledge and data Needless to say that the actual list of key open problems is far larger, but the above mentioned are some of the key issues that most directly affect the biologist. Current Bioinformatics, 2008, Vol. 3, No. 1 NCBI = National Center for Biomedical Informatics NGS = National Grid Service OGSA = Open Grid Service Architecture OWL = Web Ontology Language PBS = Portable Batch System PSE = Problem Solving Environment RDF = Resource Description Framework RFTP = Reliable File Transfer Protocol SGE = Sun Grid Engine ACKNOWLEDGEMENTS SOAP = Simple Object Access Protocol N. Krasnogor acknowledges the BBSRC for project BB/CS11764/1 and EPSRC for projects GR/T07534/01 and EP/D061571/1; Azhar A. Shah acknowledges the University of Sindh, Pakistan for scholarship SU/PLAN/F.SCH/794; Jacek Blazewicz and Piotr Lukasiak acknowledge the KBN for partial research grant. Finally all the authors acknowledge the anonymous reviewers for their valuable suggestions for the improvement of this paper. SRB = Storage Resource Broker SWRL = Semantic Web Rules-Language UDDI = Universal Description, Discovery and Integration WFMS = Workflow Management System ABBREVIATIONS WSDL = Web Service Description Language BIRN WSRF = Web Service Resource Framework BRIDGES = Biomedical Research Informatics Delivered by Grid Enabled Services XML = eXtensible Markup Language DAG = Direct Acyclic Graph [1] DAI = Data Access and Integrator DMS = Data Management Service DQP = Distributed Query Processor EBI = European Bioinformatics Institute EDG = European Data Grid EGEE = Enabling Grid for EscienceE = Biomedical Informatics Research Network 19 WISDOM = Wide In Silico Docking On Malaria REFERENCES [2] [3] [4] [5] EMBOSS = European Molecular Biology Open Software Suite [6] EMBRACE = European Model for Bioinformatics Research and Community Education [7] GEBAF = Grid Enabled Bioinformatics Application Framework GLAD = Grid Life Science Application Developer GSI = Grid Security Infrastructure GUI = Graphical User Interface JDL = Job Description Language LCG = LHC Computing Grid LHC = Large Hadron Collider LIM = Laboratory Information Mgt. System LRMS = Local Resource Management System LSF = Load Sharing Facility [12] LSIDs = Life Science Identifiers [13] MIAME = Minimum Information About a Microarray Experiment [8] [9] [10] [11] Wooley JC, Lin HS, Catalyzing Inquiry at the Interface of Computing and Biology, National Academic Press, Washington, DC 2005. Malakoff D. NIH urged to fund centres to merge computing and biology. Science 1999; 284: 1742. Galperin MY. The Molecular Biology Database Collection: 2005 update. Nucleic Acids Res 2005; 33: D5-24. Baru C, Moore R, Rajasekar A, Wan M, The SDSC storage resource broker, In: Proceedings of CASCON'98 Conference, Toronto, Canada, 1998. Haas LM, Schwarz PM, Kodali P et al. DiscoveryLink: a system for integrated access to life sciences data sources. IBM Systems J 2001; 40: 464-88. Susumu D, Kazutoshi F, Hideo M, Haruki NShinji S. An Empirical Study of Grid Applications in Life Science: lessons learnt from Biogrid project in Japan. Int J Inf Tech 2005; 11: 16-28. Podvinec M, Maffioletti S, Kunszt P et al., The SwissBioGrid Project: objectives, preliminary results and lessons learned, In: Proceedings of the 2nd IEEE International Conference on e-Science and Grid Computing (e-Science 2006) – Workshop on Production Grids. IEEE Computer Society Press, 2006; 148-56. Chu X, Lonie A, Harris P, Thomas SR, Buyya R, KidneyGrid: a grid platform for integration of distributed kidney models and resources, In: Proceedings of the 4th International Workshop on Middleware for Grid Computing (MGC 2006). ACM Press, NY 2006; 194: 5-12. Buyya R, Branson K, Giddy J, Abromson D. The Virtual Laboratory: enabling molecular modelling for drug design on the World Wide Grid. Concurrency Computat: Pract and Exper 2003; 15:125. Stevens RD, Robinson AJ, Goble CA. myGrid: personalised bioinformatics on the information grid. Bioinformatics 2003; 19 Suppl 1: i302-04. Goble C, Wroe C, Steven R, myGrid consortium. The myGrid project: services, architecture and demonstrator, In: Proceedings of the UK e-Science All Hands Meeting, Nottingham, UK, 2003. Wilkinson MD, Links M. BioMOBY: an open source biological web services proposal. Brief Bioinform 2002; 3: 331-41. Wilkinson M, Schoof H, Ernst R, Haase D. BioMOBY Successfully Integrates Distributed Heterogeneous Bioinformatics Web Services: the PlaNet exemplar case. Plant Physiol 2005; 138: 5-17. 20 Current Bioinformatics, 2008, Vol. 3, No. 1 [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] [27] [28] [29] [30] [31] [32] [33] [34] [35] [36] [37] Kerhornou A, Guigo R. BioMoby web services to support clustering of co-regulated genes based on similarity of promoter configurations. Bioinformatics 2007; 23: 1831-33. Rice P, Longden I, Bleasby A. EMBOSS: the European molecular biology open software suite. Trends Genet 2000; 16: 276-83. Guillermo PT, Oscar P, Emilio LZ, A survey on grid architectures for bioinformatics, In: Martino BD, Dongarra J, Hoisie A, Yang LT, Zima H Eds, Engineering The Grid: Status and Perspective. American Scientific Publishers, 2006; 109-121. Krishnan A. A survey of life sciences applications on the grid. New Generation Computing Archive 2004; 22: 111-26. Naseer A, Stergioulas LK, Integrating Grid and Web Services: a critical assessment of methods and implications to resource discovery, In: Proceedings of the 2nd Workshop on Innovations in Web Infrastructure, Edinburgh, UK, 2005. Luck R, Networking requirements of the life sciences community, http://www.geant2.net/upload/pdf/Lueck_Rupert.pdf [Accessed: August 2007]. Cheung KH, Smith AK, Yip KYL, Baker CJO, Gerstein MB, Semantic web approach to database integration in the life sciences, In: Baker C, Cheung K Eds, Semantic Web: Revolutionizing Knowledge Discovery in the Life Sciences. Springer, NY 2007; 11-30. Wang X, Gorlitsky R, Almeida JS. From XML to RDF: how semantic web technologies will change the design of 'omic' standards. Nat Biotechnol 2005; 23: 1099-103. Sahoo SS, Thomas C, Sheth A, York WS, Tartir S, Knowledge Modelling and its Application in Life Sciences: tale of two ontologies, In: Proceedings of the 15th International Conference of World Wide Web, Edinburgh, UK, 2006. Horrocks I, Patel-Schneider PF, Harmelen FV. From SHIQ and RDF to OWL: the making of a web ontology language. J Web Semantics 2003; 1: 07-26. Chen YP, Chen Q, Analyzing inconsistency toward enhancing integration of biological molecular databases, In: Proceedings of the 4th Asia-Pacific Bioinformatics Conference, Taipei, Taiwan, 2006. Zhao J, Wroe C, Goble C et al., Using semantic web technologies for representing e-science provenance, In: Sheila MA, Dimitris P, Frank H Eds, The Semantic Web - ISWC 2004. Springer-Verlag, 2004; LNCS 3298: 92-106. Quan D, Karger DR, How to make a semantic web browser, In: Proceedings of the 13th International World Wide Web Conference, ACM Press, NY 2004; 255-65. The Gene Ontology Consortium. Creating the Gene Ontology Resource: design and implementation. Genome Res 2001; 11: 1425-33. The Gene Ontology Consortium. The Gene Ontology (GO) project in 2006. Nucleic Acids Res 2006; 34: D322-28. Ruttenberg A, Rees J, Luciano J, Experience using OWL DL for the exchange of biological pathway information, In: Proceedings of the Workshop on OWL Experiences and Directions, Galway, Ireland, 2005. Ruttenberg A, Rees J, Zucker J, What BioPAX communicates and how to extend OWL to help it, In: Proceedings of the Workshop series on OWL: Experiences and Directions, Manchester, UK, 2006. Whetzel PL, Parkinson H, Causton HC et al. The MGED Ontology: a resource for semantics-based description of microarray experiments. Bioinformatics 2006; 22: 866-73. Navarange M, Game L, Fowler D et al. MiMiR: a comprehensive solution for storage, annotation and exchange of microarray data. BMC Bioinformatics 2005; 6: 268-78. Brazma A, Hingamp P, Quackenbush J et al. Minimum Information About a Microarray Experiment (MIAME) - toward standards for microarray data. Nat Genet 2001; 29: 365-71. Altunay M, Colonnese D, Warade C, High throughput web services for life sciences, In: Proceedings of the International Conference on Information Technology: Coding and Computing (ITCC’05). IEEE Computer Society, DC 2005; 1: 329-34. Hahn U, Wermter J, Blasczyk R, Horn PA. Text Mining: powering the database revolution. Nature 2007; 448: 130. Gao HT, Hayes JH, Cai H. Integrating biological research through web services. Computer 2005; 38: 26-31. Foster I. Globus Toolkit Version 4: Software for service-oriented systems. J Computer Sci and Tech 2006; 21: 513-20. Shah et al. [38] [39] [40] [41] [42] [43] [44] [45] [46] [47] [48] [49] [50] [51] [52] [53] [54] [55] [56] [57] [58] [59] [60] [61] [62] EGEE-II overview paper, http://www.dcc.ac.uk/events/policy2006/EGEE-II%20overview %20paper.pdf [Accessed: August 2007]. gLite3 user guide, https://edms.cern.ch/file/722398/1.1/gLite-3User Guide.html [Accessed: August 2007]. Furmento N, Hau J, Lee W, Newhouse S, Darlington J, Implementations of a service-oriented architecture on top of Jini, JXTA and OGSA, In: Proceedings of the UK e-science All Hands Meeting, Nottingham, UK, 2003. Wietrzyk B, Radenkovic M, Semantic life science middleware with web service resource framework, In: Proceedings of the 4th All Hands Meeting, Nottingham, UK, 2005. Radenkovic M, Wietrzyk B, Life science grid middleware in a more dynamic environment, In: Meersman R, Tari Z, Herrero P et al. Eds, Proceedings of OTS Workshops. Springer-Verlag, 2005; LNCS 3762: 264-73. Luciano JS. PAX of mind for pathway researchers. Drug Discov Today 2005; 10: 937-42. Joshi-Tope G, Gillespie M, Vastrik I et al. Reactome: a knowledgebase of biological pathways. Nucleic Acids Res 2005; 33: D428-32. Brazma A, Robinson A, Cameron G, Ashburner M. One-stop shop for microarray data. Nature 2000; 403: 699-700. Brazma A. On the importance of standardisation in life sciences. Bioinformatics 2001; 17: 113-4. Stevens R, Baker P, Bechhofer S et al. TAMBIS: transparent access to multiple bioinformatics information sources. Bioinformatics 2000; 16: 184-89. Komatsoulis GA, Warzel DB, Hartel FW et al. caCORE version 3: implementation of a model driven, service-oriented architecture for semantic interoperability. J Biomed Inform 2007 (To appear). Maibaum M, Zamboulis L, Rimon G et al. Cluster based integration of heterogeneous biological databases using the AutoMed toolkit, In: Proceedings of the 2nd International Workshop on Data Integration in the Life Sciences, San Diego, USA, 2005. Shannon PT, Reiss DJ, Bonneau R, Baliga NS. The Gaggle: an open-source software system for integrating bioinformatics software and data sources. BMC Bioinformatics 2006; 7: 176-89. Karp PD, Riley M, Saier M et al. The EcoCyc and MetaCyc databases. Nucleic Acids Res 2000; 28: 56-65. Rodriguez N, Donizelli M, Novere LN. SBMLeditor: effective creation of models in the Systems Biology Markup Language (SBML). BMC Bioinformatics 2007; 8: 79-87. Nickerson D, Hunter P, Using CellML in computational models of multi-scale physiology, In: Proceedings of the 27th IEEE Annual International Conference of the Engineering in Medicine and Biology Society (IEEE-EMBS 2005). IEEE Press, 2005; 6096-99. Bard JB, Rhee SY. Ontologies in Biology: design, applications and future challenges. Nat Rev Genet 2004; 5: 213-22. Stein LD, Mungall C, Shu S et al. The Generic Genome Browser: a building block for a model organism system database. Genome Res 2002; 12: 1599-610. Hermjakob H, Montecchi-Palazzi L, Bader G et al. The HUPO PSI's molecular interaction format--a community standard for the representation of protein interaction data. Nat Biotechnol 2004; 22: 177-83. Merelli E, Armano G, Cannata N et al. Agents in bioinformatics, computational and systems biology. Brief Bioinform 2007; 8: 4559. Luc M, Simon M, Carole G, On the use of agents in a bioinformatics grid, In: Proceedings of the 3rd International Symposium on Cluster Computing and the Grid. IEEE Computer Society, DC 2003; 653-61. Bryson K, Luck M, Joy M, Jones D. Agent interaction for bioinformatics data management. Applied Artificial Intelligence 2001; 15: 917–47. Decker K, Zheng X, Schmidt C, A multi-agent system for automated genetic annotation, In: Proceedings of the 5th ACM International Conference on Autonomous Agents. ACM Press, NY 2001; 433-40. Foster I. Service-oriented science. Science 2005; 308: 814-17. Yen E, Lee HC, Ueng W, Lin S. Building Grid-enabled applications in bioinformatics and digital archive, http://www2.twgrid.org/event/isgc2004/ presentation/0728/ GridAPs.pdf [Accessed: January 2007]. Web & Grid Technologies [63] [64] [65] [66] [67] [68] [69] [70] [71] [72] [73] [74] [75] [76] [77] [78] [79] [80] [81] [82] [83] [84] [85] Jacq N, Blanchet C, Combet C. Grid as a bioinformatics tool. Parallel Computing 2004; 30: 1093 -107. Felsenstein J. Evolutionary Trees from DNA Sequences: a maximum likelihood approach. J Mol Evol 1981; 17: 368-76. Kommineni J, Abramson D, GriddLeS enhancements and building virtual applications for the GRID with legacy components, In: Proceedings of the European Grid Conference (EGC 2005). SpringerVerlag, 2005; LNCS 3470: 961-71. Nakada H, Sato M, Sekiguchi S. Design and Implementations of Ninf: towards a global computing infrastructure. Future Generation Computer Systems 1999; 15: 649-58. Thomas M, Mock S, Boisseau J et al., The GridPort toolkit architecture for building grid portals, In: Proceedings of the 10th IEEE International Symposium on High Performance Distributed Computing (HPDC-01), San Francisco, 2001. Novotny J. The Grid portal development kit. Concurrency and Computat: Pract and Exper 2002; 14:1129–44. Chongjie Z, Kelley I, Allen G. Grid Portal Solutions: a comparison of the GridPortlets and OGCE. Cocurrency and Computat: pract and exper 2007; 19: 1739-48. Ramakirshnan L, Reed MSC, Tilson JL, Reed DA, Grid portals for bioinformatics, In: Proceedings of the 2nd International Workshop on Grid Computing Environments (GCE'06), Tampa, USA, 2006. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic Local Alignment Search Tool. J Mol Biol 1990; 215: 403-13. Thompson JD, Higgins DG, Gibson TJ. CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res 1994; 22: 4673-80. Alameda J, Christie M, Fox G et al. The Open Grid Computing Environments Collaboration: portlets and services for science gateways. Concurrency and Computat: Pract and Exper 2007; 19: 921-42. Letondal C. A Web interface generator for molecular biology programs in Unix. Bioinformatics 2001; 17: 73-82. Sulakhe D, Rodriguez A, D'Souza M et al. GNARE: automated system for high-throughput genome analysis with grid computational backend. J Clin Monit Comput 2005; 19: 361-70. Casanova H, Berman F, Bartol T. The Virtual Instrument: support for grid-enabled Mcell Simulations. Int J High Perform Computing Appl 2004; 18: 3-17. Dhar PK, Meng TC, Somani S et al. Grid Cellware: the first gridenabled tool for modelling and simulating cellular processes. Bioinformatics 2005; 21: 1284-87. Lu E, Xu Z, Sun J, An extendable grid simulation environment based on GridSim, In: Proceedings of the 2nd International Workshop on Grid and Cooperative Computing (GCC 2003). SpringerVerlag, 2004; 3032: 205-08. Buyya R, Murshed M. GridSim: a toolkit for the modelling and simulation of distributed resource management and scheduling for Grid computing. Concurrency and Computat: Pract and Exper 2002; 14: 1175-1220. Lamehamedi H, Shentu Z, Szymanski B, Deelman E, Simulation of Dynamic Data Replication Strategies in Data Grids, In: Proceedings of the 17th International Symposium on Parallel and Distributed Processing (IPDPS 2003). IEEE Computer Society, DC 2003; 100-02. Teo YM, Wang X, Ng YK. GLAD: a system for developing and deploying large-scale bioinformatics grid. Bioinformatics 2005; 21: 794-802. Trelles O. On the parallelization of bioinformatics applications. Briefings in Bioinformatics 2001; 2: 181–94. Nobrega R, Barbosa J, Monteiro AP, BioGrid Application Toolkit: a Grid-based problem solving environment tool for biomedical data analysis, In: Proceedings of the VECPAR-7th International Meeting on High Performance Computing for Computational Science, IMPA, Brazil, 2006. Cannataro M, Comito C, Schiavo FL, Veltri P. Proteus: a grid based problem solving environment for bioinformatics- architecture and experiments. IEEE Computational Intel. Bull. 2004; 3: 07-18. Choong HS, Byoung JK, Gwan SY, A model of problem solving environment for integrated bioinformatics solution on grid by using Condor, In: Proceedings of the 3rd International Conference on Grid and Cooperative Computing (GCC 2004). Springer-Verlag, 2004; LNCS 3251: 935-38. Current Bioinformatics, 2008, Vol. 3, No. 1 [86] [87] [88] [89] [90] [91] [92] [93] [94] [95] [96] [97] [98] [99] [100] [101] [102] [103] [104] [105] [106] [107] [108] [109] 21 Tang F, Chua CL, Ho LY et al. Wildfire: distributed, Grid-enabled workflow construction and execution. BMC Bioinformatics 2005; 6: 69-78. Hull D, Wolstencroft K, Stevens R et al. Taverna: a tool for building and running workflows of services. Nucleic Acids Res 2006; 34: W729-32. Berman F, Chien A, Cooper K. The GrADS Project: software support for high-level grid application development. Int J High Perform Computing App 2001; 15: 327-44. Goodale T, Allen G, Lanfermann G et al., The cactus framework and toolkit: design and applications, In: Proceedings of the 5th International Conference on High Performance Computing for Computational Science, Springer-Verlag, 2003; LNCS 2565: 15-36. Moskwa S, A Grid-enabled Bioinformatics Applications Framework (GEBAF), Masters Thesis, School of Computer Science, The University of Adelaide, 2005. Stajich JE, Block D, Boulez K et al. The bioperl toolkit: Perl modules for the life sciences. Genome Res 2002; 12:1611-18. Abramson D, Buyya R, Giddy J. A computational economy for grid computing and its implementation in the Nimrod-G resource broker. Future Generation Computer Systems 2002; 18: 1061-74. Aloisio G, Cafaro M, Fiore S, Mirto M, A WorkFlow management system for bioinformatics grid, In: Proceedings of the Network Tools and Applications in Biology (NETTAB) Workshops, Naples, Italy, 2005. Yarkhan A, Dongarra JJ. Biological sequence alignment on the computational grid using the GrADS framework. Future Generation Computer Systems 2005; 21: 980-86. Sinnott R, Atkinson M, Bayer M et al. Grid services supporting the usage of secure federated, distributed biomedical data, In: Proceedings of the UK e-Science all hands meeting, Nottingham, 2004. Blanchet C, Mollon R, Thain D, Deleage G, Grid deployment of legacy bioinformatics applications with transparent data access, In: Proceedings of IEEE Conference on Grid Computing 2006. IEEE Computer Society, 2006; 120-27. Thain D, Livny M. Parrot: an application environment for data intensive computing. Scalable Computing: Pract and Exper 2005; 6: 9-18. Bairoch A, Apweiler R. The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000. Nucleic Acids Res 2000; 28: 45-8. Smith TF, Waterman MS. Identification of Common Molecular Subsequences. J Mol Biol 1981; 147: 195-97. Pearson WR. Searching Protein Sequence Libraries: comparison of the sensitivity and selectivity of the Smith-Waterman and FASTA algorithms. Genom 1991; 11: 635-50. Krauter K, Maheswaran M, Architecture for a grid operating system, In: Buyya R, Baker M, Proceedings of the 1st IEEE/ACM International Workshop on Grid Computing, Springer Verlag, 2000; LNCS 1971: 65-76. Padala P, Wilson JN, GridOS: operating system services for grid architectures, In: Proceedings of the High Performance Computing (HiPC 2003). Springer-Verlag, 2003; LNCS 2913: 353-62. Huh EN, Mun Y, Performance analysis for real-time grid systems on COTS operating systems, In: Proceedings of the International Conference on Computational Science (ICCS 2003). SpringerVerlag, 2003; LNCS 2660: 482-90. Talia D, Towards GRID Operating Systems: from GLinux to a GVM, In: Proceedings of the Workshop on Network Centric Operating Systems, Bruxelles, Belgium, 2005. Grimshaw AS, Natrajan A, Legion: Lessons learned building a grid operating system, In: Proceedings of the IEEE, 2005; 93: 589-603. Imamagic E, Radic B, Dobrenic D, Job Management Systems Analysis, In: Proceedings of the 6th CARNET Users Conference, Zagreb, Croatia, 2004. Sun Y, Zhao S, Yu H, Gao G, Luo J. ABCGrid: application for bioinformatics computing grid. Bioinformatics 2007; 23: 1175-77. Benedyczak K, Wronski M, Nowinski A et al, UNICORE as Uniform Grid Environment for Life Sciences, In: Proceedings of the European Grid Conference (EGC 2005). Springer-Verlag, 2005; LNCS 3470: 364-73. Yang CT, Kuo YL, Li KC, Gaudiot JL, On design of cluster and grid computing environment toolkit for bioinformatics applications, In: Proceedings of 6th International Workshop on Distributed Computing (IWDC 2004). Springer-Verlag, 2004; LNCS 3326: 82-87. 22 Current Bioinformatics, 2008, Vol. 3, No. 1 [110] [111] [112] [113] [114] Shah et al. Yang CT, Hsiung YC, Kan HC, Implementation and evaluation of a Java based computational grid for bioinformatics applications, In: Proceedings of the 19th IEEE International Conference on Advanced Information Networking and Applications (AINA 2005), IEEE Computer Society, 2005; 1: 298-303. Li KC, Chen CN, Liu CC, Chang CF, Hsu CW. PCGrid: integration of college’s research computing infrastructures using grid technology, In: Proceedings of the National Computer Symposium (NCS'2005), Tainan, Taiwan, 2005. Chen CN, Li KC, Tang CY, Lin YL, Wang HH, Wu TY, On design and implementation of a bioinformatics portal in cluster and grid environments, In: Proceedings of the 7th International Meeting on High Performance Computing for Computational Science (VECPAR), IMPA, Brazil, 2006. Gagliardi F, Jones B, Grey F, Begin MEHeikkurinen M, Building an Infrastructure for Scientific Grid Computing: status and goals of the EGEE project. Philos Transact A Math Phys Eng Sci, 2005; 363: 1729-42. Chakravarti AJ, Baumgartner G, Lauria M, The Organic Grid: selforganizing computational biology on desktop grids, In: Zomaya Received: February 22, 2007 [115] [116] [117] [118] Revised: May 30, 2007 AY Eds, Parallel Computing for Bioinformatics and Computational Biology: Models, Enabling Technologies and Case Studies, John Wiley & Sons, 2005; 671-703. Desmedt C, Nabrzyski J, Tsiknakis M. A semantic grid infrastructure enabling integrated access and analysis of multilevel biomedical data in support of post-genomic clinical trials on cancer. IEEE Transactions on Inform Tech in Biomedicine 2007; to appear. Pukacki J, Nabrzyski J, Stroiski M, Programming grid applications with Gridge, In: Nabrzyski J, Stroiski M Eds, Grid Applications – New Challenges for Computational Methods, 2006; 12: 1020. Buyya R, Giddy J, Abramson D, An evaluation of economy-based resource trading and scheduling on computational power grids for parameter sweep applications, In: Proceedings of 2nd Workshop on Active Middleware Services (AMS 2000) in conjunction with HPDC 2000, Pittsburgh, USA, 2000. Buyya R, Date S, Mizuno-Matsumoto Y, Venugopal S, Abramson D. Neuroscience Instrumentation and Distributed Analysis of Brain Activity Data: a case for eScience on global grids. Concurrency Computat: Pract and Exper 2005; 17(15): 1783-98. Accepted: May 30, 2007