Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Proc. of Int. Conf. on Advances in Communication, Network, and Computing, CNC A Perusal on Genetical Diseases using Conducive Hadoop Cluster Bincy P Andrews1, Binu A2 1 Rajagiri school of engineering and technology, kochi, India [email protected] 2 Rajagiri school of engineering and technology, kochi, India [email protected] Abstract— A genetic disease is a disease that is caused by bizarreness in an individual's genome. Some genetic disorders are congenital while others are engendered by acquired changes or mutations in a gene or group of genes. Mutations occur either haphazardly or due to some environmental exposure. Thus by scrutinizing a gene one will be able to conclude about the type of infected disease. Most of the existing genetical disease analysis systems are software based there by posing numerous disadvantages. This paper, adduces a genetical disease analysis system using Hadoop based on cost effectiveness. The implementation using low cost commodity machines and facile cluster deployment scenario avoids additional expenses. Thus considering the performance verses cost effectiveness, proposed commodity cluster model for genetical disease analysis is an adaptable approach for small research organizations. Index Terms— Hadoop, Rocks Cluster, commodity clusters, DNA, hive, bioinformatics I. INTRODUCTION A genetic disease is any disease that is caused by bizarreness in an individual's genome. The abnormality can range from minuscule to major from a discrete mutation in a single base in the DNA of a single gene to a gross chromosome abnormality involving the addition or subtraction of an entire chromosome or set of chromosomes. Some genetic disorders are inherited while others are engendered by acquired changes or mutations in a gene or group of genes. Mutations occur either haphazardly or due to some environmental exposure. Thus by scrutinizing a gene one will be able to conclude about the type of infected disease. Most of these genetical information deals with bulk amount of data. Bioinformatics researchers are now facing problems with the analysis of such ultra large-scale data sets, a problem that will only increase at an alarming rate in coming years. More over colossal challenges are involved in processing, storing and analyzing these peta bytes of data without any cunctation. This implies that data manipulation by means of conventional approach with a single system is absurd. Hence a parallel cluster environment is inevitable. Various advantages of using parallel clusters are: Time saving Cost-effective Solve larger &complex problems expeditiously © Elsevier, 2014 Helps to use resources that are secluded Hadoop [1] can be used to handle such class of problems with good performance and scalability. Normally, Hadoop [3] is deployed over high performance computing systems which are expensive that only big enterprises are able to make it possible. Moreover such deployment scenarios are complex making it futile for smaller organizations to handle. In normal scenarios a cluster set up is highly influenced by following factors like: Disbursement Electricity Temperature All these above mentioned factors never get along. That is if we are choosing systems with high computational capabilities disbursement will get elevated. Moreover such systems need more energy so for smaller research organizations where cost is an important factor one cannot choose systems with high computational capabilities for cluster set up. But if we are using rocks it is not necessary to meet each of these factors. For setting up of rocks high performance systems are not required. Also temperature is not a big constrain. Rocks Cluster [2] Distribution originally called NPACI Rocks is a Linux distribution intended for high-performance computing clusters. Most important feature or rocks is that it doesn’t need high performance computing nodes for deployment. Rocks cluster has got following features: Easy deployment of cluster Addition & deletion of new host can be easily accomplished Rocks automatically edit most of the files that is used to identify its hosts. Password-less SSH No additional login to compute nodes can be performed due password less SSH. Rich set of rolls that automatically get configured during rock cluster installation. This paper, proposes a genetical disease analysis system using Hadoop based on cost effectiveness. “Ref [12]” suggests such architecture. The implementation is done using low cost commodity machines which can minimize the cost to a large extent. Most of the existing genetical disease analysis systems are software based there by posing numerous disadvantages. Also regarding cluster environment most of the existing approaches involve high performance systems. But when the cost aspect is considered, the high performance machine involves a large cost and usage of such systems for cluster set up is not feasible in a smaller research organization. Considering the performance verses cost effectiveness, proposed commodity cluster model for genetical disease analysis is an adaptable approach for small research organizations. Remaining chapters are organised as follows chapter 2 provides details on existing systems, their disadvantages compared to proposed systems. Chapter 3 provides details of system design followed by experimental setup. Finally paper concludes with concluding remarks. II. BACKGROUND Existing genetical disease analysis systems [11] [6] [7] [8] includes following. FASTLINK [11] is one of them.Alejandro Schäffer has led the development of the FASTLINK software package for genetic linkage analysis. Genetic linkage analysis is a statistical technique used to map genes and find the approximate locations of disease genes. FASTLINK aims to replace the main programs of the widely used package LINKAGE by doing the same computations faster. MSA [11] was developed by Alejandro Schäffer In collaboration with Sandeep Gupta, which significantly faster and more space-efficient. MSA can do multiple sequence alignment. CASPAR [11] Richa Agarwala, Jeremy Buhler and Alejandro Schäffer have developed software to do conditional linkage analysis of polygenic diseases such as diabetes, asthma, and glaucoma. The software is called CASPAR Computerized Affected Sibling Pair Analyzer and Reporter. Other participants in the design of CASPAR are: Kenneth Gabbay (Baylor College of Medicine), Prof. Marek Kimmel (Rice University) and David Owerbach (Baylor College of Medicine). PedHunter [11] Developed by Richa Agarwala can be used to query a genealogical database. Among the problems PedHunter solves is how best to connect a set of relatives with the same disease into a pedigree suitable for input to genetic linkage analysis. PedHunter is currently being used at NCBI to query the Amish Genealogy database (AGDB), a database of over 295,000 members of the Amish and Mennonite religious groups, and their relatives. Other participants in the design of PedHunter and AGDB include Leslie Biesecker (NHGRI/NIH), Clair Francomano (now at NIA/NIH), and Alejandro Schäffer. PedHunter is being used by other research groups to query other genealogical databases. PedHunter query software comes in two flavors that depend on 237 how the genealogy is stored: in a SYBASE database or in ASCII text files. Though most of these systems helps in disease detection it poses several disadvantages too. Some of its disadvantages are as follows: Payment before use is obligatory Annual fee payment for software maintenance is inevitable Online software needs internet which is an additional expense. There is no single entity on which the future of the software depends. No black boxes are possible. Software might not be open source Bug fixing is tedious Requires databases consuming large memory space. Software up gradation is difficult Runs on a single system Cannot run multiple task at same time maintenance is expensive Proposed systems overcome these disadvantages successfully since it is based on open source system & Rocks Cluster. Following are some of its advantages to be noted: The availability of the source code and the right to transmogrify. The right to redistribute modifications and improvements to the code, and to reuse other open source code The right to use the software in any way. Non-exclusive rights on the software. Distributed environment for processing Can perform multiple tasks at a time Rocks reduces cluster setup complexities Rocks provide secure cluster environment III. SYSTEM DESIGN Fig 1 System design Figure 1 shows overall architecture of proposed system. It consists of two main phases disease analysis and disease diagnosis phase. Following sections describes them in detail. 238 A. Disease Analysis Input to this phase is a folder encompassing DNA of patients in fasta format. Fasta format is used for construing DNA information of patients. A sequence in FASTA format begins with a single-line description, followed by lines of sequence data [9]. The description line is distinguished from the sequence data by a greater-than symbol at the beginning [9]. Hadoop MapReduce paradigm is used for processing multiple files inside this folder. MapReduce task is implemented using java. File name & path to each of these fasta format files is provided as key value pair of map task. Inside the map phase after invoking a bioinformatics tool called fasta36 each of these fasta format files along with path of database is provided as arguments. Fasta36 is a tool used for identifying percentage similarity associated with genes that is provided as its arguments. Now fasta36 executes each of the files individually by comparing it with database & writes their result back to separate files. B. Disease Diagnosis Though previous module provides percentage similarity of each patients DNA with all DNA in the database, for final diagnosis of disease associated with each patients hive is necessary. Fig 1 depicts this scenario clearly. Output obtained from the previous module is combined to single file & uploaded over hive in this phase. Apache Hive is data warehouse infrastructure built on top of Apache Hadoop for providing data summarization, ad-hoc query, and analysis of large datasets [10]. It provides a mechanism to project structure onto the data in Hadoop and to query that data using a SQL-like language called HiveQL [10]. Hive eases integration between Hadoop and tools for business intelligence and visualization [10]. Now disease diagnosis can be accomplished in just two steps by means of hive query: Identify maximum percentage associated with each patient Identify corresponding disease name corresponding to that patient and percentage IV. PERFORMANCE EVALUATION A. Experimental Setup A master node is connected to twelve compute nodes by means of a Cisco SG 300 switch. Rocks cluster version 6.1. Emerald Boa is used for cluster setup. After cluster setup Hadoop-1.0.4 was deployed [4] [5]. Next step is to run MapReduce program. Time taken for execution of MapReduce program in parallel environment by varying number of input files was done. Also Time taken for running performing same task individually was also calculated. Fig 2 time taken by cluster environment B. Experimental Results Fig 2 shows time consumption of MapReduce paradigm in a cluster environment and Fig 3 shows time consumption of conventional approach involving single system. From the following graphs we can conclude that by deploying a cluster environment we can perform disease detection much faster compared to a conventional approach involving a single system. 239 Fig 3 time taken by conventional method III. CONCLUSIONS This paper, proposes a genetical disease analysis system using Hadoop based on cost effectiveness. The implementation is done using low cost commodity machines which can minimize the cost to a large extent. Most of the existing genetical disease analysis systems are software based there by posing numerous disadvantages. Also regarding cluster environment most of the existing approaches involve high performance systems. But when the cost aspect is considered, the high performance machine involves a large cost and usage of such systems for cluster set up is not feasible in a smaller research organization. Results show that proposed system consumes only half the time taken by conventional approaches for disease analysis. Thus considering the performance verses cost effectiveness, proposed commodity cluster model for genetical disease analysis is an adaptable approach for small research organizations. ACKNOWLEDGMENT We are greatly indebted to the college management and the faculty members for providing necessary facilities and hardware along with timely guidance and suggestions for implementing this work. REFERENCES [1] [2] [3] [4] [5] [6] Hadoop http://www.hadoop.apache.org Rocks cluster http://www.rocksclusters.org/ Hadoop http://www.ibm.com/developerworks/library/l-hadoop-1/ Hadoop deployment http://ankitasblogger.blogspot.in/2011/01/hadoop-cluster-setup.html Hadoop deployment http://icanhadoop.blogspot.in/2012/09/configuring-hadoop-is-very-if-you-just.html H. Stockinger, M. Pagni, L. Cerutti, L. Falquet, “Grid Approach to Proc of the 2nd IEEE Intl. Conf. on e-Science and Grid Computing, 2006, doi:10.1109/E-SCIENCE.2006.70. [7] J. Andrade, M. Andersen, L. Berglund, and J. Odeberg, “Applications of Grid Computing in Genetics and Proteomics,” LNCS, 2007, doi:10.1007/978-3-540-75755-9. [8] P. Balaji, et. al., “Distributed I/O with ParaMEDIC: Experiences with a Worldwide Supercomputer,” Proceedings of the IEEE International Supercomputing Conference, June 2008. [9] Fasta format http://www.ncbi.nlm.nih.gov/BLAST/blastcgihelp.shtml [10] hive http://hortonworks.com/hadoop/hive/ [11] genetic analysis software http://www.ncbi.nlm.nih.gov/CBBresearch/Schaffer/genetic_analysis.html [12] Bincy p Andrews, Binu A , “paralyzing bioinformatics applications using conducive Hadoop cluster” IOSR Journal of Computer Engineering Volume 14, Issue 6 (Sep-Oct. 2013), pages89-93 240