Download A Perusal on Genetical Diseases using Conducive Hadoop Cluster

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts

RNA-Seq wikipedia , lookup

Personalized medicine wikipedia , lookup

Transcript
Proc. of Int. Conf. on Advances in Communication, Network, and Computing, CNC
A Perusal on Genetical Diseases using Conducive
Hadoop Cluster
Bincy P Andrews1, Binu A2
1
Rajagiri school of engineering and technology, kochi, India
[email protected]
2
Rajagiri school of engineering and technology, kochi, India
[email protected]
Abstract— A genetic disease is a disease that is caused by bizarreness in an individual's
genome. Some genetic disorders are congenital while others are engendered by acquired
changes or mutations in a gene or group of genes. Mutations occur either haphazardly or
due to some environmental exposure. Thus by scrutinizing a gene one will be able to
conclude about the type of infected disease. Most of the existing genetical disease analysis
systems are software based there by posing numerous disadvantages. This paper, adduces a
genetical disease analysis system using Hadoop based on cost effectiveness. The
implementation using low cost commodity machines and facile cluster deployment scenario
avoids additional expenses. Thus considering the performance verses cost effectiveness,
proposed commodity cluster model for genetical disease analysis is an adaptable approach
for small research organizations.
Index Terms— Hadoop, Rocks Cluster, commodity clusters, DNA, hive, bioinformatics
I. INTRODUCTION
A genetic disease is any disease that is caused by bizarreness in an individual's genome. The abnormality can
range from minuscule to major from a discrete mutation in a single base in the DNA of a single gene to a
gross chromosome abnormality involving the addition or subtraction of an entire chromosome or set of
chromosomes. Some genetic disorders are inherited while others are engendered by acquired changes or
mutations in a gene or group of genes. Mutations occur either haphazardly or due to some environmental
exposure. Thus by scrutinizing a gene one will be able to conclude about the type of infected disease. Most of
these genetical information deals with bulk amount of data. Bioinformatics researchers are now facing
problems with the analysis of such ultra large-scale data sets, a problem that will only increase at an alarming
rate in coming years. More over colossal challenges are involved in processing, storing and analyzing these
peta bytes of data without any cunctation. This implies that data manipulation by means of conventional
approach with a single system is absurd. Hence a parallel cluster environment is inevitable. Various
advantages of using parallel clusters are:
 Time saving
 Cost-effective
 Solve larger &complex problems expeditiously
© Elsevier, 2014
 Helps to use resources that are secluded
Hadoop [1] can be used to handle such class of problems with good performance and scalability. Normally,
Hadoop [3] is deployed over high performance computing systems which are expensive that only big
enterprises are able to make it possible. Moreover such deployment scenarios are complex making it futile
for smaller organizations to handle. In normal scenarios a cluster set up is highly influenced by following
factors like:
 Disbursement
 Electricity
 Temperature
All these above mentioned factors never get along. That is if we are choosing systems with high
computational capabilities disbursement will get elevated. Moreover such systems need more energy so for
smaller research organizations where cost is an important factor one cannot choose systems with high
computational capabilities for cluster set up. But if we are using rocks it is not necessary to meet each of
these factors. For setting up of rocks high performance systems are not required. Also temperature is not a
big constrain. Rocks Cluster [2] Distribution originally called NPACI Rocks is a Linux distribution intended
for high-performance computing clusters. Most important feature or rocks is that it doesn’t need high
performance computing nodes for deployment. Rocks cluster has got following features:
 Easy deployment of cluster
 Addition & deletion of new host can be easily accomplished
 Rocks automatically edit most of the files that is used to identify its hosts.
 Password-less SSH
 No additional login to compute nodes can be performed due password less SSH.
 Rich set of rolls that automatically get configured during rock cluster installation.
This paper, proposes a genetical disease analysis system using Hadoop based on cost effectiveness. “Ref
[12]” suggests such architecture. The implementation is done using low cost commodity machines which can
minimize the cost to a large extent. Most of the existing genetical disease analysis systems are software based
there by posing numerous disadvantages. Also regarding cluster environment most of the existing approaches
involve high performance systems. But when the cost aspect is considered, the high performance machine
involves a large cost and usage of such systems for cluster set up is not feasible in a smaller research
organization. Considering the performance verses cost effectiveness, proposed commodity cluster model for
genetical disease analysis is an adaptable approach for small research organizations.
Remaining chapters are organised as follows chapter 2 provides details on existing systems, their
disadvantages compared to proposed systems. Chapter 3 provides details of system design followed by
experimental setup. Finally paper concludes with concluding remarks.
II. BACKGROUND
Existing genetical disease analysis systems [11] [6] [7] [8] includes following. FASTLINK [11] is one of
them.Alejandro Schäffer has led the development of the FASTLINK software package for genetic linkage
analysis. Genetic linkage analysis is a statistical technique used to map genes and find the approximate
locations of disease genes. FASTLINK aims to replace the main programs of the widely used package
LINKAGE by doing the same computations faster. MSA [11] was developed by Alejandro Schäffer In
collaboration with Sandeep Gupta, which significantly faster and more space-efficient. MSA can do multiple
sequence alignment. CASPAR [11] Richa Agarwala, Jeremy Buhler and Alejandro Schäffer have developed
software to do conditional linkage analysis of polygenic diseases such as diabetes, asthma, and glaucoma.
The software is called CASPAR Computerized Affected Sibling Pair Analyzer and Reporter. Other
participants in the design of CASPAR are: Kenneth Gabbay (Baylor College of Medicine), Prof. Marek
Kimmel (Rice University) and David Owerbach (Baylor College of Medicine). PedHunter [11] Developed
by Richa Agarwala can be used to query a genealogical database. Among the problems PedHunter solves is
how best to connect a set of relatives with the same disease into a pedigree suitable for input to genetic
linkage analysis. PedHunter is currently being used at NCBI to query the Amish Genealogy database
(AGDB), a database of over 295,000 members of the Amish and Mennonite religious groups, and their
relatives. Other participants in the design of PedHunter and AGDB include Leslie Biesecker (NHGRI/NIH),
Clair Francomano (now at NIA/NIH), and Alejandro Schäffer. PedHunter is being used by other research
groups to query other genealogical databases. PedHunter query software comes in two flavors that depend on
237
how the genealogy is stored: in a SYBASE database or in ASCII text files. Though most of these systems
helps in disease detection it poses several disadvantages too. Some of its disadvantages are as follows:
 Payment before use is obligatory
 Annual fee payment for software maintenance is inevitable
 Online software needs internet which is an additional expense.
 There is no single entity on which the future of the software depends.
 No black boxes are possible.
 Software might not be open source
 Bug fixing is tedious
 Requires databases consuming large memory space.
 Software up gradation is difficult
 Runs on a single system
 Cannot run multiple task at same time

maintenance is expensive
Proposed systems overcome these disadvantages successfully since it is based on open source system &
Rocks Cluster. Following are some of its advantages to be noted:
 The availability of the source code and the right to transmogrify.
 The right to redistribute modifications and improvements to the code, and to reuse other open source
code
 The right to use the software in any way.
 Non-exclusive rights on the software.
 Distributed environment for processing
 Can perform multiple tasks at a time
 Rocks reduces cluster setup complexities
 Rocks provide secure cluster environment
III. SYSTEM DESIGN
Fig 1 System design
Figure 1 shows overall architecture of proposed system. It consists of two main phases disease analysis and
disease diagnosis phase. Following sections describes them in detail.
238
A. Disease Analysis
Input to this phase is a folder encompassing DNA of patients in fasta format. Fasta format is used for
construing DNA information of patients. A sequence in FASTA format begins with a single-line description,
followed by lines of sequence data [9]. The description line is distinguished from the sequence data by a
greater-than symbol at the beginning [9]. Hadoop MapReduce paradigm is used for processing multiple files
inside this folder. MapReduce task is implemented using java. File name & path to each of these fasta format
files is provided as key value pair of map task. Inside the map phase after invoking a bioinformatics tool
called fasta36 each of these fasta format files along with path of database is provided as arguments. Fasta36
is a tool used for identifying percentage similarity associated with genes that is provided as its arguments.
Now fasta36 executes each of the files individually by comparing it with database & writes their result back
to separate files.
B. Disease Diagnosis
Though previous module provides percentage similarity of each patients DNA with all DNA in the database,
for final diagnosis of disease associated with each patients hive is necessary. Fig 1 depicts this scenario
clearly. Output obtained from the previous module is combined to single file & uploaded over hive in this
phase. Apache Hive is data warehouse infrastructure built on top of Apache Hadoop for providing data
summarization, ad-hoc query, and analysis of large datasets [10]. It provides a mechanism to project structure
onto the data in Hadoop and to query that data using a SQL-like language called HiveQL [10]. Hive eases
integration between Hadoop and tools for business intelligence and visualization [10]. Now disease diagnosis
can be accomplished in just two steps by means of hive query:
 Identify maximum percentage associated with each patient
 Identify corresponding disease name corresponding to that patient and percentage
IV. PERFORMANCE EVALUATION
A. Experimental Setup
A master node is connected to twelve compute nodes by means of a Cisco SG 300 switch. Rocks cluster
version 6.1. Emerald Boa is used for cluster setup. After cluster setup Hadoop-1.0.4 was deployed [4] [5].
Next step is to run MapReduce program. Time taken for execution of MapReduce program in parallel
environment by varying number of input files was done. Also Time taken for running performing same task
individually was also calculated.
Fig 2 time taken by cluster environment
B. Experimental Results
Fig 2 shows time consumption of MapReduce paradigm in a cluster environment and Fig 3 shows time
consumption of conventional approach involving single system. From the following graphs we can conclude
that by deploying a cluster environment we can perform disease detection much faster compared to a
conventional approach involving a single system.
239
Fig 3 time taken by conventional method
III. CONCLUSIONS
This paper, proposes a genetical disease analysis system using Hadoop based on cost effectiveness. The
implementation is done using low cost commodity machines which can minimize the cost to a large extent.
Most of the existing genetical disease analysis systems are software based there by posing numerous
disadvantages. Also regarding cluster environment most of the existing approaches involve high performance
systems. But when the cost aspect is considered, the high performance machine involves a large cost and
usage of such systems for cluster set up is not feasible in a smaller research organization. Results show that
proposed system consumes only half the time taken by conventional approaches for disease analysis. Thus
considering the performance verses cost effectiveness, proposed commodity cluster model for genetical
disease analysis is an adaptable approach for small research organizations.
ACKNOWLEDGMENT
We are greatly indebted to the college management and the faculty members for providing necessary
facilities and hardware along with timely guidance and suggestions for implementing this work.
REFERENCES
[1]
[2]
[3]
[4]
[5]
[6]
Hadoop http://www.hadoop.apache.org
Rocks cluster http://www.rocksclusters.org/
Hadoop http://www.ibm.com/developerworks/library/l-hadoop-1/
Hadoop deployment http://ankitasblogger.blogspot.in/2011/01/hadoop-cluster-setup.html
Hadoop deployment http://icanhadoop.blogspot.in/2012/09/configuring-hadoop-is-very-if-you-just.html
H. Stockinger, M. Pagni, L. Cerutti, L. Falquet, “Grid Approach to Proc of the 2nd IEEE Intl. Conf. on e-Science
and Grid Computing, 2006, doi:10.1109/E-SCIENCE.2006.70.
[7] J. Andrade, M. Andersen, L. Berglund, and J. Odeberg, “Applications of Grid Computing in Genetics and
Proteomics,” LNCS, 2007, doi:10.1007/978-3-540-75755-9.
[8] P. Balaji, et. al., “Distributed I/O with ParaMEDIC: Experiences with a Worldwide Supercomputer,” Proceedings
of the IEEE International Supercomputing Conference, June 2008.
[9] Fasta format http://www.ncbi.nlm.nih.gov/BLAST/blastcgihelp.shtml
[10] hive http://hortonworks.com/hadoop/hive/
[11] genetic analysis software http://www.ncbi.nlm.nih.gov/CBBresearch/Schaffer/genetic_analysis.html
[12] Bincy p Andrews, Binu A , “paralyzing bioinformatics applications using conducive Hadoop cluster” IOSR Journal
of Computer Engineering Volume 14, Issue 6 (Sep-Oct. 2013), pages89-93
240