Download Implementation of the BLAST Algorithm Using Hadoop MapReduce

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
University of Belgrade
School of Electrical Engineering
Siniša Ivković, Goran Rakočević, Prof. Veljko Milutinovic
Introduction
-Sequence alignment
•
way of arranging sequences of DNK, RNK or protein
to identify regions of similarity
•
•
•
functional
structural
evolutionary relationships between sequences
- How to know that two genes, often in different organizams,
in fact two versions of the same gene?
Similarity!
Siniša Ivković - [email protected]
2/13
Introduction
•
There are a number of algorithms that solve problems of
aligning the sequences and guarantee the best solutions
•
By increasing amount of data that need to be processed
execution speed of these algorithms becomes unacceptable
•
Therefore, we must turn to heuristic methods - BLAST
Siniša Ivković - [email protected]
3/13
BLAST - Basic Local Alignment
Search Tool
•
Fast local sequence alignment algorithm
•
BLAST efficiency lies in the fact that it tends to find regions of
high similarity, not necessarily trying to find and check all
local alignment.
KRKLQRNRTSFTQEQIEALEKEFERTHYPDVFARERLAAKIDLPEARIQVWFSNRRAKWRREEKL
KKKHRRNRTTFTTYQLHQLERAFEASHYPDVYSREELAAKVHLPEVRVQVWFQNRRAKWRRQERL
KKKHRRNRTTFTTYQLHQLERAFEASHYPDVYSREELAAKVHLPEVRVQVWFQNRRAKWRRQERL
Siniša Ivković - [email protected]
4/13
Parallel BLAST
- Most bioinformatics algorithms are designed as a sequential
• The very nature of bioinformatics processing
• The rapid spread of knowledge in biology causes
constant emergence of new concepts, and
significant changes to already known
- Declining price of genome sequencing requires
increasing the speed of execution of these algorithms
- Implementations of Parallel BLAST
• PThread
• MPI
Siniša Ivković - [email protected]
5/13
ETF Hadoop BLAST
- Big Data – collection of data sets so large and complex
that it becomes difficult to process using standard database tools or
traditional data processing applications
- Parallel computing – a form of computation
in which many calculations are carried out simultaneously
• communication and synchronization between processes
• hardware failure
- MapReduce – programming model that frees programmers of
thinking about these problems
- Apache Hadoop – free implementation of the MapReduce
paradigm
Siniša Ivković - [email protected]
6/13
MapReduce
VALUE
MAP
VALUE
MAP
SORT
VALUE
VALUE
REDUCE
VALUE
VALUE
VALUE
MAP
REDUCE
VALUE
Siniša Ivković - [email protected]
7/13
ETF Hadoop BLAST - Implementation
{db1}
mySequence
{q1}
{db2}
{db3}
{q1}
{db1}
{q1}
MAP
{db2}
{q1}
MAP
{db3}
MAP
{hit1}
{db1}
{hit3}
{db2}
{hit5}
{db3}
{hit2}
{db1}
{hit4}
{db2}
{hit6}
{db3}
Siniša Ivković - [email protected]
8/13
mySequence
{q1}
{db2}
{db3}
ETF Hadoop BLAST - Implementation
{q1}
{db1}
{q1}
MAP
{db2}
{q1}
MAP
{db3}
MAP
{hit1}
{db1}
{hit3}
{db2}
{hit5}
{db3}
{hit2}
{db1}
{hit4}
{db2}
{hit6}
{db3}
REDUCE
{hit1}
{db1}
{hit3}
REDUCE
{db2}
{hit6}
Siniša Ivković - [email protected]
{db3}
9/13
ETF Hadoop BLAST
>GENSCAN00000000013 pep:genscan chromosome:GRCh37:18:4755977:4807982:1
transcript:GENSCAN00000000013 transcript_biotype:protein_coding
TANTGLLAVKVEVIILVSLTHAQLSRAGQHAGCTTCLQDECAVAAGEEEETQQGELADVIYPSLL
AASTSSVLEDGAGPHKGLQKLSRLIRFVDVVGGFRREKGYMAWIKPRYSEFPKVNSWTESSFP
FG
TANTGLLAVKVEVIILVSLTHAQLSRAGQHAGCTTCLQDECAVAAGEEEETQQGELADVIYPSLL
AASTSSVLEDGAGPHKGLQKLSRLIRFVDVVGGFRREKGYMAWIKPRYSEFPKVNSWTESSFP
FG
HSP: 661
E-value: 0.001446314485823671
Siniša Ivković - [email protected]
10/13
Conclusion
- Bioinformatics has become an important part of many areas of
biology
•
•
•
Sequencing and annotating genomes and
their observed mutations
Datamining of biological literature and
the development of gene ontologies
Understanding of evolutionary aspects of molecular biology
- Personalized medicine
• Medical model that proposes the customization of healthcare
• We need to consider whole spectar of clinical information
• Electronic health care records
• Clinical trials
• etc.
Siniša Ivković - [email protected]
11/13
Conclusion
- We need to collect information from real world
- Develop analytics that can actually extract causal relationships
and generate predictive models
- Future steps:
- Specialized hardvare (FPGA)
Siniša Ivković - [email protected]
12/13
13/13
Related documents