Download bio sequence data mining : a survey

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

K-means clustering wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Asian Journal of Computer Science And Information Technology 4 : 3(2014) 21 - 24.
Contents lists available at www.innovativejournal.in
Asian Journal of Computer Science And Information Technology
Journal Homepage: http://www.innovativejournal.in/index.php/ajcsit
BIO SEQUENCE DATA MINING : A SURVEY
Nirbhaya Shaji, Sminu Izudheen
Department of Computer Science Rajagiri School of Engineering and Technology Cochin,India
ARTICLE INFO
ABSTRACT
Corresponding Author:
Nirbhaya.Shaji,
Department of Computer Science
Rajagiri School of Engineering
and Technology Cochin,India
[email protected]
The rapid progress of biotechnology and biodata analysis methods has led to
the emergence and fast growth of a promising field Bioinformatics.
Bioinformatics is the sci- ence of collecting and analyzing complex biological
data. More specifically it is the science of managing, mining integrating and
interpreting information from biological data at genomic, proteomic,
transcriptomic and metabalomic levels. Data mining or knowledge discovery
from data (KDD), in its most fundamen- tal form, is to extract interesting
nontrivial implicit previously unknown and potentially useful information
from data. Therefore the scenario is to bridge this two fields, data mining and
bioinformatics, for the successful mining of biological data. In this paper, we
present an overview of data mining studies that help biodata analysis and also
surveys on some algorithms that may motivates the further development of
data mining tools for the taming of various kinds of biological data.
2014, AJCSIT, All Right Reserved.
INTRODUCTION
Bio informatics is a field in which data mining can
be applied to the maximum with fruitful results.
Biomedical data is produced very rapidly at different
locations using varying devices and applying several
different data acquisition techniques [1]. To extract and
analysis this data poses a much bigger challenge than to
generate the data. Also these data are of heterogeneous
and distributed nature. Therefore to tame this data to a
logical and meaningful format we may need to transform
the data, link data with annotations or explicitly specified
using support programs. Of all the different types of data
produced, biosequence data is the one-dimensional
ordering of monomers, covalently linked within
a
biopolymer. It is also referred to as the primary structure
of the biological macromolecule, like protein sequences,
DNA sequences and other nucleotide
sequences.
Biosequences
data mining is mainly to recognize
functional elements, predict biological functions of the
sequences and to identify the interactions and mutual
functions between sequences [2].Biosequences patterns
reflect elements in biose- quences such as repeated
patterns and conserved sequence patterns [3]. Hence it is
one of the most important research areas in biosequences
data mining.
While tremendous progress has been made over
the years,many of the fundamental
problems
in
bioinformatics,
such as protein structure prediction,
gene-environment interaction, and regulatory network
mapping, have not been convincingly addressed. Besides
these, new technologies such as next- generation
sequencing are producing massive amount of sequence
data. Managing, mining and compressing these data raise
challenging issues. Finally, there is a pressing need to use
these data and computational
techniques to build
network models of complex biological processes and
disease phenotypes. Data mining will play an essential
role in addressing these fundamental problems and the
development of novel therapeutic/diagnostic solutions in
the post-genomic era of medicine.
Remainder of this paper is organized as follows:
in section II an overview on several pattern mining
algorithms that stands as the base of bio data mining
algorithms are being discussed. Followed by a detailed
review on Bio data Mining Algorithm in section III and
then with a conclusion paper is being completed.
PATTERN MINING ALGORITHMS
A. Sequential Pattern Mining Algorithms
The problem of mining sequential pattern was
first introduced by Srikant and Agrawal in their paper
Mining Sequential Patterns (1995) [11]. In the paper they
studied the problem of mining sequence pattern from a
database of customer sales transactions. The algorithms
they proposed AprioriAll and AprioriSome showed better
comparable perfor- mance, even though AprioriSome
performs a little better for the lower values of the
minimum number of customers that must support a
sequential pattern. However this problem had the the
following limitations ie, absence of time constraints, rigid
definition of a transaction and absence of taxonomies. So
in 1996 as a performance improvement to their initial
work they introduced algorithm GSP, [11] that discovers
generalized sequential patterns. The problem was to find
all sequences whose support was greater than the userspecified minimum support, while considering user
specified min-gap and max- gap time constraints, and a
user-specified sliding window size. Empirical evaluation
using synthetic and real-life data showed that GSP is much
faster than the AprioriAll algorithm which preceded it.
Later in 2001 Zaki advanced algorithm SPADE
[12] based on candidate generate-test technique, which
21
Shaji et. al/Bio Sequence Data Mining : A Survey
resulted in much faster discovery of sequential patterns.
SPADE utilized combinatorial properties to decompose
the original problem into smaller sub-problems that can
be independently solved using efficient lattice search
techniques,
and using simple join operations. All
sequences were discovered by only three database scans.
SPADE outperformed the best previous algo- rithms by
greatly reducing the number of scans and increasing the
processing speed.
Lately some mining algorithms on biological data have
been proposed such as PTR-based (Perfect Tandem
Repeat) algorithms, ATR-based (Approximate tandem
repeat) algorithms Reputer algorithm by Kurtz and
Trfinder algorithm by Benson.
PTR-based algorithm [19] proposed by Kolpakov
and Kucherov in 1999 assumes that there are only a linear
number of maximal repetitions in a word. From the idea
that a word may contain quadratic number of repetitions
they introduced the idea of maximal repetition. A maximal
repetition in a word is a repetition such that its extension
by one letter to the right or to the left yields a word with a
bigger period. The main drawback was that it does not
allow to extract a reasonable constant factor in the linear
bound.
ATR -based algorithms [20] STAR and whole
genome tandem repeat search considered more effective
algorithm for biosequences. In this work an exact
algorithm to locate approximate tandem repeats (ATR) of
a motif in a DNA sequence was developed. The fact that a
compression method tries to reduce the size of a sequence
by exploiting a property
’P’ is the idea behind this method. It re-encodes ’s’
relatively to ’P’ and this may compress ’s’ or not. The more
relevant the property, the better the compression. STAR
operates in three steps. First, STAR aligns the sequence s
with a perfect repeat (ETR: Exact Tandem Repeat) of the
motif m and obtains an optimal list of mutations that
convert this repeat into s and the optimal length for the
ETR. Secondly from this mutation list, STAR evaluates in a
second step the compression gain as if s was a single ATR
of m. The third step achieves maximization of the global
compression gain by decomposing s into ATR and nonATR segments optimally w.r.t. the global compression
gain.
REPuter algorithm [22] which is based on suffix
tree and the sequence alignment technique has no length
limitation on the input sequences. The search engine
REPfind of REPuter uses an efficient and compact
implementation of suffix trees in order to locate exact
repeats in linear space and time. But frequent repeat
finding is not efficient with REPuter.
Solving this in 2005 Wang proposed a new
concept of repetitions, the largest pattern repetition (the
LPR) and a concept of pattern unit [23]. A lightweight
index structure, namely, the succeeding unit array (SUA)
was designed based on pattern unit.
The
SUA
decreases the space consumption efficiently and solves
the space bottleneck in the search of repetition.
Wang et al. also introduced
a new comparability
standard and the concept of SATR (segment- similarity
based approximate tandem repeats)[24]. They designed
an algorithm named Suasatr for similar repeated segments
detecting. The algorithm has no limitations on pattern
length during the searching process. For the same set of
DNA sequences with fixed similarity, the algorithm is
faster than other traditional ones, but is not very efficient
when processing long sequences. BioPM algorithm [25] for
Protien sequences by Xiong et al. introduced the concept
of multiple supports so as to improve the performance
and efficiency. Later an algorithm mMbioPM [26]
improved BioPMs efficiency by optimizing hash list
structures to reduce the running time. But again for the
huge volume of projected database constructed these
algorithms were not efficient. Another disadvantage of
such algorithm was that these algorithms produced large
number of irrelevant short patterns during the mining
process. This even- tually costs large wastage of memory
B. Structured Pattern Mining Algorithms
In 2004 the path took a new turn when Han, Pei,
et al introduced structured pattern mining. It proposed
pattern growth method [13] in which the sequence data
base is parti- tioned into much smaller projected
databases. Then recursive mining is done for smaller
projected databases. However the efficiency of pattern
growth method is reduced by the length of pattern. [14]
Biosequences always undergo mutation. A gene mutation
is defined as an alteration in the sequence of nucleotides
in DNA. Considering this there has been a lot of
probabilistic approaches proposed to model biosequences
pattern mining. Again in 2011 based on hidden Markov
model (HMM), Zaki et al presented a method named
VOGUE for modeling complex pattern in sequential data
[15]. It combined two separate techniques for modeling
complex patterns in sequential data: pattern mining and
data modeling. VOGUE was applied to a variety of real
sequence data taken from domains such as protein
sequence classification,
web usage logs, intrusion
detection, and spelling correction. Given a database of
protein sequences, the goal is to build a statistical model
that can determine whether a query protein belongs to a
given family (class) or not. Statistical models for proteins,
such as profiles, position-specific scoring matrices, and
hidden Markov models [Eddy 1998] have been developed
to find homology. However, in most biological sequences,
interesting patterns repeat (either within the same
sequence or across sequences) and may be separated by
variable length gaps. Therefore a method like VOGUE that
specifically
takes these kinds of patterns into
consideration was very effective.
At the same time
Wan proposed a method based on a hierarchical hidden
Markov model [16] to detect the frequent patterns in
sequences with the added advantage that no priorknowledge was required to learn the structure of the
model. Lately Li et al. [18] presented two algorithms, GapBIDE and Gap-Connect. Gap- BIDE for mining closed gapconstrained subsequences from a set of input sequences,
and Gap-Connect for mining repetitive gap-constrained
subsequences from a single input sequence. Studies
showed this method very efficient in mining frequent
subsequences with gap constraints. Rassi et al. tackled
the problem of bounding sequential, and introduced two
methods to estimate the number of candidate sequences.
Liu et al. [18] proposed an incremental mining algorithm
for sequential patterns using a frequent sequence tree as
the storage structure. They also proposed an algorithm
for constructing the frequent sequence tree, and a pruning
strategy to optimize the tree construction
BIO DATA MINING ALGORITHMS
Because of the particular nature of biological data
the above developed mining methods are not completely
efficient for large scale mining of these data. A lot of
studies have been done on these fields recently. Finding
Tandem repeats in sequences have proven useful in
genome cartography, forensic and population studies, etc.
22
Shaji et. al/Bio Sequence Data Mining : A Survey
space and computation time.
But again for the huge volume of projected database
constructed these algorithms were not efficient. Another
disad- vantage of such algorithm
was that these
algorithms produced large number of irrelevant short
patterns during the mining process. This eventually costs
large wastage of memory space and computation time.
A. MSPM based algorithm
Clearly there is a need for an algorithm for improving the
efficiency and speed of frequent patterns detected. Here
comes the concept of primary pattern which can be
extended to form larger patterns in the sequence, prefix
tree to detect frequent primary pattern and based on this
prefix
tree a pattern extending approach. Pattern
extending approach is to mine the frequent patterns
without producing large amount of irrelevant candidate
patterns. Also we know that biosequences are based on an
alphabet with only 4 (DNA) or 20 (Protein) different
characters, there are limited number of primary patterns
in a biosequences, the size of their prefix tree is limited
and the degree of the tree is a constant. Considering all
these L. Chen and W. Liu proposed a fast and efficient
algorithm MSPM (multiple sequence pattern mining).
Experiment result showed that MSPM algorithm can
achieve not only faster speed, but also higher quality
results as compared with other methods [27].
research frontiers. It is important to examine what are the
important research issues in bio informatics and develop
new data mining methods for scalable and effective biodata analysis. The need of efficient bio sequencing methods
will be needed in many variety of fields. A very recent
and challenging application is in integrating genetic data into
Electronic Medical Record - EMR. This will enable greater
possibilities of successful personalized medicine development
for more patient centric treatments at health care sector. The
active interactions and collaborations between these two fields
have just started and a lot of exciting results will appear in
the near future.
REFERENCES
[1] J.Han and M.Kamber. Data Mining: Concepts and Techniques.
Morgan Kaufmann, San Francisco, 2001
[2] P.Bajesy, J.Han, L.Liu, et al, Survey of data aanalysis from a data
mining perpective, 200
[3] D.He,X.Zhu,X.Wu,Approximate repeating pattern mining with
gap re- quirements ,in: Proceedings of 21st International
Conference on Tool swith Artificial Intelligence,2009.ICTAI 09,
pp.1724.
[4] D.Mount Bioinformatics sequence andGenome Analysis. Cold
Spring Harbor Laboratory Press, Woodbury, NY, 2001
[5] D. G Higgins and P. M. Sharp. Clustal: a package for
performing
multiple sequence
alignment
on a
microcomputer.Gene, 73:237-244, 1988
[6] X. Huang and A. Madan. CAP3: a DNA sequence assembly
program. Genome Research, 9:868-877, 1999
[7] M. Borodovsky and J. McIninch. GeneMark: parallel gene
recognition for both DNA strands. Comput. Chem., 17:123-133,
1993
[8] C.B. Burge and S. Karlin. Finding the genes in genomic DNA.
1998 [9] J. Quackenbush. Computational analysis of micro array
data. Natural Review Genetics 2001
[10] P. Karp, M. Riley, M. Saier The EcoCuc Database, Nucliec
Acids research, 2002
[11] R. Srikant, R. Agrawal, Mining sequential patterns:
generalization and performance improvements,in:Proceedings of
the 15th International Conferenceon Extending
Database
Technology.London:Springer- Verlag,1996,pp.317.
[12] M. Zaki, SPADE: an efficient algorithm for mining frequent
sequences Mach. Lear.(2001)3160.
[13] J. Han,J.Pei, From sequential pattern mining to structured
pattern
mining:a
pattern-growth
approach,
J.Comput.Sci.Technol.23 (2004) 257279.
[14] J.H.Wang ,Asanuma ,K. Yoshiaki. A scalable sequential
pattern mining al- gorithm, in:Proceedings of the 2006 IEEE
International Conference on Computer Systems and
Applications,2006,pp.437444.
[15] M.J. Zaki, C .D. Carothers, B. K. Szymanski, VOGUE: a variable
order hid- den Markov model with duration based on frequent
sequence min- ing,Trans. Knowl.DiscoveryData4(1)(2010).
[16] Li Wan,Learning frequent episodes based hierarchical
hidden
Markov
models
in
sequence
data,Commun.Comput.Inf.Sci.153(2011)120124.
[17] C .Li ,Q.Y.Yang,J.Y.Wang,M.Li,Efficient mining of gapconstrained
subsequences
and
its
various
applications,Trans.Knowl.DiscoveryData6(1) (2012).(2.1-2.39).
[18] J.Liu;S.Yan;J.Ren,The design of frequent sequence tree in
incremental mining of sequential patterns, in :Proceedings of
2011 IEEE Second International Conference on Software
Engineering and Service Science (ICSESS), pp. 679682.
[19] R.Kolpakov,G.Kucherov,Finding maximal repetitions in a
word in linear time, in: Proceedings of the 1999 Symposium on
Foundations of ComputerScience. Washington,1999,pp.596604.
B. Mining frequent patterns from sequences with wildcards
The traditional pattern mining algorithms focus on the
problem of mining frequent patterns from sequence
databases. The support of a pattern is defined as the
number of sequences that contain the pattern, and no gap
constraints are defined. Fei Xie, Xindong Wu, Xuegang Hu ,
Jun Gao Dan Guo, Yulian Fei and Ertian Hua in their paper
defines a wildcard as a special symbol that can be matched
by any character in the alphabet. A gap is a sequence of
wildcards . The size of a gap refers to the number of
wildcards. In the paper they proposed an algorithm MAIL
for Mining frequent patterns with wildcards ie gap
constraints. The experimental results showed that the
number of mining patterns has been about 2 times more
and the time performance has been about 12 times faster
on average than earlier proposed methods.
C. Bio sequencing in Health Care
Integrating genetic data into Electronic Health Records
has been a challenge in health sector service for more
than a decades due to a lot of legal and ethical issues. But
its quiet evident that if that level of data integration may
be bought into EHRs then the challenging and daunting
goal of personalized medicine and tailor made drugs can
be made into main stream diagnosis and treatment
strategies by physicians. And subse- quently the processes
of inferring and processing information for knowledge
comes from the ability to analysis this sequence of
genomic data. A lot of works and researcher are being
done in this field. Some being ”Incorporating personalized
gene sequence variants, molecular genetics knowledge,
and health knowledge into an EHR prototype based on the
Continuity of Care Record standard” by Xia Jing et al.[29],
”Technical desiderata for the integration of genomic data
into Electronic Health Records” by Daniel et al.[30] , and
the works by Jin Fan et al for leveraging the electronic
medical record (EMR) to conduct genome-wide
association studies (GWAS)[31].
CONCLUSION
Both data mining and bio informatics are fast expanding
23
Shaji et. al/Bio Sequence Data Mining : A Survey
[20] O. Delgrange,E.Rivals,STAR:an algorithm to search for
tandem approx- imate repeats, Bioinformatics20(2004)28122820.
[21] A.Krishnan,F.Tang,Exhaustive whole-genome tandem
repeats search, Bio informatics 20(2004)27022710.
[22]
S.
Kurtz,J.V.Choudhuri,E.Ohlebusch,C.Schleiermacher,J.Stoye,R.Gieger
ich, REPuter:the manifold applications of repeat analysis on a
genomic scale, Nucleic Acids Res.29(2001)46334642.
[23] D.Wang,G.WangG.Q.Wu,B.Chen Finding LPRs in
DNA
sequence based on a new indexSUA,in:Proceedings of the
IEEE
Fifth Symposiumon Bioinformatics and Bioengineering(BIBE2005).Washington,2005,pp.281284.
[24] D. Wang,Y.Zhao,B.Chen,G.Wang,SUA-based
algorithm for
finding
SATRs
in
DNA
sequence
J.Northeastern Univ.(Nat.Sci.)28(2007)209212.
[25] Y.Xiong,Y.Zhu,BioPM:anefficient algorithm for protein motif
mining,in: Proceedings ofICBBE07,IEEEPress,2007,pp.394397.
[26] Q .Zhou,Q.Jiang;S.Li;X.Xie;L.Lin,Anefficient algorithm for
protein se- quence pattern mining,in:Proceedings of 2010 Fifth
International Con- ference on Computer Science and
Education(ICCSE),2010,pp.18761881.
[27] Frequent pattern mining in multiple biological sequences,
Ling Chen, Wei Liu Computers in Biology and Medicine,
[28] Sequential Pattern Mining with Wildcards Fei Xie Coll. of
Comput. Sci. and Info. Eng., Hefei Univ. of Tech., Hefei, China
Xindong Wu ; Xuegang Hu ; Jun Gao ; Dan Guo ; Yulian Fei ; Ertian
Hua october 2010 ELSEVIER, 2013 July.
[29] Incorporating personalized gene sequence variants,
molecular genetics knowledge, and health knowledge into an EHR
prototype based on the Continuity of Care Record standard Xia
Jinga, Corresponding author contact information, E-mail the
corre- sponding author, Stephen Kayb, Thomas Marleyc, Nicholas
R. Hardik- erd, James J. Ciminoa April 2011
[30] Technical desiderata for the integration of genomic data into
Electronic Health Records Daniel R. Masysa, Corresponding author
contact information, E-mail the corresponding author, Gail P.
Jarvikb, c, Neil F. Abernethya, Nicholas R. Andersona, George J.
Papanicolaoud, Dina N. Paltooe, Mark A. Hoffmanf, Isaac S.
Kohaneg, Howard P. Levyh December 2011
[31] Leveraging informatics for genetic studies: use of the
electronic medical record to enable a genome-wide association
study of peripheral arterial disease Iftikhar J Kullo1, Jin Fan1,
Jyotishman
Pathak2, Guergana K Savova2, Zeenat Ali1,
Christopher G Chute2 june 2010.
24