Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Asian Journal of Computer Science And Information Technology 4 : 3(2014) 21 - 24. Contents lists available at www.innovativejournal.in Asian Journal of Computer Science And Information Technology Journal Homepage: http://www.innovativejournal.in/index.php/ajcsit BIO SEQUENCE DATA MINING : A SURVEY Nirbhaya Shaji, Sminu Izudheen Department of Computer Science Rajagiri School of Engineering and Technology Cochin,India ARTICLE INFO ABSTRACT Corresponding Author: Nirbhaya.Shaji, Department of Computer Science Rajagiri School of Engineering and Technology Cochin,India [email protected] The rapid progress of biotechnology and biodata analysis methods has led to the emergence and fast growth of a promising field Bioinformatics. Bioinformatics is the sci- ence of collecting and analyzing complex biological data. More specifically it is the science of managing, mining integrating and interpreting information from biological data at genomic, proteomic, transcriptomic and metabalomic levels. Data mining or knowledge discovery from data (KDD), in its most fundamen- tal form, is to extract interesting nontrivial implicit previously unknown and potentially useful information from data. Therefore the scenario is to bridge this two fields, data mining and bioinformatics, for the successful mining of biological data. In this paper, we present an overview of data mining studies that help biodata analysis and also surveys on some algorithms that may motivates the further development of data mining tools for the taming of various kinds of biological data. 2014, AJCSIT, All Right Reserved. INTRODUCTION Bio informatics is a field in which data mining can be applied to the maximum with fruitful results. Biomedical data is produced very rapidly at different locations using varying devices and applying several different data acquisition techniques [1]. To extract and analysis this data poses a much bigger challenge than to generate the data. Also these data are of heterogeneous and distributed nature. Therefore to tame this data to a logical and meaningful format we may need to transform the data, link data with annotations or explicitly specified using support programs. Of all the different types of data produced, biosequence data is the one-dimensional ordering of monomers, covalently linked within a biopolymer. It is also referred to as the primary structure of the biological macromolecule, like protein sequences, DNA sequences and other nucleotide sequences. Biosequences data mining is mainly to recognize functional elements, predict biological functions of the sequences and to identify the interactions and mutual functions between sequences [2].Biosequences patterns reflect elements in biose- quences such as repeated patterns and conserved sequence patterns [3]. Hence it is one of the most important research areas in biosequences data mining. While tremendous progress has been made over the years,many of the fundamental problems in bioinformatics, such as protein structure prediction, gene-environment interaction, and regulatory network mapping, have not been convincingly addressed. Besides these, new technologies such as next- generation sequencing are producing massive amount of sequence data. Managing, mining and compressing these data raise challenging issues. Finally, there is a pressing need to use these data and computational techniques to build network models of complex biological processes and disease phenotypes. Data mining will play an essential role in addressing these fundamental problems and the development of novel therapeutic/diagnostic solutions in the post-genomic era of medicine. Remainder of this paper is organized as follows: in section II an overview on several pattern mining algorithms that stands as the base of bio data mining algorithms are being discussed. Followed by a detailed review on Bio data Mining Algorithm in section III and then with a conclusion paper is being completed. PATTERN MINING ALGORITHMS A. Sequential Pattern Mining Algorithms The problem of mining sequential pattern was first introduced by Srikant and Agrawal in their paper Mining Sequential Patterns (1995) [11]. In the paper they studied the problem of mining sequence pattern from a database of customer sales transactions. The algorithms they proposed AprioriAll and AprioriSome showed better comparable perfor- mance, even though AprioriSome performs a little better for the lower values of the minimum number of customers that must support a sequential pattern. However this problem had the the following limitations ie, absence of time constraints, rigid definition of a transaction and absence of taxonomies. So in 1996 as a performance improvement to their initial work they introduced algorithm GSP, [11] that discovers generalized sequential patterns. The problem was to find all sequences whose support was greater than the userspecified minimum support, while considering user specified min-gap and max- gap time constraints, and a user-specified sliding window size. Empirical evaluation using synthetic and real-life data showed that GSP is much faster than the AprioriAll algorithm which preceded it. Later in 2001 Zaki advanced algorithm SPADE [12] based on candidate generate-test technique, which 21 Shaji et. al/Bio Sequence Data Mining : A Survey resulted in much faster discovery of sequential patterns. SPADE utilized combinatorial properties to decompose the original problem into smaller sub-problems that can be independently solved using efficient lattice search techniques, and using simple join operations. All sequences were discovered by only three database scans. SPADE outperformed the best previous algo- rithms by greatly reducing the number of scans and increasing the processing speed. Lately some mining algorithms on biological data have been proposed such as PTR-based (Perfect Tandem Repeat) algorithms, ATR-based (Approximate tandem repeat) algorithms Reputer algorithm by Kurtz and Trfinder algorithm by Benson. PTR-based algorithm [19] proposed by Kolpakov and Kucherov in 1999 assumes that there are only a linear number of maximal repetitions in a word. From the idea that a word may contain quadratic number of repetitions they introduced the idea of maximal repetition. A maximal repetition in a word is a repetition such that its extension by one letter to the right or to the left yields a word with a bigger period. The main drawback was that it does not allow to extract a reasonable constant factor in the linear bound. ATR -based algorithms [20] STAR and whole genome tandem repeat search considered more effective algorithm for biosequences. In this work an exact algorithm to locate approximate tandem repeats (ATR) of a motif in a DNA sequence was developed. The fact that a compression method tries to reduce the size of a sequence by exploiting a property ’P’ is the idea behind this method. It re-encodes ’s’ relatively to ’P’ and this may compress ’s’ or not. The more relevant the property, the better the compression. STAR operates in three steps. First, STAR aligns the sequence s with a perfect repeat (ETR: Exact Tandem Repeat) of the motif m and obtains an optimal list of mutations that convert this repeat into s and the optimal length for the ETR. Secondly from this mutation list, STAR evaluates in a second step the compression gain as if s was a single ATR of m. The third step achieves maximization of the global compression gain by decomposing s into ATR and nonATR segments optimally w.r.t. the global compression gain. REPuter algorithm [22] which is based on suffix tree and the sequence alignment technique has no length limitation on the input sequences. The search engine REPfind of REPuter uses an efficient and compact implementation of suffix trees in order to locate exact repeats in linear space and time. But frequent repeat finding is not efficient with REPuter. Solving this in 2005 Wang proposed a new concept of repetitions, the largest pattern repetition (the LPR) and a concept of pattern unit [23]. A lightweight index structure, namely, the succeeding unit array (SUA) was designed based on pattern unit. The SUA decreases the space consumption efficiently and solves the space bottleneck in the search of repetition. Wang et al. also introduced a new comparability standard and the concept of SATR (segment- similarity based approximate tandem repeats)[24]. They designed an algorithm named Suasatr for similar repeated segments detecting. The algorithm has no limitations on pattern length during the searching process. For the same set of DNA sequences with fixed similarity, the algorithm is faster than other traditional ones, but is not very efficient when processing long sequences. BioPM algorithm [25] for Protien sequences by Xiong et al. introduced the concept of multiple supports so as to improve the performance and efficiency. Later an algorithm mMbioPM [26] improved BioPMs efficiency by optimizing hash list structures to reduce the running time. But again for the huge volume of projected database constructed these algorithms were not efficient. Another disadvantage of such algorithm was that these algorithms produced large number of irrelevant short patterns during the mining process. This even- tually costs large wastage of memory B. Structured Pattern Mining Algorithms In 2004 the path took a new turn when Han, Pei, et al introduced structured pattern mining. It proposed pattern growth method [13] in which the sequence data base is parti- tioned into much smaller projected databases. Then recursive mining is done for smaller projected databases. However the efficiency of pattern growth method is reduced by the length of pattern. [14] Biosequences always undergo mutation. A gene mutation is defined as an alteration in the sequence of nucleotides in DNA. Considering this there has been a lot of probabilistic approaches proposed to model biosequences pattern mining. Again in 2011 based on hidden Markov model (HMM), Zaki et al presented a method named VOGUE for modeling complex pattern in sequential data [15]. It combined two separate techniques for modeling complex patterns in sequential data: pattern mining and data modeling. VOGUE was applied to a variety of real sequence data taken from domains such as protein sequence classification, web usage logs, intrusion detection, and spelling correction. Given a database of protein sequences, the goal is to build a statistical model that can determine whether a query protein belongs to a given family (class) or not. Statistical models for proteins, such as profiles, position-specific scoring matrices, and hidden Markov models [Eddy 1998] have been developed to find homology. However, in most biological sequences, interesting patterns repeat (either within the same sequence or across sequences) and may be separated by variable length gaps. Therefore a method like VOGUE that specifically takes these kinds of patterns into consideration was very effective. At the same time Wan proposed a method based on a hierarchical hidden Markov model [16] to detect the frequent patterns in sequences with the added advantage that no priorknowledge was required to learn the structure of the model. Lately Li et al. [18] presented two algorithms, GapBIDE and Gap-Connect. Gap- BIDE for mining closed gapconstrained subsequences from a set of input sequences, and Gap-Connect for mining repetitive gap-constrained subsequences from a single input sequence. Studies showed this method very efficient in mining frequent subsequences with gap constraints. Rassi et al. tackled the problem of bounding sequential, and introduced two methods to estimate the number of candidate sequences. Liu et al. [18] proposed an incremental mining algorithm for sequential patterns using a frequent sequence tree as the storage structure. They also proposed an algorithm for constructing the frequent sequence tree, and a pruning strategy to optimize the tree construction BIO DATA MINING ALGORITHMS Because of the particular nature of biological data the above developed mining methods are not completely efficient for large scale mining of these data. A lot of studies have been done on these fields recently. Finding Tandem repeats in sequences have proven useful in genome cartography, forensic and population studies, etc. 22 Shaji et. al/Bio Sequence Data Mining : A Survey space and computation time. But again for the huge volume of projected database constructed these algorithms were not efficient. Another disad- vantage of such algorithm was that these algorithms produced large number of irrelevant short patterns during the mining process. This eventually costs large wastage of memory space and computation time. A. MSPM based algorithm Clearly there is a need for an algorithm for improving the efficiency and speed of frequent patterns detected. Here comes the concept of primary pattern which can be extended to form larger patterns in the sequence, prefix tree to detect frequent primary pattern and based on this prefix tree a pattern extending approach. Pattern extending approach is to mine the frequent patterns without producing large amount of irrelevant candidate patterns. Also we know that biosequences are based on an alphabet with only 4 (DNA) or 20 (Protein) different characters, there are limited number of primary patterns in a biosequences, the size of their prefix tree is limited and the degree of the tree is a constant. Considering all these L. Chen and W. Liu proposed a fast and efficient algorithm MSPM (multiple sequence pattern mining). Experiment result showed that MSPM algorithm can achieve not only faster speed, but also higher quality results as compared with other methods [27]. research frontiers. It is important to examine what are the important research issues in bio informatics and develop new data mining methods for scalable and effective biodata analysis. The need of efficient bio sequencing methods will be needed in many variety of fields. A very recent and challenging application is in integrating genetic data into Electronic Medical Record - EMR. This will enable greater possibilities of successful personalized medicine development for more patient centric treatments at health care sector. The active interactions and collaborations between these two fields have just started and a lot of exciting results will appear in the near future. REFERENCES [1] J.Han and M.Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann, San Francisco, 2001 [2] P.Bajesy, J.Han, L.Liu, et al, Survey of data aanalysis from a data mining perpective, 200 [3] D.He,X.Zhu,X.Wu,Approximate repeating pattern mining with gap re- quirements ,in: Proceedings of 21st International Conference on Tool swith Artificial Intelligence,2009.ICTAI 09, pp.1724. [4] D.Mount Bioinformatics sequence andGenome Analysis. Cold Spring Harbor Laboratory Press, Woodbury, NY, 2001 [5] D. G Higgins and P. M. Sharp. Clustal: a package for performing multiple sequence alignment on a microcomputer.Gene, 73:237-244, 1988 [6] X. Huang and A. Madan. CAP3: a DNA sequence assembly program. Genome Research, 9:868-877, 1999 [7] M. Borodovsky and J. McIninch. GeneMark: parallel gene recognition for both DNA strands. Comput. Chem., 17:123-133, 1993 [8] C.B. Burge and S. Karlin. Finding the genes in genomic DNA. 1998 [9] J. Quackenbush. Computational analysis of micro array data. Natural Review Genetics 2001 [10] P. Karp, M. Riley, M. Saier The EcoCuc Database, Nucliec Acids research, 2002 [11] R. Srikant, R. Agrawal, Mining sequential patterns: generalization and performance improvements,in:Proceedings of the 15th International Conferenceon Extending Database Technology.London:Springer- Verlag,1996,pp.317. [12] M. Zaki, SPADE: an efficient algorithm for mining frequent sequences Mach. Lear.(2001)3160. [13] J. Han,J.Pei, From sequential pattern mining to structured pattern mining:a pattern-growth approach, J.Comput.Sci.Technol.23 (2004) 257279. [14] J.H.Wang ,Asanuma ,K. Yoshiaki. A scalable sequential pattern mining al- gorithm, in:Proceedings of the 2006 IEEE International Conference on Computer Systems and Applications,2006,pp.437444. [15] M.J. Zaki, C .D. Carothers, B. K. Szymanski, VOGUE: a variable order hid- den Markov model with duration based on frequent sequence min- ing,Trans. Knowl.DiscoveryData4(1)(2010). [16] Li Wan,Learning frequent episodes based hierarchical hidden Markov models in sequence data,Commun.Comput.Inf.Sci.153(2011)120124. [17] C .Li ,Q.Y.Yang,J.Y.Wang,M.Li,Efficient mining of gapconstrained subsequences and its various applications,Trans.Knowl.DiscoveryData6(1) (2012).(2.1-2.39). [18] J.Liu;S.Yan;J.Ren,The design of frequent sequence tree in incremental mining of sequential patterns, in :Proceedings of 2011 IEEE Second International Conference on Software Engineering and Service Science (ICSESS), pp. 679682. [19] R.Kolpakov,G.Kucherov,Finding maximal repetitions in a word in linear time, in: Proceedings of the 1999 Symposium on Foundations of ComputerScience. Washington,1999,pp.596604. B. Mining frequent patterns from sequences with wildcards The traditional pattern mining algorithms focus on the problem of mining frequent patterns from sequence databases. The support of a pattern is defined as the number of sequences that contain the pattern, and no gap constraints are defined. Fei Xie, Xindong Wu, Xuegang Hu , Jun Gao Dan Guo, Yulian Fei and Ertian Hua in their paper defines a wildcard as a special symbol that can be matched by any character in the alphabet. A gap is a sequence of wildcards . The size of a gap refers to the number of wildcards. In the paper they proposed an algorithm MAIL for Mining frequent patterns with wildcards ie gap constraints. The experimental results showed that the number of mining patterns has been about 2 times more and the time performance has been about 12 times faster on average than earlier proposed methods. C. Bio sequencing in Health Care Integrating genetic data into Electronic Health Records has been a challenge in health sector service for more than a decades due to a lot of legal and ethical issues. But its quiet evident that if that level of data integration may be bought into EHRs then the challenging and daunting goal of personalized medicine and tailor made drugs can be made into main stream diagnosis and treatment strategies by physicians. And subse- quently the processes of inferring and processing information for knowledge comes from the ability to analysis this sequence of genomic data. A lot of works and researcher are being done in this field. Some being ”Incorporating personalized gene sequence variants, molecular genetics knowledge, and health knowledge into an EHR prototype based on the Continuity of Care Record standard” by Xia Jing et al.[29], ”Technical desiderata for the integration of genomic data into Electronic Health Records” by Daniel et al.[30] , and the works by Jin Fan et al for leveraging the electronic medical record (EMR) to conduct genome-wide association studies (GWAS)[31]. CONCLUSION Both data mining and bio informatics are fast expanding 23 Shaji et. al/Bio Sequence Data Mining : A Survey [20] O. Delgrange,E.Rivals,STAR:an algorithm to search for tandem approx- imate repeats, Bioinformatics20(2004)28122820. [21] A.Krishnan,F.Tang,Exhaustive whole-genome tandem repeats search, Bio informatics 20(2004)27022710. [22] S. Kurtz,J.V.Choudhuri,E.Ohlebusch,C.Schleiermacher,J.Stoye,R.Gieger ich, REPuter:the manifold applications of repeat analysis on a genomic scale, Nucleic Acids Res.29(2001)46334642. [23] D.Wang,G.WangG.Q.Wu,B.Chen Finding LPRs in DNA sequence based on a new indexSUA,in:Proceedings of the IEEE Fifth Symposiumon Bioinformatics and Bioengineering(BIBE2005).Washington,2005,pp.281284. [24] D. Wang,Y.Zhao,B.Chen,G.Wang,SUA-based algorithm for finding SATRs in DNA sequence J.Northeastern Univ.(Nat.Sci.)28(2007)209212. [25] Y.Xiong,Y.Zhu,BioPM:anefficient algorithm for protein motif mining,in: Proceedings ofICBBE07,IEEEPress,2007,pp.394397. [26] Q .Zhou,Q.Jiang;S.Li;X.Xie;L.Lin,Anefficient algorithm for protein se- quence pattern mining,in:Proceedings of 2010 Fifth International Con- ference on Computer Science and Education(ICCSE),2010,pp.18761881. [27] Frequent pattern mining in multiple biological sequences, Ling Chen, Wei Liu Computers in Biology and Medicine, [28] Sequential Pattern Mining with Wildcards Fei Xie Coll. of Comput. Sci. and Info. Eng., Hefei Univ. of Tech., Hefei, China Xindong Wu ; Xuegang Hu ; Jun Gao ; Dan Guo ; Yulian Fei ; Ertian Hua october 2010 ELSEVIER, 2013 July. [29] Incorporating personalized gene sequence variants, molecular genetics knowledge, and health knowledge into an EHR prototype based on the Continuity of Care Record standard Xia Jinga, Corresponding author contact information, E-mail the corre- sponding author, Stephen Kayb, Thomas Marleyc, Nicholas R. Hardik- erd, James J. Ciminoa April 2011 [30] Technical desiderata for the integration of genomic data into Electronic Health Records Daniel R. Masysa, Corresponding author contact information, E-mail the corresponding author, Gail P. Jarvikb, c, Neil F. Abernethya, Nicholas R. Andersona, George J. Papanicolaoud, Dina N. Paltooe, Mark A. Hoffmanf, Isaac S. Kohaneg, Howard P. Levyh December 2011 [31] Leveraging informatics for genetic studies: use of the electronic medical record to enable a genome-wide association study of peripheral arterial disease Iftikhar J Kullo1, Jin Fan1, Jyotishman Pathak2, Guergana K Savova2, Zeenat Ali1, Christopher G Chute2 june 2010. 24