Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Research about Data Mining Application on Bioinformatics Hsinto Cheung College of Software Nankai University Tianjin,China [email protected] Abstract—Bioinformatics cross-disciplinary microorganism as well as the pathway of genome regulatory science area. It aims to analysis and use bio-molecular network. The computational evolutionary biology present to data to solve some biological problem. During the research, us that the origin of species and the changes in the evolution. scientists use a lot of tools in statistical theory and To adapt the different environment, different organisms have computer science. Meanwhile, data mining is widely used great difference in the course of evolution and in the in bioinformatics application. With the method of data biological genetic. is a new mining, we can find a variety of informatics from the II EASE OF USE biological data. With the development of computer science, data mining has been greatly development and achieve There is a lot of information in our gene sequence. How much success. In this paper, I mainly introduce some to properly find and use the information is a great problem method of data mining (including it model and algorithm) that needs us to study. In recently years, we have made great and its application on bioinformatics. The aim of this progress in data mining technology and make important paper is to solve some biological problem with some data achievement in related fields. The analysis of sequence, mining method. genome annotation and computational evolutionary biology is the main fields in the research of bioinformatics. So, we can Keywords-Data Mining, Bioinformatics, Microarray use data mining method to solve the predict problem in genome annotation, the construction of system tree in computational evolutionary evolution and the selection of Ⅰ INTRODUCTION characteristics in the microarray of gene expression data. In Bioinformatics is a emerging cross-disciplinary science biological analysis, for sequence classification, a variety of area which apply statistic theory and computer science on strategies, depending on the problem type, can be used to map molecule biology. With the development of Human Genome sequences to a representation that can be handled by Project and high-through-put technology, bioinformatics is traditional classifiers. widely used in science area. By now, Bioinformatics has included the use of database, the computer technology, the algorithm, and the statistics and so on. III RELATED WORK In traditional data mining method, such as supervised Now, the research area of bioinformatics includes the learning [1] and semi-supervised learning [2] , one common analysis of sequence, genome annotation, and computational assumption is that both the labeled and unlabeled data are evolutionary biology, the analysis of biological regulatory and sampled form the same distribution or lie on the same so on. In this paper, we mainly focus on the analysis of gene manifold. When studying the biological information, it would express of gene and the prediction of operators in genome be useful if the previously trained model can be reused. Smyth annotation. The aim of genome annotation is exploiting the introduces a mixture of HMM in [3] and presents an known sequence information to find new gene knowledge, initialization technique that is similar to our model in that an and making annotation on gene function, the operator of individual HMM is learned for each sequence [4]. But be different from out models in that the emission matrices are not shared between HMMs. These initial models are used to compute the set of all pair wise distances between sequences, defined as the symmertrized log likehood of each element of the pair under the other’s respective model. IV MAIN SECTION V EXPERIMENT RESULT VI CONCLUSION With the constant development of computer technology and biological technique, bioinformatics has been becoming a important area in science research. In bioinformatics, the genome annotation, the computational evolutionary biology and the data analysis of gene expression are the main research fields in bioinformatics. In this paper, we proposed a novel domain data mining method to solve this problem. In data mining methods, HMM model is a extension of the standard HMM that assigns individual transition matrices to each sequence in a dataset but keeps a single emissions matrix for the entire dataset. This paper will describe three inference algorithms, a Baum-Welch-like algorithm and a Gibbs sampling algorithm. Because our model fits within a large existing body of work on generative models, we are especially interested in related models that perform classification directly. VII ACKNOWLEDGMET The authors would like to thank Mr. Jason Lee and all the mates in my class for their valuable suggestions. REFERENCE [1] R.O Duda, PE Hart, and D.G. Storks. Pattern classification.Citeseer, 2011 [2] X.Zhu. Semi-supervised learning literature survey. Computer Science, University of Wisconsin-Madison, 2006 [3] P.Smyth. Clustering sequences with hidden Markov models. Advances in neural information processing systems.1997 [4] Sam Blasiak, Huzefa Rangwala. Hidden Markov Model Variant for Sequence Classification. 2007 [5] Y. Tang, Y.Q Zhang, N.V Chawla, and S. Krasser. SVMs modelingfor highly inbalanced classification. 2009