Download Research about Data Mining Application on Bioinformatics Hsinto

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Research about Data Mining Application on Bioinformatics
Hsinto Cheung
College of Software
Nankai University
Tianjin,China
[email protected]
Abstract—Bioinformatics
cross-disciplinary
microorganism as well as the pathway of genome regulatory
science area. It aims to analysis and use bio-molecular
network. The computational evolutionary biology present to
data to solve some biological problem. During the research,
us that the origin of species and the changes in the evolution.
scientists use a lot of tools in statistical theory and
To adapt the different environment, different organisms have
computer science. Meanwhile, data mining is widely used
great difference in the course of evolution and in the
in bioinformatics application. With the method of data
biological genetic.
is
a
new
mining, we can find a variety of informatics from the
II EASE OF USE
biological data. With the development of computer science,
data mining has been greatly development and achieve
There is a lot of information in our gene sequence. How
much success. In this paper, I mainly introduce some
to properly find and use the information is a great problem
method of data mining (including it model and algorithm)
that needs us to study. In recently years, we have made great
and its application on bioinformatics. The aim of this
progress in data mining technology and make important
paper is to solve some biological problem with some data
achievement in related fields. The analysis of sequence,
mining method.
genome annotation and computational evolutionary biology is
the main fields in the research of bioinformatics. So, we can
Keywords-Data Mining, Bioinformatics, Microarray
use data mining method to solve the predict problem in
genome annotation, the construction of system tree in
computational evolutionary evolution and the selection of
Ⅰ
INTRODUCTION
characteristics in the microarray of gene expression data. In
Bioinformatics is a emerging cross-disciplinary science
biological analysis, for sequence classification, a variety of
area which apply statistic theory and computer science on
strategies, depending on the problem type, can be used to map
molecule biology. With the development of Human Genome
sequences to a representation that can be handled by
Project and high-through-put technology, bioinformatics is
traditional classifiers.
widely used in science area. By now, Bioinformatics has
included the use of database, the computer technology, the
algorithm, and the statistics and so on.
III RELATED WORK
In traditional data mining method, such as supervised
Now, the research area of bioinformatics includes the
learning [1] and semi-supervised learning [2] , one common
analysis of sequence, genome annotation, and computational
assumption is that both the labeled and unlabeled data are
evolutionary biology, the analysis of biological regulatory and
sampled form the same distribution or lie on the same
so on. In this paper, we mainly focus on the analysis of gene
manifold. When studying the biological information, it would
express of gene and the prediction of operators in genome
be useful if the previously trained model can be reused. Smyth
annotation. The aim of genome annotation is exploiting the
introduces a mixture of HMM in [3] and presents an
known sequence information to find new gene knowledge,
initialization technique that is similar to our model in that an
and making annotation on gene function, the operator of
individual HMM is learned for each sequence [4]. But be
different from out models in that the emission matrices are not
shared between HMMs. These initial models are used to
compute the set of all pair wise distances between sequences,
defined as the symmertrized log likehood of each element of
the pair under the other’s respective model.
IV MAIN SECTION
V EXPERIMENT RESULT
VI CONCLUSION
With the constant development of computer technology
and biological technique, bioinformatics has been becoming a
important area in science research. In bioinformatics, the
genome annotation, the computational evolutionary biology
and the data analysis of gene expression are the main research
fields in bioinformatics. In this paper, we proposed a novel
domain data mining method to solve this problem. In data
mining methods, HMM model is a extension of the standard
HMM that assigns individual transition matrices to each
sequence in a dataset but keeps a single emissions matrix for
the entire dataset. This paper will describe three inference
algorithms, a Baum-Welch-like algorithm and a Gibbs
sampling algorithm. Because our model fits within a large
existing body of work on generative models, we are especially
interested in related models that perform classification
directly.
VII ACKNOWLEDGMET
The authors would like to thank Mr. Jason Lee and all the
mates in my class for their valuable suggestions.
REFERENCE
[1] R.O Duda, PE Hart, and D.G.
Storks. Pattern
classification.Citeseer, 2011
[2] X.Zhu. Semi-supervised learning literature survey.
Computer Science, University of Wisconsin-Madison,
2006
[3] P.Smyth. Clustering sequences with hidden Markov
models. Advances in neural information processing
systems.1997
[4] Sam Blasiak, Huzefa Rangwala. Hidden Markov Model
Variant for Sequence Classification. 2007
[5] Y. Tang, Y.Q Zhang, N.V Chawla, and S. Krasser. SVMs
modelingfor highly inbalanced classification. 2009