Download A Review on the Usefulness of Data Mining Techniques in Bio

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
International Journal of Information Technology and Engineering Vol. 2, No. 1, January-June 2011, pp. 33-37
© International Science Press,
ISSN: 2229-7367
A Review on the Usefulness of Data Mining Techniques in
Bio-Informatics
S. Nirmala Devi1 & S.P. Rajagopalan2
1
Research Scholar, Bharath University, Chennai-600005, India, E-mail: [email protected].
2
D.M.G.R Educational and Research Institute, Chennai-600095, India, E-mail: [email protected].
­ ABSTRACT
This article promotes some of the general idea inferred from Data Mining and Bioinformatics. The applications of
Data Mining in the field of Bioinformatics are explained and major areas in the field of Bioinformatics are highlighted.
The current dispute and contingency of Data Mining in Bioinformatics is highlighted.
Keywords: Data mining, bioinformatics, protein sequence analysis, bioinformatics tools.
1. INTRODUCTION
Over the past few decades rapid developments in genomic
and other molecular research technologies and developments in information technologies have combined to
produce a tremendous amount of information related to
molecular biology. It is the name given to these mathematical and computing approaches used to glean
understanding of biological processes.
Bioinformatics now entails the creation and
advancement of databases, algorithms, computational
and statistical techniques and theory to solve formal and
practical problems arising from the management and
analysis of biological data.
The application and development of data mining
techniques to solve biological problems is the particular
active area of research in bioinformatics.
Evaluating large biological data sets requires
perception of the data by inferring structure or generalizations from the data. Major research efforts in the field
include sequence alignment, gene finding, genome
assembly, drug design, drug discovery, protein structure
alignment, protein structure prediction, prediction of gene
expression and protein-protein interactions, genome-wide
association studies and the modeling of evolution.
2. BIOINFORMATICS
The term Bioinformatics was invented by Paulien
Hogeweg in 1979 for the study of information content
and information flow in biological systems and processes.
It was primarily used in genomics and genetics,
particularly in those areas of genomics involving largescale DNA sequencing.
Bioinformatics, Computational biology and Bioinformation infrastructure are often times used interchangeably.
Bioinformatics is the application of information
technology and computer science to the field of molecular
biology. Using computer technology to manage large
amounts of biological data (eg, genetic information) for
research in areas such as molecular biology, genomics,
and proteomics. The science of using computer technology
to gather, store, extract, analyze and merge biological data.
The primary goal of this discipline is to establish the
grounds of discovering new insights in biology and at
the same time be able to establish a perspective wherein
which the merging principles in biology can be globally
discerned and applying computationally intensive
techniques (e.g., pattern recognition, data mining, machine
learning algorithms, and visualization etc.,) to achieve
this goal. The research area in Bioinformatics are:
2.1 Sequence Analysis
Computational biology encompasses the use of algorithmic
tools to facilitate biological analyses. Sequence Analysis
is the basic tool in computational biology. Sequence
analysis encompasses the use of various bioinformatics
methods to determine the biological function and/or
structure of genes and the proteins they code for. It consists
of finding which part of the biological sequences are alike
and which part differs during medical analysis and
genome mapping processes. The sequence analysis
subjecting a DNA or peptide sequence to sequence
alignment, sequence databases, repeated sequence
searches and other bioinformatics methods on a
computer.
34
International Journal of Information Technology and Engineering
2.2 Genome Annotation
The process of identifying the locations of genes and all
of the coding regions in a genome and determining what
those genes do. Once a genome is sequenced, it needs to
be annotated to make sense of it. Gene finding is most
important step in understanding the genome of a species
once it has been sequenced.
2.3 Prediction of Protein Structure
Proteins play a crucial role in virtually all biological
processes with a broad range of functions. Proteins are
the building blocks of life. In a cell, 70% is water and 15%20% are protein. A protein is composed of a central
backbone and a collection of 50-2000 amino acids. 20
different kinds of amino acids each consisting of up to 18
atoms. Ex: Leucine, Alanine, Serine, Glutmic acid etc.,
Proteins must fold to function. Some diseases are caused
by misfolding e.g., mad cow disease. The amino acid
sequence of a protein, the so-called primary structure, can
be easily determined from the sequence on the gene that
codes for it.
Amino acids are molecules containing an amine
group, a carboxylic acid group and a side chain that varies
between different amino acids. These molecules contain
the key elements of carbon, hydrogen, oxygen, and
nitrogen. Amino acids are critical to life, and have many
functions in metabolism. One particularly important
function is to serve as the building blocks of proteins.
Protein structure prediction is important for drug design
and the design of novel enzymes.
Levels of Protein structure:
(a) Primary Structure– the linear sequence of amino
acids in a protein cycle.
(b) Secondary structure– region of local regularity
within a protein fold.
(c) Super–secondary structure–the arrangement of
α-helices and β-strands into discrete folding units.
(d) Tertiary structure– the overall fold of a protein
sequence, formed by the packing of its secondary
and super-secondary structure elements.
(e) Quaternary structure– the arrangements of
separate protein chains in a protein molecule
with more than one subunit.
(f) Quinternary structure– the arrangement of
separate molecules, such as in protein-protein or
protein- nucleic acid interactions.
2.4 Analysis of Gene Expression
mRNA levels are measured using various techniques such
as microarrays, serial analysis of gene expression (SAGE)
tag sequencing, massively parallel signature sequencing
(MPSS), expresses cDNA sequence tag [EST] sequencing
etc., t determine the expression of many genes. All of these
techniques are extremely noise-prone and/or subject to
bias in the biological measurement. Developing statistical
tools to separate signal from noise is the major research
area in computational biology.
2.5 Analysis of Protein Expression
The subcomponent of gene expression is Protein
expression .mRNA and protein expression are used to
measure gene expression. This protein expression is
actually used for gene activity since proteins are usually
final catalysts of cell activity. Protein expression systems
are very widely used in the life sciences, biotechnology
and medicine. Molecular biology research uses an
enormous number of proteins and enzymes many of which
are from expression systems. High throughput (HT) mass
spectrometry (MS) and Protein microarrays can provide a
snapshot of the proteins present in a biological sample.
2.6 Analysis of Mutation in Cancer
The genomes of affected cells are rearranged in complex
or even unpredictable ways in Cancer. To identify
previously unknown point mutations in a variety of genes
in cancer using Massive sequencing efforts. Specialized
automated systems are produced by Bioinformaticians to
manage the sheer volume of sequence data produced, and
for comparing the sequencing results to the growing
collection of human genome sequences and germline
polymorphisms they create new algorithms and software.
Oligonucleotide microarray technology is to identify
chromosomal gains and losses and single nucleotide
polymorphism arrays to detect known point mutations
are the New physical detection techniques and produces
terabytes of data per experiment.
2.7 Comparative Genomics
Comparative genomics is the study of the relationship of
genome structure and function across different biological
species [1]. There are currently more than 300 completed
sequenced microbial genomes publicly available and
many are of closely related species. It produces both
similarities and differences in the proteins, RNA and
regulatory regions of different organisms. Computational
approaches to genome comparison have recently become
a common research topic in computer science.
2.8 Modeling Biological Systems
The significant task of systems biology and mathematical
biology is Modeling biological systems. System Biology is
the study of the interactions between the components of
biological systems, and how these interactions give rise to
the function and behaviour of that system. Mathematical
biology aims at the mathematical representation, treatment
35
A Review on the Usefulness of Data Mining Techniques in Bio-Informatics
and modeling of biological processes, using a variety of
applied mathematical techniques and tools. It has both
theoretical and practical applications in biological,
biomedical and biotechnology research.
4. BIOINFORMATICS TOOLS
Computational systems biology aims to develop and
use efficient algorithms, data structures, visualization and
communication tools for the integration of large quantities
of biological data with the goal of computer modeling. It
involves the use of computer simulations of biological
systems, like cellular subsystems to both analyze and
visualize the complex connections of these cellular
processes.
Sequence
2.9 Protein-Protein Docking
Bioinformatics Tool
Research
(AppliArea
cation)
Multiple
Sequence
Alignment
The biological database are computer sites that organize,
store and disseminate files that contain information
consisting of literature reference, nucleic acid sequence,
protein sequence and protein structures is divided into
three categories.
(1) Primary Database– It is a collection of sequence
of nucleic acid.
(2) Secondary Database– It is collection of sequence
of protein.
(3) Tertiary or Composite Database– sequence of
composite protein.
There are several database build for store and
management of the biological data. But the techniques of
information retrieval are important. The mining facilitates
the pattern identification of this retrieved information.
Data mining facilitates fast retrieval of information from
the available data sets by using mathematical and
statistical methods through minimal input. This mining
information (previously unknown) or data are used to
make crucial strategic decision. Data mining techniques
offers to discover and analyze the data from different
sources using centralized approach and provide
knowledge intelligence, generalization and smooth-going
analytical operations.
http://blast.ncbi.nlm.nih.gov/Blast.cgi
CSBLAST
ftp://toolkit.lmb.uni-muenchen.de/
csblast/
HMMER
http://hmmer.janelia.org/
FASTA
www.ebi.ac.uk/fasta33
MSA
Probs
http://msaprobs.sourceforge.net/
DNA
http://www.fluxus-engineering.com/
Alignment align.html
MultiAlin http://multalin.toulouse.inra.fr/mult
alin/multalin.html
Proteins controls and mediate many of the biological
activities of cells.
3. BIOLOGICAL DATABASES
BLAST
Alignment
X-ray crystallography and Protein nuclear magnetic
resonance spectroscopy (protein NMR) produced Tens of
thousands of protein three-dimensional structures.
A cell is not static-changes in Shape, Division and
Metabolism and all cells are not equivalent. The binding
of one signaling protein to another can have a number of
consequences. -Banding serve to recruit a signaling
protein. The binding can induce conformational changes
that affect activity.
References
Gene
Finding
DiAlign
http://bibiserv.techfak.uni-bielefeld.
de/dialign/
Genome
Scan
http://genes.mit.edu/genomescan.
html
GeneMark http://exon.biology.gatech.edu/
Pattern
Gibbs
Identification Sampler
http://bayesweb.wadsworth.org/
gibbs/gibbs.html
AlignACE http://atlas.med.harvard.edu/
Protein
Domain
Analysis
MEME
http://meme.sdsc.edu/
Pfam
http://pfam.sanger.ac.uk/
BLOCKS http://blocks.fhcrc.org/
Genomic
Analysis
Motif
finding
ProDom
http://prodom.prabi.fr/prodom/
current/html/home.jsp
SLAM
http://bio.math.berkeley.edu/slam/
Multiz
http://www.bx.psu.edu/miller_lab/
MEME
/MAST
http://meme.sdsc.edu
eMOTIF
http://motif.stanford.edu
5. DATA MINING
Data mining, the extraction of hidden predictive information
from large databases. It is the science of finding patterns and
relationship in huge amount of data. Generally, data mining
(sometimes called data or knowledge discovery) is the
process of analyzing data from different perspectives and
summarizing it into useful information - information that
can be used to increase revenue, cuts costs, or both. Data
mining software is one of a number of analytical tools for
analyzing data and it is the process of finding correlations
or patterns among dozens of fields in large relational
databases.
36
International Journal of Information Technology and Engineering
Data Mining is also sometimes called Knowledge
Discovery in Databases(KDD). Knowledge discovery in
databases is well-defined process consisting of several
distinct steps.
Data mining consists of five major elements:
(i) Classification is learning a function that maps
(classifies) a data item into one of several
predefined classes.
(ii) Regression is learning a function which maps a
data item to a real-valued prediction variable.
(iii) Clustering is a common descriptive task where
one seeks to identify a finite set of categories or
clusters to describe the data.
•
Extract, transform, and load transaction data onto
the data warehouse system.
•
Store and manage the data in a multidimensional
database system.
•
Provide data access to business analysts and
information technology professionals.
•
Analyze the data by application software.
Closely related to clustering is the task of probability
density estimation which consists of techniques for
estimating, from data, the joint multi-variate probability
density function of all of the variables/fields in the
database.
•
Present the data in a useful format, such as a
graph or table.
Summarization involves methods for finding a
compact description for a subset of data.
Mining biological data helps to extract useful knowledge from massive datasets gathered in biology and in
other related life science areas such as medicine and
neuroscience.
Dependency Modeling consists of finding a model
which describes significant dependencies between
variables.Dependency models exist at two levels:
Applications of Data Mining involves Ad revenue
forecasting, Churn (turnover) management, Claims
processing, Credit risk analysis, Cross-marketing,
Customer profiling, Customer retention, Electronic
commerce, Exception reports, Food-service menu analysis,
Fraud detection, Government policy setting, Hiring
profiles, Market basket analysis, Medical management,
Member enrollment, New product development,
Pharmaceutical research, Process control, Quality control,
Shelf management/store management, Student recruiting
and retention, Targeted marketing, Warranty analysis etc.,
5.1 Data Mining Tasks
The cycle of data and knowledge mining comprises
various analysis steps, each step focusing on a different
aspect or task. [13] propose the following categorization
of data mining tasks. The two “high-level” primary goals
of data mining, in practice, are prediction and description.
(a) Prediction involves using some variables or fields
in the database to predict unknown or future
values of other variables of interest.
(b) Description focuses on finding humaninterpretable patterns describing the data.
The relative importance of prediction and description
for particular data mining applications can vary
considerably. However, in the context of KDD, description
tends to be more important than prediction. This is in
contrast to pattern recognition and machine learning
applications (such as speech recognition) where
prediction is often the primary goal of the KDD process.
The goals of prediction and description are achieved
by using the following primary data mining tasks:
(1) The structural level of the model specifies (often
graphically) which variables are locally
dependent on each other, and
(2) The quantitative level of the model specifies the
strengths of the dependencies using some
numerical scale.
Change and Deviation Detection focuses on
discovering the most significant changes in the
data from previously measured or normative
values.
6. APPLICATION OF DATA MINING IN
BIOINFORMATICS
Gene Finding, function motif detection, protein function
domain detection, protein function inference, disease
prognosis, disease treatment optimization, motif detection,
protein and gene interaction network reconstruction, data
cleansing, protein sub-cellular location prediction are the
applications of Data Mining in Bioinformatics.
Patient outcome prediction using microarray
technologies is an important application in bioinformatics.
Based on patients’ genotypic microarray data, predictions
are made to estimate patients’ survival time and their risk
of tumor metastasis or recurrence. So, accurate prediction
can potentially help to provide better treatment for
patients. Machine Learning can be used for peptide
identification through mass spectroscopy.
7. CONCLUSION AND CHALLENGES
Both data mining and bioinformatics are fast-expanding
and closely related research frontiers. Data mining
approaches ideally suits for bioinformatics and
bioinformatics is data-rich but lacks a comprehensive
theory of life’s organization at the molecular level.
A Review on the Usefulness of Data Mining Techniques in Bio-Informatics
Not applying data mining methods in research where
the model is not known might miss essential discoveries.
The data in genome and protein databases is growing
constantly. The large size of biological data sets, inherent
complexity of biological problems and the ability to deal
with error-prone data all result in special requirements
such as large memory space and huge computation time.
It is important to examine the important research
issues in bioinformatics and develop new data mining
methods for scalable and effective biological analysis.
REFERENCES
[1]
Stefano Lonardi(2010), “IEEE/ACM Transactions on
Computational Biology and Bioinformatics”, 7, pp. 195196.
[2]
Jake Y. Chen, Mohammed J. Zaki and Stefano Lonardi,
“BIOKDD08: A Workshop Report on Data Mining in
Bioinformatics”, SIGKDD Explorations, 10(2): pp. 54-56.
Dec 2008.
[3]
Yang, Qiang., “Data Mining and Bioinformatics: Some
Challenges”, http://www.cse.ust.hk/~qyang.
[4]
Tuan D. Pham (2008)., “Computational Prediction
Models for Cancer Classification using Mass Spectrometry Data”, International Journal of Data Mining and
Bioinformatics.
[5]
Lee, Kyoungrim. (2008)., “Computational Study for
Protein-Protein Docking using Global Optimization and
Empirical Potentials”, Int. J. Mol. Sci., 9, pp. 65-77.
[6]
[7]
Chris Ding, and Hanchuan Peng,(2005), “Minimum
Redundancy Feature Selection from Microarray Gene
Expression Data”, Journal of Bioinformatics and Computational
Biology.
Lei Liu; Jion Yang, A. and Tung K. H., “Data Mining
Techniques for Microarray Datasets”, Proceedings of the
21st International Conference on Data Engineering (ICDE
2005).
37
[8]
John F. Elder IV & Dean W. Abbott, “A Comparison of
Leading Data Mining Tools”, Fourth International
Conference on Knowledge Discovery & Data Mining, (1998).
[9]
Berson, Alex, Smith, Stephen and Thearling, Kurt,
“Building Data Mining Application for CRM”, Tata
McGraw Hill.
[10] Richard, R.J. A. and Sriraam, N. (2005)., “A Feasibility
Study of Challenges and Opportunities in Computational
Biology: A Malaysian Perspective”, American Journal of
Applied Sciences, 2(9):, pp. 1296-1300.
[11] SJ, Wodak and Janin, J. (1978)., “Computer Analysis of
Protein-Protein Interactions”, Journal of Molecular Biology,
124(2): pp. 323-42.
[12] Lee, Kyoungrim. (2008)., “Computational Study for
Protein-Protein Docking using Global Optimization and
Empirical Potentials”, Int. J. Mol. Sci., 9, pp. 65-77.
[13] Hand, D.J., Mannila, H., and Smyth, P.(2001)., “Principles
of Data Mining”, MIT Press.
[14] Hirschman, Lynette; C. Park, Jong; T., Junichi, Wong, L.
and H. Wu., Cathy (2002)., “Accomplishments and
Challenges in Literature Data Mining for Biology”,
BIOINFORMATICS REVIEW, 18, no. 12, pp. 1553-1561.
[15] Margaret H. Dunham(2006), “Data Mining Introductory
and Advanced Topics”, Pearson Education.
[16] Han and Kamber (2006)., “Data Mining Concepts and
Techniques”, Morgan Kaufmann Publishers.
[17] Hand, D. J.; Mannila, H. and Smyth, P., “Principles of
Data Mining”, MIT Press.
[18] T K Attwood(2008), Introduction to Bioinformatics Pearson
Education.
[19] Mount, D. W. (2002)., “Bioinformatics: Sequence and
Genome Analysis Spring Harbor Press”.
[20] Pevzner, P. A. (200)., ”Computational Molecular Biology:
An Algorithmic Approach The MIT Press”.