Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
International Journal of Information Technology and Engineering Vol. 2, No. 1, January-June 2011, pp. 33-37 © International Science Press, ISSN: 2229-7367 A Review on the Usefulness of Data Mining Techniques in Bio-Informatics S. Nirmala Devi1 & S.P. Rajagopalan2 1 Research Scholar, Bharath University, Chennai-600005, India, E-mail: [email protected]. 2 D.M.G.R Educational and Research Institute, Chennai-600095, India, E-mail: [email protected]. ABSTRACT This article promotes some of the general idea inferred from Data Mining and Bioinformatics. The applications of Data Mining in the field of Bioinformatics are explained and major areas in the field of Bioinformatics are highlighted. The current dispute and contingency of Data Mining in Bioinformatics is highlighted. Keywords: Data mining, bioinformatics, protein sequence analysis, bioinformatics tools. 1. INTRODUCTION Over the past few decades rapid developments in genomic and other molecular research technologies and developments in information technologies have combined to produce a tremendous amount of information related to molecular biology. It is the name given to these mathematical and computing approaches used to glean understanding of biological processes. Bioinformatics now entails the creation and advancement of databases, algorithms, computational and statistical techniques and theory to solve formal and practical problems arising from the management and analysis of biological data. The application and development of data mining techniques to solve biological problems is the particular active area of research in bioinformatics. Evaluating large biological data sets requires perception of the data by inferring structure or generalizations from the data. Major research efforts in the field include sequence alignment, gene finding, genome assembly, drug design, drug discovery, protein structure alignment, protein structure prediction, prediction of gene expression and protein-protein interactions, genome-wide association studies and the modeling of evolution. 2. BIOINFORMATICS The term Bioinformatics was invented by Paulien Hogeweg in 1979 for the study of information content and information flow in biological systems and processes. It was primarily used in genomics and genetics, particularly in those areas of genomics involving largescale DNA sequencing. Bioinformatics, Computational biology and Bioinformation infrastructure are often times used interchangeably. Bioinformatics is the application of information technology and computer science to the field of molecular biology. Using computer technology to manage large amounts of biological data (eg, genetic information) for research in areas such as molecular biology, genomics, and proteomics. The science of using computer technology to gather, store, extract, analyze and merge biological data. The primary goal of this discipline is to establish the grounds of discovering new insights in biology and at the same time be able to establish a perspective wherein which the merging principles in biology can be globally discerned and applying computationally intensive techniques (e.g., pattern recognition, data mining, machine learning algorithms, and visualization etc.,) to achieve this goal. The research area in Bioinformatics are: 2.1 Sequence Analysis Computational biology encompasses the use of algorithmic tools to facilitate biological analyses. Sequence Analysis is the basic tool in computational biology. Sequence analysis encompasses the use of various bioinformatics methods to determine the biological function and/or structure of genes and the proteins they code for. It consists of finding which part of the biological sequences are alike and which part differs during medical analysis and genome mapping processes. The sequence analysis subjecting a DNA or peptide sequence to sequence alignment, sequence databases, repeated sequence searches and other bioinformatics methods on a computer. 34 International Journal of Information Technology and Engineering 2.2 Genome Annotation The process of identifying the locations of genes and all of the coding regions in a genome and determining what those genes do. Once a genome is sequenced, it needs to be annotated to make sense of it. Gene finding is most important step in understanding the genome of a species once it has been sequenced. 2.3 Prediction of Protein Structure Proteins play a crucial role in virtually all biological processes with a broad range of functions. Proteins are the building blocks of life. In a cell, 70% is water and 15%20% are protein. A protein is composed of a central backbone and a collection of 50-2000 amino acids. 20 different kinds of amino acids each consisting of up to 18 atoms. Ex: Leucine, Alanine, Serine, Glutmic acid etc., Proteins must fold to function. Some diseases are caused by misfolding e.g., mad cow disease. The amino acid sequence of a protein, the so-called primary structure, can be easily determined from the sequence on the gene that codes for it. Amino acids are molecules containing an amine group, a carboxylic acid group and a side chain that varies between different amino acids. These molecules contain the key elements of carbon, hydrogen, oxygen, and nitrogen. Amino acids are critical to life, and have many functions in metabolism. One particularly important function is to serve as the building blocks of proteins. Protein structure prediction is important for drug design and the design of novel enzymes. Levels of Protein structure: (a) Primary Structure– the linear sequence of amino acids in a protein cycle. (b) Secondary structure– region of local regularity within a protein fold. (c) Super–secondary structure–the arrangement of α-helices and β-strands into discrete folding units. (d) Tertiary structure– the overall fold of a protein sequence, formed by the packing of its secondary and super-secondary structure elements. (e) Quaternary structure– the arrangements of separate protein chains in a protein molecule with more than one subunit. (f) Quinternary structure– the arrangement of separate molecules, such as in protein-protein or protein- nucleic acid interactions. 2.4 Analysis of Gene Expression mRNA levels are measured using various techniques such as microarrays, serial analysis of gene expression (SAGE) tag sequencing, massively parallel signature sequencing (MPSS), expresses cDNA sequence tag [EST] sequencing etc., t determine the expression of many genes. All of these techniques are extremely noise-prone and/or subject to bias in the biological measurement. Developing statistical tools to separate signal from noise is the major research area in computational biology. 2.5 Analysis of Protein Expression The subcomponent of gene expression is Protein expression .mRNA and protein expression are used to measure gene expression. This protein expression is actually used for gene activity since proteins are usually final catalysts of cell activity. Protein expression systems are very widely used in the life sciences, biotechnology and medicine. Molecular biology research uses an enormous number of proteins and enzymes many of which are from expression systems. High throughput (HT) mass spectrometry (MS) and Protein microarrays can provide a snapshot of the proteins present in a biological sample. 2.6 Analysis of Mutation in Cancer The genomes of affected cells are rearranged in complex or even unpredictable ways in Cancer. To identify previously unknown point mutations in a variety of genes in cancer using Massive sequencing efforts. Specialized automated systems are produced by Bioinformaticians to manage the sheer volume of sequence data produced, and for comparing the sequencing results to the growing collection of human genome sequences and germline polymorphisms they create new algorithms and software. Oligonucleotide microarray technology is to identify chromosomal gains and losses and single nucleotide polymorphism arrays to detect known point mutations are the New physical detection techniques and produces terabytes of data per experiment. 2.7 Comparative Genomics Comparative genomics is the study of the relationship of genome structure and function across different biological species [1]. There are currently more than 300 completed sequenced microbial genomes publicly available and many are of closely related species. It produces both similarities and differences in the proteins, RNA and regulatory regions of different organisms. Computational approaches to genome comparison have recently become a common research topic in computer science. 2.8 Modeling Biological Systems The significant task of systems biology and mathematical biology is Modeling biological systems. System Biology is the study of the interactions between the components of biological systems, and how these interactions give rise to the function and behaviour of that system. Mathematical biology aims at the mathematical representation, treatment 35 A Review on the Usefulness of Data Mining Techniques in Bio-Informatics and modeling of biological processes, using a variety of applied mathematical techniques and tools. It has both theoretical and practical applications in biological, biomedical and biotechnology research. 4. BIOINFORMATICS TOOLS Computational systems biology aims to develop and use efficient algorithms, data structures, visualization and communication tools for the integration of large quantities of biological data with the goal of computer modeling. It involves the use of computer simulations of biological systems, like cellular subsystems to both analyze and visualize the complex connections of these cellular processes. Sequence 2.9 Protein-Protein Docking Bioinformatics Tool Research (AppliArea cation) Multiple Sequence Alignment The biological database are computer sites that organize, store and disseminate files that contain information consisting of literature reference, nucleic acid sequence, protein sequence and protein structures is divided into three categories. (1) Primary Database– It is a collection of sequence of nucleic acid. (2) Secondary Database– It is collection of sequence of protein. (3) Tertiary or Composite Database– sequence of composite protein. There are several database build for store and management of the biological data. But the techniques of information retrieval are important. The mining facilitates the pattern identification of this retrieved information. Data mining facilitates fast retrieval of information from the available data sets by using mathematical and statistical methods through minimal input. This mining information (previously unknown) or data are used to make crucial strategic decision. Data mining techniques offers to discover and analyze the data from different sources using centralized approach and provide knowledge intelligence, generalization and smooth-going analytical operations. http://blast.ncbi.nlm.nih.gov/Blast.cgi CSBLAST ftp://toolkit.lmb.uni-muenchen.de/ csblast/ HMMER http://hmmer.janelia.org/ FASTA www.ebi.ac.uk/fasta33 MSA Probs http://msaprobs.sourceforge.net/ DNA http://www.fluxus-engineering.com/ Alignment align.html MultiAlin http://multalin.toulouse.inra.fr/mult alin/multalin.html Proteins controls and mediate many of the biological activities of cells. 3. BIOLOGICAL DATABASES BLAST Alignment X-ray crystallography and Protein nuclear magnetic resonance spectroscopy (protein NMR) produced Tens of thousands of protein three-dimensional structures. A cell is not static-changes in Shape, Division and Metabolism and all cells are not equivalent. The binding of one signaling protein to another can have a number of consequences. -Banding serve to recruit a signaling protein. The binding can induce conformational changes that affect activity. References Gene Finding DiAlign http://bibiserv.techfak.uni-bielefeld. de/dialign/ Genome Scan http://genes.mit.edu/genomescan. html GeneMark http://exon.biology.gatech.edu/ Pattern Gibbs Identification Sampler http://bayesweb.wadsworth.org/ gibbs/gibbs.html AlignACE http://atlas.med.harvard.edu/ Protein Domain Analysis MEME http://meme.sdsc.edu/ Pfam http://pfam.sanger.ac.uk/ BLOCKS http://blocks.fhcrc.org/ Genomic Analysis Motif finding ProDom http://prodom.prabi.fr/prodom/ current/html/home.jsp SLAM http://bio.math.berkeley.edu/slam/ Multiz http://www.bx.psu.edu/miller_lab/ MEME /MAST http://meme.sdsc.edu eMOTIF http://motif.stanford.edu 5. DATA MINING Data mining, the extraction of hidden predictive information from large databases. It is the science of finding patterns and relationship in huge amount of data. Generally, data mining (sometimes called data or knowledge discovery) is the process of analyzing data from different perspectives and summarizing it into useful information - information that can be used to increase revenue, cuts costs, or both. Data mining software is one of a number of analytical tools for analyzing data and it is the process of finding correlations or patterns among dozens of fields in large relational databases. 36 International Journal of Information Technology and Engineering Data Mining is also sometimes called Knowledge Discovery in Databases(KDD). Knowledge discovery in databases is well-defined process consisting of several distinct steps. Data mining consists of five major elements: (i) Classification is learning a function that maps (classifies) a data item into one of several predefined classes. (ii) Regression is learning a function which maps a data item to a real-valued prediction variable. (iii) Clustering is a common descriptive task where one seeks to identify a finite set of categories or clusters to describe the data. • Extract, transform, and load transaction data onto the data warehouse system. • Store and manage the data in a multidimensional database system. • Provide data access to business analysts and information technology professionals. • Analyze the data by application software. Closely related to clustering is the task of probability density estimation which consists of techniques for estimating, from data, the joint multi-variate probability density function of all of the variables/fields in the database. • Present the data in a useful format, such as a graph or table. Summarization involves methods for finding a compact description for a subset of data. Mining biological data helps to extract useful knowledge from massive datasets gathered in biology and in other related life science areas such as medicine and neuroscience. Dependency Modeling consists of finding a model which describes significant dependencies between variables.Dependency models exist at two levels: Applications of Data Mining involves Ad revenue forecasting, Churn (turnover) management, Claims processing, Credit risk analysis, Cross-marketing, Customer profiling, Customer retention, Electronic commerce, Exception reports, Food-service menu analysis, Fraud detection, Government policy setting, Hiring profiles, Market basket analysis, Medical management, Member enrollment, New product development, Pharmaceutical research, Process control, Quality control, Shelf management/store management, Student recruiting and retention, Targeted marketing, Warranty analysis etc., 5.1 Data Mining Tasks The cycle of data and knowledge mining comprises various analysis steps, each step focusing on a different aspect or task. [13] propose the following categorization of data mining tasks. The two “high-level” primary goals of data mining, in practice, are prediction and description. (a) Prediction involves using some variables or fields in the database to predict unknown or future values of other variables of interest. (b) Description focuses on finding humaninterpretable patterns describing the data. The relative importance of prediction and description for particular data mining applications can vary considerably. However, in the context of KDD, description tends to be more important than prediction. This is in contrast to pattern recognition and machine learning applications (such as speech recognition) where prediction is often the primary goal of the KDD process. The goals of prediction and description are achieved by using the following primary data mining tasks: (1) The structural level of the model specifies (often graphically) which variables are locally dependent on each other, and (2) The quantitative level of the model specifies the strengths of the dependencies using some numerical scale. Change and Deviation Detection focuses on discovering the most significant changes in the data from previously measured or normative values. 6. APPLICATION OF DATA MINING IN BIOINFORMATICS Gene Finding, function motif detection, protein function domain detection, protein function inference, disease prognosis, disease treatment optimization, motif detection, protein and gene interaction network reconstruction, data cleansing, protein sub-cellular location prediction are the applications of Data Mining in Bioinformatics. Patient outcome prediction using microarray technologies is an important application in bioinformatics. Based on patients’ genotypic microarray data, predictions are made to estimate patients’ survival time and their risk of tumor metastasis or recurrence. So, accurate prediction can potentially help to provide better treatment for patients. Machine Learning can be used for peptide identification through mass spectroscopy. 7. CONCLUSION AND CHALLENGES Both data mining and bioinformatics are fast-expanding and closely related research frontiers. Data mining approaches ideally suits for bioinformatics and bioinformatics is data-rich but lacks a comprehensive theory of life’s organization at the molecular level. A Review on the Usefulness of Data Mining Techniques in Bio-Informatics Not applying data mining methods in research where the model is not known might miss essential discoveries. The data in genome and protein databases is growing constantly. The large size of biological data sets, inherent complexity of biological problems and the ability to deal with error-prone data all result in special requirements such as large memory space and huge computation time. It is important to examine the important research issues in bioinformatics and develop new data mining methods for scalable and effective biological analysis. REFERENCES [1] Stefano Lonardi(2010), “IEEE/ACM Transactions on Computational Biology and Bioinformatics”, 7, pp. 195196. [2] Jake Y. Chen, Mohammed J. Zaki and Stefano Lonardi, “BIOKDD08: A Workshop Report on Data Mining in Bioinformatics”, SIGKDD Explorations, 10(2): pp. 54-56. Dec 2008. [3] Yang, Qiang., “Data Mining and Bioinformatics: Some Challenges”, http://www.cse.ust.hk/~qyang. [4] Tuan D. Pham (2008)., “Computational Prediction Models for Cancer Classification using Mass Spectrometry Data”, International Journal of Data Mining and Bioinformatics. [5] Lee, Kyoungrim. (2008)., “Computational Study for Protein-Protein Docking using Global Optimization and Empirical Potentials”, Int. J. Mol. Sci., 9, pp. 65-77. [6] [7] Chris Ding, and Hanchuan Peng,(2005), “Minimum Redundancy Feature Selection from Microarray Gene Expression Data”, Journal of Bioinformatics and Computational Biology. Lei Liu; Jion Yang, A. and Tung K. H., “Data Mining Techniques for Microarray Datasets”, Proceedings of the 21st International Conference on Data Engineering (ICDE 2005). 37 [8] John F. Elder IV & Dean W. Abbott, “A Comparison of Leading Data Mining Tools”, Fourth International Conference on Knowledge Discovery & Data Mining, (1998). [9] Berson, Alex, Smith, Stephen and Thearling, Kurt, “Building Data Mining Application for CRM”, Tata McGraw Hill. [10] Richard, R.J. A. and Sriraam, N. (2005)., “A Feasibility Study of Challenges and Opportunities in Computational Biology: A Malaysian Perspective”, American Journal of Applied Sciences, 2(9):, pp. 1296-1300. [11] SJ, Wodak and Janin, J. (1978)., “Computer Analysis of Protein-Protein Interactions”, Journal of Molecular Biology, 124(2): pp. 323-42. [12] Lee, Kyoungrim. (2008)., “Computational Study for Protein-Protein Docking using Global Optimization and Empirical Potentials”, Int. J. Mol. Sci., 9, pp. 65-77. [13] Hand, D.J., Mannila, H., and Smyth, P.(2001)., “Principles of Data Mining”, MIT Press. [14] Hirschman, Lynette; C. Park, Jong; T., Junichi, Wong, L. and H. Wu., Cathy (2002)., “Accomplishments and Challenges in Literature Data Mining for Biology”, BIOINFORMATICS REVIEW, 18, no. 12, pp. 1553-1561. [15] Margaret H. Dunham(2006), “Data Mining Introductory and Advanced Topics”, Pearson Education. [16] Han and Kamber (2006)., “Data Mining Concepts and Techniques”, Morgan Kaufmann Publishers. [17] Hand, D. J.; Mannila, H. and Smyth, P., “Principles of Data Mining”, MIT Press. [18] T K Attwood(2008), Introduction to Bioinformatics Pearson Education. [19] Mount, D. W. (2002)., “Bioinformatics: Sequence and Genome Analysis Spring Harbor Press”. [20] Pevzner, P. A. (200)., ”Computational Molecular Biology: An Algorithmic Approach The MIT Press”.