Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Bioinformatics and its application Introduction: An exceptional wealth of biological data has been generated by the human genome project and sequencing projects in many other organisms. The huge demand for analysis and interpretation of these data is being managed by bioinformatics. Bioinformatics is defined as the application of tools of computation and analysis to the capture and interpretation of biological data. It is an interdisciplinary field, which harnesses computer science, mathematics, physics, and biology Bayat accurately and broadly defines the discipline as “the application of tools of computation and analysis to the capture and interpretation of biological data” and, operationally, that “The main tools of a bioinformatician are computer software programs and the internet. A fundamental activity is sequence analysis of DNA and proteins using various programs and databases available on the world wide web” NCBI also defines bioinformatics as a single broad discipline, but with three “important subdisciplines”: • “The development of new algorithms and statistics with which to assess relationships among members of large data sets; • The analysis and interpretation of various types of data including nucleotide and amino acid sequences, protein domains, and protein structures; and • The development and implementation of tools that enable efficient access and manage- ment of different types of information” (http://www.ncbi.nlm.nih.gov/Education/) However, Ellis (2003a) notes 40 published operational definitions between 2000-2001 and another 37 (2003b) in 2003, suggesting the definitions vary by subdiscipline. It appears that several specialties were working out for themselves their roles in molecular biology and the legacy of computational biology’s influence on their work. Features of Bioinformatics The features of bioinformatics are of three types. First, at its simplest bioinformatics organizes data in a way that allows researchers to access existing information and to submit new entries as 1 they are produced like 3D macromolecular structures. While data-curation is an essential task, the information stored in these databases is essentially useless until analyzed. Thus the purpose of bioinformatics extends much further. The second feature is it develops tools and resources that aid in the analysis of data. For example, having sequenced a particular protein, it is of interest to compare it with previously characterized sequences. This needs more than just a simple textbased search and programs such as alignment tools and sequence homology search tool that must consider what comprises a biologically significant match. Development of such resources dictates expertise in computational theory as well as a thorough understanding of biology. The third feature is that these tools could be used to analyze the data and interpret the results in a biologically meaningful manner. Traditionally, biological studies examined individual systems in detail, and frequently compared those with a few that are related. In bioinformatics, we can now conduct global analyses of all the available data with the aim of uncovering common principles that apply across many systems and highlight novel features. However in recent years new directions of bioinformatics have emerged that has featured its future. The practice of studying genetic disorders is changing from investigation of single genes in isolation to discovering cellular networks of genes, understanding their complex interactions, and identifying their role in disease. 19 As a result of this, a whole new age of individually tailored medicine will emerge. Bioinformatics will guide and help molecular biologists and clinical researchers to capitalize on the advantages brought by computational biology. 20 The clinical research teams that will be most successful in the coming decades will be those that can switch effortlessly between the laboratory bench, clinical practice, and the use of these sophisticated computational tools. Artificial intelligence has been incorporated into machine learning and neural network formations for better understanding of a disease. This holds a prospect of designing new algorithms for better incorporation of artificial intelligence. Table 1. Sources of data used in bioinformatics, the quantity of each type of data that was available, and bioinformatics subject areas that utilize this data. Data source Bioinformatics topics Separating coding and non-coding regions Raw DNA sequence Identification of introns and exons Gene product prediction 2 Forensic analysis Sequence comparison algorithms Protein sequence Multiple sequence alignments algorithms Identification of conserved sequence motifs Secondary, tertiary structure prediction Macromolecular structure 3D structural alignment algorithms Protein geometry measurements Surface and volume shape calculations Intermolecular interactions Characterisation of repeats Structural assignments to genes Genomes Phylogenetic analysis Genomic-scale censuses (characterisation of protein content, metabolic pathways) Linkage analysis relating specific genes to diseases Correlating expression patterns Gene expression Mapping expression data to sequence, structural and biochemical data Literature Metabolic pathways Digital libraries for automated bibliographical searches Knowledge databases of data from literature Pathway simulations Application of Bioinformatics: 1. Sequence Analysis: Sequence analysis has been an important technique in bioinformatics. Apart from the basic features that just represent the nucleotide or amino acid at each position in a sequence, many other features, such as higher order combinations of these building blocks can be derived, their number growing exponentially with the pattern length. The prediction of subsequences that code for proteins has been a focus of interest since the early days of bioinformatics (Saeys et al., 2007). Many features can be extracted from sequences and to deal with the high amount of 3 possible features, and the often limited amount of samples, (Salzberg et al., 1998) introduced the interpolated Markov model (IMM), which used interpolation between different orders of the Markov model to deal with small sample sizes, and a filter method to select only relevant features. A second class of techniques focuses on the prediction of protein function from sequence. The early work of Chuzhanova et al. (1998), who combined a genetic algorithm in combination with the Gamma test to score feature subsets for classification of large subunits of rRNA, inspired researchers to use FS techniques to focus on important subsets of amino acids that relate to the protein’s functional class (Al-Shahib et al., 2005). An interesting technique is described in Zavaljevsky et al. (2002), using selective kernel scaling for support vector machines (SVM) as a way to asses feature weights, and subsequently remove features with low weights. Sequences are also involved in the recognition of conserved signals, representing mainly binding sites for various proteins or protein complexes. A common approach to find regulatory motifs, is to relate motifs to gene expression levels using a regression approach (Saeys et al., 2007).Feature selection can then be used to search for the motifs that maximize the fit to the regression model (Keles et al., 2002; Tadesse et al., 2004). In Sinha (2003), a classification approach is chosen to find discriminative motifs. 2. High through put genome analysis: Advances in sequencing technology have led to a remarkable increase in the production of experimental data. Genomics studies now typically involve the analysis of dozens of sequencing datasets such as transcripts/genes, exons/introns, promoter sites, alignments, binding sites, repeat elements, microarray probes, sequencing data (RNA-seq, ChIP-seq, DNA-seq, etc.), or chromosomal conformations (3C-seq, 4C-seq, etc.) can be represented as genomic regions, i.e. ordered sets of genomic intervals, which in turn are defined as tuples: <chromosome, strand, start position, end position> and with the amount of data it requires an efficient management of computational resources such as time, memory and development time (Tsirigos et al., 2012).Gene finding where prediction of introns and exons in a segment of DNA sequence, sequence comparison, transcriptome analysis and many other genome analysis are used by bioinformatics where the each datasets can be well studied. 4 3. Microarray analysis: The advent of microarray datasets motivated a new line of research in bioinformatics. Microarray data pose a great challenge for computational techniques, because of their large dimensionality and their small sample sizes (Somorjai et al., 2003). In order to deal with these particular characteristics of microarray data, the obvious need for dimension reduction techniques was realized (Alon et al., 1999; Ben-Dor et al., 2000; Golub et al., 1999; Ross et al., 2000), and soon their application became a concerning fact in the field. 4. Mass spectra analysis: Mass spectrometry technology (MS) is an emerging and attractive framework for disease diagnosis and protein-based biomarker profiling (Petricoin and Liotta, 2003). A mass spectrum sample is characterized by mass/charge (m/ z) ratios on the x-axis, each with their corresponding signal intensity value on the y-axis. For data mining and bioinformatics purposes, it can initially be assumed that each m/ z ratio represents a distinct variable whose value is the intensity. Somorjai et al. (2003) explains the data analysis step is severely constrained by both highdimensional input spaces and their inherent sparseness. Starting from the raw data, and after an initial step to reduce noise and normalize the spectra from different samples (Coombes et al., 2007), the following step is to extract the variables that will represent the initial pool of candidate discriminative features. References: 1. Alon, U., Barkai, N., Notterman, D.A., Gish, K., Ybarra, S., Mack, D. and Levine A.J. 1999. Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc. Nat. Acad. Sci. USA 96: 6745–6750. 2. Al-Shahib, A., Breitling, R. and Gilbert D. 2005. Feature selection and the class imbalance problem in predicting protein function from sequence. Appl. Bioinformatics 4(3): 195–203. 5 3. Bayat, A. 2002, April. Science, medicine, and the future: Bioinformatics. British Medical Journal 324: 1018-1022. Retrieved June 24, 2003 from http://bmj.com/cgi/reprint/324/7344/1018 4. Ben-Dor, A., Bruhn, L., Friedman, N., Nachman, I., Schummer, M. and Yakhini, Z. 2000. Tissue classification with gene expression profiles. J. Comput. Biol., 7(3-4): 559– 584. 5. Butler, D. 2001. Are you ready for the revolution? Nature 409: 758-760. 6. Chuzhanova, N.A., Jones, A.J. and Margetts, S. 1998. Feature selection for genetic sequence classification. Bioinformatics 14(2): 139–143. 7. Coombes, K., Baggerly, K. and Morris, J. 2007. Pre-processing mass spectrometry data. In Dubitzky, W., Granzow, M., and Berrar, D. (eds.), Fundamentals of Data Mining in Genomics and Proteomics. Kluwer, Boston, pp. 79–99. 8. Debouk C, Metcalf B. 2000. The impact of genomics on drug discovery. Annu Rev Pharmacol Toxicol 40: 193-208. 9. Ellis, L. 2003a. What is bioinformatics? 2000-2001. Retrieved June 24, 2003 from http://www.binf.umn.edu/whatsbinf2000.html 10. Ellis, L. 2003b. What is bioinformatics? 2003. http://www.binf.umn.edu/whatsbinf.html Retrieved June 24, 2003 from 11. Golub, T.R, Slonim, D.K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J.P., Coller, H., Loh, M.L., Downing, J.R., Caligiuri, M.A., Bloomfield, C.D. and Lander, E.S. 1999. Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring. Science, 286: 531–537. 12. Keles, S., van der Laan, M.J. and Eisen, M.B. 2002. Identification of regulatory elements using a feature selection method. Bioinformatics, 18: 1167–1175. 13. Petricoin, E. and Liotta, L. 2003. Mass spectometry-based diagnostic: the upcoming revolution in disease detection. Clin. Chem. 49: 533–534. 14. Ross, D., et al. (2000) Systematic variation in gene expression patterns in human cancer cell lines. Nat. Genet., 24, 227–234. 15. Saeys Y, Inza I, Larrañaga P. A review of feature selection techniques in bioinformatics. Bioinformatics. 23(19): 2507-2517. 6 16. Salzberg, S., et al. (1998) Microbial gene identification using interpolated markov models. Nucleic Acids Res., 26: 544–548. 17. Sinha, S. (2003) Discriminative motifs. J. Comput. Biol., 10: 599–615. 18. Somorjai, R.L., Dolenko, B. and Baumgarter, R. 2003. Class prediction and discovery using gene microarray and proteomics mass spectroscopy data: curses, caveats, cautions. Bioinformatics, 19(12): 1484–1491. 19. Tadesse, M.G., Vannucci, M. and Liò, P. 2004. Identification of DNA regulatory motifs using Bayesian variable selection. Bioinformatics, 20(16): 2553–2561. 20. Tsirigos, A., Haiminen, N., Bilal, E. and Utro F. 2012. GenomicTools: a computational platform for developing high-throughput analytics in genomics. Bioinformatics. 28(2):282-283. 21. Zavaljevsky, N., Stevens, F.J. and Reifman, J. 2002. Support vector machines with selective kernel scaling for protein classification and identification of key amino acid positions. Bioinformatics, 18(5): 689–696. 7