Bioinformatics and its application Introduction: An exceptional Download

Transcript
Bioinformatics and its application
Introduction:
An exceptional wealth of biological data has been generated by the human genome project and
sequencing projects in many other organisms. The huge demand for analysis and interpretation
of these data is being managed by bioinformatics. Bioinformatics is defined as the application of
tools of computation and analysis to the capture and interpretation of biological data. It is an
interdisciplinary field, which harnesses computer science, mathematics, physics, and biology
Bayat accurately and broadly defines the discipline as “the application of tools of computation
and analysis to the capture and interpretation of biological data” and, operationally, that “The
main tools of a bioinformatician are computer software programs and the internet. A fundamental activity is sequence analysis of DNA and proteins using various programs and databases available on the world wide web”
NCBI also defines bioinformatics as a single broad discipline, but with three “important subdisciplines”:
•
“The development of new algorithms and statistics with which to assess relationships
among members of large data sets;
•
The analysis and interpretation of various types of data including nucleotide and amino
acid sequences, protein domains, and protein structures; and
•
The development and implementation of tools that enable efficient access and manage-
ment of different types of information” (http://www.ncbi.nlm.nih.gov/Education/)
However, Ellis (2003a) notes 40 published operational definitions between 2000-2001 and
another 37 (2003b) in 2003, suggesting the definitions vary by subdiscipline. It appears that
several specialties were working out for themselves their roles in molecular biology and the
legacy of computational biology’s influence on their work.
Features of Bioinformatics
The features of bioinformatics are of three types. First, at its simplest bioinformatics organizes
data in a way that allows researchers to access existing information and to submit new entries as
1
they are produced like 3D macromolecular structures. While data-curation is an essential task,
the information stored in these databases is essentially useless until analyzed. Thus the purpose
of bioinformatics extends much further. The second feature is it develops tools and resources that
aid in the analysis of data. For example, having sequenced a particular protein, it is of interest to
compare it with previously characterized sequences. This needs more than just a simple textbased search and programs such as alignment tools and sequence homology search tool that must
consider what comprises a biologically significant match. Development of such resources
dictates expertise in computational theory as well as a thorough understanding of biology. The
third feature is that these tools could be used to analyze the data and interpret the results in a
biologically meaningful manner. Traditionally, biological studies examined individual systems in
detail, and frequently compared those with a few that are related. In bioinformatics, we can now
conduct global analyses of all the available data with the aim of uncovering common principles
that apply across many systems and highlight novel features. However in recent years new
directions of bioinformatics have emerged that has featured its future. The practice of studying
genetic disorders is changing from investigation of single genes in isolation to discovering
cellular networks of genes, understanding their complex interactions, and identifying their role in
disease.
19
As a result of this, a whole new age of individually tailored medicine will emerge.
Bioinformatics will guide and help molecular biologists and clinical researchers to capitalize on
the advantages brought by computational biology. 20 The clinical research teams that will be most
successful in the coming decades will be those that can switch effortlessly between the
laboratory bench, clinical practice, and the use of these sophisticated computational tools.
Artificial intelligence has been incorporated into machine learning and neural network
formations for better understanding of a disease. This holds a prospect of designing new
algorithms for better incorporation of artificial intelligence.
Table 1. Sources of data used in bioinformatics, the quantity of each type of data that was
available, and bioinformatics subject areas that utilize this data.
Data source
Bioinformatics topics
Separating coding and non-coding regions
Raw DNA sequence
Identification of introns and exons
Gene product prediction
2
Forensic analysis
Sequence comparison algorithms
Protein sequence
Multiple sequence alignments algorithms
Identification of conserved sequence motifs
Secondary, tertiary structure prediction
Macromolecular
structure
3D structural alignment algorithms
Protein geometry measurements
Surface and volume shape calculations
Intermolecular interactions
Characterisation of repeats
Structural assignments to genes
Genomes
Phylogenetic analysis
Genomic-scale censuses
(characterisation of protein content, metabolic pathways)
Linkage analysis relating specific genes to diseases
Correlating expression patterns
Gene expression
Mapping expression data to sequence, structural and
biochemical data
Literature
Metabolic pathways
Digital libraries for automated bibliographical searches
Knowledge databases of data from literature
Pathway simulations
Application of Bioinformatics:
1. Sequence Analysis:
Sequence analysis has been an important technique in bioinformatics. Apart from the basic
features that just represent the nucleotide or amino acid at each position in a sequence, many
other features, such as higher order combinations of these building blocks can be derived, their
number growing exponentially with the pattern length. The prediction of subsequences that code
for proteins has been a focus of interest since the early days of bioinformatics (Saeys et al.,
2007). Many features can be extracted from sequences and to deal with the high amount of
3
possible features, and the often limited amount of samples, (Salzberg et al., 1998) introduced the
interpolated Markov model (IMM), which used interpolation between different orders of the
Markov model to deal with small sample sizes, and a filter method to select only relevant
features.
A second class of techniques focuses on the prediction of protein function from sequence. The
early work of Chuzhanova et al. (1998), who combined a genetic algorithm in combination with
the Gamma test to score feature subsets for classification of large subunits of rRNA, inspired
researchers to use FS techniques to focus on important subsets of amino acids that relate to the
protein’s functional class (Al-Shahib et al., 2005). An interesting technique is described in
Zavaljevsky et al. (2002), using selective kernel scaling for support vector machines (SVM) as a
way to asses feature weights, and subsequently remove features with low weights.
Sequences are also involved in the recognition of conserved signals, representing mainly binding
sites for various proteins or protein complexes. A common approach to find regulatory motifs, is
to relate motifs to gene expression levels using a regression approach (Saeys et al., 2007).Feature
selection can then be used to search for the motifs that maximize the fit to the regression model
(Keles et al., 2002; Tadesse et al., 2004). In Sinha (2003), a classification approach is chosen to
find discriminative motifs.
2. High through put genome analysis:
Advances in sequencing technology have led to a remarkable increase in the production of
experimental data. Genomics studies now typically involve the analysis of dozens of sequencing
datasets such as transcripts/genes, exons/introns, promoter sites, alignments, binding sites, repeat
elements, microarray probes, sequencing data (RNA-seq, ChIP-seq, DNA-seq, etc.), or
chromosomal conformations (3C-seq, 4C-seq, etc.) can be represented as genomic regions, i.e.
ordered sets of genomic intervals, which in turn are defined as tuples: <chromosome, strand,
start position, end position> and with the amount of data it requires an efficient management of
computational resources such as time, memory and development time (Tsirigos et al.,
2012).Gene finding where prediction of introns and exons in a segment of DNA sequence,
sequence comparison, transcriptome analysis and many other genome analysis are used by
bioinformatics where the each datasets can be well studied.
4
3. Microarray analysis:
The advent of microarray datasets motivated a new line of research in bioinformatics. Microarray
data pose a great challenge for computational techniques, because of their large dimensionality
and their small sample sizes (Somorjai et al., 2003). In order to deal with these particular
characteristics of microarray data, the obvious need for dimension reduction techniques was
realized (Alon et al., 1999; Ben-Dor et al., 2000; Golub et al., 1999; Ross et al., 2000), and soon
their application became a concerning fact in the field.
4. Mass spectra analysis:
Mass spectrometry technology (MS) is an emerging and attractive framework for disease
diagnosis and protein-based biomarker profiling (Petricoin and Liotta, 2003). A mass spectrum
sample is characterized by mass/charge (m/ z) ratios on the x-axis, each with their corresponding
signal intensity value on the y-axis. For data mining and bioinformatics purposes, it can initially
be assumed that each m/ z ratio represents a distinct variable whose value is the intensity.
Somorjai et al. (2003) explains the data analysis step is severely constrained by both highdimensional input spaces and their inherent sparseness. Starting from the raw data, and after an
initial step to reduce noise and normalize the spectra from different samples (Coombes et al.,
2007), the following step is to extract the variables that will represent the initial pool of
candidate discriminative features.
References:
1. Alon, U., Barkai, N., Notterman, D.A., Gish, K., Ybarra, S., Mack, D. and Levine A.J.
1999. Broad patterns of gene expression revealed by clustering analysis of tumor and
normal colon tissues probed by oligonucleotide arrays. Proc. Nat. Acad. Sci. USA 96:
6745–6750.
2. Al-Shahib, A., Breitling, R. and Gilbert D. 2005. Feature selection and the class
imbalance problem in predicting protein function from sequence. Appl. Bioinformatics
4(3): 195–203.
5
3. Bayat, A. 2002, April. Science, medicine, and the future: Bioinformatics. British
Medical Journal 324: 1018-1022.
Retrieved June 24, 2003 from
http://bmj.com/cgi/reprint/324/7344/1018
4. Ben-Dor, A., Bruhn, L., Friedman, N., Nachman, I., Schummer, M. and Yakhini, Z.
2000. Tissue classification with gene expression profiles. J. Comput. Biol., 7(3-4): 559–
584.
5. Butler, D. 2001. Are you ready for the revolution? Nature 409: 758-760.
6. Chuzhanova, N.A., Jones, A.J. and Margetts, S. 1998. Feature selection for genetic
sequence classification. Bioinformatics 14(2): 139–143.
7. Coombes, K., Baggerly, K. and Morris, J. 2007. Pre-processing mass spectrometry data.
In Dubitzky, W., Granzow, M., and Berrar, D. (eds.), Fundamentals of Data Mining in
Genomics and Proteomics. Kluwer, Boston, pp. 79–99.
8. Debouk C, Metcalf B. 2000. The impact of genomics on drug discovery. Annu Rev
Pharmacol Toxicol 40: 193-208.
9. Ellis, L. 2003a. What is bioinformatics? 2000-2001. Retrieved June 24, 2003 from
http://www.binf.umn.edu/whatsbinf2000.html
10. Ellis, L. 2003b. What is bioinformatics? 2003.
http://www.binf.umn.edu/whatsbinf.html
Retrieved June 24, 2003 from
11. Golub, T.R, Slonim, D.K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J.P., Coller,
H., Loh, M.L., Downing, J.R., Caligiuri, M.A., Bloomfield, C.D. and Lander, E.S. 1999.
Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene
Expression Monitoring. Science, 286: 531–537.
12. Keles, S., van der Laan, M.J. and Eisen, M.B. 2002. Identification of regulatory elements
using a feature selection method. Bioinformatics, 18: 1167–1175.
13. Petricoin, E. and Liotta, L. 2003. Mass spectometry-based diagnostic: the upcoming
revolution in disease detection. Clin. Chem. 49: 533–534.
14. Ross, D., et al. (2000) Systematic variation in gene expression patterns in human cancer
cell lines. Nat. Genet., 24, 227–234.
15. Saeys Y, Inza I, Larrañaga P. A review of feature selection techniques in bioinformatics.
Bioinformatics. 23(19): 2507-2517.
6
16. Salzberg, S., et al. (1998) Microbial gene identification using interpolated markov
models. Nucleic Acids Res., 26: 544–548.
17. Sinha, S. (2003) Discriminative motifs. J. Comput. Biol., 10: 599–615.
18. Somorjai, R.L., Dolenko, B. and Baumgarter, R. 2003. Class prediction and discovery
using gene microarray and proteomics mass spectroscopy data: curses, caveats, cautions.
Bioinformatics, 19(12): 1484–1491.
19. Tadesse, M.G., Vannucci, M. and Liò, P. 2004. Identification of DNA regulatory motifs
using Bayesian variable selection. Bioinformatics, 20(16): 2553–2561.
20. Tsirigos, A., Haiminen, N., Bilal, E. and Utro F. 2012. GenomicTools:
a computational platform for developing high-throughput analytics in genomics.
Bioinformatics. 28(2):282-283.
21. Zavaljevsky, N., Stevens, F.J. and Reifman, J. 2002. Support vector machines with
selective kernel scaling for protein classification and identification of key amino acid
positions. Bioinformatics, 18(5): 689–696.
7