Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Algorithms in Bioinformatics I, WS’06, ZBIT, D. Huson, October 16, 2006 3 1 Introduction 1.1 What is bioinformatics? What is computational biology ? Bioinformatics and computational biology are multidisciplinary fields, at the intersection of natural sciences (esp. biology, chemistry), computer science and mathematics. Both terms are often used synonymously but sometimes they are used with different nuances. The US National Institutes of Health defined both notions as follows: • Bioinformatics: Research, development, or application of computational tools and approaches for expanding the use of biological, medical, behavioral or health data, including those to acquire, store, organize, archive, analyze, or visualize such data. • Computational Biology: The development and application of data-analytical and theoretical methods, mathematical modeling and computational simulation techniques to the study of biological, behavioral, and social systems. In German the term “Bioinformatik” is used for both, the term “Computerbiologie” is virtually not used. If we talk about Bioinformatics in the following we include the notions of Computational Biology. (Note: in October 2006 Google found 26.7 million pages on Bioinformatics, 4.1 million on Computational Biology, 1.6 million on Bioinformatik and 238 on Computerbiologie.) 1.2 Why is there a need for this new scientific discipline? Technological advances, such as high-throughput sequencers, DNA-arrays, protein mass-spectrometry, faster processors and larger computer memories, have helped transform molecular biology into a highthroughput science, in which huge amounts of data are being generated ever faster. To handle and interpret these data, bioinformaticians are needed who understand the experiment by which the data are generated and how they can be efficiently processed and stored. They not only need to be capable of developing or adapting algorithms and their implementation, but also need to be able to communicate with the other involved scientists from various fields. Bioinformaticians develop simulations to test theoretical models of biological or chemical processes, possibly derived from data analysis. Verified models can be implemented in predictive programs that will e.g. allow to narrow down the number of necessary biological experiments in drug research (known as rational drug design). 1.3 Contents of the lecture 1. Introduction 2. Probabilities 3. DNA compression algorithm 4. Pairwise sequence alignment 5. BLAST: Searching for similar sequences in a database 6. Multiple sequence alignment 7. Genome comparison 8. RNA secondary structure 9. Protein secondary structure 4 Algorithms in Bioinformatics I, WS’06, ZBIT, D. Huson, October 16, 2006 10. Protein tertiary structure 11. Physical mapping 12. Hidden Markov models 1.4 Probability theory is vital in bioinformatics Figure 1.1: A die: Can you be sure that it is fair? Probability theory and statistics are important to bioinformatics: • one wants to estimate how significant the result of a predictive or heuristic method is. • in heuristic algorithms probabilistic decisions may be taken to walk through solution space (e.g. in simulated annealing) • etc... 1.5 DNA compression Figure 1.2: DNA compression takes repeats into account to efficiently reduce space requirement. The discussed DNA compression algorithm uses 2-bit nucleotide encoding and a repeat recognition. The compression rates of two genomes allows to estimate their relatedness. 1.6 Pairwise sequence alignment Hemoglobin is oxygen carrier in the blood. Myoglobin in the muscle also carries oxygen and both have a heme group as a cofactor to which O2 is bound (see Figure 1.3). Figure 1.3: Heme molecule with one oxygen molecule bound. Are both protein sequences similar? A pairwise sequence alignment can help! (see Figure 1.4). Algorithms in Bioinformatics I, WS’06, ZBIT, D. Huson, October 16, 2006 5 Figure 1.4: A pairwise alignment of human hemoglobin (α chain) and myoglobin. 1.7 BLAST: Basic Local Alignment Search Tool MbtH-like protein is a protein that occurs in certain bacteria that produce antibiotics. Its function is still unclear. Are there other proteins in the databases of known proteins that are similar to it and might have a known function? BLAST will compare a query protein to all proteins in the database (see Figure 1.5 and Figure 1.6). Figure 1.5: Result of a BLAST search with MbtH-like protein of Amycolatopsis balhimycina as query and Swiss-Prot as database. Figure 1.6: The best relevant hit is a hypothetical protein with unknown function. 1.8 Multiple sequence alignment Besides hemoglobin and myoglobin a third protein has a heme cofactor to transport oxygen: Leghemoglobin, a hemoprotein found in the nitrogen-fixing root nodules of leguminous plants. A multiple sequence alignment (MSA) of all three globins shows their mutual conservation. 6 Algorithms in Bioinformatics I, WS’06, ZBIT, D. Huson, October 16, 2006 Figure 1.7: Multiple sequence alignment of leghemoglobin (of yellow lupine), human hemoglobin and myoglobin. Pairwise identities: hemoglobin vs. myoglobin 25%, leghemoglobin vs. myoglobin 23% and leghemoglobin vs. hemoglobin 14%. Figure 1.8: The 3D structures of Hemoglobin (α chain), Myoglobin and Leghemoglobin are astoundingly similar despite their low sequence identity. [source: Berg, Tymoczk & Stryer (2002). Biochemistry]. The three-dimensional structure is much more closely associated with function than is sequence. This is the reason why it is evolutionary more conserved. 1.9 Genome comparison Figure 1.9: Chimp-human chromosome differences. The major structural difference is that human chromosome 2 (green color code) was derived from two smaller chromosomes that are found in other great apes (now called 2A and 2B, see: Entrez PubMed 15218271). Parts of human chromosome 2 are scattered among parts of several cat and rat chromosomes in these species that are more distantly related to humans (more ancient common ancestors; about 85 million years since the human/rodent common ancestor: Entrez PubMed 12552136). [Source: Wikipedia, Chimpanzee Genome Project]. Algorithms in Bioinformatics I, WS’06, ZBIT, D. Huson, October 16, 2006 1.10 7 RNA secondary structure prediction In this chapter we will discuss algorithms to predict the secondary structure of RNAs. Figure 1.10: Secondary Structure of tRN ALeu(U U R) : The indicated nucleotide transitions are associated with the MELAS syndrome. [Source: Saara Finnilä, Oulu University, 2000]. Figure 1.11: 3D structure model of a transfer RNA (tRNA). [Source: Wikipedia, transfer RNA]. 1.11 Prediction of secondary and tertiary protein structure Figure 1.12: 10 models for the tertiary structure of MbtH-like protein generated by ROSETTA. 1.12 Physical mapping Physical Mapping is the process of identifying the order and orientation (called layout) of overlapping unsequenced DNA pieces (sequences). A prerequisite is the prior position finding of “orientation 8 Algorithms in Bioinformatics I, WS’06, ZBIT, D. Huson, October 16, 2006 points” called markers, short known words (subsequences) with a length in the order of 10 base pairs that should occur in more than one sequence. In the Physical Map the sequences should be placed and orientated consistently with the marker information. Figure 1.13: A map of human chromosome X. [source: NCBI Mapview]. 1.13 Hidden Markov Models (HMMs) HMMs are statistical models particularly used in pattern recognition. Because only their output is visible (the data) the underlying model is hidden and has to be estimated. Figure 1.14: A simple Hidden Markov Model of the weather. [Source: K. Noto, University of Wisconsin-Madison].