Download Lecture notes Algorithms in Bioinformatics I

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Database model wikipedia , lookup

Transcript
Algorithms in Bioinformatics I, WS’06, ZBIT, D. Huson, October 16, 2006
3
1 Introduction
1.1
What is bioinformatics? What is computational biology ?
Bioinformatics and computational biology are multidisciplinary fields, at the intersection of natural
sciences (esp. biology, chemistry), computer science and mathematics. Both terms are often used
synonymously but sometimes they are used with different nuances.
The US National Institutes of
Health defined both notions as follows:
• Bioinformatics: Research, development, or application of computational tools and approaches for
expanding the use of biological, medical, behavioral or health data, including those to acquire,
store, organize, archive, analyze, or visualize such data.
• Computational Biology: The development and application of data-analytical and theoretical
methods, mathematical modeling and computational simulation techniques to the study of biological, behavioral, and social systems.
In German the term “Bioinformatik” is used for both, the term “Computerbiologie” is virtually not
used. If we talk about Bioinformatics in the following we include the notions of Computational Biology.
(Note: in October 2006 Google found 26.7 million pages on Bioinformatics, 4.1 million on Computational Biology,
1.6 million on Bioinformatik and 238 on Computerbiologie.)
1.2
Why is there a need for this new scientific discipline?
Technological advances, such as high-throughput sequencers, DNA-arrays, protein mass-spectrometry,
faster processors and larger computer memories, have helped transform molecular biology into a highthroughput science, in which huge amounts of data are being generated ever faster.
To handle and
interpret these data, bioinformaticians are needed who understand the experiment by which the data
are generated and how they can be efficiently processed and stored. They not only need to be capable of
developing or adapting algorithms and their implementation, but also need to be able to communicate
with the other involved scientists from various fields. Bioinformaticians develop simulations to test
theoretical models of biological or chemical processes, possibly derived from data analysis. Verified
models can be implemented in predictive programs that will e.g. allow to narrow down the number
of necessary biological experiments in drug research (known as rational drug design).
1.3
Contents of the lecture
1. Introduction
2. Probabilities
3. DNA compression algorithm
4. Pairwise sequence alignment
5. BLAST: Searching for similar sequences in a database
6. Multiple sequence alignment
7. Genome comparison
8. RNA secondary structure
9. Protein secondary structure
4
Algorithms in Bioinformatics I, WS’06, ZBIT, D. Huson, October 16, 2006
10. Protein tertiary structure
11. Physical mapping
12. Hidden Markov models
1.4
Probability theory is vital in bioinformatics
Figure 1.1: A die: Can you be sure that it is fair?
Probability theory and statistics are important to bioinformatics:
• one wants to estimate how significant the result of a predictive or heuristic method is.
• in heuristic algorithms probabilistic decisions may be taken to walk through solution space (e.g.
in simulated annealing)
• etc...
1.5
DNA compression
Figure 1.2: DNA compression takes repeats into account to efficiently reduce space requirement.
The discussed DNA compression algorithm uses 2-bit nucleotide encoding and a repeat recognition.
The compression rates of two genomes allows to estimate their relatedness.
1.6
Pairwise sequence alignment
Hemoglobin is oxygen carrier in the blood. Myoglobin in the muscle also carries oxygen and both have
a heme group as a cofactor to which O2 is bound (see Figure 1.3).
Figure 1.3: Heme molecule with one oxygen molecule bound.
Are both protein sequences similar?
A pairwise sequence alignment can help! (see Figure 1.4).
Algorithms in Bioinformatics I, WS’06, ZBIT, D. Huson, October 16, 2006
5
Figure 1.4: A pairwise alignment of human hemoglobin (α chain) and myoglobin.
1.7
BLAST: Basic Local Alignment Search Tool
MbtH-like protein is a protein that occurs in certain bacteria that produce antibiotics. Its function is
still unclear.
Are there other proteins in the databases of known proteins that are similar to it and might have a
known function?
BLAST will compare a query protein to all proteins in the database (see Figure 1.5 and Figure 1.6).
Figure 1.5: Result of a BLAST search with MbtH-like protein of Amycolatopsis balhimycina as query and Swiss-Prot
as database.
Figure 1.6: The best relevant hit is a hypothetical protein with unknown function.
1.8
Multiple sequence alignment
Besides hemoglobin and myoglobin a third protein has a heme cofactor to transport oxygen: Leghemoglobin, a hemoprotein found in the nitrogen-fixing root nodules of leguminous plants. A multiple
sequence alignment (MSA) of all three globins shows their mutual conservation.
6
Algorithms in Bioinformatics I, WS’06, ZBIT, D. Huson, October 16, 2006
Figure 1.7: Multiple sequence alignment of leghemoglobin (of yellow lupine), human hemoglobin and myoglobin.
Pairwise identities: hemoglobin vs. myoglobin 25%, leghemoglobin vs. myoglobin 23% and leghemoglobin vs. hemoglobin
14%.
Figure 1.8: The 3D structures of Hemoglobin (α chain), Myoglobin and Leghemoglobin are astoundingly similar despite
their low sequence identity. [source: Berg, Tymoczk & Stryer (2002). Biochemistry].
The three-dimensional structure is much more closely associated with function than is sequence. This
is the reason why it is evolutionary more conserved.
1.9
Genome comparison
Figure 1.9: Chimp-human chromosome differences. The major structural difference is that human chromosome 2
(green color code) was derived from two smaller chromosomes that are found in other great apes (now called 2A and
2B, see: Entrez PubMed 15218271). Parts of human chromosome 2 are scattered among parts of several cat and rat
chromosomes in these species that are more distantly related to humans (more ancient common ancestors; about 85
million years since the human/rodent common ancestor: Entrez PubMed 12552136). [Source: Wikipedia, Chimpanzee
Genome Project].
Algorithms in Bioinformatics I, WS’06, ZBIT, D. Huson, October 16, 2006
1.10
7
RNA secondary structure prediction
In this chapter we will discuss algorithms to predict the secondary structure of RNAs.
Figure 1.10: Secondary Structure of tRN ALeu(U U R) : The indicated nucleotide transitions are associated with the
MELAS syndrome. [Source: Saara Finnilä, Oulu University, 2000].
Figure 1.11: 3D structure model of a transfer RNA (tRNA). [Source: Wikipedia, transfer RNA].
1.11
Prediction of secondary and tertiary protein structure
Figure 1.12: 10 models for the tertiary structure of MbtH-like protein generated by ROSETTA.
1.12
Physical mapping
Physical Mapping is the process of identifying the order and orientation (called layout) of overlapping
unsequenced DNA pieces (sequences). A prerequisite is the prior position finding of “orientation
8
Algorithms in Bioinformatics I, WS’06, ZBIT, D. Huson, October 16, 2006
points” called markers, short known words (subsequences) with a length in the order of 10 base pairs
that should occur in more than one sequence. In the Physical Map the sequences should be placed
and orientated consistently with the marker information.
Figure 1.13: A map of human chromosome X. [source: NCBI Mapview].
1.13
Hidden Markov Models (HMMs)
HMMs are statistical models particularly used in pattern recognition. Because only their output is
visible (the data) the underlying model is hidden and has to be estimated.
Figure 1.14: A simple Hidden Markov Model of the weather. [Source: K. Noto, University of Wisconsin-Madison].