* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download presentation source
Oncogenomics wikipedia , lookup
Genetic engineering wikipedia , lookup
Gene desert wikipedia , lookup
Public health genomics wikipedia , lookup
Non-coding DNA wikipedia , lookup
Nutriepigenomics wikipedia , lookup
Genomic imprinting wikipedia , lookup
Polycomb Group Proteins and Cancer wikipedia , lookup
Transposable element wikipedia , lookup
Genomic library wikipedia , lookup
Ridge (biology) wikipedia , lookup
Biology and consumer behaviour wikipedia , lookup
Point mutation wikipedia , lookup
Human genome wikipedia , lookup
Genome (book) wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
History of genetic engineering wikipedia , lookup
Epigenetics of human development wikipedia , lookup
Designer baby wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Pathogenomics wikipedia , lookup
Gene expression programming wikipedia , lookup
Minimal genome wikipedia , lookup
Metagenomics wikipedia , lookup
Gene expression profiling wikipedia , lookup
Genome editing wikipedia , lookup
Microevolution wikipedia , lookup
Sequence alignment wikipedia , lookup
Multiple sequence alignment wikipedia , lookup
Smith–Waterman algorithm wikipedia , lookup
Computational phylogenetics wikipedia , lookup
Helitron (biology) wikipedia , lookup
Complexity Issues in Bioinformatics Winfried Just Department of Mathematics and Quantitative Biology Institute Ohio University Biology’s dilemma: There is too much to know about living things Roughly 1.5 million species of organisms have been described and given scientific names to date. Some biologists estimate that the total number of all living species may be several times higher. It is impossible to learn everything about all these organisms. Biologists solve the dilemma by focusing on some species, so-called model organisms, and trying to find out as much as they can about these model organisms. Some important model organisms Mammals: Homo sapiens, Chimpanzee, mouse, rat Fish: Zebrafish, Pufferfish Insects: Fruitfly (Drosophila melanogaster) Roundworms: Ceanorhabditis elegans Protista: Malaria parasite (Plasmodium falciparum) Fungi: Yeast (Saccharomyces cerevisiae, S. pombe) Plants: Thale cress (Arabidopsis thaliana), corn, rice Bacteria: Escherichia coli, salmonella Archea: Methanococcus janaschii Let’s find out everything about some species What would it mean to learn everything about a given species? All available evidence indicates that the complete blueprint for making an organism is encoded in the organism’s genome. Chemically, the genome consists of one or several DNA molecules. These are long strings composed of pairs of nucleotides. There are only four different nucleotides, denoted by A, C, G, T. The information about how to make the organism is encoded by the order in which the nucleotides appear. Some genome sizes HIV2 virus Mycoplasma genitalis Haemophilus influenzae Saccharomyces cerevisiae Caenorhabditis elegans Drosophila melanogaster Homo sapiens Some amphibians Amoeba dubia 9671 bp 5.8 · 105 bp 1.83 · 106 bp 1.21 · 107 bp 108 bp 1.65 · 108 bp 3.14 · 109 bp 8 · 1010 bp 6.7 · 1011 bp Sequencing Genomes Contemporary technology makes it possible to completely sequence entire genomes, that is, determine the sequence of A’s, C’s, G’s, and T’s in the organism’s genome. The first virus was sequenced in the 1980’s, the first bacterium (Haemophilus influenzae) in 1995, the first multicellular organism (Caenorhabditis elegans) in 1998. The rough draft of the human genome was announced in June 2000. Where to store all these data? Some of the sequence data are stored in proprietary data bases, but most of them are stored in the public data base Genbank and can be accessed via the World Wide Web. In fact, most relevant journals require proof of submission to Genbank before an article discussing sequence data will be published. A notable exception was the publication of Celera’s announcement in Science. What’s in the databases? As of February 20, 2000, Genbank contained 5,861,088,510 bp of information. There were about 600 completely sequenced viruses, 19 completely sequenced bacteria, 6 completely sequenced archaea, and 3 complete genomes of eukaryotes: S. cerevisiaea (baker’s yeast), C. elegans (a roundworm), and Drosophila melanogaster (fruitfly). What’s in the databases? As of November 23, 2000, Genbank contained 10,853,673,034 bp of information. There were about 600 completely sequenced viruses, 29 completely sequenced bacteria, 8 completely sequenced archaea, and 3 complete genomes of eukariotes: S. cerevisiaea (baker’s yeast), C. elegans (a roundworm), and Drosophila melanogaster (fruitfly). The genome of Arabidopsis (thale cress) was near completion, and a first draft of the human genome had been completed. What’s in the databases? As of March 18, 2002, Genbank contained 20,197,497,568 bp of information. There were about 700 completely sequenced viruses, 63 completely sequenced bacteria, 13 completely sequenced archaea, and 5 complete genomes of eukaryotes: S. cerevisiaea, S. pombe (two yeasts), C. elegans (a roundworm), Drosophila melanogaster (fruitfly) and Arabidopsis thaliana (thale cress), as well as a draft of the human genome. Time-complexity of algorithms Suppose you have an algorithm that takes discrete objects (such as DNA sequences) as inputs. We say that the worst-case time complexity of this algorithm is O(f(n)) if there is a constant M such that on each input of size n the algorithm produces an output after running for at most M ·f(n) steps. We say that the average-case time complexity of this algorithm is O(f(n)) if there is a constant M such that the expected number of steps on inputs of size n is bounded by M ·f(n). Consequences of database growth For many algorithms in bioinformatics, the whole data set in Genbank is the input. Let us make the conservative estimate that the size of this input doubles every nine months. Moore’s Law says that computer speed doubles every 18 months. Thus if you have an algorithm with running time O(n2) that takes Genbank as input and that takes one hour to run today on the fastest available computer, it will take eight hours to run on the fastest available computer 18 months from now. How to overcome the problem? Designing faster and faster algorithms will remain a constant challenge in bioinformatics. If the worst-case running time of an algorithm is too long, its average running time may still make it suitable for your purposes. If your problem is an optimization problem, one can work with approximation algorithms that are not guaranteed to find the best solution, but a pretty good one. One can use heuristic algorithms that have no performance guarantee whatsoever, but in most cases appear to find pretty good solutions. You have sequenced your genome - what do you do with it? This is known as genome analysis or sequence analysis. At present, most of bioinformatics is concerned with sequence analysis. Here are some of the questions studied in sequence analysis: gene finding protein 3D structure prediction gene function prediction prediction of important sites in proteins reconstruction of phylogenetic trees From genes to proteins Proteins are the workhorses of biochemistry. Practically all chemical reactions in a cell are catalyzed by proteins and many proteins have diverse other functions. From the chemical point of view, proteins are long chains of chemicals called amino acids. There are 20 amino acids that are the building blocks for most proteins in most organisms. Amino acids are coded by triplets of nucleotides, which are also called codons. A stretch of a genome that codes for a given protein is called a gene. What did the gene finding algorithms find? Mycoplasma genitalis (bacterium) 500 Genes Escherichia coli (bacterium) 4,500 Genes Saccharomyces cerevisiae (yeast) 6,000 Genes Caenorhabditis elegans (worm) 19,000 Genes Drosophila Melanogaster (fruitfly) 13,500 Genes Arabidopsis thaliana (thale cress) 25,500 Genes Homo sapiens (Human) 24,000-40,000 Genes Oryza sativa japonica (rice) 32,000-50,000 Genes Oryza sativa indica (rice) 45,000-56,000 Genes So we know the genes - do we know everything? Far from it. The next two questions are: Given a single gene, how does it function in the biology of an organism? How do various genes interact? Prediction of gene function Suppose you have identified a gene. What is its role in the biochemistry of its organism? Sequence databases can help us in formulating reasonable hypotheses. Search the database for genes with similar nucleotide sequences in other organisms. If the functions of the most similar genes are known and if they tend to be the same function (e.g., “codes enzyme involved in glucose metabolism”), then it is reasonable to conjecture that your gene also codes an enzyme involved in glucose metabolism. How to detect similarity of genes? DNA and protein molecules evolve mostly by three processes: point mutations (exchange of a single letter for another), insertions, and deletions. If two genes have evolved from a common ancestral gene, then it should be possible to detect the similarity by inserting gaps into the two sequences that correspond to insertions and deletions so that the nucleotides derived from common ancestral nucleotides end up on top of each other. This process is known as sequence alignment. The columns of the alignment should show significantly more nucleotide identities than would be expected by chance. Example of an alignment ACT- GGTTCAGTC CG | | | | | | | | | AGTAGGTA - -GACCG How to score an alignment Alignments are scored column by column. There is usually a reward for character matches, a small penalty for character mismatches, and a large penalty for gaps. These penalties should be such that the highest scoring alignment is the most likely one to reflect the true evolutionary relationship of the loci. Two sequences can be considered “similar” if their best alignment has a sufficiently high score. The Smith-Waterman algorithm finds the best scoring local alignment of two sequences of length n and m respectively in running time O(n ·m). Prediction of gene function: homology searches Given a nucleotide or DNA sequence, searching the data base(s) for similar sequences is known as “homology searches”. This is essentially the task of finding all (local) alignments between the data base of length n and a query sequence of length m. The most popular software tool for performing these searches is called BLAST; therefore biologists often speak of “BLAST searches”. It is a heuristic algorithm with average running time O(n), independently of the length of the query sequence. Prediction of important sites in proteins Not all parts of a protein are equally important; the function of most of its amino acids is often just to maintain an appropriate 3D structure, and mutations of those less crucial amino acids often don't have much effect. However, most proteins have crucial parts such as binding sites. Any mutations occurring at binding sites tend to be lethal and will be weeded out by evolution. How to predict binding sites from sequence data: Get a collection of proteins of similar amino acid sequences and analogous biochemical function from your database. Align these sequences amino acid by amino acid. Check which regions of the protein are highly conserved in the course of evolution. The binding site should be in one of the highly conserved regions. Example of a multiple alignment A T G | A C G | - C G | - T G - - T T C G | A A T C C A | A A T C C T | A G C A C T G A C | G - C | A A C | A A C T T C C How to score multiple sequence alignments? The most popular way of scoring multiple sequence alignments, the so-called SP-score (sum of pairs score), works by computing the scores of all the induced pairwise alignments and summing them up. The pairwise alignments are being scored by a fixed scoring scheme that rewards or penalizes matches, mismatches, and gaps. Computational complexity of multiple sequence alignment The best scoring alignment should be the most likely hypothesis about the evolutionarily correct alignment. Finding the best scoring alignment for a given scoring scheme is called the multiple sequence alignment problem. The size of an instance is given by the number k of sequences to be aligned and the length n of the longest sequence. For any fixed k the problem can be solved in time O(nk) by a variant of the Smith-Waterman algorithm. Can we do any better? The class P A decision problem is in the class P if there is an algorithm that always gives the correct answer and has worst-case running time of O(na) for a fixed a, where n is the size of the input. An optimization problem is in the class P if the corresponding decision problem of determining for every input and given number c whether there is a solution with score at least c is in the class P. The class NP A decision problem of the form “there exists an object J such that for the given input I property(I,J) holds” is in the class NP if there is an algorithm that for any given input I and given object J determines whether property(I,J) holds and has worst-case running time of O(na) for a fixed a, where n is the size of the input. Note that the decision problem corresponding to the multiple sequence alignment problem is in the class NP. The P = NP problem It is not known whether the classes P and NP are identical. This is probably the biggest open problem in mathematics. There are literally hundreds of so-called NP-hard problems. These are problems in the class NP of which it is know that if any one of these problems is in the class P, then the classes P and NP coincide. But nobody has ever found a polynomial-time algorithm for any of the NP-hard problems. This is strong empirical evidence for the conjecture that P is not equal NP, but no proof exists. Computational complexity of multiple sequence alignment With respect to the number of sequences k, the multiple sequence alignment problem is NP-hard. This was shown around 1993 by Wang and Jiang. Unfortunately, their proof uses a biologically totally irrelevant scoring scheme. It was not clear until recently whether the problem might become tractable for scoring schemes that are actually used by biologists, or if we limit the number of gaps that can be inserted. It was also unknown whether there might be always a PTAS (polynomial time approximation scheme). Polynomial-time approximation schemes An optimization problem is said to have a PTAS if for every ε > 0 there exists an algorithm A(ε) that finds a solution that is within a factor of 1 - ε of optimum and whose running time is O(na(ε)). NP-hard optimization problems may still have a PTAS if a(ε) increases to infinity as ε decreases to zero. Computational complexity of multiple sequence alignment Theorem 1: (Just) For every scoring scheme used by biologists, the multiple sequence alignment problem is NP-hard. This remains true even if the number and size of gaps that can be inserted into each sequence is restricted in the most severe way possible. Theorem 2: (Just) There exists a scoring scheme for which the multiple alignment problem has no PTAS, even if the number and size of gaps that can be inserted into each sequence is most severely restricted. A few words about the proof Theorems 1 and 2 were proved by reducing the MAX-CUT problem for graphs to the multiple sequence alignment problem. That is, given a simple graph G, a multiple sequence alignment problem is constructed in such a way that from a (nearly) optimal solution of the sequence alignment problem a cut in the graph G of (nearly) maximal size can be reconstructed in polynomial time. An open question A scoring scheme is metric if the mismatch penalties satisfy the triangle inequality. Biologically meaningful scoring schemes have to be metric in this sense. However, the scoring scheme used in the proof of Theorem 2 does not have this property. Problem: (Jiang, Kearney and Li, 1999) Is there a PTAS for the multiple sequence alignment problem for metric scoring schemes? An partial answer Theorem 3: (Just, Della Vedova) The restriction of the multiple sequence alignment problem to alignments that insert at most L spaces into each sequence has a PTAS that works for all metric scoring schemes. This theorem was proved by conceptualizing the multiple sequence alignment problem as a facility location problem. The practice of multiple sequence alignment The most frequently used multiple sequence alignment algorithm used in practice is CLUSTAL. This is a heuristic algorithm for which no performance guarantee is known. Using genomic data for reconstruction of phylogenies A phylogenetic tree depicts the branching pattern in the evolution of contemporary species from their common ancestor. Given the sequences of homologous genes (i.e., genes derived from a common ancestral gene), one can try to reconstruct the phylogenetic tree for these species by looking at the amount of evolutionary change that has occurred at the molecular level and estimating the times at which any two of these species diverged. Reconstructing phylogenies: How to get started Most reconstructions of phylogenetic trees from molecular sequences start from a multiple alignment of these sequences. The quality of the phylogenetic tree thus constructed crucially depends on the quality of the multiple sequence alignment. Methods for reconstruction of phylogenies Distance methods Maximum parsimony Maximum likelihood Bayesian Analysis Methods for reconstruction of phylogenies Maximum parsimony: Given a tree, count the number of character changes that must have occurred during evolution if the tree is correct. The most parsimonious tree is the one that postulates the fewest such changes among all possible trees. Maximum likelihood: Given a tree and a model of molecular evolution, compute the probability that the sequences actually observed in the extant species have evolved. The maximum likelihood tree is the one for which this probability is largest among all possible trees. The computational challenge for reconstruction of phylogenies Maximum parsimony and maximum likelihood require us to consider all possible trees. Given n species, there are (2n-3)!/[2n-2(n-2)!] distinct possible (rooted) trees for depicting the phylogenetic relationship between these species. For n = 10, this number is 34,459,425; for n = 15 it is already equal to 213,458,046,676,875, for n = 21 we get half a mole of trees. Again, we are confronted with a problem whose most obvious solution requires exponential running time. Reconstruction of phylogenies: A success story There are two basic kinds of free-living organisms: prokaryotes, that do not have a cell nucleus, and eukaryotes, which do. Prokaryotes fall into two major groups: Eubacteria and Archaea. Phenotypically, eubacteria and archaea are very similar to each other. However, it has been demonstrated by using molecular data that archaea are more closely related to eukaryotes than to eubacteria, and thus it appears that the evolutionary branching between archaea and eubacteria occurred before the branching of archaea and eukaryotes. Gene interactions: Collecting gene expression data All cells of a multicellular organism have the same set of genes. What accounts for the differences in various cell types and function is which of the genes are being expressed (switched on) at a given time in a given cell. A relatively new technology, called gene chips or microarrays makes it possible to monitor, for tens of thousands of genes simultaneously, the differences in gene expression levels between two different experimental conditions. Gene interactions: Interpeting gene expression data Once gene expression data have been collected, it is possible to identify clusters of genes that have similar expression profiles, that is, are up- or downregulated under the same experimental conditions. One then conjectures that genes with similar expression profiles have similar functions, for example, are involved in the same biochemical pathways. Such conjectures can serve as powerful guides for setting up experiments to confirm the biochemical role of groups of genes. Interpeting gene expression data: A mathematical challenge Gene expression data sets are peculiar in the sense that we typically have very few experiments (5-10 perhaps) and a large number (tens of thousands) monitored genes. It seems inevitable that some genes will show similar expression profiles just by random accident. Question: How can we tell spurious clusters of genes from biologically meaningful ones? Gene expression profiles: A success story Cancer patients with the same clinical picture often respond to very different types of treatment. Gene expression profiles of groups of cancer patients have revealed that what looks to the clinician as the same disease can sometimes be one of several diseases at the biochemical level. The latter can be distinguished by characteristic expression profiles of certain groups of genes. Once the biochemical nature of the disease has been established, treatment can be tailored to the type of disease a patient actually has.