* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download CSE527 Project Report
Survey
Document related concepts
Transcript
Finding regulatory modules using a linear HMM and the Viterbi algorithm December 17, 2003 CSE527 Computational Biology Project Report Raphael Hoffmann Introduction To fully understand the function of genes in higher eukaryotes, one has to know the complex regulatory mechanisms that control gene expression. It is well-known that finding transcription factor binding sites can be a key to “crack” these mechanisms. Therefore, many techniques and algorithms that tackle this task have been developed and are available. However, their capabilities are limited: One reason is that short transcription factor motifs tend to occur frequently outside promoter regions, resulting in a large number of false positives. Another reason is that usually multiple transcription factors act in concert. Their motifs have to be aligned within a certain distance and often ordering. These cis-regulatory modules (CRM’s) are typically a few hundred base pairs long. Finding regulatory modules A common approach to finding regulatory module, is to consider statistical significance of transcription factor binding motif clusters. MCAST, which is part of MetaMEME 1, uses exactly this idea. As input, it requires a set of motifs, each with its position-specific probability matrix (PSPM) and its occurences in a DNA sequence. This information can be obtained by running MEME2, a motif discovery tool, on a DNA sequence. MCAST then builds a Hidden Markov Model (HMM) according to this data. The HMM represents a statistical model for a module, and contains an intra-module spacer state, an inter-module spacer state and each motif as a state. The Viterbi algorithm then uses the HMM to find statistically significant sites. This general approach is not new, however, the authors have improved it in a number of ways: The HMM can be created using a linear topology, a star topology or a complete topology. The latter being more flexible, but requiring many parameters. Another main improvement solves a problem that all HMM’s have: Since transitions are Markov, gap lengths are usually distributed geometrically which is not true for real data. The authors created a sophisticated scoring function that among other things allows arbitrary distributions for spacer lengths. Additionally, they tried to learn the transition probabilities of the HMM using the EM algorithm. However, they mentioned that this is not yet feasible due to the limited amount of reliable data that is available. 1 MetaMEME - Motif-based hidden Markov modeling of biological sequences is available from http://metameme.sdsc.edu/ 2 Multiple EM for Motif Elicitation (MEME) can be downloaded from http://meme.sdsc.edu 1 Architecture My application is object-oriented and based on JAVA. It is a simplified version of the approach that has been described. It only considers linear HMMs and uses the standard logodds scoring function. The following figure shows an overview. MEME Motifs PSPMs occurences DNA sequences Viterbi Linear HMM modules Figure 1. Overview of the architecture MEME delivers a set of motifs with PSPMs and their occurences. Based on that information, a Linear HMM is built and the Viterbi algorithm applied. It delivers the most probable path through the model and its probability. Testing on Simulated Data Since results on real data depend on a large number of factors, it is recommendable to first test the performance of the algorithm on simulated data. I developed a small program that uses a given HMM and generates a random sequence according to background noise and the HMM. The actual states that served for generating the sequence are being stored in a log file for later comparison with the results of the module finding algorithm. Of course, the probability of detecting the module correctly in the simulated data depends on the PSPM’s of the motifs, the background distribution and especially the length of the motifs and the number of motifs. In the following example I used two very short Transcription Factor binding motifs and connected them sequentially through a Linear HMM. The used motifs are explained later, when the same HMM will be applied to real data. The entire sequence is 16,000 bp long. I used a sliding window and calculated the log-odds score in each position. The results are shown in the following figure 10 5 0 -5 -10 Figure 2. The above diagram shows the log-odds scores. Below, blue indicates that the data was generated from the background model. Red indicates, it was generated by the HMM. The motif was found, i.e. it had the highest score. 2 Real Data Testing the developed algorithm on real data turned out to be much harder than on simulated data. The reasons are various. There is only very limited well known data about regulatory modules available. Most research work focuses on one of the following two datasets: There are about 20 Drosophila developmental genes of which some CRM sequences are known. Those sequences can be obtained from [4]. Another dataset is from human, however I focused on the Drosophila dataset. Some research in this field is obviously inspired by a simple and successful experiment [3]. About 1-Mb of DNA around the even-skipped gene was scanned using a 700-bp window. Sites that contained at least 13 predicted Transcription Factor binding sites (Bcd, Cad, Hb, Kr and Kni) correlated with known CRM sites unexpectedly well. In this experiment the relative ordering and distances between the binding sites was entirely ignored. I first tried to extract the binding site motifs and their relative ordering from the sequences that contained known CRMs (available at [4]) using MEME. However, MEME did not precisely detect the known motifs (Bcd, Cad, Hb, Kr and Kni). I used “meme –dna –revcomp –mod tcm –minw 8 –maxw 11 –nmotifs 5”. However, even helping MEME by adding sample motif sequences, restricting the CRM sites and tweaking the parameters did not lead to success. Probably some motifs are not so clear from a statistical perspective. I then used PSPM’s for the five motifs which were available from [5], and explicitly aligned the motifs and the known CRM’s using MAST3. The alignments are shown in the following figure: Knirpscis -5 -4 +1 + +1 -3 -3 +1 -1 3 - 3 + -2 4 - 3 + 1 - +2 +2 +4 2 Eve2 + 4 + 4 - 2 - 2 + 4 + 4 + 1 TaillesPD + 2 - 2 + 1 - 1 + 1 + 1 - 1 - 5 - 5 - 4 - 5 - 1 - 1 - 4 Runt 3+7 + 1 + 1 + 1 + 1 + 1 + 1 + 4 + 4 + 1 - 5 - 1 + 5 - 1 - 1 - 2 - 2 Hairy7 + 1 - 2 - 2 - 2 + 1 - 1 + 1 - 1 - 4 + 1 - 4 - 3 Giant PE - 1 - 5 - 1 - 1 + 4 - 1 - 1 - 2 Abdominal + 5 + 1 - 5 + 1 + 5 + 1 - 5 - 5 - 5 Eve3+7 - 4 + 1 + 1 + 1 + 2 + 1 + 2 Hairy6+2 + 4 - 3 - 1 + 1 Kruppel CD1 + 1 + 1 - 4 + 1 - 1 - 1 Eve4+6 + 1 + 3 + 4 + 1 + 1 Hairy5+1 - 4 - 3 - 2 + 1 Eve1 + 3 + 2 + 3 + 3 - 3 Buttonhead - 2 + 2 Spalt b | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 1 2 5 5 0 7 5 1 0 0 1 2 5 1 5 0 1 7 5 2 0 0 2 2 5 2 5 0 2 7 5 3 0 0 3 2 5 3 5 0 3 7 5 4 0 0 4 2 5 4 5 0 4 7 5 5 0 0 5 2 5 5 5 0 5 7 5 6 0 0 6 2 5 6 5 0 6 7 5 7 0 0 7 2 5 7 5 0 7 7 5 8 0 0 8 2 5 8 5 0 8 7 5 9 0 0 9 2 5 9 5 0 9 7 5 1 0 0 0 1 0 2 5 1 0 5 0 1 0 7 5 1 1 Figure 3. Known regulatory modules were aligned with a set of transcription factor motifs. These are Hb (cyan), Kr (blue), Bcd (red), Cad (purple), Kni (yellow). The output was generated using MAST. There is hardly any pattern in the alignment of the different motifs. Therefore, it makes no sense to apply the presented algorithm with a large HMM to find regulatory modules. However, there is obviously a large amount of Hb-binding sites, and they often appear in pairs, e.g. in the first CRM. An obvious question is: Can this fact be used to identify the regulatory modules? I developed a simple HMM (following figure) and ran the algorithm on larger sequences. 3 Motif Alignment and Search Tool (MAST) can be downloaded from http://meme.sdsc.edu 3 0 0 1 1 2 5 1 1 5 0 Spacer Hb PWM A C G T 0.3953 0.2791 0.3256 0.0000 0.0000 0.0000 0.0930 0.9070 0.0000 0.0233 0.0000 0.9767 0.0000 0.0233 0.0000 0.9767 0.0000 0.0000 0.0000 1.0000 0.0233 0.0000 0.0000 0.9767 0.2558 0.0698 0.1163 0.5581 0.1628 0.1860 0.1860 0.4651 0.0698 0.1395 0.5814 0.2093 0.2558 0.2558 0.1628 0.3256 0.1163 0.2093 0.2326 0.2326 0.1860 0.4186 0.1481 0.3023 0.2000 0.8000 0.8000 Spacer 0.2000 Hb PWM A C G T 0.3953 0.2791 0.3256 0.0000 0.0000 0.0000 0.0930 0.9070 0.0000 0.0233 0.0000 0.9767 0.0000 0.0233 0.0000 0.9767 0.0000 0.0000 0.0000 1.0000 0.0233 0.0000 0.0000 0.9767 0.2558 0.0698 0.1163 0.5581 0.1628 0.1860 0.1860 0.4651 0.0698 0.1395 0.5814 0.2093 0.2558 0.2558 0.1628 0.3256 0.1163 0.2093 0.2326 0.2326 0.1860 0.4186 0.1481 0.3023 Spacer Figure 4. The linear HMM used. It contains the Hb-motif twice. I used a sliding window and calculated the log-odds score in each position around the Hairy gene. The results are shown in the following figure. 10 5 0 -5 -10 Figure 5. Log-odds scores. Below, blue indicates that these regions are believed to be noncoding or that their function was not yet discovered. Red regions represent known regulatory modules. The green regions are exons. The results are not very clear. However, the scores in the exons (green) tend to be lower than average, while the scores (at least in the in the first 3 CRM’s) are above average. There seems to be a region around 10,000 bp that has rather high scores, but has not been previously identified as a CRM. Discussion Interesting observations can be made by comparing the results of the same HMM on simulated data and on real data. On average, the log-odds scores on the real data were significantly higher than those on the simulated data, even in noncoding regions. However, in general it is not possible to reliably detect CRM’s using the simple HMM described above. Reasons: 4 I only regarded motifs on one strand. However, the MAST experiments already showed that some patterns appear on the opposite strand. To include both strands, the HMM topology has to be extended. Also visible in the MAST results is that there is hardly any significant sequence of motifs that appears in multiple CRM’s. Therefore, the Linear HMM approach fails. A Complete HMM or Star HMM could improve results. Since I used a classical HMM, the length of spaces between motifs is distributed geometrically. This model is definetly wrong, since short gaps will always be preferred over longer ones. A solution would be to associate individual arbitrary distributions to gap lengths. The biological processes are probably not fully understood and therefore no model is absolutely accurate. The large number of available papers that deal with finding the modules on the same small dataset (described above) shows that the task is not trivial. I simply used the log-odds score as a scoring function. [1] and others have defined sophisticated scoring functions that span hundreds of lines of code and that include many tweakable parameters. Future Work Timothy Bailey and William Noble [1] proposed to use the EM-algorithm to learn the parameters of the Hidden Markov Model, i.e. the transition probabilities. However, since few transcription factor binding sites are exactly known, the learning approach is infeasible at the time. As more and more data becomes available, the learning performance could certainly improve. I am convinced that learning the transition probabilities could become crucial, because the parameter values (e.g. transition probabilities) are now based on MEME output. This however could be often unreliable. Furthermore, MEME itself uses thresholds for the alignment and doesn’t even output occurences with weaker agreements. This data could still be valuable for defining the right HMM. Applying a training method could solve these problems. References 1. Bailey, T. L., Noble, W. S. (2003) Searching for statistically significant modules. Bioinformatics, vol 19, ii16-ii25. 2. Bailey, T. L. (2003) Details of MCAST Statistics. Technical Report IMB-TR0001. Institute for Molecular Bioscience, University of Queensland. 3. Berman, B. P., Nibu, Y., Pfeiffer, B. D., Tomancak, P., Celniker, S., Levine, M., Rubin, G. M., Eisen, M. B. (2002) Exploiting transcription factor binding site clustering to identify cisregulatory modules involved in pattern formation in the Drosophila genome. PNAS vol. 99 no. 2, 757-762. 4. Lifanov A., Makeev V., Nazina A. and Papatsenko D.(2003) Interactive collection of cisregulatory modules from Drosophila. http://homepages.nyu.edu/~dap5/PCL/appendix2.htm 5. Rajewsky, N., Vergassola, M., Gaul, U., Siggia, E. D. (2002) Computational Detection of genomic cis-regulatory modules applied to body patterning in the early Drosophila embryo. BMC Bioinformatics. http://www.biomedcentral.com/content/pdf/1471-2105-3-30.pdf 5