Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Alternative Splicing: Functionality, Evolution and Selection Motivation and Background Alternative splicing (AS) was discovered in 1978, and in the subsequent 10-15 years was viewed mostly as a curiosity: an interesting way to generate several proteins from one gene (Ast, 2004). With the advent of large scale genome sequencing and EST determination, it has become clear that a very large percentage of genes are alternatively spliced. A key goal of bioinformatics is to predict as much as possible of the behaviour of a biological system by computational means using available knowledge. Presently, we can only predict coarse features of a gene from the sequence alone, although progress in this field is steady. Obviously, AS increases the variety of proteins encoded by the genome. For the researcher it creates the challenge of determining which of this is functional, which is tolerated noise, and which is directly detrimental or advantageous novelties. This is analogous to previous debates on the selective value of observed sequence variation within a species and molecular differences between species. We propose to address this problem by a comparative analysis of AS from different species. The standard method of extracting a compact representation of AS data is to construct the associated alternative splicing graph (ASG), where the empirically observed transcripts correspond to different paths through the graph (Leipzig et al., 2004). The ASG only contains information about putative transcripts, but not about their probability. While ASGs may serve as a starting point, more comprehensive models which include the stochastic nature of AS are needed. Comparing AS is also an area where theoretical work is lagging behind experimental techniques. In particular, missing information needs to be incorporated. There is no guarantee that, for instance, the mouse fibrinogen ASG fully represents the AS of that gene – all transcripts might not have been observed and there can be tissue variation. Ideally this should be part of an evolutionary model used to compare AS models, but could also be ignored in initial investigations or for limited curated data sets of high quality. Drosophila melanogastor and related species are one of the most studied organisms over the last century. The vast knowledge accumulated about their genetics, ecology, neuroscience, molecular biology, embryology cannot be rivalled by any other metazoan. A series of large scale projects have been launched that will ensure the Drosophilas role as central model species from a genomics perspective as well. These projects will generate genome as well as expression and transcript data. This provides a unique opportunity to study several open issues in modern biology, and we propose to focus on alternative splicing. Biological Issues to be Addressed This proposal aims to investigate AS, it's variation within an individual as well as between species. We will do this by modelling AS on several levels, starting from AS in one particular cell and progressing over different tissue types to AS evolution and selection. In broad terms, the proposed research can be characterised and structured as follows. 1. Quantification and Characterisation of Alternative Splicing in Drosophila a) Inference of AS structure b) Parametrisation of AS models i. Splicing dependencies ii. Tissue specific and tissue dependent models 2. Evolution of Alternative Splicing a) Models of AS evolution i. Sequence oblivious evolutionary models ii. Sequence dependent evolutionary models b) Comparative AS inference 3. Selection and Neutrality of Alternative Splicing Initially the project will focus on characterising and modelling AS within a single organism. This work will be based on data becoming available for several species of Drosophila, and will in itself be valuable in characterising AS in these species. The main aim of this part is to obtain a better understanding of correlations between splicing events and determine adequate models for describing these correlations. This includes extending models from considering AS in one specific cell type to capturing AS across multiple tissue types. We would expect good models to be parametrisable such that only a few parameters change between different tissue types. Once this necessary ground work for understanding and describing AS has been done, it becomes feasible to formulate evolutionary models describing evolution of AS in terms of changes to AS descriptions. A key derivative of this work will be the ability to assess and quantify how selection on AS differs from neutral evolution. This will be essential for harnessing the power of comparative approaches for AS inference, a methodology that has already been proven invaluable for e.g. finding non-coding RNA and determining regulatory signals. Research Plan Modelling alternative splicing The figure below illustrates the relationship between transcripts and the ASG. The solid straight line represents a genic region of DNA. The dashed lines represent “jumps” that would not be part of a messenger RNA transcript, if that jump was selected; that is, the left endpoints of the dashed lines correspond to donor sites, and the right endpoints of the dashed lines corresponds to acceptor sites. A given ASG can generate all possible transcripts by traversing from left to right selecting all possible routes through the graph. Assuming that all intronic regions can be retained, the graph below can generate 18 different transcripts. In reality, we can see the transcripts (or partial knowledge thereof like splice-array experiments) and have to infer the ASG. Typically, the ASG is chosen as the smallest ASG (minASG) that can explain all the observed transcripts and the true ASG can easily be more complicated than the minASG. Conversely, some transcripts allowed by the minASG may not be observable, e.g. if two splicing events are mutually exclusive. Simulation studies and analytical investigations using ASGs as transcript generators could help getting an idea of how much AS could be missed. If the transcripts are generated with the “wrong” kind of ASG, how will this influence the recovered ASG? For example, the ASG will assume independence among different splicing events, but that could be wrong. How many transcripts would be needed to reveal this? In general, we want to assess to what extent the transcripts generated depend on the precise model employed. The situation is further complicated by different transcripts having different probabilities, and the above graph should be parametrised so that each transcript has a well defined probability. The parametrised ASG should describe AS for the gene, but no consensus presently exists on the details of such a description. Ideally, it should generate transcripts according to their frequency by assigning probabilities to individual paths. There are numerous different ways to do this. If an arbitrary level of complexity of dependencies is allowed, any probability distribution over transcripts can be modelled. The important question is to find a sufficient level of complexity, allowing us to make generalisations from observed data. We have introduced models essentially enriching the ASG to a Markov chain description of AS, thus ignoring any long range dependencies. This is, at least in some cases, too simplistic. An extension to this approach will be to assume that the full splicing potential of a cell consist of a collection of simple splicing mechanisms, differing in conformation, associated regulatory elements etc. If each simple mechanism is described by a simple Markov chain, this will still allow the introduction of long range dependencies. Under this approach, tissue specific splicing will correspond to different mixtures of the constituent simple splicing mechanisms. The difference between tissue types are then captured by parameters describing the strength with which each simple mechanism contributes to the overall splicing. A simple approach to describe such a mixture would be to have a distribution over simple models. This corresponds to initially choosing a model that exclusively describes the splicing in the Markov chain. More realistic approaches will allow for jumps between the simple constituent models. We will also investigate description of splicing as a temporal process, repeatedly excising introns from the current state of the transcript. Coupled with the model mixture approach this can be viewed as a step towards modelling the reality of spliceosomes attaching to and splicing the transcript. Even though the ASG is a more complicated structure than the original set of transcripts, it is still sequential in nature with a clear ordering on exons. This means most computational methods for sequence analysis can be extended to ASG analysis. In its simplest form, a comparative analysis of AS can be approached as the problem of computing the distance between ASGs for homologous genes given a suitable score matrix. Extending this to allow inferences in statistical models of the evolution of AS will have several benefits. First, it will allow testing models with varying degrees of detail in the description of the ASG as well as of the evolution of the ASG to determine necessary and sufficient features required in these descriptions. Secondly, it provides a framework for interpreting the comparison of two or more ASGs. Finally, it puts the comparative analysis on a stronger footing, where parameters are rigorously estimated rather than determined in an ad hoc manner. Moreover, posterior decoding can be used to assess the confidence with which parameters and annotations have been inferred. Gene and Associated Alternative Splicing Evolution Modelling how a gene and its alternative splicing evolves is a complicated affair indeed. To model this over a short time period will require very good knowledge of the regulation of alternative splicing for consequences of substitutions and selective pressures to be incorporated. Over longer time periods the underlying mechanism of alternative splicing could also evolve, will would have to be incorporated. For example, mutations in RNAs involved in the spliceosome could slightly change the recognition of splice signals. This latter phenomena is a seriously complicating phenomena, that can either be dealt with by having time inhomogeneous models of molecular evolution or by more general models also modelling the evolution of the AS mechanism. The latter would require a high level of knowledge of the molecular mechanism of AS. In practice it will probably be necessary, at least initially, to ignore changes in AS mechanism and focus on modelling AS evolution for a constant mechanisms. Modelling AS evolution with a constant mechanism can be done in a variety of ways. The short time modelling problem is still complicated, but techniques exists that are suited for this purpose. The standard (easy) way would be to model the evolution the probabilities parametrizing the ASG with for instance Brownian Motion. This could be useful, but would be a “zero functional knowledge” approach, thus pursuing this through modelling of the regulation will be harder but considerably more rewarding. Modelling AS and sequence evolution simultaneously has strong similarity to models used by Liu and colleagues (Jensen et al., 2005) to combined analysis of expression levels and sequence change. While combined expression and sequence analysis model selection levels directly as a function of the presence of given regulatory signals, the combined AS and sequence analysis would model observed transcripts in a two step procedure: the presence of splicing signals will defines a distribution over possible ASGs, and an ASG will in return define probabilities of observed transcripts. The position and content of signals can be found by well established MCMC algorithms, especially the Gibbs sampler (Lawrence et at., 1993, Liu et al., 2001). Moreover, a long series of signal and molecular mechanisms are known, allowing some description of key signal such as Exonic Splicing Enhancers (ESE), Intronic Splicing Enhancers (ISE), Exonic Splicing Silencers (ISS), Intronic Splicing Silencers (ISS), Branch Point, donor sites and acceptor sites. Together with standard HMM gene finding algorithms defining possible exons, such signals allow the formulation of an HMM that would define a graph containing the true ASG. Deeper understanding would shrink this graph toward the true ASG. Selection and Neutrality The last few years have seen many investigations concerning the degree of conservation of nucleotides as a function of alternative splice patterns (Sorek et al., 2004), the correlation of AS with protein function (Matlin et al., 2005) and the ratio of synonymous and non-synonymous substitution rates dependent on different splicing scenarios. It is clear that there is a large component of functionality, as alternatively spliced exons are often accompanied by segments under extra selective constraints. These results have been intriguing and have underlined the importance of AS, but also the need for further data and analysis. Estimates of the fraction of AS under purifying, neutral and positive selection will be a major contribution in understanding the contribution AS to the complexity of an organism. After the discovery of the low number of genes in humans and higher animals, many have looked to AS as the hidden source of complexity (Maniatis and Tacik, 2002), but this still remains to be proven. Biological literature is also full of functional explanations of AS variants, but much could well be tolerated noise. A large scale AS analysis for the Drosophila could finally estimate these quantities. Models for combining multiple constraints have been proposed for combinations of RNA and protein genes (Pedersen et al., 2004a,b) and can be readily transferred to incorporate the constraints from a regulatory signal. The idea is simple: one constraint, encoding a functioning protein for example, will accelerate/decelerate the rate of nucleotide (or nucleotide pair) substitution with a certain factor; a second constraint, encoding a regulatory element controlling splicing for example, will contribute a second factor to the final rate. Given such a combined model, it is possible to test for the presence of an additional selective constraint; for example in the case where splicing is controlled by an RNA regulatory element, this constraint could be the presence of one or more conserved base pairings. The success of this would depend on the strength of selection and the amount of data. Short signals with only few additional constraints can only be detected with significant confidence if large amounts of data are available. The data required for such a test is a list of splicing signals, or regions containing a splicing signal, that is alternatively spliced and a list that is not alternatively spliced. Initially we will compile a curated data set of known positive and negative examples of AS. Through our genome analyses we aim to provide a comprehensive computational annotation of all the Drosophila genomes with this information. Software Development An intrinsic part of the investigations proposed here will be software development, implementing methods and models proposed for the study of evolution of AS. This software will be made available to the research community as both freely available source code and web servers. The researcher we have identified for this post already has extensive experience in developing software implementing model based statistical inference. Our group also has a strong tradition for developing bioinformatics software, ranging from recombination analysis to phylogenetic RNA structure prediction. We see it as a further strength that this project brings together collaborators from different aspects of AS research. This will allow the software development to benefit from immediate feedback, ensuring its relevance for the intended users. The Data Species Data: A main resource for this project will be the 12 Drosophila genomes whose sequences were completed in February 2006. These genomes have been chosen carefully from a very large number of available species to span a range of evolutionary distances from sibling species to quite diverged species. They can be treated as raw data or as the genome alignment generated by MAVID (Bray and Pachter, 2004). The genomes have also been annotated using (Chatterji and Pachter, 2006). This is useful although we will re-annotate with regard to features of relevance. Transcript Data: In the period of relevance, there will also be available a large set of transcripts generated from a long series of laboratories. On average this will generate at least 50–100 transcripts for the 18,000 Drosophila melanogastor genes. This will additionally be supplemented with data for D.pseudoobscura, D.simulans, D.yakuba and possibly also others. This is most likely a serious underestimate of the true amount of transcript data that will be available during 2007-9. One reason for the anticipated growth is the increased interest in Drosophila as a model organism, the rise of high throughput technologies and the potential of comparative approaches. Population Data: The 12 genomes of different Drosophila species are additionally supplemented by at least 7 Mb sequences from 50 genomes from D. melanogastor from one population (www.dpgp.org). It is the hope that eventually 50 complete genomes will be available. This would allow many classic molecular evolution versus population genetics issues to be addressed (Hein et al., 2005) References • Ast, G. (2004) How did alternative splicing evolve? Nat. Rev. Genet. 5(10): 773-82. • Boue, S., Vingron, M., Kriventseva, E. and Koch, I. (2002) Theoretical analysis of alternative splice forms using computational methods. Bioinformatics 18 suppl. 2: S65-73. • Bray ,N. and Pachter, L. (2004) MAVID: constrained ancestral alignment of multiple sequences. Genome Res. 14(4): 693-9. • Cawley, S.L. and Pachter, L. (2003) HMM sampling and applications to gene finding and alternative splicing. Bioinformatics 19 suppl. 2: ii36-41. • Chatterji, S. and Pachter, L. (2006) Reference based annotation with GeneMapper. Genome Biology 7: R29. • Hein, J. (1989) A new method that simultaneously aligns and reconstructs ancestral sequences for any number of homologous sequences, when the phylogeny is given. Mol. Biol. Evol. 6(6).649-68. • Hein, J., Schierup, M.H. and Wiuf, C. (2005) Gene Genealogies, Variation and Evolution: A Primer in Coalescent Theory. Oxford University Press. • Jenkins, P., Lyngsø, R.B. and Hein, J. (2006) How Many transcripts does it take to reconstruct the alternative splicing graph. In press. • Jensen, S., Shen, L. and Liu, J. (2005) Combining Phylogenetic Motif Discovery and Motif Clustering to Predict Co-Regulated Genes. Bioinformatics 21(20): 3832-9. • Kan, Z., Rouchka, E.C., Gish, W.R. and States, D.J. (2001) Gene Structure Prediction and Alternative Splicing Analysis Using Genomically Aligned ESTs. Genome Res. 11(5): 889-900. • Knudsen, B. and Hein, J. (1999) RNA secondary structure prediction using stochastic context-free grammars and evolutionary history. Bioinformatics 15(6): 446-54 • Lawrence, C., Altschul, S.F., Boguski, M.S., Liu, J., Neuwald, A.F. and Wootton, J.C. (1993) Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. Science 262(5131): 208-14. • Leipzig, J., Pevzner, P. and Heber, S. (2004) The Alternative Splicing Gallery (ASG): bridging the gap between genome and transcriptome. Nucleic Acids Res. 32(13): 3977-83. • Liu, J. (2001) Monte Carlo Strategies in Scientific Computing. Springer. • Lunter, G., Drummond, A.J., Miklos, I. and Hein, J. (2005) Statistical Alignment: Recent Progress, New Applications, and Challenges. Chapter in Statistical Methods in Molecular Evolution, ed. Rasmus Nielsen. Springer. • Maniatis, T. and Tasic, B. (2002) Alternative pre-mRNA splicing and proteome expansion in metazoans. Nature 418(6894): 236-43. • Matlin, A.J., Clark, F. and Smith, C.W. (2005) Understanding Alternative Splicing: Towards a Cellular Code. Nat. Rev. Mol. Cell. Biol. 6(5): 386-98. • Modrek, B. and Lee, C. (2002) A Genomic view of Alternative Splicing. Nature Genetics 30(1): 13-9. • Pedersen, J.S. and Hein, J. (2003) Gene finding with a hidden Markov model of genome structure and evolution. Bioinformatics 19(2): 219-27. • Pedersen, J.S., Meyer, I.M., Forsberg, R., Simmonds, P. and Hein, J. (2004) A comparative method for finding and folding RNA secondary structures within protein-coding regions. Nucleic Acids Res. 32(16): 4925-36. • Pedersen, J.S., Forsberg, R., Meyer, I.M. and Hein, J. (2004) An Evolutionary Model for ProteinCoding Regions with Conserved RNA Structure Mol. Biol. Evol. 21(10): 1913-22. • Sorek, R., Shamir, R. and Ast, G. (2004) How prevalent is functional alternative splicing in the human genome? Trends Genet. 20(2):68-71. • Xing, T.Yu, Y.N.Wu, M.Roy, J.Kim & C.Lee (2006) An expectation-maximization algorithm for probabilistic reconstructions of full-length isoforms from splice graphs. Nucleic Acids Res. 34(10): 3150-60 • Xing, Y. and Lee, C. (2006) Alternative Splicing and RNA Selection pressure – evolutionary consequences for eukaryotic genomes. Nat. Rev. Genet. 7(7): 499-509. Milestones and Deliverables Year 1-Months 1-3: Literature review of bioinformatics and molecular biology of AS, and collecting relevant existing programs for AS prediction and data analysis. Year 1 Months 4-6: Detailed planning of the overall structure of the methods and algorithms to be developed during the span of the project. Extending the Jenkins et al. (2006) methods to incomplete transcripts. Creating core curated data (genomes and associated transcripts) and analysing this data set with the extended methods. Year 1 Months 7-12: Simple analysis programs are developed that can be continuously tested on existing databases. Implement simple methods for ASG comparison. Year 2 Months 13-18: Scientific visit with Gil Ast in Tel Aviv, Israel. Increased focus on the molecular biology and cellular machinery of AS and existing knowledge of regulatory signals governing AS in the cell. Analysis of data with focus on characterizing regulatory signals and their evolution. Year 2 Months 19-24: Software development and large scale analysis including probabilistic parametrisation of the ASG. The core data set will be supplemented with predicted ASGs. Year 3: Months 25-30: Scientific visit with Lior Pachter at UC Berkeley, USA. Analysis of comprehensive data sets with increased focus on validating prediction methods. Year 3: Months 31-36: Documentation of developed software and publication of final analysis results. Management of Project The postdoc will start 1.1.07. He will be based in Oxford, but will visit our collaborators in Israel and the US, for 6 months each. These visits are planned to 1.1.08-30.6.08 (Gil Ast, Israel) and 1.1.09-30.6.09 (Lior Pachter, Stephen Brenner and Mike Eisen, all Berkeley). Lior Pachter will be based in Oxford in the academic year 06/07, so it is reasonable first to visit to Gil Ast. This will allow the postdoc to benefit from Gil Ast's expertise on the functional molecular biology of AS from an early stage in the project. The postdoc will colleborate closely with Pachter, both during Pachter's visit to Oxford and during a later visit with Pachter and colleagues at UC Berkeley. The association with Lior Pachter is a major asset. He developed MAVID, a key program for whole genome alignments, as well as methods for alternative splicing prediction. He is also heavily involved in the development of Drosophila genome data bases. The Hein group has projects on RNA gene finding (postdoc Rune Lyngsø funded by a BBSRC grant expiring 31.12.08 and PhD student Naila Mimouni from the Life Sciences Interface Doctoral Training Centre in Oxford), statistical string comparison (postdocs Andrea Rocco and David Dale funded by a BBSRC grant expiring 31.3.08). Additionally, Thomas Mailund (postdoc funded by a grant from the Danish government expiring 31.12.07) is a computer scientist with bioinformatics expertise, and Rahul Satija (Rhodes Scholar, who will be in our group until 30.9.09) will work on detection of regulatory signals in the same 12 Drosophila Genomes. Our group also has strong links to Chris Holmes' group, that specialises in the application of computer intensive statistical methods to bioinformatics.