Download Alternative Splicing: Functionality, Evolution and Selection

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Hologenome theory of evolution wikipedia , lookup

Introduction to evolution wikipedia , lookup

Genetics and the Origin of Species wikipedia , lookup

Futures techniques wikipedia , lookup

The eclipse of Darwinism wikipedia , lookup

Symbiogenesis wikipedia , lookup

Koinophilia wikipedia , lookup

Saltation (biology) wikipedia , lookup

Transcript
Alternative Splicing: Functionality, Evolution and Selection
Motivation and Background
Alternative splicing (AS) was discovered in 1978, and in the subsequent 10-15 years was viewed
mostly as a curiosity: an interesting way to generate several proteins from one gene (Ast, 2004).
With the advent of large scale genome sequencing and EST determination, it has become clear
that a very large percentage of genes are alternatively spliced. A key goal of bioinformatics is to
predict as much as possible of the behaviour of a biological system by computational means using
available knowledge. Presently, we can only predict coarse features of a gene from the sequence
alone, although progress in this field is steady. Obviously, AS increases the variety of proteins
encoded by the genome. For the researcher it creates the challenge of determining which of this is
functional, which is tolerated noise, and which is directly detrimental or advantageous novelties.
This is analogous to previous debates on the selective value of observed sequence variation
within a species and molecular differences between species.
We propose to address this problem by a comparative analysis of AS from different species.
The standard method of extracting a compact representation of AS data is to construct the
associated alternative splicing graph (ASG), where the empirically observed transcripts
correspond to different paths through the graph (Leipzig et al., 2004). The ASG only contains
information about putative transcripts, but not about their probability. While ASGs may serve as a
starting point, more comprehensive models which include the stochastic nature of AS are needed.
Comparing AS is also an area where theoretical work is lagging behind experimental techniques.
In particular, missing information needs to be incorporated. There is no guarantee that, for
instance, the mouse fibrinogen ASG fully represents the AS of that gene – all transcripts might
not have been observed and there can be tissue variation. Ideally this should be part of an
evolutionary model used to compare AS models, but could also be ignored in initial
investigations or for limited curated data sets of high quality.
Drosophila melanogastor and related species are one of the most studied organisms over the
last century. The vast knowledge accumulated about their genetics, ecology, neuroscience,
molecular biology, embryology cannot be rivalled by any other metazoan. A series of large scale
projects have been launched that will ensure the Drosophilas role as central model species from a
genomics perspective as well. These projects will generate genome as well as expression and
transcript data. This provides a unique opportunity to study several open issues in modern
biology, and we propose to focus on alternative splicing.
Biological Issues to be Addressed
This proposal aims to investigate AS, it's variation within an individual as well as between
species. We will do this by modelling AS on several levels, starting from AS in one particular
cell and progressing over different tissue types to AS evolution and selection. In broad terms, the
proposed research can be characterised and structured as follows.
1. Quantification and Characterisation of Alternative Splicing in Drosophila
a) Inference of AS structure
b) Parametrisation of AS models
i. Splicing dependencies
ii. Tissue specific and tissue dependent models
2. Evolution of Alternative Splicing
a) Models of AS evolution
i. Sequence oblivious evolutionary models
ii. Sequence dependent evolutionary models
b) Comparative AS inference
3. Selection and Neutrality of Alternative Splicing
Initially the project will focus on characterising and modelling AS within a single organism.
This work will be based on data becoming available for several species of Drosophila, and will in
itself be valuable in characterising AS in these species. The main aim of this part is to obtain a
better understanding of correlations between splicing events and determine adequate models for
describing these correlations. This includes extending models from considering AS in one
specific cell type to capturing AS across multiple tissue types. We would expect good models to
be parametrisable such that only a few parameters change between different tissue types. Once
this necessary ground work for understanding and describing AS has been done, it becomes
feasible to formulate evolutionary models describing evolution of AS in terms of changes to AS
descriptions. A key derivative of this work will be the ability to assess and quantify how selection
on AS differs from neutral evolution. This will be essential for harnessing the power of
comparative approaches for AS inference, a methodology that has already been proven invaluable
for e.g. finding non-coding RNA and determining regulatory signals.
Research Plan
Modelling alternative splicing
The figure below illustrates the relationship between transcripts and the ASG. The solid straight
line represents a genic region of DNA. The dashed lines represent “jumps” that would not be part
of a messenger RNA transcript, if that jump was selected; that is, the left endpoints of the dashed
lines correspond to donor sites, and the right endpoints of the dashed lines corresponds to
acceptor sites. A given ASG can generate all possible transcripts by traversing from left to right
selecting all possible routes through the graph. Assuming that all intronic regions can be retained,
the graph below can generate 18 different transcripts.
In reality, we can see the transcripts (or partial knowledge thereof like splice-array experiments)
and have to infer the ASG. Typically, the ASG is chosen as the smallest ASG (minASG) that can
explain all the observed transcripts and the true ASG can easily be more complicated than the
minASG. Conversely, some transcripts allowed by the minASG may not be observable, e.g. if
two splicing events are mutually exclusive. Simulation studies and analytical investigations using
ASGs as transcript generators could help getting an idea of how much AS could be missed. If the
transcripts are generated with the “wrong” kind of ASG, how will this influence the recovered
ASG? For example, the ASG will assume independence among different splicing events, but that
could be wrong. How many transcripts would be needed to reveal this? In general, we want to
assess to what extent the transcripts generated depend on the precise model employed.
The situation is further complicated by different transcripts having different probabilities, and
the above graph should be parametrised so that each transcript has a well defined probability. The
parametrised ASG should describe AS for the gene, but no consensus presently exists on the
details of such a description. Ideally, it should generate transcripts according to their frequency by
assigning probabilities to individual paths. There are numerous different ways to do this. If an
arbitrary level of complexity of dependencies is allowed, any probability distribution over
transcripts can be modelled. The important question is to find a sufficient level of complexity,
allowing us to make generalisations from observed data. We have introduced models essentially
enriching the ASG to a Markov chain description of AS, thus ignoring any long range
dependencies. This is, at least in some cases, too simplistic.
An extension to this approach will be to assume that the full splicing potential of a cell consist
of a collection of simple splicing mechanisms, differing in conformation, associated regulatory
elements etc. If each simple mechanism is described by a simple Markov chain, this will still
allow the introduction of long range dependencies. Under this approach, tissue specific splicing
will correspond to different mixtures of the constituent simple splicing mechanisms. The
difference between tissue types are then captured by parameters describing the strength with
which each simple mechanism contributes to the overall splicing. A simple approach to describe
such a mixture would be to have a distribution over simple models. This corresponds to initially
choosing a model that exclusively describes the splicing in the Markov chain. More realistic
approaches will allow for jumps between the simple constituent models. We will also investigate
description of splicing as a temporal process, repeatedly excising introns from the current state of
the transcript. Coupled with the model mixture approach this can be viewed as a step towards
modelling the reality of spliceosomes attaching to and splicing the transcript.
Even though the ASG is a more complicated structure than the original set of transcripts, it is
still sequential in nature with a clear ordering on exons. This means most computational methods
for sequence analysis can be extended to ASG analysis. In its simplest form, a comparative
analysis of AS can be approached as the problem of computing the distance between ASGs for
homologous genes given a suitable score matrix. Extending this to allow inferences in statistical
models of the evolution of AS will have several benefits. First, it will allow testing models with
varying degrees of detail in the description of the ASG as well as of the evolution of the ASG to
determine necessary and sufficient features required in these descriptions. Secondly, it provides a
framework for interpreting the comparison of two or more ASGs. Finally, it puts the comparative
analysis on a stronger footing, where parameters are rigorously estimated rather than determined
in an ad hoc manner. Moreover, posterior decoding can be used to assess the confidence with
which parameters and annotations have been inferred.
Gene and Associated Alternative Splicing Evolution
Modelling how a gene and its alternative splicing evolves is a complicated affair indeed. To
model this over a short time period will require very good knowledge of the regulation of
alternative splicing for consequences of substitutions and selective pressures to be incorporated.
Over longer time periods the underlying mechanism of alternative splicing could also evolve, will
would have to be incorporated. For example, mutations in RNAs involved in the spliceosome
could slightly change the recognition of splice signals. This latter phenomena is a seriously
complicating phenomena, that can either be dealt with by having time inhomogeneous models of
molecular evolution or by more general models also modelling the evolution of the AS
mechanism. The latter would require a high level of knowledge of the molecular mechanism of
AS. In practice it will probably be necessary, at least initially, to ignore changes in AS
mechanism and focus on modelling AS evolution for a constant mechanisms. Modelling AS
evolution with a constant mechanism can be done in a variety of ways. The short time modelling
problem is still complicated, but techniques exists that are suited for this purpose. The standard
(easy) way would be to model the evolution the probabilities parametrizing the ASG with for
instance Brownian Motion. This could be useful, but would be a “zero functional knowledge”
approach, thus pursuing this through modelling of the regulation will be harder but considerably
more rewarding. Modelling AS and sequence evolution simultaneously has strong similarity to
models used by Liu and colleagues (Jensen et al., 2005) to combined analysis of expression levels
and sequence change. While combined expression and sequence analysis model selection levels
directly as a function of the presence of given regulatory signals, the combined AS and sequence
analysis would model observed transcripts in a two step procedure: the presence of splicing
signals will defines a distribution over possible ASGs, and an ASG will in return define
probabilities of observed transcripts.
The position and content of signals can be found by well established MCMC algorithms,
especially the Gibbs sampler (Lawrence et at., 1993, Liu et al., 2001). Moreover, a long series of
signal and molecular mechanisms are known, allowing some description of key signal such as
Exonic Splicing Enhancers (ESE), Intronic Splicing Enhancers (ISE), Exonic Splicing Silencers
(ISS), Intronic Splicing Silencers (ISS), Branch Point, donor sites and acceptor sites. Together
with standard HMM gene finding algorithms defining possible exons, such signals allow the
formulation of an HMM that would define a graph containing the true ASG. Deeper
understanding would shrink this graph toward the true ASG.
Selection and Neutrality
The last few years have seen many investigations concerning the degree of conservation of
nucleotides as a function of alternative splice patterns (Sorek et al., 2004), the correlation of AS
with protein function (Matlin et al., 2005) and the ratio of synonymous and non-synonymous
substitution rates dependent on different splicing scenarios. It is clear that there is a large
component of functionality, as alternatively spliced exons are often accompanied by segments
under extra selective constraints. These results have been intriguing and have underlined the
importance of AS, but also the need for further data and analysis. Estimates of the fraction of AS
under purifying, neutral and positive selection will be a major contribution in understanding the
contribution AS to the complexity of an organism. After the discovery of the low number of
genes in humans and higher animals, many have looked to AS as the hidden source of complexity
(Maniatis and Tacik, 2002), but this still remains to be proven. Biological literature is also full of
functional explanations of AS variants, but much could well be tolerated noise. A large scale AS
analysis for the Drosophila could finally estimate these quantities.
Models for combining multiple constraints have been proposed for combinations of RNA and
protein genes (Pedersen et al., 2004a,b) and can be readily transferred to incorporate the
constraints from a regulatory signal. The idea is simple: one constraint, encoding a functioning
protein for example, will accelerate/decelerate the rate of nucleotide (or nucleotide pair)
substitution with a certain factor; a second constraint, encoding a regulatory element controlling
splicing for example, will contribute a second factor to the final rate. Given such a combined
model, it is possible to test for the presence of an additional selective constraint; for example in
the case where splicing is controlled by an RNA regulatory element, this constraint could be the
presence of one or more conserved base pairings. The success of this would depend on the
strength of selection and the amount of data. Short signals with only few additional constraints
can only be detected with significant confidence if large amounts of data are available. The data
required for such a test is a list of splicing signals, or regions containing a splicing signal, that is
alternatively spliced and a list that is not alternatively spliced. Initially we will compile a curated
data set of known positive and negative examples of AS. Through our genome analyses we aim to
provide a comprehensive computational annotation of all the Drosophila genomes with this
information.
Software Development
An intrinsic part of the investigations proposed here will be software development, implementing
methods and models proposed for the study of evolution of AS. This software will be made
available to the research community as both freely available source code and web servers. The
researcher we have identified for this post already has extensive experience in developing
software implementing model based statistical inference. Our group also has a strong tradition for
developing bioinformatics software, ranging from recombination analysis to phylogenetic RNA
structure prediction. We see it as a further strength that this project brings together collaborators
from different aspects of AS research. This will allow the software development to benefit from
immediate feedback, ensuring its relevance for the intended users.
The Data
Species Data: A main resource for this project will be the 12 Drosophila genomes whose
sequences were completed in February 2006. These genomes have been chosen carefully from a
very large number of available species to span a range of evolutionary distances from sibling
species to quite diverged species. They can be treated as raw data or as the genome alignment
generated by MAVID (Bray and Pachter, 2004). The genomes have also been annotated using
(Chatterji and Pachter, 2006). This is useful although we will re-annotate with regard to features
of relevance.
Transcript Data: In the period of relevance, there will also be available a large set of transcripts
generated from a long series of laboratories. On average this will generate at least 50–100
transcripts for the 18,000 Drosophila melanogastor genes. This will additionally be supplemented
with data for D.pseudoobscura, D.simulans, D.yakuba and possibly also others. This is most
likely a serious underestimate of the true amount of transcript data that will be available during
2007-9. One reason for the anticipated growth is the increased interest in Drosophila as a model
organism, the rise of high throughput technologies and the potential of comparative approaches.
Population Data: The 12 genomes of different Drosophila species are additionally supplemented
by at least 7 Mb sequences from 50 genomes from D. melanogastor from one population
(www.dpgp.org). It is the hope that eventually 50 complete genomes will be available. This
would allow many classic molecular evolution versus population genetics issues to be addressed
(Hein et al., 2005)
References
• Ast, G. (2004) How did alternative splicing evolve? Nat. Rev. Genet. 5(10): 773-82.
• Boue, S., Vingron, M., Kriventseva, E. and Koch, I. (2002) Theoretical analysis of alternative splice
forms using computational methods. Bioinformatics 18 suppl. 2: S65-73.
• Bray ,N. and Pachter, L. (2004) MAVID: constrained ancestral alignment of multiple sequences.
Genome Res. 14(4): 693-9.
• Cawley, S.L. and Pachter, L. (2003) HMM sampling and applications to gene finding and alternative
splicing. Bioinformatics 19 suppl. 2: ii36-41.
• Chatterji, S. and Pachter, L. (2006) Reference based annotation with GeneMapper. Genome Biology 7:
R29.
• Hein, J. (1989) A new method that simultaneously aligns and reconstructs ancestral sequences for any
number of homologous sequences, when the phylogeny is given. Mol. Biol. Evol. 6(6).649-68.
• Hein, J., Schierup, M.H. and Wiuf, C. (2005) Gene Genealogies, Variation and Evolution: A Primer in
Coalescent Theory. Oxford University Press.
• Jenkins, P., Lyngsø, R.B. and Hein, J. (2006) How Many transcripts does it take to reconstruct the
alternative splicing graph. In press.
• Jensen, S., Shen, L. and Liu, J. (2005) Combining Phylogenetic Motif Discovery and Motif Clustering
to Predict Co-Regulated Genes. Bioinformatics 21(20): 3832-9.
• Kan, Z., Rouchka, E.C., Gish, W.R. and States, D.J. (2001) Gene Structure Prediction and Alternative
Splicing Analysis Using Genomically Aligned ESTs. Genome Res. 11(5): 889-900.
• Knudsen, B. and Hein, J. (1999) RNA secondary structure prediction using stochastic context-free
grammars and evolutionary history. Bioinformatics 15(6): 446-54
• Lawrence, C., Altschul, S.F., Boguski, M.S., Liu, J., Neuwald, A.F. and Wootton, J.C. (1993)
Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. Science
262(5131): 208-14.
• Leipzig, J., Pevzner, P. and Heber, S. (2004) The Alternative Splicing Gallery (ASG): bridging the gap
between genome and transcriptome. Nucleic Acids Res. 32(13): 3977-83.
• Liu, J. (2001) Monte Carlo Strategies in Scientific Computing. Springer.
• Lunter, G., Drummond, A.J., Miklos, I. and Hein, J. (2005) Statistical Alignment: Recent Progress,
New Applications, and Challenges. Chapter in Statistical Methods in Molecular Evolution, ed. Rasmus
Nielsen. Springer.
• Maniatis, T. and Tasic, B. (2002) Alternative pre-mRNA splicing and proteome expansion in
metazoans. Nature 418(6894): 236-43.
• Matlin, A.J., Clark, F. and Smith, C.W. (2005) Understanding Alternative Splicing: Towards a Cellular
Code. Nat. Rev. Mol. Cell. Biol. 6(5): 386-98.
• Modrek, B. and Lee, C. (2002) A Genomic view of Alternative Splicing. Nature Genetics 30(1): 13-9.
• Pedersen, J.S. and Hein, J. (2003) Gene finding with a hidden Markov model of genome structure and
evolution. Bioinformatics 19(2): 219-27.
• Pedersen, J.S., Meyer, I.M., Forsberg, R., Simmonds, P. and Hein, J. (2004) A comparative method for
finding and folding RNA secondary structures within protein-coding regions. Nucleic Acids Res.
32(16): 4925-36.
• Pedersen, J.S., Forsberg, R., Meyer, I.M. and Hein, J. (2004) An Evolutionary Model for ProteinCoding Regions with Conserved RNA Structure Mol. Biol. Evol. 21(10): 1913-22.
• Sorek, R., Shamir, R. and Ast, G. (2004) How prevalent is functional alternative splicing in the human
genome? Trends Genet. 20(2):68-71.
• Xing, T.Yu, Y.N.Wu, M.Roy, J.Kim & C.Lee (2006) An expectation-maximization algorithm for
probabilistic reconstructions of full-length isoforms from splice graphs. Nucleic Acids Res. 34(10):
3150-60
• Xing, Y. and Lee, C. (2006) Alternative Splicing and RNA Selection pressure – evolutionary
consequences for eukaryotic genomes. Nat. Rev. Genet. 7(7): 499-509.
Milestones and Deliverables
Year 1-Months 1-3: Literature review of bioinformatics and molecular biology of AS, and
collecting relevant existing programs for AS prediction and data analysis.
Year 1 Months 4-6: Detailed planning of the overall structure of the methods and algorithms to be
developed during the span of the project. Extending the Jenkins et al. (2006) methods to
incomplete transcripts. Creating core curated data (genomes and associated transcripts) and
analysing this data set with the extended methods.
Year 1 Months 7-12: Simple analysis programs are developed that can be continuously tested on
existing databases. Implement simple methods for ASG comparison.
Year 2 Months 13-18: Scientific visit with Gil Ast in Tel Aviv, Israel. Increased focus on the
molecular biology and cellular machinery of AS and existing knowledge of regulatory signals
governing AS in the cell. Analysis of data with focus on characterizing regulatory signals and
their evolution.
Year 2 Months 19-24: Software development and large scale analysis including probabilistic
parametrisation of the ASG. The core data set will be supplemented with predicted ASGs.
Year 3: Months 25-30: Scientific visit with Lior Pachter at UC Berkeley, USA. Analysis of
comprehensive data sets with increased focus on validating prediction methods.
Year 3: Months 31-36: Documentation of developed software and publication of final analysis
results.
Management of Project
The postdoc will start 1.1.07. He will be based in Oxford, but will visit our collaborators in
Israel and the US, for 6 months each. These visits are planned to 1.1.08-30.6.08 (Gil Ast, Israel)
and 1.1.09-30.6.09 (Lior Pachter, Stephen Brenner and Mike Eisen, all Berkeley). Lior Pachter
will be based in Oxford in the academic year 06/07, so it is reasonable first to visit to Gil Ast.
This will allow the postdoc to benefit from Gil Ast's expertise on the functional molecular
biology of AS from an early stage in the project. The postdoc will colleborate closely with
Pachter, both during Pachter's visit to Oxford and during a later visit with Pachter and colleagues
at UC Berkeley. The association with Lior Pachter is a major asset. He developed MAVID, a key
program for whole genome alignments, as well as methods for alternative splicing prediction. He
is also heavily involved in the development of Drosophila genome data bases. The Hein group
has projects on RNA gene finding (postdoc Rune Lyngsø funded by a BBSRC grant expiring
31.12.08 and PhD student Naila Mimouni from the Life Sciences Interface Doctoral Training
Centre in Oxford), statistical string comparison (postdocs Andrea Rocco and David Dale funded
by a BBSRC grant expiring 31.3.08). Additionally, Thomas Mailund (postdoc funded by a grant
from the Danish government expiring 31.12.07) is a computer scientist with bioinformatics
expertise, and Rahul Satija (Rhodes Scholar, who will be in our group until 30.9.09) will work on
detection of regulatory signals in the same 12 Drosophila Genomes. Our group also has strong
links to Chris Holmes' group, that specialises in the application of computer intensive statistical
methods to bioinformatics.