Download Gene Finding in Viral Genomes

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Epigenetics of diabetes Type 2 wikipedia , lookup

Epistasis wikipedia , lookup

Quantitative trait locus wikipedia , lookup

Oncogenomics wikipedia , lookup

RNA interference wikipedia , lookup

Genomic library wikipedia , lookup

Copy-number variation wikipedia , lookup

Transposable element wikipedia , lookup

Non-coding DNA wikipedia , lookup

Essential gene wikipedia , lookup

Gene therapy wikipedia , lookup

Human genome wikipedia , lookup

Public health genomics wikipedia , lookup

Metagenomics wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

Gene nomenclature wikipedia , lookup

History of genetic engineering wikipedia , lookup

Frameshift mutation wikipedia , lookup

Expanded genetic code wikipedia , lookup

Genomics wikipedia , lookup

Ridge (biology) wikipedia , lookup

Gene desert wikipedia , lookup

Genome editing wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

The Selfish Gene wikipedia , lookup

Biology and consumer behaviour wikipedia , lookup

Pathogenomics wikipedia , lookup

Genomic imprinting wikipedia , lookup

Epigenetics of human development wikipedia , lookup

Gene expression programming wikipedia , lookup

Point mutation wikipedia , lookup

Genome (book) wikipedia , lookup

Minimal genome wikipedia , lookup

RNA-Seq wikipedia , lookup

Genetic code wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Gene expression profiling wikipedia , lookup

Gene wikipedia , lookup

Designer baby wikipedia , lookup

Microevolution wikipedia , lookup

Helitron (biology) wikipedia , lookup

Genome evolution wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Transcript
Gene Finding in Viral Genomes.
Stephen McCauley, Jotun Hein
Introduction
UNIDIRECTIONAL
Viral Genomes are small and exploit overlapping reading frames to code more compactly. In the regions of the genome where two or more genes
overlap, the nucleotide composition differs from other regions due to the evolutionary constraints that coding in multiple frames imposes.
In nature there are three classes of overlapping genes (see Figure 1): unidirectional where the 3’ end of one gene overlaps with the 5’ start of another
gene in a different reading frame; convergent where the 3’ end of one gene overlaps with the 3’ end of another gene (reading in the opposing direction)
and divergent where the 5’ start of one gene overlaps with the 5’ start of another gene (reading in the opposing direction). The above listing is roughly in
accordance with the relevant preponderance of these types of overlapping genes in Nature in that unidirectional overlaps are more common that
convergent which are more common that divergent (1: Rogozin et a 2002). Our methodology models unidirectional overlaps and is extended to allow for
three genes which overlap in a unidirectional manner.
Gene1
5’ 1
2
3
1
2
3
1
2
3
T
G
A
T
G
A
T
G
A
2
3
1
2
3
Sequence
5’
Gene2
3’
3’
CONVERGENT
It is more common than not to observe overlapping genes in a viral genome, and even in eukaryotic genomes (where there is not the same evolutionary
constraints on genomic size) they are not uncommon. It is self evident that there are limitations to the amino acid coding capabilities of overlapping
regions (a UUU encoding Phe may overlap in one reading frame with a UUA Leu in another, but a UUU Phe may not overlap with a GGG Gly). It may
seem intuitive that these regions of overlap might be compositional biased in some manner and it is possible to examine these overlaps mathematically
and propose likely biases. It has been illustrated that these biases are observed and we can utilize this information when we are predicting genes in viral
genomes. Below we discuss briefly the nucleotide compositional constraints on unidirectional overlapping genes.
We discuss the methodology that we have employed to predict genes in viral genomes. We employ a Hidden Markov Model framework which assigns a
genomic annotation at the nucleotide level. It has been clearly illustrated with a simulation study the potential improvement that such a methodology may
bring. We discuss these simulation results and leave work predicting on actual viral genomes for future publications.
Gene1
5’
Sequence
1
2
T
2
3
G
3’
3
1
2
3
A
T
G
A
T
G
A
3’ 1
2
3
1
2
3 5’
Gene2
3'
DIVERGENT
Gene1
Nucleotide Compositional Constraints on Unidirectional
Overlapping Genes
It has been illustrated using information theory Entropy measurements that overlapping genes tend to have a more uniform nucleotide composition as
compared with non-overlapping genes. In additional then tend to have higher order structuring which takes the form of a greater frequency of amino acid
residues with a high level of degeneracy (2: Pavesi et al 1997). These observations are understood in terms of a mathematical analysis of the potential
overlapping codon pairs (3: Kozlov 1999) and from evolutionary observations of viral genomes undergoing simultaneous positive and purifying selection
on overlapping reading frames (4: Hughes et al 2001).
5’
Sequence
T
Gene2
3’
1
2
3
1
2
3
G
A
T
G
A
T
G
A
2
3
1
2
3 5’
3’
Fig.1.
Sequence Example 1:
Kozlov 2000 examines the set of potential overlapping amino acids. Random consideration of two codons in non-overlapping genes yields a space of 400
possible amino acid pairs. This space is reduced to only 80 possible amino acid pairs under some overlapping constraints. 50% of this space incorporates
one of Ser,Leu or Arg as one of the encoded amino acids pairs. A more detailed examination of the potential 61*61 coding space (of which the amino acid
pair space is a summary, excluding STOP codons) indicates that substantially more than 50% of the potential codon pairs encode Ser, Leu or Arg
(unpublished). These amino acids are 6 fold degenerate at the codon level, and although we know that Nature often favours one or two of these
degenerate codons, we would nevertheless be surprised were the nucleotide composition of overlapping unidirectional genes unbiased towards Ser, Leu ,
Arg rich (since the majority of overlapping codon pairs incorporate these vis-a-vis non overlapping genes in which only 18/61 codons code for Ser,Leu or
Arg.)
Leu
C
Leu
U
U
C
U
U
Phe
Hughes et al 2001 described the observed pattern of simultaneous positive selection in the tat gene of SIV and purifying selection in the corresponding
overlapping region of the vpr gene. Nonsynonymous substitutions which altered the region of the tat gene which encoded an epitope were observed
(positive selection indicative of and favouring immune escape). These nonsynonymous substitutions in the tat reading frame were associated with
synonymous substitutions in the vpr reading frame. This evolutionary mode is only possible when amino acids with degrees of degeneracy are employed
(See Figure 2).
Leu
C
U
U
Phe
The above RNA sequence implies the following amino acid peptide in Frame 1:
Leu,Leu,Leu and in Frame 2: Phe,Phe. We can imagine a mutation at the 3rd
nucleotide to either A, C or G. This would result in no change to the peptide chain in
Frame 1 but Frame 2 could become either Tyr,Phe or Ser,Phe or Cys,Phe. This is an
example of an RNA sequence with degenerate codons (the Leu in Frame 1) that can
give rise to synonymous substitutions in one reading frame and nonsynonymous
substitutions in another.
Amino acids which are multiply degenerate are involved in the greater proportion of the potential overlapping coding space. Overlapping regions which
employ the greatest proportion of these offer increased flexibility for evolutionary adaptation under selection pressure, which perhaps explains their
greater documented abundance (2: Pavesi et al 1997).
Sequence Example 2:
Trp
A Hidden Markov Model for Explicit modelling of Unidirectional
Overlapping Genes at the Nucleotide Level
U
Trp
G
G
U
Gly
The composition of nucleotides in non-gene and gene regions differ. Furthermore the nucleotide composition in genes whose reading frames overlap
differ from conventional non-overlapping genes. We define an HMM with 8 active states as follows: One non-gene state, 3 single gene states (one for
each of 3 unidirectional reading frames), 3 paired overlapping gene states (genes in reading frames 1&2, 2&3 and 1&3) and a triple overlapping state.
Each nucleotide emits from each of these states according to a defined conditional probability emission distribution, and transitions between states (from
one nucleotide to another) are permitted according to a set of defined conditional transition probability matrices. We shall examine one particular example
of a transition matrix to serve as an illustration of how the model operates.
We are currently applying this methodology to actual viral genomes where realistic performance of the procedure can be ascertained. There are several
extensions to the procedure that we already wish to apply. Modelling introns explicitly may be necessary and we also may be able to use first pass
annotations to help parameterize the HMM in a genome specific fashion (some EM procedure may work well with such a starting point). We would like to
employ some type of evolutionary model to help annotate a multiple alignment of viral genomes where the phylogeny may be well documented.
G
G
Gly
Reading Frame #2: 1st Nucleotide Position
State Transition Matrix:
1 2 3 4 5 6 7 8
*
*
2
Simulation Results and Suggestions for Further Work
U
Fig.2.
1 *
We parameterised an HMM as described above using HIV1 sequence as a guide. From this HMM we simulated many thousands of genomes of length
10,000 (approximately the length of the HIV genome). We then annotated the sequences with the Viterbi and Posterior Decoding algorithms and
compared these annotations with the known simulated state sequences. Using this methodology greater than 98% of gene nucleotides were correctly
annotated using either the Viterbi or Posterior Decoding procedures. Of course these results are annotating sequences generated according to the HMM
model and known parameters and so would likely serve as a maximum level of annotative performance on real genomes where neither these conditions
are necessarily true. We also designed a simpler gene finder where the overlapping gene regions were not explicitly modelled and genes were annotated
in separate reading frames and then combined in a final annotation. Using this simplified model, both Viterbi and Posterior Decoding procedures
performed very poorly (the simulated data had many overlaps typical of viral genomes), which encourages us that our hypothesis, that modelling gene
overlaps explicitly is an important consideration in viral gene prediction, is likely correct.
G
The above RNA sequence implies the following amino acid peptides in Frame 1:
Trp,Trp,Trp and in Frame 2: Gly,Gly. Similarly we can imagine a mutation at the 3rd
nucleotide to either A, C or U. Unlike the previous example, any mutation at this
locus would result in peptide chain changes in both reading frames. This is an
example of how intolerant to mutations, sequences rich in non-degenerate codons (the
Trp in Frame 1), are. There is no flexibility for selectively driven mutations in such an
arrangement and one can imagine this is a drawback for the propagation of such
sequences over the course of evolutionary time.
Consider Figure 3. Figure 3 is concerned with the first nucleotide position in reading frame 2 (so nucleotide loci 2,5,8, etc). State 1 is the Non-Gene state
and if the previous nucleotide were in this state then it is possible that the nucleotide under consideration could describe the first position in a codon in
reading frame 2. This is described as State 3 and so there is a defined probability of transitioning to this State 3 from State 1 (there is also the probability
of remaining in State 1). Consider State 8 which represents the triple overlapping gene state. Were the previous nucleotide in this state then the HMM
could remain in this state (by continuing with a new codon in reading frame 2) or the HMM could leave the gene state in reading frame 2 and transition to
the doubly overlapping gene State 6 (which represents a gene in reading frame 1 overlapping with a gene in reading frame 3). The transition matrix in
Figure 3 is populated with stars which denote non-zero probabilities. We have the star notation because we further condition these state transition
matrices on whether the previous nucleotide triple could represent a START codon or a STOP codon or NONE. There are further nuances which need to
be employed to start and end the HMM in a consistent manner, but the above description represents the crux of the model.
We employ the Viterbi and Posterior Decoding procedures to infer the most likely genomic state annotation, and the most likely annotation state for every
individual nucleotide. Obviously these annotations are only optimal in so far as (a) the parameters of the HMM describe Nature; (b) the HMM is a suitable
model (for example this methodology implicitly assumes that gene length is geometrically distributed) (c) We do not model introns in this methodology and
since we are annotating viral genomes and introns are less common than in eukaryotic genomes we suspect this should not be a major weakness
however the model can always be extended if deemed necessary.
G
Trp
No Gene
*
*
3 *
*
4
*
*
*
5
Hierarchical flow chart illustrating the
state transitions possible according the
defined state transition matrix.
*
6
*
8
G RF2
G RF3
G RF1,2
G RF1,3
G RF2,3
*
*
7
G RF1
*
*
State Diagram Legend
State 1
State 2
State 3
State 4
State 5
State 6
State 7
G RF1,2,3
State 8
Fig.3.
1:Purifying and directional selection in overlapping prokaryotic genes. Rogozin IB,Spiridonov AN,Sorokin AV,Wolf YI,Jordan IK,Tatusov RL,Koonin EV. Trends Genet.2002 May;18(5):228-32.
2:On the informational content of overlapping genes in prokaryotic and eukaryotic viruses.Pavesi A,DeIaco B,Granero MI,Porati A. J Mol Evol. 1997 Jun;44(6):625-31.
3:Analysis of a set of overlapping genes. Kozlov NN. Dokl Biochem.2000 Jul-Aug;373(1-6):119-22.
4:Simultaneous positive and purifying selection on overlapping reading frames of the tat and vpr genes of simian immunodeficiency virus. Hughes AL, Westover K, da Silva J, O'Connor DH, Watkins DI. J Virol. 2001 Sep;75(17):7966-72.