Download Conserved Expressed

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Epigenetics in learning and memory wikipedia , lookup

Gene nomenclature wikipedia , lookup

Human genome wikipedia , lookup

NEDD9 wikipedia , lookup

Metagenomics wikipedia , lookup

Transposable element wikipedia , lookup

Quantitative trait locus wikipedia , lookup

History of RNA biology wikipedia , lookup

X-inactivation wikipedia , lookup

Non-coding DNA wikipedia , lookup

Essential gene wikipedia , lookup

Polycomb Group Proteins and Cancer wikipedia , lookup

Short interspersed nuclear elements (SINEs) wikipedia , lookup

RNA interference wikipedia , lookup

Epigenetics of diabetes Type 2 wikipedia , lookup

Public health genomics wikipedia , lookup

Pathogenomics wikipedia , lookup

Gene desert wikipedia , lookup

Primary transcript wikipedia , lookup

History of genetic engineering wikipedia , lookup

Epitranscriptome wikipedia , lookup

RNA silencing wikipedia , lookup

Long non-coding RNA wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Biology and consumer behaviour wikipedia , lookup

Non-coding RNA wikipedia , lookup

Genomic imprinting wikipedia , lookup

Mir-92 microRNA precursor family wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Ridge (biology) wikipedia , lookup

Genome (book) wikipedia , lookup

Minimal genome wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Microevolution wikipedia , lookup

Gene expression programming wikipedia , lookup

Gene wikipedia , lookup

Designer baby wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Genome evolution wikipedia , lookup

Epigenetics of human development wikipedia , lookup

Gene expression profiling wikipedia , lookup

RNA-Seq wikipedia , lookup

Transcript
Translational evidence and the accuracy of prokaryotic gene
annotation
Luciano Brocchieri
Department of Molecular Genetics & Microbiology and Genetics Institute
University of Florida, Gainesville, FL 32610
From gene prediction to genome annotation
• Computational gene predictions. E.g., GeneMark2.5 (Borodovsky
and McIninch 1993), GeneMarkHMM (Lukashin and Borodovsky
1998), Glimmer3.0 (Delcher et al. 2007), Prodigal (Hyatt et al.
2010), etc.
• Union of predictions (comprehensive compilation)
• Intersection of prediction (robust predictions)
• Evolutionary conservation
• Annotations modeled on closely-related species
• Long-range conservation indicative of functionality
• Expression
• Microarrays
• RNA-seq
• Proteomics
Missing genes in genome annotation
• Extensive conservation analysis of genomic ORFs from 1,300
bacterial chromosomes has revealed conservation across distantly
related genomes of 40,000 ORFs not represented in genome
annotations (Warren et al., BMC Bioinformatics 2010)
• More than 52,000 genes predicted by Glimmer3.0 and not included
in 1,574 bacterial chromosome annotations are confirmed by
evolutionary conservation and functional characterization (Wood et
al., Biology Direct 2012)
• Significant 3-base periodicity identifies more than 68,000 conserved ORFs in annotated inter-genic regions of 2,000 prokaryotic
chromosomes (Oden and Brocchieri, Bioinformatics 2015)
NPACT and the identification of coding regions by 3base periodicity (http://genome.ufl.edu/npact)
ORFs not included in gene annotations can be identified by significant 3-base
periodicity in the sequence (Oden and Brocchieri 2015, in revision)
Why are genes missed in genome annotation?
• Missed genes do not depend on date of annotation (Wood et al.,
Biology Direct 2012)
• Lack of sensitivity of computational gene predictors
• Lack of consistency among computational gene predictors
• Lack of specificity of computational gene predictors
• Stringent criteria (e.g., on consistency or conservation) for
acceptance during annotation
• Problems with the annotation pipelines
Gene annotation and conservation
We define a gene to be conserved if:
• Has sequence similarity with E-value ≤ 1.0E-6.
• Is conserved in length:
1/1.2 ≤ [length target] / [length query] ≤ 1.2
• Is conserved across genera or phyla.
Conservation by class of prediction
None (NPACT)
101,019
0.0677
• Genes exclusively predicted by one method tend to be less conserved.
• Glimmer3.0 predicts substantially more exclusive genes than other methods, of
which a greater number but a smaller fraction are conserved.
Gene predictions and periodicity
in Pseudomonas aeruginosa strains
Experimental evidence of expression in P. aeruginosa PAO1: RNA-seq
$
"."
$#0#
#
&
#0#
'
! "#$###
! "! $"""
>,? .
@- A/
' ( ( ) *+*, - .
( ) ) *+, +- . /
! %?@
! &@$
! %?$
: ;*<
$%/%
!
%/%
!
#$/$
$
&
! "! #$$$
! ! "! $%%
$/$
0%@@
( ?@
4
&
1$0$
$
&
; <*=
; <+=
#.#
"/"
! ""%$""
! "#$%&' (
!!!"
!!!>
&' ' ( )*)+, -
!
&
! "! #$""
"/"
! "! %$""
9 :);
! &&0&
'
%&0&
&
'
$
! "#$%&&
&0&
! "#%%&&
! ##.#
$
"
%#.#
#
"
$
! "#$%##
#.#
! "#%%##
/ 01 23
$"/"
"
6 *789*: ) +
%
!
1 23 45
"#.#
! "! $###
&
!
! ""#$""
! ""%$$$
$
0""/"
&
0 12 34
6 ) 789) : ( *
5 ( 678( 9' )
%
!
! "! " ###
$"/"
"
, AB(
/ ##.#
#
$0$
&
@, =(
! #$A
: ;)<
!
0""/"
< += -
ABC' ADC'
( ) ) *+, +- . /
B$%$
!
! "#$%&' (
' ( ( ) *+*, - .
#0#
! "%####
; <*=
> -? /
>,? .
&' ' ( )*)+, -
! "#$###
( +@
A
' ( ( ) *+*, - .
2$$0$
! "#$%
%
"&
! "#$%$&'
= +> -
&
! "#$%$"&
! 1#1
'
! ""#$$$
! "! %$$$
1#0#
#
>,? .
'
!
&
'
! "#"###
! ""$###
! 1#'
%##0#
4 ( 567( 8' )
&
! ! "#$%%
#.#
1 23 45
%
0 12 34
6 ) 789) : ( *
5 ) 678) 9( *
!
!
= >+?
0$$/$
&
1 23 45
#%%/%
&
/ #.#
#
! &@0
; <*=
! ' #'
'
! "#$%%
"&
! "#$%
$&'
=,>.
' ( ( ) *+*, - .
!
%
! ""####
! "#%###
! "#$#%&'
$##.#
%
8 *9:; *<) +
%
! "! #"""
&
! ' #!
< =+>
2 34 56
/ "."
"
1##0#
'
0 12 34
7 *89: *; ) +
5 ( 678( 9' )
$
! ' #&
: ;)<
2 34 56
#""."
( ) ) *+, +- . /
1 23 45
< =+>
%
! $$?
! ! #%
6 ) 789) : ( *
! ! #$
$?@%
: ;)<
*, *
&' ' ( )*)+, -
12 34
70*89:
*; ) +
( ) ) *+, +- . /
? - @/
3 45 67
&' ' ( )*)+, -
! "#$%%
&'
= +> -
5 ( 678( 9' )
= +> -
? - @/
$?@!
! "#$%
&' (
! "#$$%
"& ! "#$$' ( &
! "#$%
&' (
Expression of predicted genes by length and conservation classes
Published annotation
Newly identified ORFs
ORFs with RNA-seq coverage
What do we learn about gene predictions from
transcription in bacteria?
Unexpected patterns
H-51*A
New
0032
Annotation
betC
0033
0034
trpA
trpB
Hits
Log-count
2
50.0
0
2
4
33000
34000
35000
36000
37000
0.0
38000
Contradictory patterns of expression of well defined
protein coding genes
% C+G
100.0
4
What do we learn about gene predictions from transcription in bacteria?
The problem of antisense transcription
New
0306
H-443*A
Annotation
0307
Hits
Log-count
2
50.0
0
% C+G
100.0
4
2
4
347000
348000
0.0
349000
In the case of prediction of H-443*A , sequence features are more
convincing than RNA-seq expression evidence.
‘Pervasive transcription’ in bacterial genomes (see Wade and Grainger,
Nature reviews 2014) limits the detective power of RNA-seq
Ribosome footprinting
(Ingolia et al, Science 2009)
R
ib
o
s
o
m
e
s
ta
llin
g
w
ith
tr
a
n
s
la
tio
n
e
lo
n
g
a
tio
n
in
h
ib
ito
rcycloheximide
te
tr
a
c
y
c
lin
e
C
e
llly
s
is
a
n
d
d
ig
e
s
tio
n
o
f
u
n
p
r
o
te
c
te
d
R
N
A
fo
o
tp
r
in
ts
c
D
N
A
lib
r
a
r
y
p
r
e
p
a
r
a
tio
n
fo
rd
e
e
p
s
e
q
u
e
n
c
in
g
a
n
d
g
e
n
o
m
e
m
a
p
p
in
g
Schematic representation of the
ribosome footprinting. In
application to P. aeruginosa
tetracycline replaces cycloheximide
Ribosome footprints at initiation sites
The antibiotic tetracycline inhibits translation-elongation
stalling actively-translating ribosomes
Ribosome footprints at initiation sites
However, tetracycline does not prevent more ribosomes to be
recruited at the initiation site.
Ribosome footprints of initiation sites
The accumulation of ribosomes will result in increased
numbers of profile-reads corresponding to the initiation site.
# of reads
Ribosome footprint coverage in P. aeruginosa
Example of ribosome footprint coverage in P. aeruginosa
PAO1 showing relation with S-profiles, annotated genes and
newly identified ORFs.
Ribosome footprint coverage by codon position
Metagene analysis of ribosome-footprint coverage
Coverage is averaged over all genes, relative to the start of translation
Ribosome footprint coverage by codon position: center of reads
Metagene analysis of coverage by read center + 2 nt
Coverage is averaged over all genes, relative to the start of translation
Translational evidence by ribosome
footprinting in P. aeruginosa
Ribosome-footprint read-count patterns identify mRNA
translation, translation-initiation sites, and translational pausing.
Ribosome-footprint-coverage patterns are
robustly reproducible
Similar patterns of coverage of groEL observed in
independent biological replicates.
What drives ribosome-footprint coverage patterns?
Newly identified genes in P. aeruginosa
Position relative to predicted start of translation
Examples of RFP-based gene
discovery in P. aeruginosa PAO1
showing relation with S-profiles
and annotated genes.
Identification of new genes by ribosome-footprint
evidence
A new gene is found to be expressed 5’ of the gene eco for
Ecotin, a protease inhibitor localized to the periplasmic space.
Translational evidence for newly identified ORFs
Scoring RFP expression
“Strength” of evidence decreases for poorly
translated mRNA.
Scoring RFP expression
Expression Index
C0
= C1 ln
C1
C0 : Count of RFP reads in codon positions [-2,+2] / 5;
C1 : Count of RFP reads in codon positions [+8, len/2] / (len/2 - 8);
“Strength” of the evidence of expression is measured
by an “Expression Index”.
Expression of predicted genes by length and conservation classes
Published annotation
Newly identified ORFs
ORFs with Expression Index ≥ 12.0
Conservation and expression of genes
annotated in Pseudomonas aeruginosa PAO1
5,457/5,567
0.980
3,208/5,567
0.576
Conserved
Expressed
Number and fraction of conserved or expressed genes of
all genes annotated in P. aeruginosa PAO1
Conservation and expression of predicted genes not included
in annotations by class of prediction
Conserved
Expressed
Number and fraction of conserved or expressed genes of all genes
predicted by different sets of predictors in P. aeruginosa PAO1
Identification of translation-initiation sites by
ribosome-footprinting
Hyothetical gene 1889. RFP evidence of translation from
alternative start at +600.
Start of translation identification by RFP read
accumulation
Annotated
Newly
identified
Same start
85.0%
77.8%
Different start
15.0%
22.2%
Ribosome footprints confirm the predicted start of translation of 85% of
annotated genes, and of 78% of the newly-identified ORFS, among those
with evidence of translation.
Alternative start of translation?
RFP read patterns suggest that translation of cysH [phospho-adenylylsulphate
reductase (PAPS) reductase] starts 75 nucleotides downstream of the
computationally-predicted start
Alternative start of translation?
FliA, sigma factor of RNA polymerase for flagellum genes transcription.
CheY is involved in transmission of sensory signal to the flagellal motor.
Post-transcriptional control of translation after
oxidative stress
20
G(RFP) / G(0.001)
15
10
Others
RNA>1,RFP>1
RNA<-1,RFP<-1
RFP>1,RNA=0
RFP<-1,RNA=0
RNA>1,RFP=0
RNA<-1,RFP=0
RNA<-1,RFP>1
RNA>1,RFP<-1
5
0
-5
-10
-8
-6
-4
-2
0
G(RNA) / G(0.001)
2
4
6
Thanks to
Lab members
• Steve Oden – Postdoctoral associate. Development of gene finding
methods and software, gene content analysis in human and prokaryotes.
• Nathan Bird– Programmer with Acceleration.com.
• Anna Picca – Postdoctoral associate. RNA-seq and ribosome profiling
• Ying Zhang – Postdoctoral associate. RNA-seq
Collaborators
• Silvia Tornaletti (UF Dept. of Medicine). RNA biology.
• Shouguang Jin (UF Dept. of Molecular Genetics and Microbiology). P.
aeruginosa samples and advice
Funding
• NIH R01 GM08748501A2
• MGM, Genetics Institute, College of Medicine.