* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Conserved Expressed
Epigenetics in learning and memory wikipedia , lookup
Gene nomenclature wikipedia , lookup
Human genome wikipedia , lookup
Metagenomics wikipedia , lookup
Transposable element wikipedia , lookup
Quantitative trait locus wikipedia , lookup
History of RNA biology wikipedia , lookup
X-inactivation wikipedia , lookup
Non-coding DNA wikipedia , lookup
Essential gene wikipedia , lookup
Polycomb Group Proteins and Cancer wikipedia , lookup
Short interspersed nuclear elements (SINEs) wikipedia , lookup
RNA interference wikipedia , lookup
Epigenetics of diabetes Type 2 wikipedia , lookup
Public health genomics wikipedia , lookup
Pathogenomics wikipedia , lookup
Gene desert wikipedia , lookup
Primary transcript wikipedia , lookup
History of genetic engineering wikipedia , lookup
Epitranscriptome wikipedia , lookup
RNA silencing wikipedia , lookup
Long non-coding RNA wikipedia , lookup
Nutriepigenomics wikipedia , lookup
Biology and consumer behaviour wikipedia , lookup
Non-coding RNA wikipedia , lookup
Genomic imprinting wikipedia , lookup
Mir-92 microRNA precursor family wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Ridge (biology) wikipedia , lookup
Genome (book) wikipedia , lookup
Minimal genome wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Microevolution wikipedia , lookup
Gene expression programming wikipedia , lookup
Designer baby wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Genome evolution wikipedia , lookup
Epigenetics of human development wikipedia , lookup
Translational evidence and the accuracy of prokaryotic gene annotation Luciano Brocchieri Department of Molecular Genetics & Microbiology and Genetics Institute University of Florida, Gainesville, FL 32610 From gene prediction to genome annotation • Computational gene predictions. E.g., GeneMark2.5 (Borodovsky and McIninch 1993), GeneMarkHMM (Lukashin and Borodovsky 1998), Glimmer3.0 (Delcher et al. 2007), Prodigal (Hyatt et al. 2010), etc. • Union of predictions (comprehensive compilation) • Intersection of prediction (robust predictions) • Evolutionary conservation • Annotations modeled on closely-related species • Long-range conservation indicative of functionality • Expression • Microarrays • RNA-seq • Proteomics Missing genes in genome annotation • Extensive conservation analysis of genomic ORFs from 1,300 bacterial chromosomes has revealed conservation across distantly related genomes of 40,000 ORFs not represented in genome annotations (Warren et al., BMC Bioinformatics 2010) • More than 52,000 genes predicted by Glimmer3.0 and not included in 1,574 bacterial chromosome annotations are confirmed by evolutionary conservation and functional characterization (Wood et al., Biology Direct 2012) • Significant 3-base periodicity identifies more than 68,000 conserved ORFs in annotated inter-genic regions of 2,000 prokaryotic chromosomes (Oden and Brocchieri, Bioinformatics 2015) NPACT and the identification of coding regions by 3base periodicity (http://genome.ufl.edu/npact) ORFs not included in gene annotations can be identified by significant 3-base periodicity in the sequence (Oden and Brocchieri 2015, in revision) Why are genes missed in genome annotation? • Missed genes do not depend on date of annotation (Wood et al., Biology Direct 2012) • Lack of sensitivity of computational gene predictors • Lack of consistency among computational gene predictors • Lack of specificity of computational gene predictors • Stringent criteria (e.g., on consistency or conservation) for acceptance during annotation • Problems with the annotation pipelines Gene annotation and conservation We define a gene to be conserved if: • Has sequence similarity with E-value ≤ 1.0E-6. • Is conserved in length: 1/1.2 ≤ [length target] / [length query] ≤ 1.2 • Is conserved across genera or phyla. Conservation by class of prediction None (NPACT) 101,019 0.0677 • Genes exclusively predicted by one method tend to be less conserved. • Glimmer3.0 predicts substantially more exclusive genes than other methods, of which a greater number but a smaller fraction are conserved. Gene predictions and periodicity in Pseudomonas aeruginosa strains Experimental evidence of expression in P. aeruginosa PAO1: RNA-seq $ "." $#0# # & #0# ' ! "#$### ! "! $""" >,? . @- A/ ' ( ( ) *+*, - . ( ) ) *+, +- . / ! %?@ ! &@$ ! %?$ : ;*< $%/% ! %/% ! #$/$ $ & ! "! #$$$ ! ! "! $%% $/$ 0%@@ ( ?@ 4 & 1$0$ $ & ; <*= ; <+= #.# "/" ! ""%$"" ! "#$%&' ( !!!" !!!> &' ' ( )*)+, - ! & ! "! #$"" "/" ! "! %$"" 9 :); ! &&0& ' %&0& & ' $ ! "#$%&& &0& ! "#%%&& ! ##.# $ " %#.# # " $ ! "#$%## #.# ! "#%%## / 01 23 $"/" " 6 *789*: ) + % ! 1 23 45 "#.# ! "! $### & ! ! ""#$"" ! ""%$$$ $ 0""/" & 0 12 34 6 ) 789) : ( * 5 ( 678( 9' ) % ! ! "! " ### $"/" " , AB( / ##.# # $0$ & @, =( ! #$A : ;)< ! 0""/" < += - ABC' ADC' ( ) ) *+, +- . / B$%$ ! ! "#$%&' ( ' ( ( ) *+*, - . #0# ! "%#### ; <*= > -? / >,? . &' ' ( )*)+, - ! "#$### ( +@ A ' ( ( ) *+*, - . 2$$0$ ! "#$% % "& ! "#$%$&' = +> - & ! "#$%$"& ! 1#1 ' ! ""#$$$ ! "! %$$$ 1#0# # >,? . ' ! & ' ! "#"### ! ""$### ! 1#' %##0# 4 ( 567( 8' ) & ! ! "#$%% #.# 1 23 45 % 0 12 34 6 ) 789) : ( * 5 ) 678) 9( * ! ! = >+? 0$$/$ & 1 23 45 #%%/% & / #.# # ! &@0 ; <*= ! ' #' ' ! "#$%% "& ! "#$% $&' =,>. ' ( ( ) *+*, - . ! % ! ""#### ! "#%### ! "#$#%&' $##.# % 8 *9:; *<) + % ! "! #""" & ! ' #! < =+> 2 34 56 / "." " 1##0# ' 0 12 34 7 *89: *; ) + 5 ( 678( 9' ) $ ! ' #& : ;)< 2 34 56 #""." ( ) ) *+, +- . / 1 23 45 < =+> % ! $$? ! ! #% 6 ) 789) : ( * ! ! #$ $?@% : ;)< *, * &' ' ( )*)+, - 12 34 70*89: *; ) + ( ) ) *+, +- . / ? - @/ 3 45 67 &' ' ( )*)+, - ! "#$%% &' = +> - 5 ( 678( 9' ) = +> - ? - @/ $?@! ! "#$% &' ( ! "#$$% "& ! "#$$' ( & ! "#$% &' ( Expression of predicted genes by length and conservation classes Published annotation Newly identified ORFs ORFs with RNA-seq coverage What do we learn about gene predictions from transcription in bacteria? Unexpected patterns H-51*A New 0032 Annotation betC 0033 0034 trpA trpB Hits Log-count 2 50.0 0 2 4 33000 34000 35000 36000 37000 0.0 38000 Contradictory patterns of expression of well defined protein coding genes % C+G 100.0 4 What do we learn about gene predictions from transcription in bacteria? The problem of antisense transcription New 0306 H-443*A Annotation 0307 Hits Log-count 2 50.0 0 % C+G 100.0 4 2 4 347000 348000 0.0 349000 In the case of prediction of H-443*A , sequence features are more convincing than RNA-seq expression evidence. ‘Pervasive transcription’ in bacterial genomes (see Wade and Grainger, Nature reviews 2014) limits the detective power of RNA-seq Ribosome footprinting (Ingolia et al, Science 2009) R ib o s o m e s ta llin g w ith tr a n s la tio n e lo n g a tio n in h ib ito rcycloheximide te tr a c y c lin e C e llly s is a n d d ig e s tio n o f u n p r o te c te d R N A fo o tp r in ts c D N A lib r a r y p r e p a r a tio n fo rd e e p s e q u e n c in g a n d g e n o m e m a p p in g Schematic representation of the ribosome footprinting. In application to P. aeruginosa tetracycline replaces cycloheximide Ribosome footprints at initiation sites The antibiotic tetracycline inhibits translation-elongation stalling actively-translating ribosomes Ribosome footprints at initiation sites However, tetracycline does not prevent more ribosomes to be recruited at the initiation site. Ribosome footprints of initiation sites The accumulation of ribosomes will result in increased numbers of profile-reads corresponding to the initiation site. # of reads Ribosome footprint coverage in P. aeruginosa Example of ribosome footprint coverage in P. aeruginosa PAO1 showing relation with S-profiles, annotated genes and newly identified ORFs. Ribosome footprint coverage by codon position Metagene analysis of ribosome-footprint coverage Coverage is averaged over all genes, relative to the start of translation Ribosome footprint coverage by codon position: center of reads Metagene analysis of coverage by read center + 2 nt Coverage is averaged over all genes, relative to the start of translation Translational evidence by ribosome footprinting in P. aeruginosa Ribosome-footprint read-count patterns identify mRNA translation, translation-initiation sites, and translational pausing. Ribosome-footprint-coverage patterns are robustly reproducible Similar patterns of coverage of groEL observed in independent biological replicates. What drives ribosome-footprint coverage patterns? Newly identified genes in P. aeruginosa Position relative to predicted start of translation Examples of RFP-based gene discovery in P. aeruginosa PAO1 showing relation with S-profiles and annotated genes. Identification of new genes by ribosome-footprint evidence A new gene is found to be expressed 5’ of the gene eco for Ecotin, a protease inhibitor localized to the periplasmic space. Translational evidence for newly identified ORFs Scoring RFP expression “Strength” of evidence decreases for poorly translated mRNA. Scoring RFP expression Expression Index C0 = C1 ln C1 C0 : Count of RFP reads in codon positions [-2,+2] / 5; C1 : Count of RFP reads in codon positions [+8, len/2] / (len/2 - 8); “Strength” of the evidence of expression is measured by an “Expression Index”. Expression of predicted genes by length and conservation classes Published annotation Newly identified ORFs ORFs with Expression Index ≥ 12.0 Conservation and expression of genes annotated in Pseudomonas aeruginosa PAO1 5,457/5,567 0.980 3,208/5,567 0.576 Conserved Expressed Number and fraction of conserved or expressed genes of all genes annotated in P. aeruginosa PAO1 Conservation and expression of predicted genes not included in annotations by class of prediction Conserved Expressed Number and fraction of conserved or expressed genes of all genes predicted by different sets of predictors in P. aeruginosa PAO1 Identification of translation-initiation sites by ribosome-footprinting Hyothetical gene 1889. RFP evidence of translation from alternative start at +600. Start of translation identification by RFP read accumulation Annotated Newly identified Same start 85.0% 77.8% Different start 15.0% 22.2% Ribosome footprints confirm the predicted start of translation of 85% of annotated genes, and of 78% of the newly-identified ORFS, among those with evidence of translation. Alternative start of translation? RFP read patterns suggest that translation of cysH [phospho-adenylylsulphate reductase (PAPS) reductase] starts 75 nucleotides downstream of the computationally-predicted start Alternative start of translation? FliA, sigma factor of RNA polymerase for flagellum genes transcription. CheY is involved in transmission of sensory signal to the flagellal motor. Post-transcriptional control of translation after oxidative stress 20 G(RFP) / G(0.001) 15 10 Others RNA>1,RFP>1 RNA<-1,RFP<-1 RFP>1,RNA=0 RFP<-1,RNA=0 RNA>1,RFP=0 RNA<-1,RFP=0 RNA<-1,RFP>1 RNA>1,RFP<-1 5 0 -5 -10 -8 -6 -4 -2 0 G(RNA) / G(0.001) 2 4 6 Thanks to Lab members • Steve Oden – Postdoctoral associate. Development of gene finding methods and software, gene content analysis in human and prokaryotes. • Nathan Bird– Programmer with Acceleration.com. • Anna Picca – Postdoctoral associate. RNA-seq and ribosome profiling • Ying Zhang – Postdoctoral associate. RNA-seq Collaborators • Silvia Tornaletti (UF Dept. of Medicine). RNA biology. • Shouguang Jin (UF Dept. of Molecular Genetics and Microbiology). P. aeruginosa samples and advice Funding • NIH R01 GM08748501A2 • MGM, Genetics Institute, College of Medicine.