Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
IslandPath: A computational aid for identifying genomic islands that may play a role in microbial pathogenicity William Hsiao1*, Nancy Price2, Ivan Wan3, Steven J. Jones3, and Fiona S. L. Brinkman1. 1Department of Molecular Biology and Biochemistry, Simon Fraser University, Burnaby, 2Department of Medical Genetics, University of British Columbia, Vancouver, and 3Genome Sequence Centre, B.C. Cancer Agency, British Columbia, Canada www.pathogenomics.bc.ca/brinkman Abstract As more genomes from bacterial pathogens are sequenced, it is becoming apparent that a significant proportion of virulence factors are encoded in clusters of genes, termed Pathogenicity Islands (reviewed in 1). These islands and other genomic islands, tend to have atypical guanine and cytosine content (%G+C), contain mobility genes (e.g. transposases and integrases), and are associated with tRNA sequences. We have developed a web-based computational tool, IslandPath, to aid the visualization of these features in a full genome display in order to facilitate the identification of genes in new genome sequences that may be involved in virulence or have horizontal origins. The ability to visualize these features within the genomic context can facilitate better detection of the genomic island borders and neighbouring genes. Atypical %G+C by itself is not indicative of the horizontal origin of the sequence involved, however, the predictive power increases when such regions are associated with mobile elements, direct repeats, or contain genes with similarity to known virulence factors. Therefore, we are incorporating into IslandPath algorithms to detect partial tRNAs in new genomic sequences that are likely to be the reminiscent of phage insertion events, and are also comparing the genomic sequences to a custom-built database of a subset of known virulence factors. Preliminary results are encouraging through our investigation of the ability of IslandPath to visualize known Pathogenicity Islands as distinct regions within the genomes. This computational tool also permitted us to perform a more in-depth analysis of %G+C variance in genomes and enabled us to detect correlations not previously reported. As more and more genome data become available, tools like IslandPath, which can be updated in an automated fashion, will become valuable for genomic research. Whole Genome (predicted) ORF Display: Horizontal Gene Transfer and Bacterial Pathogenicity: Genome ORFs are displayed to allow interesting regions (rich in mobility genes, abnormal %G+C, close to structural RNAs) to be viewed in a genome context. E.g. H. Pylori 26695 Genome Several types of mobile elements have been shown to carry virulence factors: Transposons: ST enterotoxin genes in E. coli Prophages: Shiga-like toxins in EHEC Diptheria toxin gene Cholera toxin Botulinum toxins Plasmids: Shigella, Salmonella, Yersinia Pathogenicity Islands: Uro/Entero-pathogenic E. coli Salmonella typhimurium Yersinia spp. Helicobacter pylori Vibrio cholerae IslandPath Graphical Display: Each dot in a graphic corresponds to a predicted protein-coding ORF in the genome. Dot colours indicate if an ORF has a higher or lower %G+C than cutoffs you set (default settings are +/- 3.48* of the mean %G+C). You may click on a dot to view a portion of an annotation table presented below the graphic. Several low %G+C regions can be seen in the graphic display: = CAG island = region contains virB homologues; not present in strain J99 = plasticity zone (contain different genes for J99 and 26695) Detection of Known Pathogenicity Islands: Yersinia pestis strain CO92: High Pathogenicity Island core (in red rectangle) Mean: 47.9 STD DEV: 4.9 •3.48 = 1.5 S.D. of the mean for Chlamydia genomes, which are proposed to have undergone no recent horizontal gene transfer (data not shown). %GC S.D. 56.48 +1 58.81 +2 58.33 +2 60.40 +2 60.79 +2 60.15 +2 56.35 +1 57.29 +1 58.62 +2 59.48 +2 55.25 +1 52.65 Location Orientation Product 2140840..2142861 pesticin/yersiniabactin receptor protein 2142992..2144569 yersiniabactin siderophore biosynthetic protein 2144573..2145376 yersiniabactin biosynthetic protein YbtT 2145373..2146473 yersiniabactin biosynthetic protein YbtU 2146470..2155961 yersiniabactin biosynthetic protein 2156049..2162156 yersiniabactin biosynthetic protein 2162347..2163306 transcriptional regulator YbtA 2163473..2165275 + lipoprotein inner membrane ABC-transporter 2165262..2167064 + inner membrane ABC-transporter YbtQ 2167057..2168337 + putative signal transducer 2168365..2169669 + putative salicylate synthetase 2169863..2171125 integrase Vibrio cholerae chromosome I: VPI (toxin regulated pili) VPI delineated as a stretch of low %G+C region flanked by mobility genes Detection of Proposed or Potential Genomic Islands: Methods: Escherichia coli O157:H7: Core scripts written in Perl and CGI/Perl Sequence Data: NCBI Genome FTP site Potential mobility elements: COG analysis2,3 plus keyword scan RNA locations: NCBI data plus tRNAscan-SE4 %G+C calculated for each ORF Mean and Std. Dev. for all ORFs in genome calculated File containing all ORF information used to generate a graphical representation Virulence Gene Subset (VGS) database developed through literature analysis of genes identified as virulence factors using the “Molecular Koch’s Postulates” (i.e. gene knockout affects virulence) Area displayed in white rectangle is ~ 28kb in size (from 3708kbp to 3736kbp) and contains Type III Secretion proteins Epr’s, Epa’s, and Eiv’s; and numerous hypothetical proteins with unknown functions Vibrio cholerae chromosome I: Area displayed in red rectangle is ~ 34kb in size (from 1896kbp to 1930kbp) and contains a tRNA-ser in the same orientation as the phage integrase downstream of it. The ORFs contain one putative helicase, one chemotaxis protein MotB-related protein, one putative type I restriction enzyme HsdR, one putative DNA methylase, one putative N-acetylneuraminate lyase, one C4-dicarboxylate-binding periplasmic protein, and numerous hypothetical proteins and conserved hypothetical proteins. tRNA when adjacent to an abnormal %G+C region is often observed to be in the same orientation as the stretch. This might be an artefact of phage insertion and excision events as 3’ end of tRNA are common phage attachment (att) sites. %G+C Analysis for Complete Genome Sequences: Frequencies of ORF %G+C in Genomes: Histograms of frequencies of %G+C were plotted for several organisms. Bacterial Pathogens %G+C %G+C Mean S.D. Primary Diseases Cellular # of Localization ORFs Neisseria meningitidis serogroup B strain MC58 meningitis extracellular 2025 52.4 6.9 Neisseria meningitidis serogroup A strain Z2491 meningitis extracellular 2121 52.6 6.5 Xylella fastidiosa Citrus variegated chlorosis extracellular Escherichia coli O157:H7 (E. coli O157:H7_EDL933) diarrhoea facultative intracellular 5361 (5349) 51.1 (51.9) 5.3 (5.3) Mycoplasma pneumoniae M129 mycoplasmal pneumonia ("walking pneumonia") extracellular 677 40.3 4.9 Yersinia pestis strain CO92 bubonic plague and Pneumonic plague facultative intracellular 3885 48.3 4.7 Streptococcus pneumoniae bacterial pneumonia, TIGR4 meningitis, sepsis, and otitis media (S. pneumoniae R6) extracellular 2094 40.3 4.4 (2043) (40.4) (4.3) Treponema pallidum Nichols syphilis extracellular 1031 51.4 4.2 Mycoplasma pulmonis murine respiratory mycoplasmosis extracellular Pseudomonas aeruginosa PAO1 variety of mucosal infections (opportunistic) extracellular 5565 67.0 3.8 Rickettsia conorii Malish 7 Mediterranean spotted fever obligate intracellular 1374 32.4 3.8 (ORFs (ORFs >300bp) >300bp) Observations: 2766 782 53.4 27.2 5.4 3.8 Lowest kurtosis occurs most commonly with a mode of 33.33% for %G+C values of ORFs in a genome (e.g. M. jannaschii DSM2661) This G+C value corresponds to maximum A/T in synonymous sites for the standard codon usage table. Long tails in the frequency plots occur more frequently downward (e.g. H. pylori J99 and N. meningitidis) than upward These observations likely reflect either a bias in gene identification in high G+C genomes, or a selection to higher A+T content. %G+C Analysis General Observations: High %G+C variance is associated with species with evidence of recent horizontal gene transfers (e.g. N. meningitidis). Low %G+C variance is associated with highly clonal species and species with no evidence of horizontal gene transfers (e.g. Chlamydia species, which are obligate intracellular microbes thought to have been ecologically isolated from other bacteria for a longer period than other obligate intracellular bacteria). %G+C variance is similar for single species, with the exception of the two V. cholerae chromosomes and two E. coli strains. However, chromosome II of V. cholerae appears to have originated from a megaplasmid captured by Vibrio5. For E. coli, pathogenic strain O175:H7 has higher %G+C variance. This might be due to the presence of PAI and other potentially horizontally transferred genetic elements. Ureaplasma urealyticum serovar 3 urethritis extracellular 613 25.8 3.8 Vibrio cholerae N16961 cholera extracellular I: 2736 II: 1092 I: 48.1 II: 46.9 I: 3.7 II: 4.3 Borrelia burgdorferi B31 Lyme disease facultative intracellular 851 28.7 3.6 Streptococcus pyogenes scarlet fever, toxic shock like syndrome extracellular 1696 38.9 3.6 Mycoplasma genitalium G37 urethritis (opportunistic, usually HIV patients) extracellular 484 31.4 3.5 Campylobacter jejuni NCTC11168 gastroenteritis extracellular 1654 30.6 3.5 Helicobacter pylori 26695 (H. pylori J99) peptic ulcers and gastritis extracellular 1566 (1491) 39.4 (39.7) 3.4 (3.3) Haemophilus influenzae Rd-KW20 upper respiratory infection extracellular meningitis 1709 38.5 3.4 Mycobacterium tuberculosis CDC1551 (M. tuberculosis H37Rv) tuberculosis 4187 65.5 3.3 (3918) (65.6) (3.3) Pasteurella multocida PM70 fowl cholera, cattle septicemia, etc. extracellular 2014 40.8 3.3 Rickettsia prowazekii Madrid E epidemic typhus obligate intracellular 834 30.1 3.3 Staphylococcus aureus Mu50 (S. aureus N315) food poisoning, toxic shock syndrome, necrotizing fascitis extracellular 2714 33.3 3.0 (2595) (32.2) (3.0) Mycobacterium leprae Leprosy obligate intracellular 2720 60.0 2.9 Agrobacterium tumefacien C58 (Cereon) crown gall (in plants) Extracellular c:2721 l:1833 c: 59.8 l: 59.7 c: 2.7 l: 2.9 Chlamydophila pneumoniae AR39 (C. pneumoniae J138) [C. pneumoniae CWL029] chlamydial pneumonia 1110 41.1 2.6 (1070) [1052] (41.1) [41.1] (2.6) [2.6] 2 Tatusov RL, et al., 1997, Science 278(5338):631-7 Chlamydia trachomatis D chlamydia obligate intracellular 894 41.5 2.3 4 Lowe TM and Eddy SR, 1997, Nucleic Acids Res. 25(5):955-64 obligate intracellular 909 Chlamydia muridarum MoPn chlamydia Non-pathogens Escherichia coli K12 facultative intracellular obligate intracellular # of ORFs 4289 51.3 Discussion: IslandPath appears to be an effective automated tool to visualize and detect genomic islands. Previous reports have expressed concern about the use of %G+C to detect HGT; however, these reports were examining %G+C for individual genes. We propose that %G+C analysis is effective if clusters of genes containing motifs associated with mobility elements are considered. Foreign genes with similar %G+C to the organism’s genome are not detected, and due to gene amelioration, only “recent” HGT can be detected. This tool represents one approach that can be complemented with others, to prioritize particular genomic islands that merit further research. Future developments: Virulence factor homology search (based on comparison to our VGS dataset) Alternative DNA signatures (e.g. codon usage) Allow users to input their own sequences for analysis References 1 Hacker J and Kaper JB, 2000, Annu Rev Microbiol. 54:641-79 3 Tatusov RL, et al., 2001, Nucleic Acids Res. 29(1)22-8 5 Heidelberg JF, et al., 2000, Nature 406:477-84 40.8 2.2 %G+C Mean %G+C S.D. (ORFs >300bp) (ORFs >300bp) 4.7 Acknowledgements This project is funded by the Peter Wall Institute for Advanced Studies. We wish to thank Tatiana Tatusov of NCBI for providing helpful files for IslandPath and acknowledge the efforts of the many genome projects that have made our analysis possible.