Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Int. J. Bioinformatics Research and Applications, Vol. 11, No. 4, 2015 347 Opposite nucleotide usage biases in different parts of the Corynebacterium diphtheriae spaC gene Vladislav Victorovich Khrustalev* and Eugene Victorovich Barkovsky Department of General Chemistry, Belarusian State Medical University, Dzerzinskogo 83, Minsk, Belarus Email: [email protected] Email: [email protected] *Corresponding author Valentina Leonidovna Kolodkina Vaccine Preventable Diseases Laboratory, Republican Research and Practical Centre for Epidemiology and Microbiology, Filimonova 23, Minsk, Belarus Email: [email protected] Tatyana Aleksandrovna Khrustaleva Laboratory of Cellular Technologies, Institute of Physiology of the National Academy of Sciences of Belarus, Academicheskaya 28, Minsk, Belarus Email: [email protected] Abstract: In this work we described a bacterial open reading frame with two different directions of nucleotide usage biases in its two parts. The level of GCcontent in third codon positions (3GC) is equal to 40.17 ± 0.22% during the most of the length of Corynebacterium diphtheriae spaC gene. However, in the 3'-end of the same gene (from codon #1600 to codon #1873) 3GC level is equal to 64.61 ± 0.91%. Using original methodology (‘VVTAK Sliding window’ and ‘VVTAK VarInvar’) we approved that there is an ongoing mutational AT-pressure during the most of the length of spaC gene (up to codon #1599), and there is an ongoing mutational G-pressure in the 3'-end of spaC. Intragenic promoters predicted by three different methods may be the cause of the differences in preferable types of nucleotide mutations in spaC parts because of their autonomous transcription. Keywords: mutational pressure; transcription-associated mutational pressure; asymmetric mutational pressure; genomic islands; pathogenicity islands; Corynebacterium diphtheriae; Corynebacterium ulcerans; spaC; adhesion; pili; nucleotide mutations; promoter prediction; terminator prediction; intragenic promoters; intragenic terminators. Copyright © 2015 Inderscience Enterprises Ltd. 348 V.V. Khrustalev et al. Reference to this paper should be made as follows: Khrustalev, V.V., Barkovsky, E.V., Kolodkina, V.L. and Khrustaleva, T.A. (2015) ‘Opposite nucleotide usage biases in different parts of the Corynebacterium diphtheriae spaC gene’, Int. J. Bioinformatics Research and Applications, Vol. 11, No. 4, pp.347–365. Biographical notes: Vladislav Victorovich Khrustalev, PhD, is an Associate Professor in the Department of General Chemistry at the Belarusian State Medical University. His research interests are in the areas of biochemistry, computational biology, immunology, virology, microbiology, genomics, proteomics and bioinformatics. The most of his scientific projects are connected with the mutational pressure theory. Eugene Victorovich Barkovsky, PhD, Professor, is the Head of the Department of General Chemistry at the Belarusian State Medical University. His research interests are in the areas of biochemistry, molecular biology, molecular evolution, proteomics, immunology, computational biology and bioinformatics. He has written over 400 research articles and six monographs. Valentina Leonidovna Kolodkina, PhD, is a senior researcher in Vaccine Preventable Diseases Laboratory at the Republican Research and Practical Centre for Epidemiology and Microbiology, Minsk, Belarus. Her research interests are in the field of microbiology, immunology, vaccine design, molecular phylogenetic and molecular evolution. Tatyana Aleksandrovna Khrustaleva is a researcher in the Laboratory of Cellular Technologies at the Institute of Physiology of the National Academy of Sciences of Belarus. Her research interests are in the field of biochemistry, ligand-protein interactions, bioinformatics and comparative genomics. 1 Introduction Mutational pressure is the resulting direction of all the nucleotide mutations happening in genome, gene, or even in a fragment of gene. Some types of nucleotide mutations are either occur more frequently than others, or are repaired worse than others (Sueoka, 1988). As a result, usages of nucleotides prone to more frequent mutations are decreasing in the genome, gene or a fragment of a gene in a given population. If we consider a bacterial genome (consisting of double-stranded DNA), then we can use a sum of guanine and cytosine usages (G+C) to characterise the direction of symmetric mutational pressure (Sueoka, 1988; Khrustalev and Barkovsky, 2010a). Mutations leading to the increase or decrease of GC-content are fixing more frequently in third codon positions (Sueoka, 1988). This happens because there are 32 fourfold degenerated sites in third codon positions: all nucleotide substitutions in those sites are synonymous (they don’t cause amino acid substitutions in the protein) (Nei and Kumar, 2000). There are also 14 twofold degenerated sites from third codon positions in which thymine to cytosine and cytosine to thymine mutations are synonymous and 12 twofold degenerated sites in which adenine to guanine and guanine to adenine mutations are synonymous (Khrustalev and Barkovsky, 2010b). Opposite nucleotide usage biases 349 If we consider a single bacterial gene, then we cannot use the sum of guanine and cytosine levels to characterise nucleotide usage biases properly. There can be significant differences between usages of guanine and cytosine, as well as between usages of adenine and thymine in a single gene. Those biases are formed by asymmetric mutational pressures (Lobry and Sueoka, 2002). That is why such indices as nucleotide content in fourfold (T4f, A4f, C4f and G4f) and twofold degenerated sites from third codon positions (T2f3p, A2f3p, C2f3p and G2f3p) were suggested for the description of asymmetric mutational pressure (Khrustalev and Barkovsky, 2012). Replication-associated mutational pressure makes nucleotide usage biases in genes from lagging strands of DNA different from those in genes from leading strands (Lobry and Sueoka, 2002). Transcription-associated mutational pressure is responsible for different nucleotide usage biases for differentially expressed genes (Chen and Chen, 2007). From that point of view, one may use the information on nucleotide usage biases in studies on differential transcription. Previously we described the eukaryotic gene (the gene encoding platelet phosphofructokinase of birds from Passeriformes order) with short regions of elevated GC-content which may be associated with autonomous transcription of microRNA precursors (Khrustalev et al., 2014). In the present work we described a situation when autonomously transcribing elements can be suggested inside a single bacterial open reading frame because of the presence of the opposite nucleotide usage biases. To visualise those biases we used original MS Excel based algorithm entitled ‘VVTAK Sliding Window’ (http://chemres.bsmu.by/VVK%20SW.htm). An interesting fact about nucleotide usage biases is that they are retrospective: they still exist long after the point in time when the change in mutational pressure direction happened (Khrustalev and Barkovsky, 2010a). Mutational pressure can change its direction because of such reason, as mutation in a gene encoding an enzyme from DNA repair system. After that change in mutational pressure direction nucleotide usage biases will start to change too. However, it may take a lot of time to establish a new equilibrium between the rates of nucleotide mutations and usages of nucleotides. That is why it is important not just to study nucleotide usage biases, but to check whether mutational pressure is ongoing. The aim of the present study was to check the existence of ongoing mutational pressures of different directions inside a single bacterial open reading frame. With that aim we developed a new computer algorithm (http://chemres.bsmu.by/ VVK%20VarInvar.htm) entitled ‘VVTAK VarInvar’. Pili (polymeric adhesines linked to bacterial cell wall) play important role in corynebacterial pathogenesis. There are three separate pilus gene clusters in the first complete genome sequence of Corynebacterium diphtheriae (a clinical isolate from the UK – strain NCTC13129) (Cerdeño-Tárraga et al., 2003). There are three types of polymeric pili available for C. diphtheriae: spaA-, spaD- and spaH-type pili (Ton-That and Schneewind, 2003). All three types of pili have similar architecture: polymeric shaft made of numerous major pilins is joined to a specific tip pilin and a base pilin (minor pilins). The specific binding of corynebacteria to pharyngeal epithelial cells is attributed to the two minor pilins (spaB and spaC), which can not only exist as a base and a tip of the polymeric pilus, but most intriguingly, can be linked to the bacterial cell wall in monomeric and heterodimeric forms as well (Chang et al., 2011). 350 V.V. Khrustalev et al. In this study we showed that there are different directions of mutational pressure in different regions of the same open reading frame coding for spaC. This heterogeneous GC-content distribution along the length of a single gene cannot be understood without description of heterogeneous GC-content distribution along the length of C. diphtheriae genome. According to our hypothesis the both heterogeneities are consequences of transcription-associated mutational pressure. Other genes with heterogeneous GC-content distribution along their lengths have been found in C. diphtheriae genome and genomes of its close relatives from the same genus. There are genes homologous to spaC in genomes of other Corynebacterium species. 3GC is higher than 50% along the most of their lengths. So, one cannot suggest that specific nucleotide usage biases in spaC gene have been formed because of the recent gene fusion event(s). There is a possibility of partial gene transfer (Boc et al., 2012) of GC-rich 3’-end of SpaC from other species to the GC-poor C. diphtheriae SpaC gene. Anyway, there should be some inner causes which don’t let those nucleotide usage biases fade away with the course of time. According to the results of promoter and terminator predictions there might be autonomously transcribed elements in spaC gene which may be involved in adhesion regulation. 2 Materials and methods 2.1 Materials As a material we used nucleotide sequences of spaC gene from completely sequenced and annotated (Trost et al., 2012) genomes of different C. diphtheriae strains: NCTC 13129 (NC_002935), Va01 (NC_016790), 241 (NC_016782), BH8 (NC_016800), CDCE 8392 (NC_016785), HC01 (NC_016786), HC02 (NC_016802), HC03 (NC_016787), NCTC 5011 (AJVH01000021), and HC04 (NC_016788). The last genome contains two copies of spaC. BLAST-search (http://blast.ncbi.nlm.nih.gov/Blast.cgi) has been performed with amino acid sequence of spaC from C. diphtheriae NCTC 13129 strain as a query. Several sequences of homologous proteins (also entitled spaC in some species) which belong to other species from Corynebacterium genus have been found. Nucleotide sequences coding for those spaC homologues have been included in the current study: Corynebacterium ulcerans 809 spaC (NC_017317: nucleotides 2229085 – 2234643; 38.3% of identical amino acid residues with C. diphtheriae spaC); C. ulcerans 0102 (AP012284; 2302504 – 2308062; 38.4%); C. ulcerans BR-AD22 spaC (NC_015683; 2324724 – 2330282; 38.0%); Corynebacterium pseudotuberculosis 258 (NC_017945; 2058211 – 2063823; 40.1%); Corynebacterium casei UCMA 3821 (NZ_CAFW01000104; 509 – 6160; 38.3%). The data on codon usage in each coding region along the complete genome of C. diphtheriae NCTC 13129 has been downloaded from the Codon Usage Database (Nakamura et al., 2000) (www.kazusa.or.jp/codon). Since there are no records on complete C. ulcerans, C. pseudotuberculosis and C. casei genomes in the Codon Usage Database, we used their nucleotide sequences and protein tables from the GenBank to calculate nucleotide usage biases in each coding region along their lengths. We used the genomes of C. ulcerans strain 809 (NC_017317), C. pseudotuberculosis strain 258 (NC_017945) and C. casei strain LMG S-19264 (CP004350). Opposite nucleotide usage biases 351 2.2 Methods Each of the nucleotide sequences coding for spaC (from C. diphtheriae and other Corynebacterium species) has been studied with the help of the ‘VVTAK Sliding Window’ algorithm (http://chemres.bsmu.by/VVK%20SW.htm). Nucleotide content distribution along the length of each coding region has been calculated in sliding windows 150 codons in length (it is a longest window length for that algorithm). The algorithm calculates nucleotide usage in three codon positions, as well as in fourfold and twofold degenerated sites from third codon positions in each sliding window. The ‘VVTAK Sliding Window’ algorithm works with a single nucleotide sequence which has no more than 3000 nucleotides in length. The user should copy a nucleotide sequence into the designated cell of the ‘Sequence’ list and write the length of the sliding window (no more than 150 codons) in another designated cell from the same list. Inserted nucleotide sequence will be cut into short fragments of the required length (the step of the sliding window is equal to one codon) and all the indices describing nucleotide usage in each of those fragments will appear in columns from the ‘Results’ list. Alignment of spaC from different C. diphtheriae strains has been performed by PAM algorithm included in the MEGA 5.1 program (Tamura et al., 2011). All the gaps have been deleted from that alignment. Then alignment has been cut into three parts, according to the changes of biases in nucleotide content distribution along spaC coding regions: part 1 (from codon #1 to codon #400 relatively to the NCTC 13129 spaC), part 2 (from codon #401 to codon #1599), and part 3 (from codon #1600 to codon #1873). The first boarder (codon #400) is the compromise between intersection points for G4f and A4f (codon #350), for C4f and T4f (codon #450), for G2f3p and A2f3p (codon #300) and for C2f3p and T2f3p (codon #500). The second boarder (codon #1600) is the point at which 3GC becomes higher than 50%. Percentage of different types of variable fourfold degenerated sites has been calculated in three alignments with the help of the ‘VVTAK VarInvar’ algorithm (http://chemres.bsmu.by/VVK%20VarInvar.htm). The ‘VVTAK VarInvar’ algorithm works with the alignment of nucleotide sequences (the alignment may contain up to 100 sequences no more than 4000 nucleotides in length each). The logical flow path of the ‘VVTAK VarInvar’ algorithm includes: (a) determination of fourfold and twofold degenerated sites in third codon positions in each of the sequences; (b) finding those twofold and fourfold sites which stay twofold and fourfold degenerated, respectively, in all the sequences from the alignment (we called them stable twofold and stable fourfold degenerated sites); (c) calculation of nucleotide content in stable twofold and fourfold degenerated sites for each of the sequences; (d) finding invariable sites among stable twofold and fourfold degenerated sites; (e) calculation of nucleotide content in invariable twofold and fourfold degenerated sites (that index is the same for each of the sequences); (f) finding whether the difference between nucleotide usage in stable and invariable sites is significant using two-tailed t-test. The algorithm ‘VVTAK VarInvar’ does not use those sites which are fourfold (or twofold) degenerated only in part of the sequences. Those ‘instable’ sites formed because of mutations in first or second codon positions and might introduce significant bias in calculations. We used three methods for promoter prediction: the ‘BPROM’ available via the SoftBerry server (http://linux1.softberry.com); the ‘NNPP’ program (http://fruitfly.org/ seq_tools/promoter.html) (Reese, 2001); the ‘PromPredict’ (http://nucleix.mbu.iisc. ernet.in/prompredict/prompredict.html) program (Rangannan and Bansal, 2009). 352 V.V. Khrustalev et al. Three methods for Rho-independent transcription terminators prediction used in this study are: the ‘Erpin’ (Gautheret and Lambert, 2001) and the ‘RNAmotif’ (Macke et al., 2001; Lesnik et al., 2001) methods both included into the ‘ARNold’ algorithm (http://rna.igmors.u-psud.fr/toolbox/arnold/index.php), as well as the ‘RibEx’ server (http://132.248.32.45/cgi-bin/ribex.cgi) (Abreu-Goodger and Merino, 2005). Consensus secondary structures have been predicted for regions homologous to two predicted Rho-independent terminators from C. diphtheriae Va01 strain spaC with the help of the CentroidFold algorithm (Hamada et al., 2009) (http://www.ncrna.org/ centroidfold/). Promoter and Rho-independent terminator predictions have been performed in spaC coding regions from 11 strains of C. diphtheriae and in its homologues from other Corynebacterium species. The graph demonstrating GC-content distribution between three codon positions along the length of C. diphtheriae NCTC 13129 strain chromosome has been built with the help of the ‘Chore Viewer’ algorithm (Khrustalev and Barkovsky, 2012) (http://chemres.bsmu.by/CRV.htm). As an input that algorithm uses a record from the Codon Usage Database (Nakamura et al., 2000) (www.kazusa.or.jp/codon) describing codon usage in each of the coding regions along the length of a genome. The graphs demonstrating GC-content distribution in three codon positions along the length of C. ulcerans 809, C. pseudotuberculosis 258 and C. casei LMG S-19264 genomes have been created with the help of the ‘VVK Protective Buffer’ algorithm (http://chemres.bsmu.by/VVK%20Protective%20buffer.htm). That algorithm calculates the number of indices describing nucleotide usage (including G+C, 1GC, 2GC and 3GC) and amino acid usage in the set of protein coding regions (it can work with 100 sequences no more than 10000 nucleotides in length each simultaneously). We searched for additional examples of coding regions with heterogeneous 3GC distribution in complete genomes of C. diphtheriae NCTC 13129, C. ulcerans 809, C. pseudotuberculosis 258 and C. casei LMG S-19264. Each gene has been studied with the help of the ‘VVTAK Sliding Window’ algorithm with sliding window length equal to 150 codons. We calculated the difference between the window with the highest 3GC and the lowest 3GC level (∆3GC150) for each of those genes. Genes were sorted according to that value for each genome. 3 Results 3.1 Distribution of GC-content in three codon positions along the length of spaC from C. diphtheriae NCTC 13129 As one can see in Figure 1, 3GC distribution along the length of the gene coding for spaC is quite heterogeneous. In the first 400 codons of this gene 3GC level varies approximately from 45 to 50% (average level for all the sliding windows is equal to 45.51±0.36%). In the next part of that gene (from codon #401 to codon #1599) 3GC level becomes lower: it varies around the level of 40% (average level is equal to 38.72±0.20%). Interestingly, in the last part of the gene coding for spaC (it starts from the codon #1600) 3GC becomes much higher (64.61±0.91%). The length of that last (GC-rich) part of the gene coding for spaC is rather long (273 codons), even though it is shorter than the lengths of two previous parts. Opposite nucleotide usage biases Figure 1 353 GC-content in three codon positions (1GC, 2GC and 3GC) in sliding windows 150 codons in length along the spaC gene from C. diphtheriae NCTC 13129 There are proteins homologous to C. diphtheriae spaC in other species from the Corynebacterium genus. 3GC is higher than 50% in spaC homologues from C. casei, C. ulcerans, and C. pseudotuberculosis. The data described above allow us to reject hypothesis of complete spaC lateral transfer to C. diphtheriae from species with low GC-content. At first, 3GC is high in its closer relatives from other species of the same genus. At second, 3GC is not decreased in the last part of C. diphtheriae spaC – it is even higher than that in homologous regions of C. ulcerans and C. pseudotuberculosis. Theoretically, partial lateral transfer might happen, while the probability of such event is not very high (Boc et al., 2012). 3.2 Nucleotide usage biases in fourfold and twofold degenerated sites from third codon positions along the length of C. diphtheriae NCTC 13129 spaC More specific changes in nucleotide content distribution along the length of the spaC gene from C. diphtheriae NCTC 13129 are represented in Figure 2. In Figure 2A one can see that in the middle of the spaC gene adenine content in fourfold degenerated sites (A4f) is much higher than guanine content (G4f). In the beginning of the coding region A4f is close to G4f, while in the end of the gene G4f is much higher than A4f (see Figure 2A). Similar tendency can be observed in Figure 2C. The usage of C4f is higher than the usage of T4f along the most of the spaC gene. However, there are peaks of C4f usage near the codons #200, #400 and #1600 (Figure 2B). In twofold degenerated sites the usage of thymine (T2f3p) is much higher than the usage of cytosine (C2f3p) in the middle part of the spaC coding region (Figure 2D). 354 Figure 2 V.V. Khrustalev et al. Guanine and adenine (A, C), and cytosine and thymine (B, D) content in fourfold degenerated (A, B) and twofold degenerated sites from third codon positions (C, D) in sliding windows 150 codons in length along the spaC gene from C. diphtheriae NCTC 13129 According to the nucleotide usage biases along the length of the spaC gene, there should be mutational AT-pressure in the middle part of that gene and G-pressure in its 3’-end. We separated the spaC coding region into three parts. The first part ends at the codon #400, according to the positions of intersection points for G4f and A4f (codon #350), for C4f and T4f (codon #450), for G2f3p and A2f3p (codon #300) and for C2f3p and T2f3p (codon #500). The last part starts from the codon #1600 where lines corresponding to G4f and A4f usages (and G2f3p and A2f3p usages) cross each other making 3GC level higher than 50% for the 3’-end of the gene. 3.3 Evidences of the ongoing mutational AT-pressure in the middle part of C. diphtheriae spaC and the ongoing mutational G-pressure in its 3'-end In the next step of the study we used 11 sequences of spaC. In Table 1 one can see that levels of A4f, T4f, A2f3p and T2f3p are higher in invariable sites (in sites without nucleotide mutations) than in all the stable sites from third codon positions (in sites which stay twofold degenerated or fourfold degenerated in all the sequences from the alignment) of the spaC Part 1 (from codon #1 to codon #400). In contrast, levels of G4f, C4f, G2f3p and C2f3p are lower in invariable sites than in all the stable sites from third codon positions of the spaC Part 1. These results confirm that cytosine and guanine residues are more mutable than adenine and thymine in that part of spaC. In other words, there is an ongoing mutational AT-pressure in the spaC Part 1. Opposite nucleotide usage biases Table 1 355 Nucleotide content in fourfold and twofold degenerated sites from third codon positions in three parts of spaC gene. The usage of nucleotide in invariable sites is written in bold underlined type in case if it is significantly higher than the average usage of the same nucleotide in all the stable sites. Insignificant differences are written in italic font Part 1: Part 2: Part 3: codons 1 – 400 codons 401 – 1599 codons 1600 – 1873 Nucleotide usage Invariable Stable Invariable Stable Invariable Stable P-value P-value P-value sites sites sites sites sites sites A4f 23.94 22.60 0.032 29.02 29.10 0.594 6.41 7.69 0.003 T4f 40.85 32.88 < 10-3 33.16 31.48 < 10-3 30.77 33.55 < 10-3 G4f 18.31 19.24 0.045 19.69 18.93 0.142 42.31 36.94 < 10-3 25.28 < 10 -3 20.45 < 10 -3 20.51 21.82 0.005 < 10 -3 < 10 -3 5.66 8.09 < 10-3 < 10 -3 < 10 -3 18.87 21.42 0.015 < 10 -3 < 10 -3 56.60 45.33 < 10-3 < 10 -3 < 10 -3 18.87 25.16 < 10-3 C4f A2f3p T2f3p G2f3p C2f3p 16.90 29.33 33.33 14.67 22.67 23.57 31.36 18.80 26.27 18.13 32.27 38.65 11.95 17.13 29.06 33.88 15.94 21.13 In the Part 2 of spaC (from codon #401 to codon #1599) level of T4f is significantly higher in invariable sites than in all the stable sites, while level of C4f is significantly lower (see Table 1). There are no significant differences between levels of A4f and G4f in invariable and all the stable sites. One may conclude that cytosine residues are more mutable than thymine residues. The level of thymine keeps growing in fourfold degenerated sites, while the level of adenine is not. Probably, levels of adenine and guanine have already reached their equilibrium. On the other hand, there are clear evidences of the ongoing mutational AT-pressure in twofold degenerated sites: levels of A2f3p and T2f3p are higher in invariable sites than in all the stable sites. In the Part 3 of spaC (from codon #1600 to codon #1873) the frequency of guanine mutations should be lower than frequencies of adenine, thymine and cytosine mutations. Indeed, the usage of G4f in invariable sites is significantly higher than its average usage in all the stable fourfold degenerated sites, while usages of A4f, T4f and C4f are lower in invariable sites than in all the stable ones. It means that in the Part 3 of spaC mutations leading to appearance of guanine in place of other nucleotides are more frequent than mutations leading to the replacement of guanine by other nucleotides. There are clear evidences of mutational G-pressure in twofold degenerated sites: level of G2f3p is higher in invariable sites than in all the stable sites of that kind, while levels of A2f3p and T2f3p are lower (see Table 1). The most probable cause of G-pressure in the Part 3 of spaC should be the elevated rates of T to G transversions. As one can see in Table 2, 21.8% of variable fourfold degenerated sites in spaC Part 3 are represented by sites containing G and T: it is more than two times higher than in Part 2 and more than three times higher than in Part 1. Interestingly, the percent of sites containing G and T nucleotides (sites with trasversion) is even higher than the percent of sites containing A and G nucleotides (sites with transition) in the Part 3 of spaC. Once again, according to the biases in nucleotide content (see Figure 2), those sites occurred mostly due to G to T transversions in Parts 1 and 2, and mostly due to T to G transversions in Part 3 of the spaC gene. 356 V.V. Khrustalev et al. Table 2 Percentage of different types of variable fourfold degenerated sites in the alignment of spaC genes. Type of the variable site Part 1: codons 1 – 400 Part 2: codons 401 – 1599 Part 3: codons 1600 – 1873 TC 40.00 27.54 37.50 AG 21.33 28.50 18.75 GC 8.00 4.83 9.38 AT 4.00 9.18 9.38 AC 13.33 10.14 3.13 TG 6.67 9.66 21.88 ATG 0.00 2.90 0.00 ATC 0.00 4.83 0.00 AGC 5.33 0.97 0.00 TGC 1.33 0.97 0.00 ATGC 0.00 0.48 0.00 In general, the data represented above showed that there are at least two regions with different mutational pressure directions along the length of the same open reading frame. Moreover, the first of those regions can be divided into two parts with different intense of the mutational AT-pressure. 3.4 Prediction of intragenic promoters and terminators in spaC from C. diphtheriae NCTC 13129 and its homologues Theoretically, there may be autonomous promoters and transcription terminators inside the spaC ORF responsible of the differential transcription of its parts. Unfortunately, large scale RNA expression data for C. diphtheriae is not available in the current time, unlike that for C. glutamicum (Pfeifer-Sancar et al., 2013). So, we tested hypothesis of the autonomous transcription in silico: we predicted promoter regions and Rhoindependent transcription terminators along the length of spaC from C. diphtheriae and its homologues from other species of the same genus. Rho-dependent transcription terminators cannot be predicted using computer software, although it is known that those terminators are enriched with cytosine residues (Ciampi, 2006). There are many putative promoter regions along the length of the spaC ORF. However, there are just two regions in which promoters have been predicted by all the three methods: near the codon #400 and near the codon #1550. It is very important to highlight that these putative promoters are situated near the areas in which nucleotide usage biases are changing their direction or intense. There are also two regions in which promoters were predicted by three methods in the spaC homologue from C. ulcerans strain 0102. These putative promoters are situated in the same regions as in the C. diphtheriae spaC. Moreover, two abovementioned putative promoters from C. ulcerans spaC are homologous to promoters from C. diphtheriae spaC. Interestingly, 3GC between those two predicted promoters is higher than in 5′ and 3′ parts of the spaC gene from C. ulcerans (see Figure 3), in contrast to the C. diphtheriae spaC gene (see Figure 1). Opposite nucleotide usage biases Figure 3 357 GC-content in three codon positions (1GC, 2GC and 3GC) in sliding windows 150 codons in length along the spaC gene from C. ulcerans strain 0102 The search for Rho-independent transcription terminators was less productive than the search for promoters. Two methods (RibEx and Erpin) failed to predict any terminator in all the sequences studied. However, the RNAmotif method predicted terminators in one sequence. Two Rho-independent transcription terminators have been predicted by RNAmotif method in the spaC gene from C. diphtheriae Va01 strain. In ten other spaC gene sequences RNAmotif failed to predict terminators. Nucleotide sequences of two predicted terminators from Va01 strain have been aligned with sequences from other strains. Homologous regions containing a few nucleotide substitutions have been found in each of them. It is known that L-type Rho-independent terminator consists of a short inverse repeat which is able to form a hairpin and a 3'-tale enriched by thymine residues (by uracil residues in mRNA) (Naville et al., 2011). We predicted consensus secondary structures of both regions homologous to terminators predicted in Va01 strain spaC with the help of the CentroidFold algorithm. As one can see in Figure 4, hairpins were predicted by the CentroidFold in both cases. It means that those hairpins may be formed not only in spaC from Va01 strain, but in sequences from other strains too. 3'-tales of both putative terminators are enriched with uracil residues (see Figure 4). The first putative Rho-independent terminator (Figure 4A) is located near the codon #1300, while the second one is located near the codon #1500. The second one is situated just upstream of the putative promoter region. The only one known feature of sequences which serve as Rho-dependent transcription terminators is their elevated cytosine content (Ciampi, 2006). Looking in Figures 2B and 2D one may see several regions in which cytosine content is elevated. The highest peak of cytosine content in twofold degenerated sites is situated near the codon #1650 of C. diphtheriae spaC (see Figure 2D). 358 V.V. Khrustalev et al. Figure 4 Consensus secondary structures of the regions homologous to Rho-independent terminator predicted near the codon #1330 (A) and near the codon #1500 (B) of spaC from the C. diphtheriae Va01 strain In general, regions in which 3GC increases or decreases along the length of spaC gene are associated with such sequence features as putative promoters or terminators. These data confirm that changes in nucleotide usage biases (and so, changes of mutational pressure direction or intense) along the length of the same bacterial ORF can be associated with the existence of intragenic transcription terminators and additional promoter regions. 4 Discussion As one can see in Figure 5A, 3GC levels of genes from C. diphtheriae strain NCTC 13129 genome vary greatly. For the most of the genes 3GC varies from 50 to 80%, while there is a group of genes with 3GC lower than 50%. The most of the GC-poor genes are grouped together near the region of replication termination. However, there are also many genes with low 3GC levels outside the boarders of the large GC-poor genomic island (see Figure 5). One of those genes is a gene coding for spaC minor pilin. So, there is a single large GC-poor island and many small GC-poor islands in the genome of the C. diphtheriae. Interestingly, similar pattern of 3GC distribution along the genome can be observed in C. pseudotuberculosis (Figure 5B) and C. ulcerans (Figure 5C). However, the difference between 3GC levels for genes from the GC-poor islands and the rest of the Opposite nucleotide usage biases 359 genome is lower for C. pseudotuberculosis and C. ulcerans than for those from C. diphtheriae. The genome of C. casei has no large GC-poor island in the middle but it has GC-rich island with 3GC levels around 85% in its third quarter (Figure 5D). Figure 5 GC-content in three codon positions (1GC, 2GC and 3GC) in each coding region along the length of C. diphtheriae NCTC 13129 (A), C. pseudotuberculosis 258 (B), C. ulcerans 809 (C) and C. casei LMG S-19264 genomes. Location of spaC gene is shown by the arrow According to the recent study (D’Afonseca et al., 2012), there are 24 pathogenicity islands in the C. diphtheriae NCTC 13129 genome. How do they relate with GC-poor genomic islands? Approximately 50% of pathogenicity islands are situated in GC-poor regions of the chromosome. Only one relatively short pathogenicity island has been identified in the long GC-poor area described above. Another half of pathogenicity islands demonstrates elevated GC-content. Interestingly, approximately one half of small GC-poor islands have no genes known to be involved in pathogenesis. There are 30 genes which demonstrate heterogeneous 3GC distribution along their lengths (the maximal difference between two sliding windows 150 codons in length is higher than 25%) in the C. diphtheriae strain NCTC 13129 genome (see Table 3). Interestingly, 14 of them are coding for membrane-anchored, surface-anchored or membrane proteins. Theoretically, those proteins may be involved in adhesion process, just like spaC does. The total number of genes encoding membrane, surface-anchored of membrane-anchored proteins (according to the latest genome annotation) is equal to 86. Chi-square test showed that the percentage of genes coding for membrane, surfaceanchored and membrane-anchored proteins among those with ∆3GC150 higher than 25% (14 out of 30; 46.7%) is significantly higher (P < 0.001) than the percentage of genes coding for membrane, surface-anchored and membrane-anchored proteins among those with ∆3GC150 lower than 25% (72 out of 2314; 3.11%). 360 V.V. Khrustalev et al. Table 3 Percentage of genes distributed according to their maximal difference in 3GC between sliding windows 150 codons in length (∆3GC150) in genomes of four species from Corynebacterium genus Genomes of Corynebacterium species ∆3GC150, % C. diphtheriae NCTC 13129 C. ulcerans 809 C. casei LMG S-19264 C. pseudotuberculosis 258 > 40 ≤ 45 0.088 0.000 0.000 0.000 > 35 ≤ 40 0.080 0.000 0.036 0.000 > 30 ≤ 35 0.264 0.046 0.107 0.000 > 25 ≤ 30 0.880 0.734 0.641 0.287 > 20 ≤ 25 2.817 2.615 2.136 2.059 > 15 ≤ 20 8.803 10.275 9.220 8.764 > 10 ≤ 15 19.058 19.266 20.007 20.259 > 5 ≤ 10 30.766 35.229 33.571 32.519 >0≤5 15.845 14.725 14.667 14.655 = 0 or < 150 codons 21.391 17.110 19.616 21.456 Genes coding for membrane, surface-anchored or membraneanchored proteins (∆3GC150 > 25%) 14 from 30 5 from 17 0 from 22 0 from 6 SpaC gene itself demonstrates the maximal difference between 3GC levels for two sliding windows equal to 43.33%. There is just a single gene with a higher ∆3GC150 (44.00%) level in that genome which is coding for bifunctional alpha-amylase endo-alphaglucosidase. Some so-called housekeeping genes (such as DNA polymerase subunit III, DNA methylases, translation initiation factor IF-2, serine-threonine protein kinase, asparagine synthetase) also demonstrate 3GC heterogeneity in the C. diphtheriae strain NCTC 13129 genome. In the C. ulcerans 809 genome there are 17 genes with ∆3GC150 > 25% and 5 of those genes are coding for membrane proteins which may be directly involved in adhesion (see Table 3). Chi-square test showed that the percentage of genes coding for membrane proteins among those with ∆3GC150 higher than 25% (5 out of 17; 29.41%) is significantly higher (P < 0.001) than the percentage of genes coding for membrane, surface-anchored and membrane-anchored proteins among those with ∆3GC150 lower than 25% (19 out of 2156; 0.88%). Once again, we used current annotation of the C. ulcerans 809 genome to find out which of the proteins are membrane, surface-anchored or membrane-anchored ones. In contrast, in C. casei genome there are no genes encoding for membrane, membrane-anchored or surface anchored proteins among those 22 genes with ∆3GC150 > 25%, while the total number of membrane, surface-anchored and membrane-anchored proteins (according to the current annotation) is equal to 25. In C. pseudotuberculosis genome there are also no genes coding for membrane, surface-anchored or membraneanchored proteins among those 6 with ∆3GC150 > 25% (see Table 3), while the total number of genes coding for membrane, surface-anchored or membrane-anchored proteins annotated in that genome is equal to 24. Opposite nucleotide usage biases 361 The overview of the Table 3 gives us the basis to say that cases similar to that described for spaC gene are relatively rare among bacteria from Corynebacterium genus. In four complete genomes we found 75 genes with anomalous 3GC content variations. In genomes of C. diphtheriae and C. ulcerans the percentage of membrane-anchored, surface-anchored and membrane proteins among the genes with ∆3GC150 > 25% is higher than that among other genes. It shows that 3GC heterogeneity may be linked with adhesion process. However, that tendency was not found in C. pseudotuberculosis and C. casei genomes. In general, heterogeneities in 3GC distribution along the length of genes equal to 5 – 10% are widespread in four studied genomes (Table 3). Transcription-associated mutational pressure occurs due to the bias in rates of nucleotide mutations which take place during transcription (Beletskii and Bhagwat, 1996). In case if bacteria can exist in different environments, differential expression of genes should take place: certain groups of genes should be expressed in certain periods of infection process only. During, at least, one of those periods some genes encoding enzymes from base excision repair system may become repressed or overexpressed. As a result, transcription-associated mutational pressure may become different for a group of genes expressed only during a special period of bacterial life than for those genes which are not expressed in that period (Khrustalev and Barkovsky, 2010a). Theoretically, suppression of uracil-DNA-glycosilase(s) expression may lead to the accumulation of C to T transitions (Gros et al., 2002) in genes (and their parts) expressed during a certain period of bacterial life and cause transcription-associated AT-pressure. In contrast, overexpression of the MutY enzyme (it excises adenine from 8-oxo-G:A mispairs and leads to T to G transversions) during certain periods of bacterial life may lead to the transcription-associated GC-pressure (Gros et al., 2002). In case with 3’-end of spaC gene T to G transversions are frequent only in non-transcribed strand. Probably, 8-oxo-G residues are removed effectively from the transcribed strand during transcriptionassociated repair in the phase of 3’-end of spaC autonomous expression. There are several different environments in which C. diphtheriae is able to survive: nasopharyngeal epithelium, skin (Orouji et al., 2012), cytoplasm of host cells (during invasive infection of inner organs) (Viguetti et al., 2012). There are also asymptomatic carriers of C. diphtheriae in which bacteria should survive but should not express genes responsible for the beginning of acute infection with clear clinical symptoms. The possibility of autonomous transcription from the inner area of coding region has been discussed in several studies. Specific areas enriched with putative promoters have been found in bacterial genomes using predictive software (Shavkunov et al., 2009). The most of the putative intragenic promoters were considered to code for short RNAs (Tutukina et al., 2007). Transcription from some of those promoters was approved in experimental works (Shavkunov et al., 2009). Transcription beginning from intragenic promoters is used in archaeal genomes even more frequently than transcription beginning from intergenic regions (Koide et al., 2009). Existence of intragenic transcription terminators was approved in many experimental works (Naville and Gautheret, 2009). There are two transcription start sites inside the long open reading frame of C. diphtheriae spaC. In case that transcription starts from the promoter near the codon #400 or near the codon #1550 of spaC, the resulting proteins will lack their N-terminal domain which is homologous to von Willebrand factor. It was shown that recombinant protein containing only the domain homologous to von Willebrand factor from spaC is able to interact with human pharyngeal epithelium cells (Mandlik et al., 2007). In case of 362 V.V. Khrustalev et al. transcription start from intragenic promoters, resulting proteins may not be able to promote adhesion to pharyngeal epithelium, while they theoretically may promote adhesion to some other surfaces. In case that transcription ends near the codon #1300 or #1500, where Rhoindependent transcriptional terminators were mapped, resulting transcript may become nonfunctional, since there is no stop-codon in its 3'-end. On the other hand, tmRNA may be involved in the resolving of mRNA without stop-codon from ribosome (Fu et al., 2011). In this case, resulting protein will lack its C-terminal part with sortase motif. The protein will not be transported to the cell wall without that conserved motif (Mandlik et al., 2007). Full-length spaC is expressed together with other genes from spaA-type pili operon, at least, in the period of C. diphtheriae infection when there is a need of adhesion to the pharyngeal epithelium. Since 3GC level is low in the middle of the spaC gene (from codon #400 to codon #1600), it is likely that this part of coding region may be expressed autonomously in certain conditions. Transcription of this long mRNA may begin from the promoter near the codon #400 and end in one of the two regions with Rho-independent terminators (near codon #1300 or #1500), or in the region enriched with cytosine residues near the codon #1600 (it may contain Rho-dependent terminator). Protein without any known functional domains or motifs may be translated from that mRNA with the help of tmRNA. Different types of proteins may theoretically be translated from different mRNAs transcribed from a single spaC ORF when C. diphtheriae survives in different conditions. However, transcription-associated mutational AT-pressure should take place during suggested autonomous transcription of the central spaC part, while suggested autonomous expression of its 3′-end should take place during the period(s) with transcriptionassociated G-pressure. Nucleotide mutations in putative promoters, in transcription factors binding sites, as well as in Rho-dependent and Rho-independent transcription terminators inside the spaC gene may cause differences in the success of adhesion to certain surfaces between different strains of C. diphtheriae. According to the results of phylogenetic analyses (Ruimy et al., 1995), C. ulcerans and C. pseudotuberculosis are the closest known relatives of C. diphtheriae. Interestingly, C. ulcerans and C. pseudotuberculosis are closer related to each other than to C. diphtheriae. The forth genome from Corynebacterium genus which has spaC homologue belong to the C. casei which is an outgroup for the three abovementioned species (Brennan et al., 2001). According to these data, the specific large GC-poor island has been formed in the common ancestor of C. diphtheriae, C. ulcerans and C. pseudotuberculosis, while the ancestor of C. casei did not have such a large GC-poor area. This fact approves that different transcription-associated mutational pressure directions existed during the evolutionary period which is long enough to make 3GC levels significantly lower, while that period of time was not enough to decrease 1GC and 2GC in genes from the GC-poor area (see Figure 5A, B and C). Moreover, genes coding for membrane, surface-anchored and membrane-anchored proteins (including spaC) have already been involved in autonomous transcription in the common ancestor of C. diphtheriae, C. ulcerans and C. pseudotuberculosis. Since the time of C. ulcerans and C. pseudotuberculosis divergence the genes coding for membrane, surface-anchored and Opposite nucleotide usage biases 363 membrane-anchored proteins of C. pseudotuberculosis have acquired more homogenous 3GC distribution along their lengths. Coming back to the C. casei genome we have to highlight that heterogeneity in 3GC distribution can be clearly observed along the length of both the whole genome and several genes of this bacterium, while the pattern of its 3GC heterogeneity is different from that in C. diphtheriae. The cause of 3GC heterogeneity along the coding region may be different: it can be formed by insertion of gene fragments from different species or by the transcriptionassociated mutational pressure. As we have shown in this study, local transcriptionassociated mutational pressure may become weaker and even disappear or change its direction. The ‘VVTAK VarInvar’ algorithm is necessary for answering the question about the direction of ongoing mutational pressure in different parts of the same gene or in different genes from the same genome. The aim of the algorithm is to separate current mutational bias from the previously existed nucleotide usage biases. 5 Conclusions In this study we showed that mutational pressure theory (Sueoka, 1988) can be applied not only to complete bacterial genomes or full-length genes, but even to different parts of the same bacterial open reading frame. Preferable direction of single nucleotide mutations may be different in different parts of the same coding region. Indeed, we approved the existence of ongoing AT-pressure in the 5'-part of C. diphtheriae spaC gene (up to the codon #1599), and the existence of ongoing G-pressure in its 3'-part (from codon #1600 to codon #1873) with the help of original bioinformatic algorithms. The repertoire of adhesins may be wider than it is thought for C. diphtheriae and C. ulcerans due to the possibility of transcription from the intragenic promoters inside the spaC (and several other genes coding for membrane surface-anchored and membraneanchored proteins) and termination of transcription on intragenic Rho-dependent or Rhoindependent terminators. Bioinformatic analyses of nucleotide usage biases along the length of bacterial genome and even along the length of a single gene may be used as promising procedures in studies on transcription regulation and differential transcription. References Abreu-Goodger, C. and Merino, E. (2005) ‘RibEx: a web server for locating riboswitches and other conserved bacterial regulatory elements’, Nucleic Acids Research, Vol. 33, pp.W690–W692. Beletskii, A. and Bhagwat, A.S. (1996) ‘Transcription-induced mutations: Increase in C to T mutations in the nontranscribed strand during transcription in Escherichia coli’, Proceedings of the National Academy of Science USA, Vol. 93, pp.13919–13924. Boc, A., Diallo, A.B. and Makarenkov, V. (2012) ‘T-REX: a web server for inferring, validating and visualizing phylogenetic trees and networks’, Nucleic Acids Research, Vol. 40, pp.W573–W579. Brennan, N.M., Brown, R., Goodfellow, M., Ward, A.C., Beresford, T.P., Simpson, P.J., Fox, P.F. and Cogan, T.M. (2001) ‘Corynebacterium mooreparkense sp. nov. and Corynebacterium casei sp. nov., isolated from the surface of a smear-ripened cheese’, International Journal of Systematic and Evolutionary Microbiology, Vol. 51, pp.843–852. 364 V.V. Khrustalev et al. Cerdeño-Tárraga, A.M., Efstratiou, A., Dover, L.G., Holden, M.T, Pallen, M., Bentley, S.D., Besra, G.S., Churcher, C., James, K.D., De Zoysa, A., Chillingworth, T., Cronin, A., Dowd, L., Feltwell, T., Hamlin, N., Holroyd, S., Jagels, K., Moule, S., Quail, M.A., Rabbinowitsch, E., Rutherford, K.M., Thomson, N.R., Unwin, L., Whitehead, S., Barrell, B.G. and Parkhill, J. (2003) ‘The complete genome sequence and analysis of Corynebacterium diphtheriae NCTC13129’, Nucleic Acids Research, Vol. 31, pp.6516–6523. Chang, C., Mandlik, A., Das, A. and Ton-That, H. (2011) ‘Cell surface display of minor pilin adhesins in the form of a simple heterodimeric assembly in Corynebacterium diphtheriae’, Molecular Microbiology, Vol. 79, pp.1236–1247. Chen, C. and Chen, C.W. (2007) ‘Quantitative analysis of mutation and selection pressures on base composition skews in bacterial chromosomes’, BMC Genomics, Vol. 8, No. 286. Ciampi, M.S. (2006) ‘Rho-dependent terminators and transcription termination’, Microbiology, Vol. 152, pp.2515–2528. D’Afonseca, V., Soares, S.C., Ali, A., Santos, A.R., Pinto, A.C., Magalhaes, A.A.C., Faria, C.J., Barbosa, E., Guimaraes, L.C., Eslabao, M., Almeida, S.S., Abreu, V.A.C., Zerlotini, A., Carneiro, A.R., Cerdeira, L.T., Ramos, R.T.J., Hirata, Jr. R., Mattos-Guaraldi, A.L., Trost, E., Tauch, A., Silva, A., Schneider, M.P., Miyoshi, A. and Azevedo, V. (2012) ‘Reannotation of the Corynebacterium diphtheriae NCTC13129 genome as a new approach to studying gene targets connected to virulence and pathogenicity in diphtheria’, Open Access Bioinformatics, Vol. 4, pp.1–13. Fu, J., Hashem, Y., Wower, J. and Frank, J. (2011) ‘tmRNA on its way through the ribosome: two steps of resume and what next?’ RNA Biology, Vol. 8, pp.586–590. Gautheret, D. and Lambert, A. (2001) ‘Direct RNA motif definition and identification from multiple sequence alignments using secondary structure profiles’, Journal of Molecular Microbiology, Vol. 313, pp.1003–1011. Gros, L., Saparbaev, M.K. and Laval, J. (2002) ‘Enzymology of the repair of free radicals-induced DNA damage’, Oncogene, Vol. 21, pp.8905–8925. Hamada, M., Kiryu, H., Sato, K., Mituyama, T. and Asai, K. (2009) ‘Prediction of RNA secondary structure using generalized centroid estimators’, Bioinformatics, Vol. 25, pp.465–473. Khrustalev, V.V. and Barkovsky, E.V. (2010a) ‘Study of completed archaeal genomes and proteomes: hypothesis of strong mutational AT pressure existed in their common predecessor’, Genomics, Proteomics & Bioinformatics, Vol. 8, pp.22–32. Khrustalev, V.V. and Barkovsky, E.V. (2010b) ‘The level of cytosine is usually much higher than the level of guanine in two-fold degenerated sites from third codon positions of genes from Simplex- and Varicelloviruses with G+C higher than 50%’, Journal of Theoretical Biology, Vol. 266, pp.88–98. Khrustalev, V.V. and Barkovsky, E.V. (2012) ‘A blueprint for a mutationist theory of replicative strand asymmetries formation’, Current Genomics, Vol. 13. pp.55–64. Khrustalev, V.V., Barkovsky, E.V., Khrustaleva, T.A. and Lelevich, S.V. (2014) ‘Intragenic isochores (intrachores) in the platelet phosphofructokinase gene of Passeriform birds’, Gene, Vol. 546, pp.16–24. Koide, T., Reiss, D.J., Bare, J.C. et al. (2009) ‘Prevalence of transcription promoters within archaeal operons and coding sequences’, Molecular Systems Biology, Vol. 5, p.285. Lesnik, E.A., Sampath, R., Levene, H.B., Henderson, T.J., McNeil, J.A. and Ecker, D.J. (2001) ‘Prediction of Rho-independent transcriptional terminators in Escherichia coli’, Nucleic Acids Research, Vol. 29, pp.3583–3594. Lobry, J.R. and Sueoka, N. (2002) ‘Asymmetric directional mutation pressures in bacteria’, Genome Biology, Vol. 3. Macke, T., Ecker, D., Gutell, R., Gautheret, D., Case, D.A. and Sampath, R. (2001) ‘RNAMotif – a new RNA secondary structure definition and discovery algorithm’, Nucleic Acids Research, Vol. 29, pp.4724–4735. Opposite nucleotide usage biases 365 Mandlik, A., Swierczynski, A., Das, A. and Ton-That, H. (2007) ‘Corynebacterium diphtheriae employs specific minor pilins to target human pharyngeal epithelial cells’, Molecular Microbiology, Vol. 64, pp.111–124. Nakamura, Y., Gojobori, T. and Ikemura, T. (2000) ‘Codon usage tabulated from the international DNA sequence databases: status for the year 2000’, Nucleic Acids Research, Vol. 28, p.292. Naville, M. and Gautheret, D. (2009) ‘Transcription attenuation in bacteria: theme and variations’, Briefings in Functional Genomics & Proteomics, Vol. 8, pp.482–492. Naville, M., Ghuillot-Gaudefroy, A., Marchais, A. and Gautheret, D. (2011) ‘ARNold: A web tool for the prediction of Rho-independent transcription terminators’, RNA Biology, Vol. 8, pp.11–13. Nei, M. and Kumar, S. (2000) Molecular Evolution and Phylogenetics, Oxford University Press, New York. Orouji, A., Kiewert, A., Filser, T., Goerdt, S. and Peitsch, W.K. (2012) ‘Cutaneous diphtheria in a german man with travel history’, Acta Dermato Venereologica, Vol. 92, pp.179–180. Pfeifer-Sancar, K., Mentz, A., Ruckert, C. and Kalinowski, J. (2013) ‘Comprehensive analysis of the Corynebacterium glutamicum transcriptome using an improved RNAseq technique’, BMC Genomics, Vol. 14, No. 888. Rangannan, V. and Bansal, M. (2009) ‘Relative stability of DNA as a generic criterion for promoter prediction: whole genome annotation of microbial genomes with varying nucleotide base composition’, Molecular BioSystems, Vol. 5, pp.1758–1769. Reese, M.G. (2001) ‘Application of a time-delay neural network to promoter annotation in the Drosophila melanogaster genome’, Computers and Chemistry, Vol. 26, pp.51–56. Ruimy, R., Riegel, P., Boiron, P., Monteil, H. and Christen, R. (1995) ‘Phylogeny of the genus Corynebacterium deduced from analyses of small-subunit ribosomal DNA sequences’, International Journal of Systematic Bacteriology, Vol. 45, pp.740–746. Shavkunov, K.S., Masulis, I.S., Tutukina, M.N., Deev, A.A. and Ozoline, O.N. (2009) ‘Gains and unexpected lessons from genome-scale promoter mapping’, Nucleic Acids Research, Vol. 37, pp.4919–4931. Sueoka, N. (1988) ‘Directional mutation pressure and neutral molecular evolution’, Proceedings of the National Academy of Science USA, Vol. 85, pp.2653–2657. Tamura, K., Peterson, D., Peterson, N., Stecher, G., Nei, M. and Kumar, S. (2011) ‘MEGA5: molecular evolutionary genetics analysis using maximum likelihood, evolutionary distance, and maximum parsimony methods’, Molecular Biology and Evolution, Vol. 28, pp.2731–2739. Ton-That, H. and Schneewind, O. (2003) ‘Assembly of pili on the surface of Corynebacterium diphtheriae’Molecular Microbiology, Vol. 50, pp.1429–1438. Trost, E., Blom, J., Soares, S.C., Huang, I.H., Al-Dilaimi, A., Schreder, J., Jaenicke, S., Dorella, F.A., Rocha, F.S., Miyoshi, A., Azevedo, V., Schneider, M.P., Silva, A., Camello, T.C., Sabbadini, P.S., Santos, C.S., Santos, L.S., Hirata, R. Jr., Mattos-Guaraldi, A.L., Efstratiou, A., Schmitt, M.P., Ton-That, H. and Tauch, A. (2012) ‘Pan-genomics of Corynebacterium diphtheriae: insights into the genomic diversity of pathogenic isolates from cases of classical diphtheria, endocarditis and pneumonia’, Journal of Bacteriology, Vol. 194, pp.3199–3215. Tutukina, M.N., Shavkunov, K.S., Masulis, I.S. and Ozoline, O.N. (2007) ‘Intragenic promotor-like sites in the genome of Escherichia coli discovery and functional implication’, Journal of Bioinformatics and Computational Biology, Vol. 5, pp.549–560. Viguetti, S.Z., Pacheco, L.G.C., Santos, L.S., Soares, S.C., Bolt, F., Baldwin, A., Dowson, C.G., Rosso, M.L., Guiso, N., Miyoshi, A., Hirata, R. Jr., Mattos-Guaraldi, A.L. and Azevedo, V. (2012) ‘Multilocus sequence types of invasive Corynebacterium diphtheriae isolated in the Rio de Janeiro urban area, Brazil’, Epidemiology & Infection, Vol. 140, pp.617–620.