* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download The distribution of substitutions reflects features of homologous
Species distribution wikipedia , lookup
Quantitative trait locus wikipedia , lookup
Mitochondrial DNA wikipedia , lookup
Genetic code wikipedia , lookup
Gene expression programming wikipedia , lookup
Behavioural genetics wikipedia , lookup
Designer baby wikipedia , lookup
Nucleic acid analogue wikipedia , lookup
Minimal genome wikipedia , lookup
Human genome wikipedia , lookup
Non-coding DNA wikipedia , lookup
Holliday junction wikipedia , lookup
Whole genome sequencing wikipedia , lookup
Genetic testing wikipedia , lookup
Metagenomics wikipedia , lookup
Heritability of IQ wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Koinophilia wikipedia , lookup
Human genetic variation wikipedia , lookup
Genome (book) wikipedia , lookup
Genomic library wikipedia , lookup
Public health genomics wikipedia , lookup
Population genetics wikipedia , lookup
History of genetic engineering wikipedia , lookup
Genetic engineering wikipedia , lookup
No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup
Genome editing wikipedia , lookup
Microevolution wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Genome evolution wikipedia , lookup
Homologous recombination wikipedia , lookup
The distribution of substitutions reflects features of homologous recombination in bacterial species Anastasia S. Kalinina IITP RAS Abstract. Homologous recombination is the important evolutionary force that drives spreading of beneficial mutations through a population. In previous studies it has been shown that distributions of the number of differences in fixed-size windows for pairwise comparisons of strains may provide insights into the features of the recombination process. This technique has been applied for Escherichia coli, Burkholderia pseudomallei and Streptococcus suis. The shape of the distribution of a number of substitutions depends only on a genetic distance between considered strains and is characteristic for the each species. Two regimes in such distributions are observed in E. coli and P. suis: for vertically inherited segments and for recombined segments. It has been demonstrated that this fact can be applicable for setting thresholds in more sophisticated approaches for detection of recombination events. 1 Introduction A genetic information in bacteria can be transmitted in two ways: vertically from parent cell and horizontally from other cells in the environment. The vertically inherited DNA have usually no or a small number of differences with the DNA in parent cell, since such differences can be introduced by mutation process and occur with a low rate. For the horizontal transfer of DNA in bacteria there are several ways: some species are capable of uptake DNA directly from environment (transformation), in other occur phage-mediated (transduction) or plasmid-mediated (conjugation) DNA transfer. In a cell exogenous DNA may be introduced into a genome by the homologous recombination. Natural limitations on efficiency of homologous recombination exist [1]. Firstly, homologous recombination rate depends on the sequence similarity [2, 3]. This point is the subject to discussion [4], but to a moment is used commonly [5–7]. Numerous approaches for homologous recombination events detection are available [8–10]. However, alternative methods also can provide insights into features of recombination process. In [11] approach for homologous recombination analysis in Escherichia coli that based on distribution of number of differences (DND) in fixed-size windows in pairwise comparisons of strains was proposed. It has been shown that the shape of such distributions depends only on the genetic distance between considered strains. 257 In this study this approach was applied to Escherichia coli, Burkholderia pseudomallei and Streptococcus suis, which are reported to have different rates of homologous recombination [12–14]. 2 2.1 Data and Methods Strains used in this study In this study 20 strains of E. coli were used, 5 strains from phylogroups A, B1, B2 and E. 4 strains of B. pseudomallei and 1 strain of B. mallei, 3 other fully sequenced genomes of B. mallei were not used, because of high genome similarity (less then 100 differences for each pairwise comparison) and 17 strains of S. suis. Further in text B. mallei strain is considered as one of the B. pseudomallei strains, since B. mallei is the internal clade in B. pseudomallei and demonstrates the same properties in this study. 2.2 Concatenated alignment of universal genes Orthologs rows for E. coli were constructed after the full-genome alignment using “export orthologs” tool by Mauve [15]. For B. pseudomallei and B. mallei, and S. suis BLAST was used for orthologs search. Only universal genes (i.e. genes, which present in each considered genome in single copy) were used for the further analysis. Multiple alignments were performed by ClustalW [16] and inspected manually. Then all alignments were concatenated and columns with gaps were excluded. 2.3 Distributions of the number of differences The basic approach is described in [11]. An alignment is sliced in non-overlapping fixed-size windows and then the distribution of the number of differences (DND) for pairs of genomes is computed. Instead of using concatenated blocks from the full-genome alignment in original publication [11], concatenated alignment of universal genes is used. The reason is to exclude intergenic regions, which have higher mutation rate. The shape of distributions for E. coli in this study and in [11] is similar, so the approach is quite stable for different types of the alignment preparation. Since the average length of a bacterial gene is about 1000 bp, window length was set to 1000 bp. The smaller sizes of window are also applicable, but provide poorer resolution in comparisons of strains with low genetic distance between them. 3 Results and Discussion In the absence of homologous recombination the number of differences in fixedsize window should have Poison distribution, since mutations are rare and occur independently. The deviations from Poison distribution in DND provide information about homologous recombination process in different bacterial species. 258 3.1 Escherichia coli There are four well-defined phylogroups A, B1, B2 and E in E. coli [17]. Phylogroups A and B1 are close and DND for pairs of genomes from these phylogroups with the same distance have the same shape, whereas B2 separated from them much earlier and demonstrates the different behaviour. Phylogroup E consists of strains with low genetic distance between them, so it can not be used in this analysis. The shape of DND. In Fig. 1A there are four pairs of genomes with the same genetic distance 0.0076: two pairs from phylogroup A and two pairs from phylogroup B2. In pairwise comparisons of strains within phylogroups the shape of DND depends only on genetic distance between strains. For strains from phylogroups A and B1 two regimes with clear-visible change-point at 4–5 differences per window are observed in DND: vertically inherited segments with low number of differences and recombined segments with high number of differences. The distribution of number of differences in vertically inherited segments is similar to the Poisson distribution, whereas the distribution of the number of differences for recombined segments have a long decay and can be approximated by the sum of Poisson distributions. The hard tail of the distribution characterizes the genetic distances to the sources of homologous recombination in E. coli. For phylogroup B2 there is no visible change-point in DND. Since the number of segments without any differences in DND for pairs of strains from phylogroup B2 is significantly lower than for pairs of strains from phylogroup A with the same genetic distance between them (Fig. 1A), the ratio of the mutation rate to the recombination rate is supposed to be higher in phylogroup B2 then in phylogroups A and B1. Setting thresholds using DND. To estimate the fraction of homologous recombination events with a source outside the E. coli phylogroups A, B1, B2 and E was used the algorithm, which uses high local density of substitutions as a signal of homologous recombination. The first step of the algorithm is to convert a multiple alignment to a Bernoulli sequence of “0” and “1” for each strain. We used here an empirical rule where a position in alignment is significant and turns to “1”, the considered strain differs if in this position from more than half of members of its phylotype and coincides with more than half of members of another phylotype. A high local density of “1” in a sequence indicates that a haplotype, which is characteristic for the other phylotype, was introduced to a considered genome by homologous recombination. Then, Adaptive Weight Smoothing (AWS) [18] is applied to the resulting Bernoulli sequence to estimate the local density of “1” in each position. To separate vertically inherited segments with the low substitution rate and recombined segments with the high substitution rate there is a need to set a threshold. The average genetic distance between strains within phylogroups is 0.006–0.008 and it is composed of distances between vertically inherited segments 259 Fig. 1. Distributions of the number of substitutions per 1000 nt windowa in pairwise comparisons of strains. On the x-axis number of substitutions and on the y-axis fraction of segments are shown. A: pairwise comparisons of 4 pairs of E. coli strains with the same genetic distance 0.0076 between them. Two pairs from phylogorup A (× and +; black) and two pairs from phylogroup B2 (⊕ and ; grey). B: pairwise comparisons of 4 pairs of B. pseudomallei strains with genetic distances 0.0021 (◦), 0.0031 (×), 0.0036 (+) and 0.0040 (♦). C: distributions of the average values of the number of differences for pairwise comparisons of 23 pairs of strains with genetic distances 0.025–0.028 (+), 23 pairs with genetic distances 0.0283–0.0299 (×) and 34 pairs with genetic distances higher than 0.03 (4). and recombined, so the threshold should lie below it. Since for phylogroups A and B1 there are two regimes in DND with the change-point at 3–5 bp, the threshold 0.005 provides minimal false positive rate. Also, AWS was applied to pairwise comparisons between strains within phylogroups. The fraction of segments supposed to be recombined in pairwise comparisons, which can not be explained by recombination with strains from other phylogroups, is approximately 30% for phylogroup A and 34% for phylogroup B1. These values do not depend on genetic distances between considered pairs of strains, which lies in range 0.0001–0.0084. If the loci with high mutation rate were supposed to be recombined with the threshold 0.005, the fraction of segments with unknown source would be higher for pairs of strains with the higher genetic distances. For phylogroup B2 fraction of recombined segments with unknown source is higher then 50%, but this estimation is unreliable, since recombined and vertically inherited segments are mixture in DNDs for phylogroup B2. 3.2 Burkholderia pseudomallei Poisson-like shape of DND does not necessarily means the low rate of homologous recombination in bacterial species. In Fig. 1B DNDs for pairs of B. pseudomallei strains with different genetic distances are presented. Unlike the DNDs for E. coli and S. suis, they have Poisson-like shape. But B. pseudomallei is reported to have high recombination rate relative to mutation rate [19], comparable with 260 it in Streptococcus pneumonia, highly recombinogenic species [13, 14], and much higher than in E. coli [12]. It is unclear, is the Poisson-like shape of DND common for bacteria with the high ratio of the recombination rate to the mutation rate and the low diversity within population, or it is a result of B. pseudomallei population structure. The absence of windows with the number of differences higher than 15 indicates that the homologous recombination with sources with high genetic distance from B. pseudomallei are rare or do not occur. Also, both B. pseudomallei chromosomes have been considered separately. The chomosome II has higher diversity than chromosome I, but the shapes of DNDs for chomosomes I and II are similar. 3.3 Streptococcus suis In this study S. suis was used as an example of highly recombinogenic species. In Fig. 1C there are clear-visible vertically inherited segments for pairs of strains with genetic distance 0.0249–0.0279 between them with the number of substitutions less then 4. For pairs of strains with the genetic distance 0.0283–0.0299 vertically inherited segments are still visible, but there are less of them. And, finally, for pairs of strains with genetic distance higher then 0.03 the vertically inherited segments are indistinguishable with recombined. This demonstrates that studies on the detection of recombination events for highly recombinogenic species are feasible just for closely related strains, since genomes of more distant strains are entirely covered by recombinations. 4 Conclusions The goal of this paper is to demonstrate that the properties of homologous recombination in bacteria can be examined not only by detection of recombined segments, but also in indirect ways. The approach, based on the distribution of number of substitutions across genome [11], was applied to three species, E. coli, B. pseudomallei and P. suis. Each species have the unique shape of DND that reflects previously reported properties of homologous recombination. Also, the shape of DNDs for E. coli phylogroup B2 differ from them for phylogroup A, so the ratio of the mutation rate to recombination rate is higher in phylogroup B2. Also, it has been demonstrated that the shape of DND may be used for setting thresholds and estimation of false positive rate in more sophisticated approaches for detection of homologous recombination events. DND-based approach for analysis of homologous recombination is rapid and transparent, and it can be easily applied for the species, in which more than one strain is sequenced. For this purpose ortholog databases for fully sequenced bacterial genomes such as OrholugeDB [20] are demand. Also, the issue of stability dependence on the order of genes should be examined more closely, since it is crucial in study of the species with the large number of genome rearrangements, Yersinia pestis for example. 261 This is joint work with M.Gelfand. References 1. Ellegaard K.M., Klasson L., Nslund K., Bourtzis K., Andersson S.G.: Comparative genomics of Wolbachia and the bacterial species concept. PLoS Genet. 9, e1003381 (2013) 2. Majewski J., Cohan F.M.: DNA sequence similarity requirements for interspecific recombination in Bacillus. Genetics. 153, 1525–33 (1999) 3. Shen P., Huang H.V.: Homologous recombination in Escherichia coli: dependence on substrate length and homology. Genetics. 112, 441–57 (1986) 4. Bao H.X. et al.: Differential efficiency in exogenous DNA acquisition among closely related Salmonella strains: implications in bacterial speciation. BMC Microbiol. 14, 157 (2014) 5. Falush D. et al.: Mismatch induced speciation in Salmonella: model and data. Philos Trans R Soc Lond B Biol Sci. 361, 2045–53 (2006) 6. Fraser C., Hanage W.P., Spratt B.G.: Recombination and the nature of bacterial speciation. Science. 315, 476–80 (2007) 7. Doroghazi J.R., Buckley D.H.: A model for the effect of homologous recombination on microbial diversification. Genome Biol Evol. 3, 1349–56 (2011) 8. Didelot X., Falush D.: Inference of bacterial microevolution using multilocus sequence data. Genetics. 175, 1251–66 (2007) 9. Croucher N.J. et al.: Rapid phylogenetic analysis of large samples of recombinant bacterial whole genome sequences using Gubbins. Nucleic Acids Res. 43, e15 (2015) 10. Marttinen P. et al.: Detection of recombination events in bacterial genomes from large population samples. Nucleic Acids Res. 40, e6 (2012) 11. Dixit P.D., Pang T.Y., Studier F.W., Maslov S.: Quantifying evolutionary dynamics of the basic genome of E. coli. arXiv:1405.2548 [q-bio.PE] (2014) 12. Didelot X., Méric G., Falush D., Darling A.E.: Impact of homologous and nonhomologous recombination in the genomic evolution of Escherichia coli. BMC Genomics. 13, 256 (2012) 13. Nandi T. et al.: Burkholderia pseudomallei sequencing identifies genomic clades with distinct recombination, accessory, and epigenetic profiles. Genome Res. 25, 129– 41 (2015) 14. Croucher N.J. et al.: Rapid pneumococcal evolution in response to clinical interventions. Science. 331, 430–4 (2011) 15. Darling A.E., Mau B., Perna N.T.: progressiveMauve: Multiple Genome Alignment with Gene Gain, Loss, and Rearrangement. PLoS One. 5, e11147 (2010) 16. Larkin M.A. et al.: Clustal W and Clustal X version 2.0. Bioinformatics. 23, 2947–8 (2007) 17. Chaudhuri R.R., Henderson I.R.: The evolution of the Escherichia coli phylogeny. Infect Genet Evol. 12, 214–26 (2012) 18. Polzehl J., Spokoiny V.: Adaptive Weights Smoothing with applications to image restoration. J R Stat Soc Ser B Stat Methodol. 62, 335–354 (2000) 19. Cheng A.C. et al.: Genetic diversity of Burkholderia pseudomallei isolates in Australia. J Clin Microbiol. 46, 249–54 (2008) 20. Whiteside M.D., Winsor G.L., Laird M.R., Brinkman F.S.: OrtholugeDB: a bacterial and archaeal orthology resource for improved comparative genomic analysis. Nucleic Acids Research. 41(Database issue), D366–76 (2013) 262