Download The distribution of substitutions reflects features of homologous

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Species distribution wikipedia , lookup

Quantitative trait locus wikipedia , lookup

Mitochondrial DNA wikipedia , lookup

Genetic code wikipedia , lookup

Gene expression programming wikipedia , lookup

Behavioural genetics wikipedia , lookup

Designer baby wikipedia , lookup

Nucleic acid analogue wikipedia , lookup

Minimal genome wikipedia , lookup

Human genome wikipedia , lookup

Non-coding DNA wikipedia , lookup

Holliday junction wikipedia , lookup

Whole genome sequencing wikipedia , lookup

Genetic testing wikipedia , lookup

Metagenomics wikipedia , lookup

Polyploid wikipedia , lookup

Heritability of IQ wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Koinophilia wikipedia , lookup

Human genetic variation wikipedia , lookup

Genome (book) wikipedia , lookup

Genomic library wikipedia , lookup

Genomics wikipedia , lookup

Public health genomics wikipedia , lookup

Population genetics wikipedia , lookup

History of genetic engineering wikipedia , lookup

Genetic engineering wikipedia , lookup

No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup

Genome editing wikipedia , lookup

Microevolution wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Genome evolution wikipedia , lookup

Homologous recombination wikipedia , lookup

Pathogenomics wikipedia , lookup

Cre-Lox recombination wikipedia , lookup

Transcript
The distribution of substitutions reflects features
of homologous recombination in bacterial species
Anastasia S. Kalinina
IITP RAS
Abstract. Homologous recombination is the important evolutionary force
that drives spreading of beneficial mutations through a population. In
previous studies it has been shown that distributions of the number of
differences in fixed-size windows for pairwise comparisons of strains may
provide insights into the features of the recombination process. This technique has been applied for Escherichia coli, Burkholderia pseudomallei
and Streptococcus suis. The shape of the distribution of a number of substitutions depends only on a genetic distance between considered strains
and is characteristic for the each species. Two regimes in such distributions are observed in E. coli and P. suis: for vertically inherited segments
and for recombined segments. It has been demonstrated that this fact
can be applicable for setting thresholds in more sophisticated approaches
for detection of recombination events.
1
Introduction
A genetic information in bacteria can be transmitted in two ways: vertically from
parent cell and horizontally from other cells in the environment. The vertically
inherited DNA have usually no or a small number of differences with the DNA
in parent cell, since such differences can be introduced by mutation process and
occur with a low rate.
For the horizontal transfer of DNA in bacteria there are several ways: some
species are capable of uptake DNA directly from environment (transformation),
in other occur phage-mediated (transduction) or plasmid-mediated (conjugation)
DNA transfer. In a cell exogenous DNA may be introduced into a genome by
the homologous recombination. Natural limitations on efficiency of homologous
recombination exist [1]. Firstly, homologous recombination rate depends on the
sequence similarity [2, 3]. This point is the subject to discussion [4], but to a
moment is used commonly [5–7].
Numerous approaches for homologous recombination events detection are
available [8–10]. However, alternative methods also can provide insights into features of recombination process. In [11] approach for homologous recombination
analysis in Escherichia coli that based on distribution of number of differences
(DND) in fixed-size windows in pairwise comparisons of strains was proposed. It
has been shown that the shape of such distributions depends only on the genetic
distance between considered strains.
257
In this study this approach was applied to Escherichia coli, Burkholderia
pseudomallei and Streptococcus suis, which are reported to have different rates
of homologous recombination [12–14].
2
2.1
Data and Methods
Strains used in this study
In this study 20 strains of E. coli were used, 5 strains from phylogroups A, B1,
B2 and E. 4 strains of B. pseudomallei and 1 strain of B. mallei, 3 other fully
sequenced genomes of B. mallei were not used, because of high genome similarity
(less then 100 differences for each pairwise comparison) and 17 strains of S.
suis. Further in text B. mallei strain is considered as one of the B. pseudomallei
strains, since B. mallei is the internal clade in B. pseudomallei and demonstrates
the same properties in this study.
2.2
Concatenated alignment of universal genes
Orthologs rows for E. coli were constructed after the full-genome alignment using
“export orthologs” tool by Mauve [15]. For B. pseudomallei and B. mallei, and
S. suis BLAST was used for orthologs search. Only universal genes (i.e. genes,
which present in each considered genome in single copy) were used for the further
analysis. Multiple alignments were performed by ClustalW [16] and inspected
manually. Then all alignments were concatenated and columns with gaps were
excluded.
2.3
Distributions of the number of differences
The basic approach is described in [11]. An alignment is sliced in non-overlapping
fixed-size windows and then the distribution of the number of differences (DND)
for pairs of genomes is computed. Instead of using concatenated blocks from
the full-genome alignment in original publication [11], concatenated alignment
of universal genes is used. The reason is to exclude intergenic regions, which
have higher mutation rate. The shape of distributions for E. coli in this study
and in [11] is similar, so the approach is quite stable for different types of the
alignment preparation. Since the average length of a bacterial gene is about
1000 bp, window length was set to 1000 bp. The smaller sizes of window are
also applicable, but provide poorer resolution in comparisons of strains with low
genetic distance between them.
3
Results and Discussion
In the absence of homologous recombination the number of differences in fixedsize window should have Poison distribution, since mutations are rare and occur
independently. The deviations from Poison distribution in DND provide information about homologous recombination process in different bacterial species.
258
3.1
Escherichia coli
There are four well-defined phylogroups A, B1, B2 and E in E. coli [17]. Phylogroups A and B1 are close and DND for pairs of genomes from these phylogroups with the same distance have the same shape, whereas B2 separated
from them much earlier and demonstrates the different behaviour. Phylogroup
E consists of strains with low genetic distance between them, so it can not be
used in this analysis.
The shape of DND. In Fig. 1A there are four pairs of genomes with the same
genetic distance 0.0076: two pairs from phylogroup A and two pairs from phylogroup B2. In pairwise comparisons of strains within phylogroups the shape of
DND depends only on genetic distance between strains. For strains from phylogroups A and B1 two regimes with clear-visible change-point at 4–5 differences
per window are observed in DND: vertically inherited segments with low number of differences and recombined segments with high number of differences. The
distribution of number of differences in vertically inherited segments is similar to
the Poisson distribution, whereas the distribution of the number of differences
for recombined segments have a long decay and can be approximated by the
sum of Poisson distributions. The hard tail of the distribution characterizes the
genetic distances to the sources of homologous recombination in E. coli.
For phylogroup B2 there is no visible change-point in DND. Since the number
of segments without any differences in DND for pairs of strains from phylogroup
B2 is significantly lower than for pairs of strains from phylogroup A with the
same genetic distance between them (Fig. 1A), the ratio of the mutation rate
to the recombination rate is supposed to be higher in phylogroup B2 then in
phylogroups A and B1.
Setting thresholds using DND. To estimate the fraction of homologous
recombination events with a source outside the E. coli phylogroups A, B1, B2
and E was used the algorithm, which uses high local density of substitutions as
a signal of homologous recombination.
The first step of the algorithm is to convert a multiple alignment to a
Bernoulli sequence of “0” and “1” for each strain. We used here an empirical
rule where a position in alignment is significant and turns to “1”, the considered
strain differs if in this position from more than half of members of its phylotype
and coincides with more than half of members of another phylotype. A high local
density of “1” in a sequence indicates that a haplotype, which is characteristic
for the other phylotype, was introduced to a considered genome by homologous
recombination. Then, Adaptive Weight Smoothing (AWS) [18] is applied to the
resulting Bernoulli sequence to estimate the local density of “1” in each position.
To separate vertically inherited segments with the low substitution rate and
recombined segments with the high substitution rate there is a need to set a
threshold. The average genetic distance between strains within phylogroups is
0.006–0.008 and it is composed of distances between vertically inherited segments
259
Fig. 1. Distributions of the number of substitutions per 1000 nt windowa in pairwise
comparisons of strains. On the x-axis number of substitutions and on the y-axis fraction
of segments are shown. A: pairwise comparisons of 4 pairs of E. coli strains with the
same genetic distance 0.0076 between them. Two pairs from phylogorup A (× and +;
black) and two pairs from phylogroup B2 (⊕ and ; grey). B: pairwise comparisons of
4 pairs of B. pseudomallei strains with genetic distances 0.0021 (◦), 0.0031 (×), 0.0036
(+) and 0.0040 (♦). C: distributions of the average values of the number of differences
for pairwise comparisons of 23 pairs of strains with genetic distances 0.025–0.028 (+),
23 pairs with genetic distances 0.0283–0.0299 (×) and 34 pairs with genetic distances
higher than 0.03 (4).
and recombined, so the threshold should lie below it. Since for phylogroups A
and B1 there are two regimes in DND with the change-point at 3–5 bp, the
threshold 0.005 provides minimal false positive rate.
Also, AWS was applied to pairwise comparisons between strains within phylogroups. The fraction of segments supposed to be recombined in pairwise comparisons, which can not be explained by recombination with strains from other
phylogroups, is approximately 30% for phylogroup A and 34% for phylogroup
B1. These values do not depend on genetic distances between considered pairs
of strains, which lies in range 0.0001–0.0084. If the loci with high mutation rate
were supposed to be recombined with the threshold 0.005, the fraction of segments with unknown source would be higher for pairs of strains with the higher
genetic distances.
For phylogroup B2 fraction of recombined segments with unknown source is
higher then 50%, but this estimation is unreliable, since recombined and vertically inherited segments are mixture in DNDs for phylogroup B2.
3.2
Burkholderia pseudomallei
Poisson-like shape of DND does not necessarily means the low rate of homologous
recombination in bacterial species. In Fig. 1B DNDs for pairs of B. pseudomallei
strains with different genetic distances are presented. Unlike the DNDs for E.
coli and S. suis, they have Poisson-like shape. But B. pseudomallei is reported
to have high recombination rate relative to mutation rate [19], comparable with
260
it in Streptococcus pneumonia, highly recombinogenic species [13, 14], and much
higher than in E. coli [12]. It is unclear, is the Poisson-like shape of DND common
for bacteria with the high ratio of the recombination rate to the mutation rate
and the low diversity within population, or it is a result of B. pseudomallei
population structure.
The absence of windows with the number of differences higher than 15 indicates that the homologous recombination with sources with high genetic distance from B. pseudomallei are rare or do not occur. Also, both B. pseudomallei
chromosomes have been considered separately. The chomosome II has higher
diversity than chromosome I, but the shapes of DNDs for chomosomes I and II
are similar.
3.3
Streptococcus suis
In this study S. suis was used as an example of highly recombinogenic species.
In Fig. 1C there are clear-visible vertically inherited segments for pairs of strains
with genetic distance 0.0249–0.0279 between them with the number of substitutions less then 4. For pairs of strains with the genetic distance 0.0283–0.0299
vertically inherited segments are still visible, but there are less of them. And,
finally, for pairs of strains with genetic distance higher then 0.03 the vertically
inherited segments are indistinguishable with recombined. This demonstrates
that studies on the detection of recombination events for highly recombinogenic
species are feasible just for closely related strains, since genomes of more distant
strains are entirely covered by recombinations.
4
Conclusions
The goal of this paper is to demonstrate that the properties of homologous
recombination in bacteria can be examined not only by detection of recombined
segments, but also in indirect ways. The approach, based on the distribution of
number of substitutions across genome [11], was applied to three species, E. coli,
B. pseudomallei and P. suis. Each species have the unique shape of DND that
reflects previously reported properties of homologous recombination. Also, the
shape of DNDs for E. coli phylogroup B2 differ from them for phylogroup A, so
the ratio of the mutation rate to recombination rate is higher in phylogroup B2.
Also, it has been demonstrated that the shape of DND may be used for setting
thresholds and estimation of false positive rate in more sophisticated approaches
for detection of homologous recombination events.
DND-based approach for analysis of homologous recombination is rapid and
transparent, and it can be easily applied for the species, in which more than one
strain is sequenced. For this purpose ortholog databases for fully sequenced bacterial genomes such as OrholugeDB [20] are demand. Also, the issue of stability
dependence on the order of genes should be examined more closely, since it is
crucial in study of the species with the large number of genome rearrangements,
Yersinia pestis for example.
261
This is joint work with M.Gelfand.
References
1. Ellegaard K.M., Klasson L., Nslund K., Bourtzis K., Andersson S.G.: Comparative
genomics of Wolbachia and the bacterial species concept. PLoS Genet. 9, e1003381
(2013)
2. Majewski J., Cohan F.M.: DNA sequence similarity requirements for interspecific
recombination in Bacillus. Genetics. 153, 1525–33 (1999)
3. Shen P., Huang H.V.: Homologous recombination in Escherichia coli: dependence
on substrate length and homology. Genetics. 112, 441–57 (1986)
4. Bao H.X. et al.: Differential efficiency in exogenous DNA acquisition among closely
related Salmonella strains: implications in bacterial speciation. BMC Microbiol. 14,
157 (2014)
5. Falush D. et al.: Mismatch induced speciation in Salmonella: model and data. Philos
Trans R Soc Lond B Biol Sci. 361, 2045–53 (2006)
6. Fraser C., Hanage W.P., Spratt B.G.: Recombination and the nature of bacterial
speciation. Science. 315, 476–80 (2007)
7. Doroghazi J.R., Buckley D.H.: A model for the effect of homologous recombination
on microbial diversification. Genome Biol Evol. 3, 1349–56 (2011)
8. Didelot X., Falush D.: Inference of bacterial microevolution using multilocus sequence data. Genetics. 175, 1251–66 (2007)
9. Croucher N.J. et al.: Rapid phylogenetic analysis of large samples of recombinant
bacterial whole genome sequences using Gubbins. Nucleic Acids Res. 43, e15 (2015)
10. Marttinen P. et al.: Detection of recombination events in bacterial genomes from
large population samples. Nucleic Acids Res. 40, e6 (2012)
11. Dixit P.D., Pang T.Y., Studier F.W., Maslov S.: Quantifying evolutionary dynamics of the basic genome of E. coli. arXiv:1405.2548 [q-bio.PE] (2014)
12. Didelot X., Méric G., Falush D., Darling A.E.: Impact of homologous and nonhomologous recombination in the genomic evolution of Escherichia coli. BMC Genomics. 13, 256 (2012)
13. Nandi T. et al.: Burkholderia pseudomallei sequencing identifies genomic clades
with distinct recombination, accessory, and epigenetic profiles. Genome Res. 25, 129–
41 (2015)
14. Croucher N.J. et al.: Rapid pneumococcal evolution in response to clinical interventions. Science. 331, 430–4 (2011)
15. Darling A.E., Mau B., Perna N.T.: progressiveMauve: Multiple Genome Alignment
with Gene Gain, Loss, and Rearrangement. PLoS One. 5, e11147 (2010)
16. Larkin M.A. et al.: Clustal W and Clustal X version 2.0. Bioinformatics. 23, 2947–8
(2007)
17. Chaudhuri R.R., Henderson I.R.: The evolution of the Escherichia coli phylogeny.
Infect Genet Evol. 12, 214–26 (2012)
18. Polzehl J., Spokoiny V.: Adaptive Weights Smoothing with applications to image
restoration. J R Stat Soc Ser B Stat Methodol. 62, 335–354 (2000)
19. Cheng A.C. et al.: Genetic diversity of Burkholderia pseudomallei isolates in Australia. J Clin Microbiol. 46, 249–54 (2008)
20. Whiteside M.D., Winsor G.L., Laird M.R., Brinkman F.S.: OrtholugeDB: a bacterial and archaeal orthology resource for improved comparative genomic analysis.
Nucleic Acids Research. 41(Database issue), D366–76 (2013)
262