Download E. coli - DTU CBS

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Phospholipid-derived fatty acids wikipedia , lookup

EXPOSE wikipedia , lookup

Bacterial morphological plasticity wikipedia , lookup

Marine microorganism wikipedia , lookup

Human microbiota wikipedia , lookup

Horizontal gene transfer wikipedia , lookup

Triclocarban wikipedia , lookup

Community fingerprinting wikipedia , lookup

Metagenomics wikipedia , lookup

Transcript
Rapid Microbial-Taxonomy Identification Tools
for Medical Diagnostics and Genomic Epidemiology
Dhany Saputra
June 30th, 2014
Abstract
(I have re-written it in such a way that it acts as an abstract, and not
as a general introduction, as you did. Then I have added a new text,
under the title Introduction, see below Abstract)
In recent years, the genomic sequence of thousands of microorganisms have been
identified and studied in details. By combining genotypic studies with phenotypic
ones, it is nowadays possible to develop bioinformatics tools that can be used to
make both medical diagnoses and prognoses, and that can help to prevent disease
outbreaks.
Two of these tools, Reads2Type and Chainmapper, are described which enable
rapid Microbial taxonomy identification both for medical diagnostics and public
health microbiology. Reads2Type is a web-based tool used to identify isolates. It
is rapid and accurate compared to other existing tools. The other tool is
Chainmapper, which permits to determine the composition of complex bacterial
community in clinical samples. This is done by a direct sequencing, i.e., without
an a priori microbial isolation. Sequencing can be done within hours, and the
outcome provides a comprehensive view of the complex bacterial community in
the investigated sample.
(An Abstract is supposed to highlight the main results, not to
outline the content of each chapter in your thesis, as you did. So if
I were you I would add the text—which I have written here
below—just after your ‘Introduction’, which is in page_1 of your
thesis_(dhany)13may2014.pdf file—which describes what you do
in each chapter. This text is therefore before your Chapter 1 )
Introduction
Bacteria and viruses (79.7%), protozoa (10.7%), fungi (6.3%), and parasitic
worms (3.3%) [Jones et al., 2008] are, in the given percentages, responsible for
world-wide emergence and re-emergence of infectious diseases. A snapshot of
such happenings is shown in Figure 1. From the socio-economical viewpoint the
battle against infectious diseases is a very costly one. Therefore much effort is
poured into the development of methods and tools that can help predicting and
preventing infectious diseases outbreaks and spreading. Regarding genomic
epidemiology, the DNA sequences of suspected pathogens can be thoroughly
investigated and characterized by using cutting-edge genome sequencing methods
together with computational analysis tools. Real-time genomic epidemiology can
not only change the characteristic time-scale of traditional patient-diagnostics, but
can make a difference regarding how to prevent outbreaks and spreading of
infectious diseases.
The aim of the work described in this thesis was to develop easy-to-use and
bioinformatics tools for analyzing sequencing data from clinical samples, such as
urine samples. Such tools would allow a rapid detection and characterization of
the microorganisms present in clinical samples.
The thesis is structured as follows. In Chapter 1 is described the Center for
Genomic Epidemiology (CGE). The long term objective of CGE is to create an
information system that can rapidly detect the spread of diseases, and monitor
their geographical diffusion.
In Chapter 2 are described DNA-sequencing technologies, and how the resulting
data are analyzed for taxonomy identification purposes. Also, it describes both the
culture-dependent and the culture-independent whole genome sequencing
approach.
The Reads2Type tool, which is developed to identify bacterial isolates, is
presented in Chapter 3, together with the results from an extensive benchmarking
study, which was done to compare Reads2Type with the existing comparable tools
for species identification. The results are presented in Paper II.
Chapters 4, 5 and 6 focus on the Chainmapper tool, which can be used to
determine the composition of complex bacterial community in the investigated
clinical sample. The Chainmapper tool was successfully used in the case of urine
samples. The results for this type of samples are presented Paper III, while the
results for other types are shown in Chapter 6.
The mail points are discussed and summarized in Chapter 7.
(Dhany,
I
did
notice
that
in
page_133
of
your
file,
thesis_(dhany)13may2014.pdf the title of chapter_7 is ‘Epilog’. If I were you I
would eliminate the word ‘Epilog’, and substitute it with ‘Conclusion and
future perspectives’. Also, I would eliminate both the sectioning 7.1 and 7.2,
and their title). I did this already. Yet I have not edited the text of that
section)
Chapter 1
The Center for Genomic Epidemiology
Figure 1.1 A world-wide map showing the emergence and re-emergence of infectious
diseases. The red bullets indicate regions affected by emerging diseases, the blue by reemerging diseases, and the black a “deliberately emerging” disease, as explained in
[Morens et al., 2004]
1.1 CGE and the work packages
The Center for Genomic Epidemiology (CGE) arises from the collaboration
between the Center for Biological Sequence Analysis and the National Food
Institute, both at the Technical University of Denmark. The idea behind CGE is to
combine next-generation and parallel sequencing technology, computational
biology, and global epidemiology, to provide real-time genomic epidemiology
results.
To accomplish such goal, the work at CGE is subdivided in work-packages (WP-i,
i=1, 2 … 7). The packages are listed below together with a brief description of the
work that they entail:
WP-1. To develop tools for analyzing and organizing data of complete and nearly
complete genome sequences. The results from my doctoral studies
contributed to the WP-1, as I developed tools for rapid identification of
taxonomy, antibiotic resistance, and virulence in single isolated bacteria
strain and metagenomic samples.
WP-2. To build the traditional (what do you mean by ‘traditional’? You have to
define briefly what it means). classification methods (for what?).
WP-3. To identify novel genomic targets for epidemiology and evolutionary
investigations.
WP-4. To explore, via a holistic approach, new challenges in epidemiology.
WP-5. To make pathogenicity predictions.
WP-6. To combining data from sequence analysis with epidemic and geographic
information.
WP-7. To build the CGE web interface.
1.2 CGE workflow
The CGE-workflow for the rapid complete genome analysis of isolates is depicted
in Figure 1.2. Both a server-side and a client-side are considered.
(I am not sure, yet, the text in red should be kept
The client side refers to the client computers that request and receive services,
over the network, from a centralized resource provider, at the server side. On the
one hand, client computers initially display a standardized interface that allows
requesting services to the CGE server and displays the results that the server
returns; on the other end, the CGE server waits for the client requests and responds
back to them.)
Figure 1.2. CGE workflow
The client side of CGE workflow concerns the users, i.e., the healthcare authorities
and particularly those located in remote and/or less-developed areas, where the
access to clinical facilities is limited. The idea is that other laboratories around the
world submit complete or partial genome sequences to the CGE server. These data
are then analyzed in silico and a rapid response regarding the pathogens
identification is given back to the healthcare authorities, who shall then decide
how to proceed. Because pathogens identification is crucial for their decision
making, the services provided by the CGE are very innovative regarding detection
and prevention of pathogenic diseases.
In particular, once that the collected medical samples are grown in a laboratory,
and bacterial strains are isolated, and their DNA is sequenced, the bacteria species
identification is done by using the Reads2Type tool. Based on its output, the
healthcare authorities can subsequently compare their isolates to the precompiled
microbial-isolate data present in CGE databases. To do that, they need to agree on
making their data publicly available, thereby contributing to the global real-time
surveillance. Once the sequencing files are uploaded to the CGE server, the
sequences are assembled to recreate the representation of the original DNA
sequences. Based on this drafted assembly, other tools—namely MLST [Larsen et
al., 2012], KmerFinder, TaxonomyFinder, and SpeciesFinder—are used for
species identification. Also, a so-called SNP-Tree module uses the drafted
assembly to provide users with additional information regarding time and places
where closely related outbreaks were seen before [Leekitcharoenphon et al.,
2012]. Information regarding virulence, pathogenicity prediction [Cosentino et al.,
2013], and antibiotic resistance [Zankari et al., 2012] of the microorganisms is
provided, too, based on the DNA drafted assembly.
Chapter 2
Next Generation Microbial
Diagnostics
The approach adopted in microbial diagnostics is currently shifting from being a
conventional phenotypic one to a sequencing one. Within the conventional
approach, microbiologists identify microorganisms based on phenotypic and
biochemical investigations. Microbial identification by this approach is often
error-prone, as two different pathogens may have similar physical properties and
biochemical reactions [Cattoir et al., 2010] [Frank et al., 2008]. Also, pathogens
may be wrongly identified as non-pathogens. The development of sequencing
technologies has revolutionized the field of genomics, and provided a reliable
alternative to the phenotypic approach. Also, it has triggered the development of
computational tools for the rapid analysis of sequencing data. In this chapter, the
next generation sequencing (NGS) analysis steps for microbial diagnostics—from
sequencing to organism identification—are described, starting from an historical
perspective.
2.1 Next generation sequencing
Figure 2.1. The Sanger sequencing workflow [Men et al., 2008]
The era of the first generation sequencing technology started in the 1970s with the
development of the Sanger sequencing method [Sanger and Coulson, 1975] (in
figure 2.1 caption you cite Men et al. 2008, not Sanger and Coulson, 1975! ).
In the 1980s and 1990s the method was further improved thanks to techniques
such as fluorescence labeling [Prober et al., 1987] and capillary electrophoresis
[Swerdlow and Gesteland, 1990] [Swerdlow et al., 1991]. Figure 2.1 shows the
workflow for the Sanger sequencing. The read length of Sanger sequencing
reaches up to 1000 base pairs (bp) with 99.999% per base raw accuracy.
Unfortunately, Sanger sequencing produces low throughput at a high cost
[Shendure and Ji, 2008]. This era was wrapped up nicely by the success of
sequencing the first human genome in 2001, which spent USD 3 billion and took
about ten years to finish [Venter et al., 2001]. The infeasibility to resequence more
people using Sanger approach stimulated the invention of NGS.
Figure 2.2. The workflow of pyrosequencing [Metzker, 2009]
Not long after that, the era of NGS began, marked by the introduction of various
sequencing machines that write short DNA sequences [Metzker, 2009]. Short
DNA sequences produced by these machines are also known as “reads”. NGS
technology was further classified into second generation sequencing (SGS) and
third generation sequencing (TGS) technologies, which produce short reads and
significantly longer reads, respectively [Schadt et al., 2010]. However, the SGS
technology is still popularly used for sequencing as the TGS technology is still in a
development phase, with Sanger sequencing approach being used for validation
purpose [Shendure and Ji, 2008].
Figure 2.3. The workflow of Illumina Genome Analyzer [Ansorge, 2009]
The first commercialized SGS machines are 454 pyrosequencing, which achieve
higher sequencing throughput at a lower cost than Sanger sequencing [Margulies
et al., 2005]. Subsequently, in 2007 Illumina/Solexa came up with Genome
Analyzer and Applied Biosystem introduced SOLiD sequencingmachines. These
new machines became the technology of choice for whole genome sequencing
(WGS), genome resequencing, chromatin immunoprecipitation sequencing (ChIPseq), ribonucleic acid sequencing (RNA-seq), and metagenomic sequencing. SGS
sequencing basically follows the same steps: library preparation by fragmenting
the DNA, ligation of adaptor sequences, clonal amplification, and sequencing
cycles based on enzyme-driven biochemistry and data imaging. The workflows of
454 pyrosequencing and Illuminasequencing were illustrated in Figure 2.2 and
Figure 2.3, respectively. One of 454 pyrosequencing machines, GS FLX Titanium
XL+, has read length up to 1 kbp with typical throughput of 700 Mb and runtime
approximately 23 hours (source: http://www.454.com/). However, the artifact
of 454 is homopolymerism, i.e. insertions of the same base, leading to high error
rate [Huse et al., 2007]. Meanwhile, Illumina reads is shorter than 454
pyrosequencing, i.e. 100 bp with Genome Analyzer machines and 250 bp at best
with MiSeq machines, and the Illumina throughput can even reach 600 Gb for
HiSeq machines. The time needed by MiSeq is 1.5 hours for preparation and 4
hours of sequencing using Nextera Sample Preparation Kit (source:
http://www.illumina.com/).
Figure 2.4. The workflow of Ion Torrent PGM sequencing [Herper, 2010]
TGS technology is mostly single molecule, real time (SMRT) system without the
need to halt between read steps [Schadt et al., 2010]. It produces longer reads,
each of which represents a single DNA molecule. Ion Torrent semiconductor
sequencer was the first TGS machine released in 2010 [Rothberg et al., 2011]
[Merriman et al., 2012] which is based on ion detection instead of dye-labeled
oligonucleotides and expensive optics (see Figure 2.4 for the workflow). There are
two types of commercially available Ion Torrent machines: Ion PGM and Ion
Proton. The throughput of Ion Proton is 60 Gb per run with reads reaching up to
200
bp
finished
within
2-4
hours
(source:
http://www.lifetechnologies.com/dk/en/home/brands/iontorrent.html).
PacBio RS II was another TGS machine launched in 2013, which is anticipated to
produce 50,000 reads per run with above 20,000 bp read length at most [Roberts et
al., 2013]. Regrettably, the machine is too large, expensive, and having very high
error rate. However, with the right algorithms and the help of SGS machines
[Koren et al., 2012], these errors can be cancelled out to get accurate assemblies,
as demonstrated by Korlach that generates a de novo assembled genome with
99.999% base concordance with its reference genome [Chin et al., 2013]. Illumina
also “goes long” by acquiring Moleculo that synthesizes long reads from discrete
pools of short reads from Illumina. Nevertheless, Moleculo sequencing still
requires small DNA fragments as the input, can introduce biases, and still has GCrich artefacts [Eisenstein, 2013]. In addition, Oxford Nanopore is also
commencing an early-access program for scientists to test its USB stick-sized
MinION sequencer by the end of November 2013 [McDougall, 2013].
None of the sequencing data in this thesis uses TGS other than Ion Torrent, but
machines from TGS are worth mentioned further as TGS machines can potentially
replace the SGS ones in the future.
2.2 Preprocessing the Sequencing Data
The sequencing files produced by NGS machines must be preprocessed to ensure
its quality control. There are two important steps in preprocessing: quality control
and read trimming.
FastQC is open source software for quality control that can spot probable errors
coming from the sequencers and the library preparation [Andrews, 2010]. From
FastQC execution report, one can proceed with read trimming, and then run
FastQC again to ensure the quality of the sequencing data after trimming.
Read trimming is clipping out the subsets of the reads having low quality scores or
containing adaptor sequences, therefore these subsets of reads will not interfere the
subsequent analysis steps. In addition, first several base pairs can also be cut when
DNA damage is presumed. Finally, very short sequences resulted from read
trimming are removed. Genobox-trim is open source software used in subsequent
chapters for read trimming [Rasmussen, 2011].
2.3 Genome Alignment
After read trimming, the sequencing files are aligned to reference genomes.
Sequence alignment or sequence mapping is matching the DNA, RNA, or protein
sequences to identify regions of similarity. These similarities are consequences of
functional, structural, or evolutionary relationships between the sequences [Mount,
2004]. Matches, mismatches, and gaps can happen during sequence alignment.
The elements of two aligned sequences are called mismatches if they are different,
otherwise they are called matches. To minimize the number of mismatches, it is
allowed to insert and delete elements to either of the aligned sequences, and these
insertions/deletions are called gaps. The alignment scoring is done by giving
reward scores for each match found, subtracted by the punishment score for each
mismatch and gap found.
The major challenge of genome alignment is how to quickly but accurately align
numerous reads to lengthy reference genomes using as minimal computational
resource as possible. To speed up the alignment, several short read aligners use a
computational strategy, called indexing. Similar to the index at the end of a book,
the index of a reference genome helps the aligners to rapidly find short sequence
embedded within it. In general, there are two approaches for genome alignment
that uses index to speed up the mapping: hash table and trie. BLAST [Altschul et
al., 1990] [Altschul et al., 1997], BLAT [Kent, 2002], SOAP [Li et al., 2008b],
MAQ [Li et al., 2008a], SHRiMP [Rumble et al., 2009], Novoalign [Technologies,
2013], and BFAST [Homer et al., 2009] implement hash table-based algorithm,
whereas Bowtie [Langmead et al., 2009], Bowtie2 [Langmead and Salzberg,
2012], BWA-ALN [Li and Durbin, 2009], BWA-SW [Li and Durbin, 2010],
BWA- MEM [Li, 2013] and SOAP2 [Li et al., 2009b] implement trie-based
algorithm.
In this thesis, Bowtie2 and BWA-MEM are used to align metagenomic samples
because they require less memory and they are fast [Li, 2013]. The product of
sequence alignment is a Sequence Alignment/Map (SAM) file [Li et al., 2009a]. It
is recommended to always compress the SAM files into BAM (Binary
Alignment/Map) files as this conversion reduces the disk use [Li and Durbin,
2009]. The files can then be used for downstream analyses, such as generalized
RNA-seq, variant calling, genome assembly, and taxonomy identification. The
focused downstream analysis in this thesis is taxonomy identification.
2.4 Taxonomy Identification
Figure 2.5. The taxonomy of American black bear [Campbell and Reece, 2005]
Once the sequencing data of the sample is obtained, the taxonomy of the
pathogens need to be identified. Taxonomy is the science of classifying organisms.
The classification of life has evolved many times, from Linneaus classification
that distinguish life into plants and animals to Woese’s three domain classification
system that divides life into archaea, bacteria, and eukaryotes. Bacterial
classification systematics has changed over time. One of the conventional
approaches is Gram staining, which is still used as the first step of bacterial
identification [Austrian, 1960]. Based on the Gram stainingand the shape,
microbiologists narrow down the bacterial identification into a smaller group, but
more efforts are needed to recognize the precise identity.
Recent taxonomy classification uses ordered ranks, starting from domain, phylum,
class, order, family, genus, species, where species is the lowest rank. Figure 2.5
shows the taxonomy of American black bear. According to Bergey’s Manual of
Systematic Bacteriology, the basic and most important taxonomic rank in bacterial
systematics is species. Microbial species is defined as the collection of organisms
with 16S rRNA sequence similarity of at least 98.7% amongst the members and a
DNA-DNA reassociation experiment is required to confirm [Stackebrandt and
Ebers, 2006]. However, there are some species with controversial definitions. For
example, Yersinia pestis and Yersinia pseudotuberculosis are very similar in 16S
rRNA sequences, yet the Judicial Commission of the International Committee on
Systematics of Prokaryotes rejected the idea of merging them into one species due
to difference in the danger to the public health [Whitman et al., 2012]. Genus is a
collection of species having the same phenotypic characteristics and it is
clusterable by 16S rRNA sequences. The genus classification is still used albeit
there is no satisfactory definition of genus, and higher taxa below phylum are even
less certain in definition than genus. However, different organisms within the
same species may have completely different impacts on public health. Escherichia
coli O157:H7, for instance, is highly virulent, while many of other E. coli subtypes
are non-pathogenic [Wirth et al., 2006]. Hence, species is further subtyped into
strains based on its special features. For example, biovar/biotype and
serovar/serotype are used to distinguish strains having different biochemical and
antigenic properties, respectively [Whitman et al., 2012].
Identifying the taxonomy of the pathogen in a sample is helpful because every
pathogen inherits certain properties from its ancestor and shares these properties
with other members of its taxonomical group. In addition, microbial taxonomy
reveals how several different microbes are related. Consequently, when organisms
are correctly classified, the efficacy of a drug can be tested and developed against
all pathogens within the same taxonomical groups, especially if the groups can be
characterized by a molecule, pathway, or trait targeted by the drug. Furthermore,
when two pathogens are found to be different in species level, despite the similar
phenotypic traits, possibility of different treatments should be considered
carefully. [Berman, 2012]
2.5 Culture Dependent WGS
Figure 2.6. Culture dependent approach vs direct sequencing approach
Several sequencing approaches for routine diagnosis have been proposed, such as
culture dependent and culture independent WGS, illustrated in Figure 2.6. In
culture dependent WGS, microbiologists cultivate the microbes in patient samples,
wait until the microbial colonies emerge, purify the DNA, and proceed to
sequencing. Producing sufficient amount of pure organisms in a colony requires 12 days for rapid grower bacteria, or even months for slow grower ones [Köser et
al., 2012b]. Figure 2.7 shows the complexity and the timescale of cultivating a
bacterial pathogen. Once the whole DNA of the selected colony has been
sequenced, computational analysis tools are used to identify the taxonomy of the
species. This is done by aligning theirsequences to some defined marker genes,
either single loci or multi locus [Larsen et al., 2012].
Once the species is detected, the presence of antimicrobial resistance and virulence
factors can also be detected to find out the right antibiotic for the patient and to
examine the disease progression. Micro dilution and disk diffusion are common
for antibiotic susceptibility testing and they take one extra day, or even longer for
some bacteria [Köser et al., 2012a]. However, the database of of resistance gene
and virulence factor sequences are still incomplete, therefore conventional
susceptibility testing is still required in complement to the molecular prediction.
Furthermore, WGS also promises a definitive resolution for outbreak
investigation, therefore it is essential to identify the strain.
Figure 2.7. Workflow of culture-dependent investigation of bacteria according to
[Didelot et al., 2012]
Compared to traditional phenotypic analysis, the WGS based microbial
identification provides more rapid, unambiguous, detailed genotype information of
the disease-causing pathogens, and it is very helpful to find targeted treatment for
the patient and to prevent the outbreak at earlier stage. The pathogen surveillance
of cholera outbreak in Haiti [Hasan et al., 2012][Hendriksen et al., 2011] and E.
coli O104:H4 outbreak in Germany [Brzuszkiewicz et al., 2011][Grad et al.,
2012][Rasko et al., 2011] by sequencing the entire bacterial genome are some of
its success stories. WGS also enables the reconstruction of the transmission
pathways between healthcare centers, hospital wards or even patient of the same
ward [Andersen et al., 2010][Köser et al., 2012b].
2.6 Metagenomics
Figure 2.8. Metagenomic workflow: targeted gene sequencing (top) vs whole
genome sequencing (down). Source: http: //bgiamericas.com/servicesolutions/services/metagenomics/
The second approach in sequence-based medical diagnosis is culture independent
technique, also known as metagenomic approach, illustrated in Figure 2.6.
Metagenomics is a comprehensive genomic view of the ecological population in a
biological sample. The pursuit of understanding the microbial composition of a
sample is important in many disciplines, such as healthcare [Loman et al., 2013],
marine biology [Brum et al., 2013], and terrestrial biology [Tseng et al., 2013].
Microbes live in many different environments and interact together to contribute
to the stability of their habitats. In nature the microbial community integrally
contributes to the important cycles in life such as photosynthesis [Cuvelier et al.,
2010], conversion of nutrients [Sessitsch et al., 2012], and degrading pollutants in
the environment [Fang et al., 2012]. Billions of microbes living in synergy in the
gut help the host to digest food and protect against pathogens [Fujimura et al.,
2010]. However, before the advent of metagenomics, the understanding of
microbial communities has been limited because microbiologists studied
individual species one by one in pure cultures and they were unable to study
microbes in their habitats. In addition, the studies of microbial community on
biofilms are focused on some selected species at a time instead [O’Toole et al.,
2000].
The most prominent advantage of metagenomics over microbial cultivation is its
ability to investigate the hard-to-cultivate organisms [Streit and Schmitz, 2004], as
microbial cultivation is recently still the primary method for microbial
identification [van Belkum et al., 2013]. Furthermore, only less than 0.02% of the
organisms in some environments are cultivable [Rappé and Giovannoni, 2003]
[Hugenholtz et al., 1998] while the rest are hard to culture. Even if there is a way
to cultivate them with artificial, superenriched media, the effort would be too high.
The new pathogenic strains are often hard to cultivate, making medical diagnosis
even more challenging [Campbell et al., 2013]. Metagenomics comes as an in
silico solution that explores all organisms in a sample, including the uncultured
ones.
There are two approaches in metagenomic sequencing: targeted gene and whole
genome sequencing, see Figure 2.8. In targeted gene sequencing, ITS and 16S
rRNA regions are typically used and amplified to survey the community of fungi
and bacteria in a sample, respectively [Buée et al., 2009] [Wang and Qian, 2009].
However, amplification bias in 16S sequencing may occur and affects the
accuracy [Yilmaz et al., 2010]. Additionally, the 16S rRNA sequences of several
species sometimes are very similar [Olive and Bean, 1999]. Whole genome
shotgun metagenomics offers more in-depth analysis and broader view of the
microbial composition than 16S sequencing, as it captures the whole parts of the
genomes in the sample [Chen and Pachter, 2005]. In this approach, the DNA
fragments from the sample are directly sequenced, without the need for any further
preprocess steps. However, whole genome shotgun metagenomics requires higher
depth of sequencing coverage to enable a closer look at the underrepresented
minority organisms.
Chapter 3
Reads2Type
3.1 Introduction
Though the development of metagenomics is exciting and rewarding for
microbiologists, culture based WGS is still of interest for detecting diseases and
preventing outbreaks [Didelot et al., 2012] [Köser et al., 2012a]. There are some
ongoing issues to be fixed regarding speed, complexity of the workflow,
sequencing costs, and sensitivity on implementing metagenomics [Tanaseichuk et
al., 2012], therefore some people still rely on microbial cultivation for clinical
practice. In the culture based approach, once the significant microbial isolate is
detected, the species is identified, then antimicrobial susceptibility, microbial
susceptibility, and outbreak possibility are investigated.
Reads2Type is a novel web-based taxonomy identification tool that fits well as a
forefront component of culture based WGS analysis pipeline. The first paper,
”Reads2Type: Rapid Microbial Taxonomy Identification” in Section 3.2,
introduces and elaborates Reads2Type. Figure 3.1 shows where Reads2Type fits
the CGE pipeline for WGS analysis. The key advantages of the web-based
Reads2Type are quick, minimal use of Internet bandwidth, and user friendly.
Reads2Type has also been tested using Internet broadband connection in Sweden
(Figure 3.2), Jordan (Figure 3.3), and Indonesia (Figure 3.4) to correctly identify
the species of a Klebsiella pneumoniae Short Read Archive (SRA) raw sequencing
file. No file submission is required in Reads2Type and the identification is totally
performed on the user’s computer. The implementation of Reads2Type is
available
as
a
free-to-use
web
service
(http://cge.cbs.dtu.dk/services/Reads2Type) as a part of the services
provided by the CGE project.
Figure 3.1. Reads2Type fits to the CGE pipeline in Figure 1.2
Figure 3.2. The screen capture of Reads2Type correctly identified a Klebsiella
pneumoniae raw read file, tested using Internet broadband connection in Lund,
Sweden
Figure 3.3. The screen capture of Reads2Type correctly identified a Klebsiella
pneumoniae raw read file, tested at the National Center for Agricultural Research
and Extension, Amman, Jordan
Figure 3.4. The screen capture of Reads2Type correctly identified a Klebsiella
pneumoniae raw read file, tested using Internet broadband connection in Indonesia
The second paper in Section 3.3, ”Benchmarking of Methods for Genomic
Taxonomy”, compares the performance of five available methods forspecies
identification. Those five methods are Reads2Type, SpeciesFinder, KmerFinder,
TaxonomyFinder, and MLST. It is mentioned in the paper that only the first three
methods are suitable for taxonomy identification of raw sequencing files. Among
those tools, Reads2Type is the only tool forspecies identification that has the webbased version, which does not require the sequencing data to be uploaded to the
server. Instead, a small reference database (4.6 MB) is automatically transferred
into the client computer’s memory in the initiation step. Based on an extensive
series of experiment on identifying the species of 10,407 raw sequencing files
representing 168 species, Reads2Type (87%) is slightly more accurate than
Species-Finder (86%) and in general KmerFinder (97%) is the most accurate
among these three techniques. However, on average Reads2Type is almost 2.5
times as fast as SpeciesFinder and KmerFinder. The secret of the low runtime of
Reads2Type relies on the small-sized reference database (i.e. the database only
consists of frequently seen bacterial sequences), the narrow down strategy (i.e.
when a read matches a sequence shared by a group of bacteria, the search space is
reduced to that group), and the use of suffix tree for fast string matching.
3.2 Paper I
The following paper was submitted to Journal of Clinical Microbiology.
3.3 Paper II
The following paper was submitted to Journal of Clinical Microbiology.
Chapter 4
Chainmapper
4.1 Introduction
There are problems in culture based WGS, such as the complicated procedures for
microbial cultivation, the long cultivation time, the different selective media
required for different type of organisms to grow, the resulting sequencing data of
only one genome at a time [Didelot et al., 2012], and the inability to culture some
organisms in the samples [Hugenholtz et al., 1998] [Rappé and Giovannoni,
2003]. Sequencing the 16S rRNA of the bacteria in a complex environmental
sample offers an amazing culture independent approach to analyze its genetic
diversity without the need to grow isolates [Muyzer et al., 1993]. This approach is
faster and can reveal the presence of the uncultured organisms. However, 16S
rRNA based analysis of metagenomic samples only identifies the organisms in the
sample. It is impossible to identify the presence of plasmids, non-16S genes, and
non-coding regions based on 16S sequencing. Moreover, 16S sequencing requires
gene filtering to get only the expected 16S genes.
Whole genome shotgun metagenomic sequencing allows sequencing all genomes
in a sample simultaneously [Tyson et al., 2004]. This technique offers an
alternative solution for cataloging the microbial composition of a sample. Finding
antibiotic resistance genes, virulence genes, other genes, and even detecting the
presence of plasmids are made possible using this approach. However, short reads
from the mixture of several organisms make the assignment of each read to an
organism and a gene difficult. When reliable analysis technique is available, whole
genome shotgun metagenomic sequencing allows deep exploration on the
organism composition, the enumeration of each organism in the sample, as well as
other DNA-related information.
Chainmapper is a command line tool for profiling microbial community in a
biological sample and estimating their abundance. Chainmapper can be accessed
at http://cge.cbs.dtu.dk/services/Chainmapper-1.0/.
4.2 Method
The input data is the raw metagenomic sequencing data. The workflow of
Chainmapper is shown in Figure 4.1.
Figure 4.1. The workflow of Chainmapper
4.2.1 Data Preprocessing
As explained in Chapter 2.2, the quality of the sequencing data must be ensured
before the analysis starts. Chainmapper provides option for automated read
trimming (Step 1 Figure 4.1). However, it is still advised that the users do the read
trimming and quality control manually and start Chainmapper with trimmed
sequencing data. Manual preprocessing is advised because trimmed sequencing
data might still have low quality due to the presumed DNA damage.
4.2.2 Alignment
Chainmapper aligns the metagenomic sequence reads to a reference database
immediately after aligning them to another database (Step 2.1, 2.2, 2.3, 3.1, 5.1
Figure 4.1). The reference databases used for mapping are the NCBI database of
bacterial complete and draft genomes [Federhen, 2012] by default. Other reference
databases, such as the database of fungi, viruses, invertebrates, protozoa, all
nucleotides, and MetaHIT can also be used. All alignments are done using BWA
MEM [Li, 2013], except the alignment against the nucleotide database (Step 5.1
Figure 4.1), which uses Bowtie2 [Langmead and Salzberg, 2012], due to the
inability of BWA MEM to index the huge nucleotide database. Species
identification, strain identification, antibiotic resistance finding, and virulence
finding are the four Chainmapper modules that can be run in parallel after the
alignment.
4.2.3 Species Identification
Species identification is the core function in Chainmapper (Step 2.1, 3.1, 4.1, 5.1,
and 6.1 Figure 4.1). This module quickly identifies the dominating organisms in
the sample on species level and calculates the number of unmapped reads. The
outputs are the list of most abundant species in the sample and their estimated
abundance, represented altogether in a graph and a tab separated text file. The
users are free to play around with the minimum number of reads (in percentage)
that categorizes a species as abundant (the default is 0.01%). In addition, this
module can also profile the microbial community of the given sample in phylum
level or in other taxonomical ranks. As the differences in genome sequences
among lower taxa are smaller, assigning short reads to organisms is more suitable
on higher taxonomy level. Hence, the confidence level of assigning each read to a
species is lower than assigning each read to a phylum, especially when plenty of
reads are mapped to more than one organisms, requiring the strain identification
module to further confirm the species abundance.
4.2.4 Strain Identification
Strain identification identifies the strains having high genome coverage and
average depth in the given sample (Step 4.2 and 5.2 Figure 4.1). The genome
coverage represents the fraction of the reference genome of a particular strain
covered by the sequencing reads, whereas the average depth represents the number
of copies of that genome in the sequencing data, given its genome coverage. The
average depth can be associated with the strain’s relative abundance. Identified
organisms are grouped by plasmids, bacteria, fungi (if required), viruses (if
required), invertebrates (if required), and protozoa (if required). Strains are
reported if their genome coverages are at least 10% and the strains are covered at
least 1X. For plasmids and viruses, the coverage thresholds are lowered to 97.5%,
as their reference sequences are relativelyshort. Protozoa and invertebrates should
only be counted when very big fraction of sequencing reads are mapped to them,
contributing to genome coverage closer to 100%, as they have larger genome size.
Therefore, when eukaryotic reads are found in huge amount during species
identification yet their coverages are low in strain identification, these reads are
just false positive. Strain identification is especially used to double check the
presence and the abundance of the species, as this module can also work
independently to find out the dominant strains in spite of its longer runtime. The
strain identification (~120 minutes) takes 3 times longer than the species
identification (~40 minutes) for ~5 million reads due to the time needed by the
strain identification to merge the alignment results from the paired reads, to sort
the merged file into reference genome, to index them using SAMtools [Li et al.,
2009a], and to calculate the genome coverage and depth using BEDTools [Quinlan
and Hall, 2010].
4.2.5 Antibiotic Resistance and Virulence Finding
This module identifies the presence of genes causing antibiotic resistance and
virulence in the sequencing data, crucial for targeted treatment and confirmation of
the presence of virulent strains. The database for antibiotic resistance genes was
adopted from ResFinder [Zankari et al., 2012], while the database for virulence
genes was taken from the virulence factor database (VFDB) [Yang et al.,
2008][Chen et al., 2012]. The resistance genes and virulence factors are reported if
the sequencing data covers at least 97.5% of the genes at least 1X due to their
short reference sequences. Finding the resistance genes on a sequencing data
having 3-5 million reads takes about 10 minutes using one CPU core and 8 GB
RAM, mostly used for sorting the binary alignment mapping (BAM) files,
indexing them, and calculating the genome coverage.
Chapter 5
Direct Sequencing for Diagnostics
Doctors diagnose patients with microbial infections based on visible symptoms
and medical records. If there are doubts, the doctors may proceed with clinical
laboratory tests to identify the precise pathogenic strain causing the disease. The
microbiological laboratory personnel isolates the microbe and then use either the
conventional morphological species identification or the culture based WGS
technique to investigate the pathogen causing the disease. However, the tests could
take days or even weeks, while in the meantime the patient could progress to a
worse state. Therefore, a rapid diagnosis on a medical sample is important to
prescribe a correct, targeted treatment to the patient and to monitor the progression
of the disease. More importantly, a rapid detection of a potential outbreak is vital
for early warning and control of the disease.
5.1 Direct sequencing and Chainmapper
Direct DNA sequencing approach for routine diagnostics is introduced in this
chapter. Whole genome shotgun metagenomic sequencing, also known as direct
sequencing, is a sequencing technique used to study the genomic materials
recovered directly from samples. This technique does not require growing the
microbe in a proper culture medium and it allows simultaneous sequencing of all
genomes in a sample. In addition, direct sequencing can be done within hours
because the DNA from the sample directly goes to the sequencing steps.
Meanwhile, the culture dependent WGS demands several days or months for
growing the microbes and selecting the microbial coloniesunder study, provided
that the microbes can grow in the given medium. Rapid growers like E. coli might
take one day for incubation and media preparation, but slow growers like
Mycobacterium tuberculosis requires three months for incubation.
In 2012, CGE introduced a novel approach of direct sequencing for routine
diagnostics. A paper titled “Rapid whole genome sequencing for the detection and
characterization of microorganisms directly from clinical samples” presented in
Section 5.2 evaluates the applicability of whole genome shotgun metagenomic
sequencing for routine diagnostics. This paper shows that Chainmapper, as
described in Chapter 4, fits into the concept of direct sequencing. Moreover, it was
shown that Chainmapper outperforms MG-RAST in terms of execution time and
Kmer in terms of accuracy.
This paper also presents an application of Chainmapper on surveying the
microbial composition of 24 samples from patients suspected to have Urinary
Tract Infection (UTI). UTI is microbial infiltration of the unsterile urinary tract
that causes infection in the urethra, bladder, ureter, or kidney. The most common
cause is uropathogenic Escherichia coli (UPEC), corresponding to 80% of the
cases. However, the remaining 20% are caused by other pathogens such as
Staphylococcus saprophyticus, Klebsiella spp., Proteus mirabilis, and
Enterococcus faecalis. A complete review of UTI is explained in [Wang et al.,
2013][Shepherd and Pottinger, 2013] and [Barber et al., 2013]. Most of the
samples were found to be positively infected by UPEC and E. faecalis. However,
there are a few exceptional cases. Some samples contain other pathogens and other
samples contain no pathogens. This paper shows that sequencing directly on the
urine samples harvested not only the list of most dominant organisms in the
samples, but also the collection of hard-to-culture bacteria, such as Lactobacillus
iners, Gardnerella vaginalis, Prevotella timonensis, and Aerococcus urinae found
in some urine samples. The success of this study indicates that sequencing the
whole genomes directly from the clinical samples brings advantages to routine
diagnostics and outbreak surveillance in the clinical settings.
5.2 Paper III
The following paper was published in Journal of Clinical Microbiology in
November 2013.
Chapter 6
Chainmapper
in Practice
6.1 Introduction
The previous two chapters describe the methodology of Chainmapper and its
implementation on direct sequencing for routine diagnostics. This chapter extends
the implementation of Chainmapper to identify organisms found in various
metagenomic samples and to detect any potential virulence in the samples. In
addition to the identification of organisms, Chainmapper also investigates the
presence of antibiotic resistance genes, virulence factors, and human microbiome
in these samples.
6.1.1 Antibiotic Resistance Genes
Figure 6.1. The timeline of antibiotic deployment and the evolution of antibiotic
resistance. Texts above the timeline are the years the antibiotics were discovered,
the texts below the timeline are the years the resistance were found.
Since the invention of antibiotics in 1928, the number of incurable bacterial
diseases has reduced [McIntyre, 2007]. Many antibiotics are naturally produced by
bacteria and fungi either in response to environmental stress, or to discourage
microbial competitors in the environment, or as signaling molecules through
quorum sensing mechanism [Martínez, 2012], and some antibiotics are
synthetically developed once their lead compounds have been revealed. However,
the awareness of bacterial resistance towards both natural and synthetic antibiotics
has shifted the attitude in hospital and agriculture [Cantas et al., 2013]. Antibiotics
may attack the cell wall biosynthesis, protein synthesis, DNA synthesis, and folic
acid synthesis of the pathogen, but bacteria evade the antibiotics by either
restricting the access of antibiotics, or secreting enzymes that inactivate the
antibiotics, or modifying the antibiotic target, or failing antibiotic activation
[Wilson et al., 2011]. Antibiotic resistance may happen naturally as the species
producing antibiotics have their own mechanisms of resistance [Allen et al.,2010].
Overprescribing antibiotics, prescribing antibiotics at the wrong doses or duration,
prescribing antibiotics for viral infections, and the use of antibiotics on crowded
livestock also on the other hand greatly contribute to the resistance [Arason et al.,
1996] [Boken et al., 1995] [Wang et al., 1999] [Yap, 2013]. Furthermore,
resistance genes can also be transferred into a pathogen, making antibiotics unable
to kill or inhibit them. The biggest contributor to the resistance is genetic
plasticity, such as acquired mutations and horizontal transfer genes (HGT). HGT
is easier, quicker, and safer for bacteria to occur than acquired mutation, as
resistance genes can be picked up by plasmids and be shared to any new
environments [Wilson et al., 2011]. The options of susceptible antibiotics become
limited, while the discovery of novel antibiotics has almost stopped, as depicted in
Figure 6.1. However, the antibiotic resistance is found regulated by certain
resistance genes, therefore sequencing the whole bacterial genome enables the
detection of antimicrobial resistance.
6.1.2 Virulence Factors
Not only our bodies develop strategies to attack pathogenic bacteria, but bacteria
also have strategies to effectively attack our bodies, contributing to microbial
pathogenecity. Virulence factors are gene products that enable pathogens to settle
on or within a host and increase the possibility to cause disease [Yang et al., 2008]
[Chen et al., 2012]. Some examples of virulencefactors are bacterial toxins,
adherence to the host, antiphagocytosis capsules, iron uptake, and bacterial
proteases. If the healthcare personnels are in doubt whether a sample contains
pathogenic or non-pathogenic strain, then the presence of a virulence gene is
enough to increase the likelihood that the sample contains the pathogenic strain
[Wilson et al., 2011].
6.1.3 Human microbiome
Human microbiome is the complex ensemble of microbes and microbial genes
associated with human. Aligning the metagenomic sequencing data against the
novel gene catalogues of human microbiome is necessary when the samples are in
close association to any human body parts. For instance, samples from sewer,
which might contain human feces, need to be mapped to the curated references of
human microbiome genes.
Figure 6.2. The composition and health effects of predominant human fecal
bacteria. The figure shows approximate numbers of the different genera [Gibson
and Roberfroid, 1995].
In this thesis, MetaHIT 2nd gene catalogue [Qin et al., 2010] is used. Some
MetaHIT genes match the genes submitted to NCBI, yet some other MetaHIT
genes are novel and unnamed. When a large number of sequencing reads do not
match to any NCBI reference genomes, acknowledging the reads to be mapped to
novel MetaHIT genes is informative enough to suggest that those reads match to
novel fecal genes.
Identification of microbes in samples taken from body sites can also be associated
to health profile. Some microbes in human body are harmful and disgusting but
some microbes even synthesize nutrition for human using undigested food that the
human eats and kick pathogens out from attachment sites in the body to pay their
“accommodation rent”, as depicted in Figure 6.2 [Wikoff et al., 2009] [Benson et
al., 2010]. Several studies suggested the consumption of probiotic products as they
contain a number of gut bacteria beneficial for human health [Oelschlaeger, 2010]
[Picard et al., 2005] [Soccol et al., 2010]. Some association studies have also
shown that microbial population shift in human body is associated with some
disease. For example, preliminary results showed that colorectal cancer is
associated with the anaerobic Fusobacterium nucleatum [Castellarin et al., 2012]
and patients with Crohn’s disease are abundant in Enterococcus faecium as well as
several other Proteobacteria [Mondot et al., 2011]. Human gut microbial
composition has been associated to diet, obesity, and inflammatory bowel disease
(IBD) [Greenblum et al., 2012]. Intestine rich of Bacteroidetes and
Parabacteroidetes is associated with animal protein and saturated fats in the diet,
while intestine dominated by Prevotella and Desulfovibrio is associated with the
diet of carbohydrates, simple sugar, and vegetables [Wu et al., 2011]. The ratio of
Bacteroidetes to Firmicutes is associated with obesity [Turnbaugh et al., 2006]
[Ley et al., 2005] [Ley et al., 2006] and a large population of Enterobacteriaceae is
associated with IBD [Garrett et al., 2010]. Humans have been categorized
according to the microbial composition oftheir intestines [Arumugam et al., 2011]
and vaginas [Ravel et al., 2011], but the inherent differences between and within
different groups are still under examination and debates.
6.2 Datasets
The datasets used are sequencing data from urine specimens, sewer samples,
ancient toilet samples, airplane toilet samples, and vulture samples.
6.2.1 Urine specimens
The first set of samples was collected in April and September 2012 from patients
with suspected UTI at Hvidovre hospital, Denmark. A total of 24 urine samples,
each 10 mL, were prepared for whole genome metagenomic sequencing. The
samples were sequenced using Ion Torrent PGM (Life Technologies) producing
variable-length single-end reads with the number of reads ranging from 1.3 million
to 3.8 million. The sequencing preparation was described in Section
5.2.Chainmapper was used to explore the microbial content of these urine
specimens. Knowing whether the urine samples contain no pathogens, or
pathogens related to other diseases, or pathogens related to the disease but the
pathogens were just named or described a few months before are very informative
to help the doctor to diagnose the patients.
6.2.2 Sewer sample
Sample was taken from sewer around Herlev hospital in the Northern Denmark
and the DNA extracts were directly sequenced using Ion Torrent PGM, producing
2.8 million variable-length single-end reads. Identifying the microbial life in the
sewer around a hospital may reveal the list of threatening nosocomial infections in
the hospital.
6.2.3 Ancient toilet samples
A pair of 300-years-old latrines unearthed from beneath Kultorvet Square,
Copenhagen, Denmark, were sampled and sequenced. The DNA was extracted
from feces and soil around the feces. This study was done in collaboration with the
Museum of Copenhagen and the Center for GeoGenetics, University of
Copenhagen. The low oxygen content of the soil in the excavation site means that
the remains were very well preserved and the smell of rotten eggs means that the
bacteria had not yet eaten up all of the contents. The samples were sequenced
using Illumina with multiple lanes, where adaptors have slightly different
sequences, so that reads from two different lanes can be distinguished albeit the
mixed barcoded libraries are run once at a time. The length of the sequencing
reads is 100 bp and the numbers of reads are described in Table 6.1.
The study of the complex microbial assemblage in these old fecal samplesaims to
characterize the health profile of the lower social class people living in the 18th
century near Kultorvet Square, Copenhagen, Denmark, and find, if any, potential
outbreaks occurred at that time.
6.2.4 Airplane toilet samples
Toilets of airplanes departing from five different cities to Copenhagen were
sampled. Those five departure cities are Aalborg in Denmark, Bangkok in
Thailand, Washington D.C. and Newark in the United States, and Toronto in
Canada. The samples were directly sequenced at the National Food Institute,
Technical University of Denmark using Illumina MiSeq with the read length of
150 bp and the number of reads as mentioned in Table 6.2.
The assessment of the microbial content in the human waste collected from these
samples may reveal the microbial diversity of body waste flushed by passengers
into the toilet. Most of the passengers are presumably either the residents of the
departure cities or inhabitants of Copenhagen spending several days in those cities.
6.2.5 Vulture samples
Vultures are scavenging birds that are notorious for eating carcasses of dead
animals, typically died of infectious diseases [Bangert et al., 1988]. Their diet
raises microbiological and pathological questions. First, do the pathogens survive
in their digestive system, come out alive in their feces and capable of causing an
outbreak? It is possible that vultures develop resistance against bacteria [Carvalho
et al., 2003], so that this study might find alternative medication for the diseases.
Second, as their typical food is armadillo that naturally carries and disseminates
pathogens, such as leprosy, why do they not develop sores on their faces? To
answer this question, it should be assured that leprosy bacteria indeed existed on
their faces or were eradicated before the bacteria disseminate to the entire body of
the vultures.
The Molecular Microbial Ecology Group at the University of Copenhagen
sequenced the intestines of two dead North American vultures (Cathartes aura) to
answer the first question and the faces of both vultures to answer the second
question using Illumina, resulting 100 bp read length with the number of reads as
mentioned in Table 6.3.
6.3 Results
The datasets were evaluated using Chainmapper. The community profile of each
dataset was presented and discussed.
6.3.1 Profiling organisms in the urine specimens
The species and strain identifications of the urine specimens were shown in Figure
A.1 to Figure A.24 and summarized in Figure 6.3.
Figure 6.3. The community profiles of the 24 urine samples.
Taken together, the results from the species identification of the urine specimens
agree with the ones from the strain identification. Many readsmatch with strains
from draft genomes, indicating that the inclusion of database of prokaryotic draft
genomes in Chainmapper is very helpful for metagenomic profiling. The patients
were all initially suspected to have UTI, yet Chainmapper found that not all of the
urine samples contained pathogens. The urine samples can be grouped into four
categories: 1) the ones containing no pathogen, 2) the ones dominated by
enterococci, 3) the ones dominated by E. coli, and 5) the ones dominated by other
pathogens.
6.3.1.1 Urine samples containing no pathogen
As seen in Table 6.4, urine samples #1, #4, #7, #8, and #16 were not dominated
by any pathogen. The dominant organisms in
#4, and #8 were the fastidious
bacteria Lactobacillus iners and Lactobacillus sp. 7_1_47FAA. L. iners can
present in women with healthy vagina, women with bacterial vaginosis, or women
that has just been subjected to antibiotic therapy [Macklaim et al., 2011], because
L. iners has a persistence mechanism regardless the presence of pathogens
[McMillan et al., 2013]. One suggestion might be that these patients do not have
UTI.
Based on the species identification, there are no dominant species in urine sample
#16. The strain identification showed that the most abundant organism in the
sample (depth = 30X) is candidate division TM7 single cell isolates TM7b,
although it could just be a false positive due to low genome coverage. This strain
was found in various environmental samples, ranging from forest soil, activated
sludge, to human mouth. Its role was still under investigation and by the time of
the writing its phylum name was still under proposal [Hugenholtz et al., 2001].
However, there are a large number of reads in this sample (41.9%) that do not
match with either the NCBI microbial database or the human genome. To further
identify these unknown reads, Chainmapper was resumed to align them against the
NCBI nucleotide database, resulting in an increase of the percentage of total reads
mapped to the human genome from 46.31% to 83.6%. This happens because a
specific human genome build-37.1 is used as reference in the contaminant removal
stage, yet the nucleotide reference database contains human genome sequences
from various sources having several different base sequences frombuild-37.1. In
addition, after mapping to the nucleotide database, the proportion of unknown
reads remarkably dropped to 8.7% because many unknown reads are assigned to
the human genomic sequences from the nucleotide database and 6.75% of the total
reads matches to Pan troglodytes, Mus musculus, Macaca mulatta, Pongo abelli,
Danio rerio, Macaca fascicularis, Sus scrofa, and Gorilla gorilla, suggesting that
these reads match the subset of human genome sequences shared with those
mammals. There can be many shared sequences between human genome and other
organisms due to either the unclosed gaps in the current human genome, or single
nucleotide polymorphisms between human and those organisms, or sequencing
errors in the assembly of human genome. Furthermore, the species identification
shows that minor but significant amount of reads match with Plasmodium vivax,
Epichloe festucae, and many other unrelated eukaryotes. Looking at their low
genome coverages and depths, the assignment to these reads are most likely false
positives as the eukaryotic genomes are longer than the prokaryotic genomes. This
finding has prompted the plausible speculation that this patient does not have UTI.
Gardnerella vaginalis, which typically indicates bacterial vaginosis, dominates
urine sample #7 (6.6%) according to the species identification. However, strain
identification showed that the dominant organism was G. vaginalis 409-05, which
is commensal [Santiago et al., 2011]. This showed the importance of strain
identification that determines whether the dominant strain is pathogenic or not.
6.3.1.2 Urine samples dominated by enterococci
Urine samples #3, #26, #31, #33, and #34 were dominated by E. faecalis PC1.1
and E. faecalis 62, see Table 6.5. E. faecalis PC1.1 is a candidate probiotic
isolated from human feces and does not infer any virulence at the time of writing
[O Cuív et al., 2013]. E. faecalis 62 lacks of elements involved in virulence [Brede
et al., 2011]. There is a possibility that the virulent strain of this species, the
vancomycin resistant E. faecalis V583, coexists in the sample although its
coverage and depth are lower than the two other strains, but the information about
the presence of this virulent strain could be more crucial than the information
about the presence of dominating E. faecalis 62 or E. faecalis PC1.1. Finding
virulence factors is thus important to find out whether E. faecalis V583 really
existed in these urine samples.
E. faecium dominates urine sample #32 and further strain identification narrows
down the list of possible strains into the vancomycin resistant E. faecium Aus0004
[Lam et al., 2012] and E. faecium DO. Virulence finding is thus required to make
sure the presence of the disease causing strain.
6.3.1.3 Urine samples dominated by E. coli
The dominant organism in urine samples #6, #10, #12, #20, #21, #25, #27, and
#29 was E. coli, as seen on Table 6.6. Particularly, the highly uropathogenic E.
coli 536 was in urine samples #6, #20, and #21. Urine #29 was also dominated by
this strain, yet the number of supporting reads and the average genome depth were
almost neglectable, possibly because the patient was getting better. This strain was
responsible for 70-90% of the estimated 150 million UTIs diagnosed annually
[Brzuszkiewicz et al., 2006]. The strain identification also found a large number of
E. coli/Shigella plasmids in urine sample #21 having nearly 100% coverage, from
E. coli O7:K1 strain CE10 to E. coli SE11. Shigella sonnei Ss046 plasmids were
also found with slightly lower coverage in urine sample #21. This supports the
conjecture of E. coli domination in these samples
E. coli ATCC-8739, was also found in abundance in urine samples #25 and #10.
This pathogenic strain is typically found in fecal samples. E. coli ATCC-8739 in
urine sample #10 might co-occur with the pathogenic Citrobacter, which usually
invades the gastrointestinal tract. Urine sample #12 mostly contained E. coli
UM146, which was just two years ago described as uropathogenic [Reeves et al.,
2011]. This sample was also suspected to contain E. coli 536, supporting the
suspection of UTI, while Bifidobacterium bifidum NCIMB 41171, which may root
from the probiotic taken by the patient, was also found. Urine sample #27 had
another exceptional E. coli strain M605 (45.37%, 1X), which was still a draft
genome in the NCBI database at the time of the writing, thus there is no further
explanation about the pathogenecity of the strain. However, there is small chance
that this sample also contained E. coli 536.
6.3.1.4 Urine samples dominated by other pathogens
Table 6.7 shows the dominant organisms in urine samples dominated by other
pathogens. Prevotella timonensis, which is associated to bacterial vaginosis
[Srinivasan et al., 2012], predominates urine sample #13. Proteus mirabilis
HI4320 in urine sample #19 typically causes UTI in immunocompromised patients
or patients with catheter [Nielubowicz and Mobley, 2010], as opposed to E. coli
that typically causes UTI to healthy individuals. Meanwhile, urine sample #24 was
dominated by Proteus mirabilis strain HI4320 as well as the uncultured
Aerococcus urinae strain ACS-120-V-Col10a. In conventional biochemical
identification, A. urinae can easily be misidentified as staphylococci because the
bacteria also has the shape of cocci and turn violet on Gram staining [Cattoir et al.,
2010] [de Jong et al., 2010]. This is one of the potential advantages of exploring
the microbial composition through shotgun metagenomic approach. The frequency
of A. urinae infection is emerging nowadays, although not dominating, and this
bacterium is resistant to several antibiotics such as sulphametoxazole and
gentamicin [Rasmussen, 2013]. Urine #28 was dominated by Staphylococcus
lugdunensis, another rare cause of UTI. This species is usually mistaken to
Staphylococcus aureus in cultivation, showing that direct sequencing promises
typing accuracy [Frank et al., 2008] [Haile et al., 2002]. Urine #35 contains
another rare UTI pathogen, Stenotrophomonas maltophilia K279a, along with the
co-occurrence of E. faecalis. S. maltophilia was known to be multidrug resistant
[Nicodemo and Paez, 2007], thus it is important to find the susceptible antibiotics
for this patient.
6.3.2 Profiling organisms in the sewer sample
Figure A.25 represents the species, strain, and plasmid identification for the sewer
sample. Again, the species identification results agree with the strain identification
ones. However, 42.4% of the reads did not match to any known genome database,
even after they have been aligned to the nucleotide database. This suggests that
there is a large amount of novel DNA in the sewer sample.
The microbial life in the sewer sample consisted of mostly typical fecal bacteria
and bacteria causing nosocomial infections. Putting aside the unknown reads, the
largest proportion of the reads (6.19%) were mapped to novel MetaHIT genes. The
next most abundant organisms were the pathogenic soil bacteria Acinetobacter
lwoffii, A. johnsonii, and A. baumannii. A. lwoffii is the normal flora found on the
skin, oropharynx, and perineum of healthy people [Rathinavelu et al., 2003], but
when it enters the human body, e.g. via catheters, it could cause nosocomial
bacteremia. Additionally, A. lwoffii in this sewer sample matched the multidrug
resistant strain. A. johnsonii is naturally found in water, soil, human skin, and
feces, yet it is common to see A. johnsonii in the hospital sewage [Zong and
Zhang, 2013]. Meanwhile, A. baumannii is one of the most troublesomepathogens
in healthcare institutions, especially in the intensive care units as it is resistant to
all old-school antibiotics and so far it has no known natural habitat outside the
hospital [Peleg et al., 2008]. A. baumannii plasmids were found in high coverage
and depth, explaining why this species has higher percentage in the species
identification but low coverage and depth in the strain identification.
The next most abundant organism in the sewer is Arcobacter butzleri, which
causes watery diarrhea and bacteremia [Bücker et al., 2009]. The clear presence of
Bacteroides vulgatus, which are typical intestinal bacteria in healthy individuals,
supported the speculation that the sewer contained a large amount of feces [Cuív
et al., 2011]. Aeromonas caviae, which was also found plentiful in the sample, is
associated with both intestinal and extra-intestinal infections and usually causes
diarrhea in children [Wilcox et al., 1992] without blood and mucus [Beatson et al.,
2011] and bacteremia [Kimura et al., 2013].
6.3.3 Profiling organisms in the ancient toilet
Figure A.26 to Figure A.30 show the species and strain identification for the 200years old fecal samples. In principle, the species identification results of the feces
agree with the strain identification ones. However, the results from strain
identification are different than the ones from species identification on the control
soil samples, due to high diversity in the community of microbes having low
genome coverages.
6.3.3.1 Ancient feces
As confirmed by the strain identification, the dominant population of the feces was
Collinsella aerofaciens, followed by Bifidobacterium spp. The decrease of C.
aerofaciens might be associated with weight loss diet, i.e. reduced carbohydrate
diet [Walker et al., 2011], irritable bowel syndrome [Salonen et al., 2010], and
reduced risk of colon cancer [Moore and Moore, 1995]. C. aerofaciens is a
commensal anaerobic gut bacterium that produces major amounts of lactic acid. C.
aerofaciens ferments oligosaccharides and simple sugars [Rey et al., 2013] and
produces H2 gas [Moore and Moore, 1995]. Next, three Bifidobacterium species
are found consistently abundant in both surface and inside the feces: B. angulatum,
B. adolescentis, and B. catenulatum. Bifidobacteria benefit human by lowering the
blood cholesterol level, acting as immunomodulators, producing vitamins,
reducing blood ammonia level, and producing acetate and lactate that inhibit the
growth of pathogens [Gibson and Roberfroid, 1995]. Based on the trace of the
latrine, Copenhagen lower social class people living in the 1700s ate seasonal
things, such as raspberries, blackberries, and apples, as well as cherries, figs,
flaxseeds, rye, and a whole range of plants. From this archaelogical information
and the study of microbial profile of the feces, researchers may proceed with a
study associating the foods they ate and the resulting bacterial composition in the
gut.
6.3.3.2 Control soils
The most abundant microbes in both control soil samples were Mycobacterium
phlei and Mycobacterium tusciae JS617. M. phlei is a fast-growing, saprophytic
nontuberculous mycobacterium typically found in soil and dust and on plants
[Abdallah et al., 2012]. Meanwhile, M. tusciae JS617 is a slow-growing
scotochromogenic mycobacterium [Tortoli et al., 1999], which was first isolated
from creosote-contaminated soil in Germany and still a draft genome until the time
of writing. Both species causes disease in immunosupressed people. The next most
abundant strain was Micromonospora lupini strain Lupac_08, which was first
isolated from root nodules of wild legume Lupinus angustifolious [Alonso-Vega et
al., 2012]. M. lupini plays important role in soil ecology, biodegradation,
biocontrol, and plant growth promotion [Hirsch and Valdés, 2010]. All of these
three species are soil-related.
6.3.4 Profiling organisms in the airplane toilet
The species and strain identification for the airplane toilets were shown in Figure
A.31 to Figure A.35. The results from the species identification again agree with
the ones from the strain identification. The proportion of novel MetaHIT genes are
25.75% in the Bangkok airplane toilet, 33.34% in Aalborg airplane toilet, and
28.73%-29.98% in the three North America airplane toilets. The number of
unknown reads were generally low: 5.07% in Aalborg, 7.64% in Bangkok, 2.3%2.4% in Newark and Washington D.C., and 1.5% in Toronto airplane toilet.
In general, all samples were dominated by Eubacterium rectale and Bacteroides
vulgatus. E. rectale is anaerobic fecal bacterium and responsible for the
production of butyrate that protects the colon from many diseases [Duncan and
Flint, 2008]. This organism was typically found abundant in the colon of people
without the ulcerative colitis [Duncan and Flint, 2008] [Macfarlane et al., 2004],
suggesting that the chance of passengers having ulcerative colitis is low. B.
vulgatus is commensal bacterium [Tilg and Gasbarrini, 2013] and it could promote
or protect against colitis [Cuív et al., 2011]. Besides these two bacteria,
Faecalibacterium prausnitzii, which might has potential to treat ulcerative colitis
and Crohn’s diseases [Siaw and Hart, 2013], were found plenty in the US and
Danish samples but not so many in Thailand and Toronto samples. An interesting
finding in Aalborg toilet sample, which is not found in other samples, was the
presence of Ruminococcusgnavus. Species identification suggests that R. gnavus
was just minority but strain identification suggests that it was the second most
covered genome (81%) and second most abundant (5X) strains after E. rectale.
The slight difference between species and strain identification was perhaps
because many of R. gnavus regions were shared with the other organisms or
because the amount of feces is too low. In addition, there were two recent cases of
R. gnavus bacteremia, one in Odense and one in Vejle [Hansen et al., 2013].
Further research on R. gnavus could help to reveal its pathogenecity.
Figure 6.4. The B/F Ratio of the airplane toilet dataset. Green and purple are the
percentages of Bacteroidetes and Firmicutes, respectively.
According to the composition of the phyla in each country depicted in Figure 6.4,
the Bacteroidetes/Firmicutes (B/F) ratio in Bangkok sample was the highest
among all, followed by the North America. On the contrary, Aalborg sample has
the lowest ratio, where Firmicutes outnumbers Bacteroidetes. One speculation,
though purely conjecture, might be that Bangkok had the least risk of obesity
[Turnbaugh et al., 2006] [Ley et al., 2005] [Ley et al., 2006], which of course
requires further confirmation.
6.3.5 Profiling organisms in the vulture samples
Vulture could be the reservoir and vector of many diseases or the scavenger that
might own curative substance. The species, strain, and plasmid identification as
well as the summary of organisms in kingdom level were shown in Figure A.36 to
Figure A.40. In general, the results from species identification, again, agree with
the ones from strain identification. Withthe help of the vulture reference sequences
from the nucleotide database, the majority of the reads are mapped to Cathartes
aura. There were many reads mapped to other vultures and chicken genomes,
which have high chromosome homology with vultures genomes [Nanda et al.,
2006], suggesting the importance of sequencing the complete genome of C. aura
before identifying the vulture samples. The aves percentage, which should be
considered as contaminants, were 45.4%, 84.53%, 84.67%, and 67.17% for
samples GRG4217_FS, GRG4217_LI, GRG4227_FS, GRG4227_LI, respectively,
showing that many sequencing reads are putatively mapped to vultures. The
fractions of unmapped reads were 31.76%, 11.24%, 8.6%, and 8.96% for samples
GRG4217_FS, GRG4217_LI, GRG4227_FS, GRG4227_LI, respectively.
6.3.5.1 Face swab
The most abundant bacteria found during species and strain identification of the
face swab were Psychrobacter cryohalolentis (GRG4217_FS = 0.59%,
GRG4227_FS = 0.51%) and Psychrobacter arcticus (GRG4217_FS = 0.1%,
GRG4227_FS = 0.11%). Both Psychrobacter species are aerobic bacterium that
can grow at -10 to 30 °C (-10 to 28 °C for P. arcticus), with optimal growth
temperature at 22 °C [Bakermans et al., 2006]. P. arcticus was first isolated from
permafrost sediment cores in Siberia. [Ayala-del Río et al., 2010]. It was also
described just a few months ago that P. arcticus strain 273-4, which matched these
samples, could develop biofilm under laboratory conditions and has large adhesin
in attachment to surfaces [Hinsa-Leasure et al., 2013], answering why the
permafrost bacteria are possibly found in turkey vulture’s face. P. cryohalolentis
K5 was first isolated from a cryopeg within permafrost in Siberia [Bakermans et
al., 2006]. The plasmid of P. cryohalolentis K5 was also found in abundance with
relatively high coverage, supporting the speculation that this organism dominates
their faces. The abundant Pseudomonas fluorescens in the species identification of
sample GRG4227_FS could just possibly be a false positive, as the abundance of
this organism was not confirmed in the strain identification. The plasmid of
Acinetobacter baumannii AYE was additionally found in abundance on sample
GRG4227_FS.
6.3.5.2 Gut intestine
In sample GRG4217_LI, the only bacteria found in abundance were
Herbaspirillum seropedicae. The most abundant microbes are Escherichia phage
rv5 with 92.74% coverage and 38X depth, which did not “scream out” in the
species identification due to small sequence and homologs with other viruses.
In Sample GRG4227_LI, Clostridium perfringens was the most dominant
microbes in the gut along with its plasmid. Lactobacillus sakei, which is normally
found as psychotrophic lactic acid bacteria in fresh meat and used for
biopreservation and food safety on fermented meat, was also found in abundance
with high coverage. So did Hafnia alvei, which is commensal but sometimes
causes disease in immunocompromised people. Even E. coli was found only in
moderate coverage and amount. Another copious bacterium was the pathogenic
Enterococcus hirae that causes septicemia in human.
6.3.6 Virulence factors
Finding the virulence factors not only confirms the strain identification and the
pathogenicity of a sample, but also confirms whether the strains in the samples are
the virulent ones, especially when many strains of the same species were found in
high confidence. Chainmapper virulence finding might help to alert the doctors
about the presence of certain toxins or other virulence produced by the bacteria.
However, if the sample contains no virulence gene, then it does not necessarily
mean that the sample is free of toxin. The list of virulence factor is still
incomplete, as not all virulence factors were comprehensively studied or even
annotated.
6.3.6.1 Virulence factors in urine samples containing no pathogens
There was no virulence detected on urine #1, #4, #7, #8, and #16. This supports
earlier finding that these samples did not contain any pathogens.
6.3.6.2 Virulence factors in urine samples dominated by enterococci
Table A.1 shows the virulence in urine samples dominated with enterococci. Most
of known virulence factors of E. faecalis were found in abundance in urine #3,
#31, and #34. Hyaluronidases (EF3023 and EF0818), genes producing enzymes
that aid the dissemination of toxins and bacteria from cell to cell, were the most
plentiful spreading factors found, followed by sprE serine protease with similar
virulence function. The ace genes, which encode adherence proteins, were also
found. The finding of efaA, an adhesin in endocarditis and a manganese
transporter, could alert the doctor for the risk of endocarditis and the possibility of
manganese deficiency on the patient. All known genes for biofilm formation
(bopD and fsrABC) were found, alerting the doctor to prescribe anti-biofilm
therapy, as biofilm is resistant to antibiotics.
Patient with urine #26 had similar condition as patient with urine #3, except that
genes expressing bacterial capsules (cps) were found. A special attention was
needed for patient with urine #33, as not only gelatinese (gelE), which degrades
hemoglobins, collagen, and fibrin, were found, but also cytolysin genes (cyl),
which lyse erythrocytes, neutrophils, leukocytes, and macrophages, were found,
suggesting doctors to keep an eye to the patient’s blood test, lest one needed blood
transfusion or hyperbaric oxygen.
The absence of E. faecium virulence factors in urine #32 did not mean there was
no virulence found, but the research on E. faecium virulence was still on a very
early stage.
In addition, looking at the organism source of the virulence in urines with E.
faecalis, it was clear that E. faecalis V583 did exist in urine #3, #26, #31, #33, and
#34, answering the hesitation in subchapter 6.3.1.2. This additional virulence
finding was found essential in confirming the occurrence or co-occurrence of a
virulent strain having low coverage when the avirulent strain of the same species
was found dominant.
6.3.6.3 Virulence factors in urine samples dominated by E. coli
Table A.2 shows the virulence in urine samples dominated with E. coli. All
virulence factors in urine samples #6, #20, and #21 belong to UPEC. The
virulence genes in the urine sample #6 were dominated by pap genes, which
encode P fimbriae, with PapG as the adhesin protein. P fimbriae are often
associated with pyelonephritis, meaning that UTI infection might ascend to
kidney, alerting doctor to check whether pyelonephritis did really occur, as
nitrofurantoin is not susceptible for the patient having pyelonephritis, and patient
could develop kidney mucosal inflammation, septic, bacteremia, or even
meningitis. Once it was confirmed, the bacteria must be killed rapidly, with
maximal suggested doses before worse things happen. The second most abundant
virulence factors of this patient were iron acquisition genes: shuV, ireA, iucABD,
iutA, chu, sit, irp1, irp2, ybtAEPQSTX, and fyuA. This informed the doctor to
prescribe iron supplement if necessary. Also, all fim and ecp genes related to the
formation of type 1 fimbriae and common pilus, respectively, were found. The tia
invasion determinant is another adhesin that might be transferred horizontally
from ETEC, adding the adherence power and invading upper urinary tract
epithelial of this patient. The sat genes, which trigger kidney epithelial cell
autophagy, were also found in abundance. Meanwhile,urine #21 contained many
iron acquisition virulence genes, genes related to the production of type-1
fimbriae. The gsp, ecp, ipaH, and pap genes were also found in urine #21.
Urine sample #20 had many UPEC virulence factors, dominated by sfa and cnf1
genes. The sfa genes encode S fimbriae that evoke adherence tobladder, kidney,
erythrocytes, and endothelial cells. This adherence can cause pyelonephritis,
sepsis, and meningitis. Another finding is that this urine sample contains many
cytotoxic necrotizing factor-1 (cnf1) genes that trigger necrosis on the epithelial
cells and decrease bacterial phagocytosis. The agn43 genes, which are
autotransporters that use Type V secretion pathway, were also clearly found in the
sample, suggesting bacterial autoaggregation, i.e. reciprocal bacterial adherence.
This type of adherence had been associated to biofilm formation and long-term
bacterial colonization in the bladder. Genes producing hemolysin hlyB that lyse
the red blood cells were also found, alerting doctors to keep an eye on the blood
test. Besides those, urine #20 also has many iron and hemin stealing genes, type-1
fimbriae, tia, gsp, and ecp genes.
Urine sample #12 contained, although not in domination, many ibe and sfa genes.
The sfa genes contribute to the production of S fimbriae that enables binding to the
brain microvascular endothelial cells. Nonetheless, without ibe genes, the invasion
will not happen. The ibe genes invade the brain microvascular endothelial cells,
enabling traversal through the blood-brain barriers. Thus the patient had the risk of
meningitis. Other than those, iron and heme uptake genes as well as genes
encoding type-1 fimbriae were found in abundance.
In urine samples #10 and #25 the enteroinvasive virulence factors were found in
domination, confirming non-UPEC pathogens found in the strain identification.
The Shigella virulence in urine sample #10 might be brought by the
commensal/environmental E. coli ATCC-8739, the Enterobacteriaceae mobile
elements, or the abundant Citrobacter, whose virulence has not been well defined
yet, but not by Shigella according to the species and strain identification. There are
plenty of genes with unknown functions, putatively related to virulence: 1) ipaH,
one of Shigella’s invasion plasmid antigens, and 2) gsp genes that are putatively
related to Shigella’s Type II secretion system. The genes ecp were exceptionally
dominant in urine sample #25, showing the common pilus used for motility
virulence factor. Also, iron uptake virulence are plentiful in both urine samples:
the enterobactin fep and ent, the ABC transporter sit, and the yersiniabactin
fyuA/psn and ybt.
Urine samples #27 and #29 have little amount of virulence, although not
neglectable. Urine sample #27 contains tia, ipaH, ecpE, gsp, fim, ibe, and chu,
while urine sample #29 contains chu and shu (heme uptake), pap, sat, fim, and hly
(hemolysin).
6.3.6.4 Virulence factors in urine samples dominated by other pathogens
Table A.3 shows the virulence in the urine samples dominated with other
pathogens. The virulence database used for this study did not containvirulence of
Prevotella timonensis and Proteus mirabilis, thus no virulence was found in urine
samples #13 and #19. Meanwhile, urine sample #28 was dominated by
Staphylococcus lugdunensis. Staphylococcus was included in the virulence
database, but S. lugdunensis was not. The high numbers of capE virulence genes,
which expression could protect staphylococci with capsules, were found perhaps
due to common genes found between staphylococci species.
Urine sample #24 was dominated by P. mirabilis and Aerococcus urinae but the
virulence was dominated by UPEC’s iron uptake genes, genes encoding S
fimbriae, genes encoding F1C fimbriae, genes encoding Type I fimbriae, cnf1, and
tia, as the virulence of P. mirabilis and A. urinae were not in the virulence
database used in this study, and UPEC was the next most abundant organism. The
same thing happened in urine sample #35. Stenotrophomonas maltophilia was not
in the virulence database, therefore the virulence of next abundant organisms, E.
faecalis and E. coli, were found abundant in the sample.
6.3.6.5 Virulence factors in the sewer sample
There was no virulence detected in the sewer sample. This does not necessarily
means that there was no virulence factors in the sample. As seen in the community
profile, the sewer contains organisms, whose genomes and genes were deeply
explored just recently.
6.3.6.6 Virulence factors in the ancient toilet
There was no virulence in the old fecal samples. Again, this does not mean that
there is no virulence found in the feces. However, Chainmapper found a high
number of mycobacterial virulence factor aceA, a persistence factor of
mycobacteria by sustaining intracellular infection in inflammatory macrophages.
6.3.6.7 Virulence factors in the airplane toilet
Table 6.8, Table 6.9, and Table 6.10 show the virulence genes in the toilets of the
airplanes departing from Aalborg, Bangkok, and Toronto, respectively. There
were several low-depth pseudomonas virulence factors in Aalborg airplane fecal
samples. Bangkok and Toronto airplane fecal samples contained low depth of
virulence factors related to type 1 fimbriae and adherence, respectively. There was
no virulence found in other airplane toilet. These data did not mean that those
were the only virulence found in the samples, as the study on virulence factors was
still ongoing.
6.3.6.8 Virulence factors in the vulture samples
Table A.4, Table A.5, and Table A.6 show the virulence genes found on the face
swab and gut intestine of the vultures. Very few virulence genes were found on
Sample GRG4217_LI and GRG4227_FS, no virulence genes were found on
Sample GRG4217_FS. The ipaH gene in sample GRG4217_LI is one of
Shigella’s invasion plasmid antigens, indicating the existence of Shigella or
enteroinvasive E. coli. The algU is a gene inferring antiphagocytosis virulence
factors found in sample GRG4227_FS.
However, sample GRG4227_LI is dominated by C. perfringens virulence factors.
These are C. perfringens virulence factors, ordered by the abundance:
1. GroEL and fbp (fibronectin-binding protein) adherence factors
2. plc that produces alpha-toxin. Alpha-toxin has lethal, hemolytic, and
dermonecrotic activities and helps developing gas gangrene.
3. pfoA that produces tethatoxin. Tetha-toxin damages the host membrane
cells by forming pore.
4. nag that produces mu-toxin. Mu-toxin is hyaluronidase, part of spreading
factor, which helps the C. perfringens to spread into deeper tissue.
5. colA that produces kappa-toxin. Kappa-toxin actively degrades the host
tissues, aiding the growth, survival, and spread of C. perfringens, as well as
helping the diffusion of other toxins.
6. nanHIJ sialidase. Sialidase cleaves and steals carbohydrate polymers for C.
perfringens’s nutrient. Sialidase also increases the attachment of bacteria
and toxin binding to host cells.
7. hly hemolysin genes that lyse host’s red blood cells.
These are E. coli or Shigella virulence factors found in the sample
1. entD enterobactin factors and other iron chelation genes that steals host’s
iron. The entD gene is the most abundant recognized virulence gene in this
sample.
2. gsp, Type II Secretion System, which translocate toxins to reach host cells.
3. fim genes which promote the development of type I fimbriae.
6.3.7 Antibiotic Resistance
After knowing the exact strains causing the illness to the patient, perhaps with
additional help of virulence finding, the final step is to prescribe the right, targeted
antibiotics. Finding the antibiotic resistance genes is important toward targeted,
effective antibiotics for the patients. A diagnostician prescribing antibiotics, of
which the patient is resistant, would harm the patient. Administering broadspectrum antibiotics, which might also kill bacteria not responsible for the disease,
should be hindered to avoid the widespread of antibiotic resistance. After
narrowing down the list of possible antibiotics due to another condition on the
patient, e.g. allergy to penicillin, the doctor should pick the susceptible antibiotic
with minimum risk.
6.3.7.1 Resistance in urine samples dominated by enterococci
Since tetracycline, streptomycin, and kanamycin are not in the list of possible
antibiotic therapy for UTI, the only information helpful for patient was that all
urines with E. faecalis infection, i.e. urine sample #3, #26, #31, #33, and #34, and
E. faecium infection, i.e. urine #32, had lsa(A) genes inferring resistance against
lincosamides and streptogramin A. Streptogramin is typically used to treat
enterococcal infection that is resistant to vancomycin. However, the administration
of vancomycin itself should be avoided since this is usually the last resort for
enterococcal infection. From this partial information, Chainmapper at least help
healthcare personnel to reduce streptogramin from the antimicrobial susceptibility
test for those patients.
6.3.7.2 Resistance in urine samples dominated by E. coli
Putting aside tetracycline, streptomycin, and gentamycin, doctor can do the
susceptibility tests towards urine samples #12, #20, #27, and #29 without any
suggestion from Chainmapper. However, the UPEC patient #21 was predicted to
be resistant to beta lactams, due to the high copy number of blaTEM-1genes.
Patients with urine sample #6 and #25 were found resistant towards beta lactams
due to the abundance of gene blaTEM-1as well as towards trimethoprim, inferred
from the abundance of gene dfrA7 in urine #6 and dfrA14 in urine #25. With the
help of Chainmapper to find potential resistance genes that takes only a few
minutes, those resistant antibiotics can be safely removed from the susceptibility
test. In urine #10, where a lot of E. coli and Citrobacter freundii has been found in
abundance, resistance to beta lactams was possible with as low coverage as
96.68%. The interesting part was that the resistance genes found usually belongs
to C. freundii, supporting the hypothesis that C. freundii also causes the sickness.
6.3.7.3 Resistance in urine samples dominated by otherpathogens
Chainmapper did not suggest any resistance genes in urine sample #13 (Prevotella
timonensis) and urine sample #19 (Proteus mirabilis). Since lincosamides,
streptomycin, and kanamycin are not part of the solution for UTI, the resistance
prediction by Chainmapper for patient with urine sample #32 did not necessarily
informative. So did patient with urine sample #24, where chloramphenicol is not
suggested for UTI patient. Patient #28 was highly resistant to fusidic acid, but
since it is not the cure for UTI, the concern was to the next most abundant
resistance gene, blaZ, conferring beta lactams resistance. Urine sample #35, which
was dominated by E. faecalis and S. maltophilia, was found to be resistant against
quinolones. The list of resistance genes found in the urine samples are shown in
Table A.7
6.3.7.4 Resistance in the hospital sewer
The list of resistance genes found in the sewer sample is shown in A.8. It is normal
to find a lot of resistance genes in the hospital sewer [Guardabassi et al., 1998].
Aminoglycoside resistance genes was found in abundance, followed by macrolide
resistance brought by A. baumannii, Beta lactamase class D (blaOXA-10), and
tetracycline efflux gene tet(39).
6.3.7.5 Resistance in the ancient toilet
There was no antibiotic resistance found in the feces. However, ole(C) gene,
conferring resistance against oleandomycin, was found in both control soil
samples.
6.3.7.6 Resistance in the airplane toilet
Table A.9 to Table A.13 show the list of resistance genes found in airplanes
departing from five different cities. Samples from the toilets of five different
airplanes were sequenced. In general, tetracycline resistance was found to
dominate in all of the samples. Additionally, all samples except Aalborg sample
are quite copious in tet(Q). In the toilet of the Bangkok airplane, clindamycin and
kanamycin resistance were found in abundance after tetracycline resistance. The
beta lactams resistance was found with lower gene coverage and depth, 97.72%,
5X. Aalborg airplane toilet contains less resistance genes, probably due to lack of
number of reads. Newark airplane toilet has more beta lactam antibiotic resistance
genes found: cfxA6, cfxA3, cfxA4, cfxA, and cfxA5. Clindamycin, erythromycin,
macrolide, aminoglycoside, chloramphenicol, and lincomycin resistance genes
were there too. Beta lactams resistance was also found in moderate amount in
Washington andToronto airplane fecal samples.
6.3.7.7 Resistance in vulture’s face and intestine samples
Table A.14 and Table A.15 show the list of resistance genes found in the face
swab and gut intestine of the vultures. In sample GRG4217_FS, there were various
resistance gene in the face swab, ranging from streptothricin (sat2A), macrolidestreptogramin (msrE), macrolide (mphE), florfenicol (floR), chloramphenicol
(cmx), to resistance towards aminoglycoside (aac(3)-IVa). Meanwhile, in the
second gut intestine sample, more antibiotic resistance genes as we can see in
Table 6.14, especially tetracycline, are found. Those genes are resistant to
tetracycline (tetA(P), tetB(P), tet(W), tet(O), tet(M), tet(L), tet(D), tet(40)),
streptomycin (strB, strA), macrolides (mef(A)), lincomycin(lnuC), florfenicol
(floR, fexA), erythromycin (ermT), trimethoprim (dfrA1), beta lactam antibiotics
(blaDHA), and aminoglycoside (aph(3’)-III). Tetracycline, lincomycin, and
aminoglycoside resistance genes were the most plentiful in this sample. However,
the second face sample and the first gut sample do not contain known antibiotic
resistance genes.
6.4 Discussion
This study provides more proof of concepts that Chainmapper can profile the
microbial community of various samples. The results from five different datasets
show that Chainmapper can be used to profile the microbial composition in the
samples. Strain identification is important to confirm species identification. The
source organisms of virulence factors and resistance genes support the species and
strain identification but they are not the primarily used to identify organisms. The
virulence finding is mainly important to support the speculation that there are
virulent strains. It is especially when according to the strain identification, the
presence of both virulent and avirulent strains in the samples are likely. The
antibiotic resistance finding is also important to find antibiotics that are less
susceptible to the patients.
However, Chainmapper comes with some limitations. First, it requires huge
computational resource to speed up Chainmapper runtime. Second, the databases
of microbial genomes, virulence factors, and antimicrobial resistance genes are
still incomplete. The more complete and thorough the reference databases are, the
more accurate the results are. It is possible that harmful species, virulence factors,
and resistance genes are not detected but the sample indeed contains virulent
species or resistant to some antibiotics not mentioned by Chainmapper, due to the
incomplete or outdated set of reference database.
Further possible studies that would improve Chainmapper are:
1. Providing more proof of concept that Chainmapper works, by testing
Chainmapper with more samples.
2. Updating the databases of virulence factors and antibiotic resistance genes,
as well as the genomes of organisms, so that Chainmapper can provide
better identification.
3. Testing the sensitivity of Chainmapper.
Concluding remarks
In brief, there is a big prospect in harnessing whole genome shotgun sequencing
and Chainmapper in real time for routine diagnosis and outbreak prevention and
control. Shotgun sequencing offers faster turnaround time for sequencing the
clinical samples and affordable price for direct metagenomic sequencing, while
Chainmapper offers rapid analysis of the microbial composition to explore the
microbial composition from the sequencing data. With the race of
commercialization in sequencing technologies, the cost of sequencing will most
likely drop steadily, enabling clinical microbiological laboratories to perform the
whole genome shotgun sequencing and utilize the rapid microbial identification by
tools like Chainmapper. Once the automated metagenomic analysis pipeline is
fully implemented, soon the metagenomic shotgun sequencing will be the standard
practice in many clinical settings for routine diagnostics and outbreak surveillance
in real time.
Based on the precise and rapid taxonomy identification of the microbial
community on patient sample using Chainmapper, as well as the additional
information about the virulence factors and antibiotic resistance, medical
personnels can diagnose more rapidly and targeted treatment can be administered.
Furthermore, if the disease is considerably potential to start an outbreak, the
dissemination of the disease could be prevented earlier and closely monitored.
Chapter 7
Conclusion
(I have eliminated the ‘future perspectives’ from the title because in
practice you list just one )
In this thesis two bioinformatics tools, Reads2Type and Chainmapper, for
identifying microorganisms in clinical samples are presented. Reads2Type is a
web-based tool that can be used to rapidly identify isolates via a whole genome
sequencing (WGS) approach. Chainmapper can be used to determine microbial
community profiles directly from sequenced clinical samples, without prior
microbial isolation. Chainmapper was tested for determining the bacterial
community profiles of five public-health related datasets, together with strain
identification, antibiotic resistance and bacteria virulence.
By using such tools one avoids to deal with the drawbacks that arise from growing
bacteria in culture, for example the long incubation time and time consuming
microbial identification.
(The three lines above are meant to summarize what you wrote below, which
is definitely more detailed than what I wrote. The problem with what you
wrote is that it is a bit convoluted. I will try to simplify it.
The concept of sequencing the whole genome of bacterial isolates brings the
practicality of rapid microbial identification in real time. The biggest problems in
the culture based WGS approach are the long incubation time and difficulties to
grow some bacteria outside their natural environments. Yet, Reads2Type is
currently practical for typing the bacterial isolates.
On the other hand, culture independent WGS approach offers faster, simpler, and
more thorough solutions over bacterial cultivation in terms of wet lab works
before sequencing, as these approaches eliminate the need to grow the organisms
in culture. In addition this approach can explore the hard-to-culture microbes,
which is less practical in culture based WGS approach.
)
The results from my PhD study show how a combination of NGS technology,
bioinformatics, and real-time epidemiology can be very beneficial for routine
diagnostics and public health epidemiology.
Nevertheless, one of the main problems of direct sequencing is its dependency on
the power and the geographical availability of computers.
Figure 7.1. MinION, a nanopore-based USB stick-sized DNA-sequencing machine
With regard to this, it is worth mentioning that, thanks to nanopore technology
(Figure 7.1), it is already possible to sequence DNA with a tool as small as a USB
memory stick. In fact, the Oxford Nanopore company has initiated an early-access
program for scientists, to test its MinIon sequencer [McDougall, 2013].
Therefore, in the nearest future, one would expect that high throughput ultra-long
read sequence data will be produced, which have a very high accuracy and require
a very short runtime.
Because time and accuracy are key factors when dealing with infections and
outbreaks, tools such as Reads2Type and Chainmapper can be used by the medical
authorities to speed routine diagnostics and prevent the spread of diseases.