Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Rapid Taxonomy Identification for Medical Diagnostics and Genomic Epidemiology Dhany Saputra July 4th , 2014 Summary In recent years, the DNA of thousands of microorganisms have been completely sequenced and genes have been characterized. The approach used for medical diagnostics is shifting from being a phenotypic approach to a genotypic one. It is nowadays possible to develop bioinformatics tools that can be used to make medical diagnoses, and that can help to prevent disease outbreaks. Two of these tools, Reads2Type and Chainmapper, are described which enable rapid Microbial taxonomy identification both for medical diagnostics and public health microbiology. Reads2Type is a web-based tool used to identify isolates. It is rapid and accurate compared to other existing tools. The other tool is Chainmapper, which permits to determine the composition of complex bacterial community in clinical samples. This is done by direct sequencing, i.e., without prior microbial isolation, and within hours. The outcome provides a comprehensive view of the complex bacterial community present in the sample under investigation. The thesis is structured as follows. In Chapter 1 is described the Center for Genomic Epidemiology (CGE). The long term objective of CGE is to create an information system that can rapidly detect the spread of diseases, and monitor their geographical diffusion. In Chapter 2 are described DNA-sequencing technologies, and how the resulting data are analyzed for taxonomy identification purposes. Also, it describes both the culture-dependent and the culture-independent whole genome sequencing approach. The Reads2Type tool, which is developed to identify bacterial isolates, is presented in Chapter 3, together with the results from an extensive benchmarking study, which was done to compare Reads2Type with the existing comparable tools for species identification. The results are presented in Paper II. Chapters 4, 5 and 6 focus on the Chainmapper tool, which can be used to determine the composition of complex bacterial communities present in clinical samples. The Chainmapper tool was successfully used in the case of urine samples. The results for this type of samples are discussed in Paper III, while the results for other types are presented in Chapter 6. The main points are summarized in Chapter 7. Introduction Bacteria and viruses (79.7%), protozoa (10.7%), fungi (6.3%), and parasitic worms (3.3%) are, in the given percentages [Jones et al., 2008], responsible for world-wide emergence and re-emergence of infectious diseases. A world-wide map of such happenings is shown in Figure 1. From the socio-economical viewpoint the battle against infectious diseases is a very costly one. Therefore much effort is poured into the development of methods and tools that can help predicting and preventing infectious diseases outbreaks and spreading. Regarding genomic epidemiology, the DNA sequences of suspected pathogens can be thoroughly investigated and characterized by using cutting-edge genome sequencing methods together with computational analysis tools. Real-time genomic epidemiology can not only change the characteristic time-scale of traditional patient-diagnostics, but can make a difference regarding how to prevent outbreaks and spreading of infectious diseases. The aim of the work described in this thesis was to develop easy-to-use bioinformatics tools for analyzing sequencing data from clinical samples, such as urine samples. Such tools would allow a rapid detection and characterization of the microorganisms present in the investigated samples. Chapter 1 The Center for Genomic Epidemiology Figure 1.1 A world-wide map showing the emergence and re-emergence of infectious diseases. The red bullets indicate regions affected by emerging diseases, the blue by reemerging diseases, and the black a “deliberately emerging” disease, as explained in [Morens et al., 2004] 1.1 CGE and the work packages The Center for Genomic Epidemiology (CGE) is born out of the collaboration between the Center for Biological Sequence Analysis and the National Food Institute, both at the Technical University of Denmark. The aim of this collaboration is to combine next-generation and parallel sequencing technology, computational biology, and global epidemiology to provide real-time genomic epidemiology results. To accomplish such goal, the work at CGE is subdivided in work-packages (WP-i, i=1, 2 … 7). The packages are listed below together with a brief description of the work that they entail: WP-1. To develop tools for analyzing and organizing data of complete and nearly complete genome sequences. The results from my doctoral studies contributed to the WP-1, as I developed tools for rapid identification of taxonomy, antibiotic resistance, and virulence in single isolated bacteria strain and metagenomic samples. WP-2. To build traditional classification methods, such as xxx. (figure out what’s it?) WP-3. To identify novel genomic targets for epidemiology and evolutionary investigations. WP-4. To explore, via a holistic approach, new challenges in epidemiology. WP-5. To make pathogenicity predictions. WP-6. To combining data from sequence analysis with epidemic and geographic information. WP-7. To build the CGE web interface. 1.2 CGE pipeline (workflow?) The CGE-workflow for the rapid complete genome analysis of isolates is depicted in Figure 1.2. The seven work packages of CGE are implemented in a pipeline with server-client architecture. Both a server-side and a client-side are considered. Service requesters (i.e. the client computers) request and receive services from a centralized resource provider (i.e. the server), over the network. Client computers initially display standardized interface that allows requesting services to CGE server and displays the results that the server returns. Then the CGE server waits for clients’ requests and responds back to them. Figure 1.2. CGE workflow The client side of CGE workflow concerns the users, i.e., the healthcare authorities and particularly those located in remote and/or less-developed areas, where the access to clinical facilities is limited. The idea is that other laboratories around the world submit complete or partial genome sequences to the CGE server. These data are then analyzed in silico and a rapid response regarding the pathogens identification is given back to the healthcare authorities, who shall then decide how to proceed. Because pathogens identification is crucial for their decision making, the services provided by the CGE are very innovative regarding detection and prevention of pathogenic diseases. In particular, once that the collected medical samples are grown in a laboratory, and bacterial strains are isolated and their DNA is sequenced, the bacteria species identification is done by using the Reads2Type tool. Based on its output, the healthcare authorities can subsequently compare their isolates to the precompiled microbial-isolate data present in CGE databases. To do that, they need to agree on making their data publicly available, thereby contributing to the global real-time surveillance. Once the sequencing files are uploaded to the CGE server, the sequences are assembled to recreate the representation of the original DNA sequences. Based on this drafted assembly, other tools—namely MLST [Larsen et al., 2012], KmerFinder, TaxonomyFinder, and SpeciesFinder—are used for species identification. Also, a so-called SNP-Tree module uses the drafted assembly to provide users with additional information regarding time and places where closely related outbreaks were seen before [Leekitcharoenphon et al., 2012]. Information regarding virulence, pathogenicity prediction [Cosentino et al., 2013], and antibiotic resistance [Zankari et al., 2012] of the microorganisms is provided, too, based on the DNA drafted assembly. Chapter 2 Next Generation Microbial Diagnostics The approach adopted in microbial diagnostics is currently shifting from the traditional phenotypic-like one to the sequencing one. Within the traditional approach, microbiologists identify microorganisms based on phenotypic and biochemical investigations. Microbial identification by this approach is often error-prone, as two different pathogens may have similar physical properties and biochemical reactions [Cattoir et al., 2010] [Frank et al., 2008]. Also, pathogens may be wrongly identified as non-pathogens. The development of sequencing technologies has revolutionized the field of genomics, and provided a reliable alternative to the phenotypic approach. Also, it has triggered the development of computational tools for the rapid analysis of sequencing data. In this chapter, the next generation sequencing analysis steps for microbial diagnostics—from sequencing to organism identification—are described. Details about these SGS and TGS technologies are described in Appendix_X. For this work the SGS Illumina technology is used. None of the sequencing data in this thesis derive from TGS technologies other the Ion Torrent one. Machines based on the TGS technology are nevertheless worth mentioning because one can expect that in the future these will replace the SGS ones. 2.2: I explain how to preprocess sequencing data from ngs machine (preprocess = part of analysis) 2.3: I explain how to align (NGS analysis = alignment, assembly, SNP calling, etc, but I only do alignment). 2.4: I explain concept of taxonomy & problems in taxonomy (what why how), because in the next chapters I identify what bacteria in the sample. 2.5: I explain culture dependent WGS, related to chap 3 2.6: culture independent WGS (metagenomics), related to chap 4-6 2.1 Preprocessing the Sequencing Data The sequencing files produced by NGS machines must be preprocessed to ensure its quality control. There are two important steps in preprocessing: quality control and ‘read’ trimming, where ‘read’ stands for the DNA sequences produced by NGS machines. FastQC is open source software for quality control that can spot probable errors coming from the sequencers and the library preparation [Andrews, 2010]. From FastQC execution report, one can proceed with read trimming, and then run FastQC again to ensure the quality of the sequencing data after trimming. Read trimming is clipping out the subsets of the reads having low quality scores or containing adaptor sequences, therefore these subsets of reads will not interfere the subsequent analysis steps. In addition, first several base pairs can also be cut when DNA damage is presumed. Finally, very short sequences resulted from read trimming are removed. Genobox-trim is open source software used in subsequent chapters for read trimming [Rasmussen, 2011]. 2.3 Genome Alignment After read trimming, the sequencing files are aligned to reference genomes. Sequence alignment or sequence mapping is matching the DNA, RNA, or protein sequences to identify regions of similarity. These similarities are consequences of functional, structural, or evolutionary relationships between the sequences [Mount, 2004]. Matches, mismatches, and gaps can happen during sequence alignment. The elements of two aligned sequences are called mismatches if they are different, otherwise they are called matches. To minimize the number of mismatches, it is allowed to insert and delete elements to either of the aligned sequences, and these insertions/deletions are called gaps. The alignment scoring is done by giving reward scores for each match found, subtracted by the punishment score for each mismatch and gap found. The major challenge of genome alignment is how to quickly but accurately align numerous reads to lengthy reference genomes using as minimal computational resource as possible. To speed up the alignment, several short read aligners use a computational strategy, called indexing. Similar to the index at the end of a book, the index of a reference genome helps the aligners to rapidly find short sequence embedded within it. In general, there are two approaches for genome alignment that uses index to speed up the mapping: hash table and trie. BLAST [Altschul et al., 1990] [Altschul et al., 1997], BLAT [Kent, 2002], SOAP [Li et al., 2008b], MAQ [Li et al., 2008a], SHRiMP [Rumble et al., 2009], Novoalign [Technologies, 2013], and BFAST [Homer et al., 2009] implement hash table-based algorithm, whereas Bowtie [Langmead et al., 2009], Bowtie2 [Langmead and Salzberg, 2012], BWA-ALN [Li and Durbin, 2009], BWA-SW [Li and Durbin, 2010], BWA- MEM [Li, 2013] and SOAP2 [Li et al., 2009b] implement trie-based algorithm. In this thesis, Bowtie2 and BWA-MEM are used to align metagenomic samples because they require less memory and they are fast [Li, 2013]. The product of sequence alignment is a Sequence Alignment/Map (SAM) file [Li et al., 2009a]. It is recommended to always compress the SAM files into BAM (Binary Alignment/Map) files as this conversion reduces the disk use [Li and Durbin, 2009]. The files can then be used for downstream analyses, such as generalized RNA-seq, variant calling, genome assembly, and taxonomy identification. The focused downstream analysis in this thesis is taxonomy identification. 2.4 Taxonomy Identification Figure 2.5. The taxonomy of American black bear [Campbell and Reece, 2005] Once the sequencing data of the sample is obtained, the taxonomy of the pathogens needs to be identified. Taxonomy is the science of classifying organisms. The classification of life has evolved many times, from the Linneaus classification, which distinguishes life into plants and animals, to Woese’s three domain classification system that divides life into archaea, bacteria, and eukaryotes. Bacterial classification systematics has changed over time. One of the conventional approaches is Gram staining, which is still used as the first step of bacterial identification [Austrian, 1960]. Based on the Gram stainingand the shape, microbiologists narrow down the bacterial identification into a smaller group, but more efforts are needed to recognize the precise identity. Recent taxonomy classification uses ordered ranks, starting from domain, phylum, class, order, family, genus, species, where species is the lowest rank. Figure 2.5 shows the taxonomy of American black bear. According to Bergey’s Manual of Systematic Bacteriology, the basic and most important taxonomic rank in bacterial systematics is species. Microbial species is defined as the collection of organisms with 16S rRNA sequence similarity of at least 98.7% amongst the members and a DNA-DNA reassociation experiment is required to confirm [Stackebrandt and Ebers, 2006]. However, there are some species with controversial definitions. For example, Yersinia pestis and Yersinia pseudotuberculosis are very similar in 16S rRNA sequences, yet the Judicial Commission of the International Committee on Systematics of Prokaryotes rejected the idea of merging them into one species due to difference in the danger to the public health [Whitman et al., 2012]. Genus is a collection of species having the same phenotypic characteristics and it is clusterable by 16S rRNA sequences. The genus classification is still used albeit there is no satisfactory definition of genus, and higher taxa below phylum are even less certain in definition than genus. However, different organisms within the same species may have completely different impacts on public health. Escherichia coli O157:H7, for instance, is highly virulent, while many of other E. coli subtypes are non-pathogenic [Wirth et al., 2006]. Hence, species is further subtyped into strains based on its special features. For example, biovar/biotype and serovar/serotype are used to distinguish strains having different biochemical and antigenic properties, respectively [Whitman et al., 2012]. Identifying the taxonomy of the pathogen in a sample is helpful because every pathogen inherits certain properties from its ancestor and shares these properties with other members of its taxonomical group. In addition, microbial taxonomy reveals how several different microbes are related. Consequently, when organisms are correctly classified, the efficacy of a drug can be tested and developed against all pathogens within the same taxonomical groups, especially if the groups can be characterized by a molecule, pathway, or trait targeted by the drug. Furthermore, when two pathogens are found to be different in species level, despite the similar phenotypic traits, possibility of different treatments should be considered carefully. [Berman, 2012] 2.5 Culture Dependent WGS Figure 2.6. Culture dependent approach vs direct sequencing approach Several sequencing approaches for routine diagnosis have been proposed, such as culture dependent and culture independent WGS, illustrated in Figure 2.6. In culture dependent WGS, microbiologists cultivate the microbes in patient samples, wait until the microbial colonies emerge, purify the DNA, and proceed to sequencing. Producing sufficient amount of pure organisms in a colony requires 12 days for rapid grower bacteria, or even months for slow grower ones [Köser et al., 2012b]. Figure 2.7 shows the complexity and the timescale of cultivating a bacterial pathogen. Once the whole DNA of the selected colony has been sequenced, computational analysis tools are used to identify the taxonomy of the species. This is done by aligning theirsequences to some defined marker genes, either single loci or multi locus [Larsen et al., 2012]. Once the species is detected, the presence of antimicrobial resistance and virulence factors can also be detected to find out the right antibiotic for the patient and to examine the disease progression. Micro dilution and disk diffusion are common for antibiotic susceptibility testing and they take one extra day, or even longer for some bacteria [Köser et al., 2012a]. However, the database of of resistance gene and virulence factor sequences are still incomplete, therefore conventional susceptibility testing is still required in complement to the molecular prediction. Furthermore, WGS also promises a definitive resolution for outbreak investigation, therefore it is essential to identify the strain. Figure 2.7. Workflow of culture-dependent investigation of bacteria according to [Didelot et al., 2012] Compared to traditional phenotypic analysis, the WGS based microbial identification provides more rapid, unambiguous, detailed genotype information of the disease-causing pathogens, and it is very helpful to find targeted treatment for the patient and to prevent the outbreak at earlier stage. The pathogen surveillance of cholera outbreak in Haiti [Hasan et al., 2012][Hendriksen et al., 2011] and E. coli O104:H4 outbreak in Germany [Brzuszkiewicz et al., 2011][Grad et al., 2012][Rasko et al., 2011] by sequencing the entire bacterial genome are some of its success stories. WGS also enables the reconstruction of the transmission pathways between healthcare centers, hospital wards or even patient of the same ward [Andersen et al., 2010][Köser et al., 2012b]. 2.6 Metagenomics Figure 2.8. Metagenomic workflow: targeted gene sequencing (top) vs whole genome sequencing (down). Source: http: //bgiamericas.com/servicesolutions/services/metagenomics/ The second approach in sequence-based medical diagnosis is culture independent technique, also known as metagenomic approach, illustrated in Figure 2.6. Metagenomics is a comprehensive genomic view of the ecological population in a biological sample. The pursuit of understanding the microbial composition of a sample is important in many disciplines, such as healthcare [Loman et al., 2013], marine biology [Brum et al., 2013], and terrestrial biology [Tseng et al., 2013]. Microbes live in many different environments and interact together to contribute to the stability of their habitats. In nature the microbial community integrally contributes to the important cycles in life such as photosynthesis [Cuvelier et al., 2010], conversion of nutrients [Sessitsch et al., 2012], and degrading pollutants in the environment [Fang et al., 2012]. Billions of microbes living in synergy in the gut help the host to digest food and protect against pathogens [Fujimura et al., 2010]. However, before the advent of metagenomics, the understanding of microbial communities has been limited because microbiologists studied individual species one by one in pure cultures and they were unable to study microbes in their habitats. In addition, the studies of microbial community on biofilms are focused on some selected species at a time instead [O’Toole et al., 2000]. The most prominent advantage of metagenomics over microbial cultivation is its ability to investigate the hard-to-cultivate organisms [Streit and Schmitz, 2004], as microbial cultivation is recently still the primary method for microbial identification [van Belkum et al., 2013]. Furthermore, only less than 0.02% of the organisms in some environments are cultivable [Rappé and Giovannoni, 2003] [Hugenholtz et al., 1998] while the rest are hard to culture. Even if there is a way to cultivate them with artificial, superenriched media, the effort would be too high. The new pathogenic strains are often hard to cultivate, making medical diagnosis even more challenging [Campbell et al., 2013]. Metagenomics comes as an in silico solution that explores all organisms in a sample, including the uncultured ones. There are two approaches in metagenomic sequencing: targeted gene and whole genome sequencing, see Figure 2.8. In targeted gene sequencing, ITS and 16S rRNA regions are typically used and amplified to survey the community of fungi and bacteria in a sample, respectively [Buée et al., 2009] [Wang and Qian, 2009]. However, amplification bias in 16S sequencing may occur and affects the accuracy [Yilmaz et al., 2010]. Additionally, the 16S rRNA sequences of several species sometimes are very similar [Olive and Bean, 1999]. Whole genome shotgun metagenomics offers more in-depth analysis and broader view of the microbial composition than 16S sequencing, as it captures the whole parts of the genomes in the sample [Chen and Pachter, 2005]. In this approach, the DNA fragments from the sample are directly sequenced, without the need for any further preprocess steps. However, whole genome shotgun metagenomics requires higher depth of sequencing coverage to enable a closer look at the underrepresented minority organisms. Chapter 3 Reads2Type 3.1 Introduction Metagenomics (=direct sequencing=culture independent approach) has drawbacks. Thus, culture dependent still practical. Reads2Type (Paper 1) is software for culture dependent analysis. Then I explain why Reads2Type is good. I’ve tested the speed. Paper 2 benchmark Reads2Type with other methods. Though the development of metagenomics is exciting and rewarding for microbiologists, culture based WGS is still of interest for detecting diseases and preventing outbreaks [Didelot et al., 2012] [Köser et al., 2012a]. There are some ongoing issues to be fixed regarding speed, complexity of the workflow, sequencing costs, and sensitivity on implementing metagenomics [Tanaseichuk et al., 2012], therefore some people still rely on microbial cultivation for clinical practice. In the culture based approach, once the significant microbial isolate is detected, the species is identified, then antimicrobial susceptibility, microbial susceptibility, and outbreak possibility are investigated. Reads2Type is a novel web-based taxonomy identification tool that fits well as a forefront component of culture based WGS analysis pipeline. The first paper, ”Reads2Type: Rapid Microbial Taxonomy Identification” in Section 3.2, introduces and elaborates Reads2Type. Figure 3.1 shows where Reads2Type fits the CGE pipeline for WGS analysis. The key advantages of the web-based Reads2Type are quick, minimal use of Internet bandwidth, and user friendly. Reads2Type has also been tested using Internet broadband connection in Sweden (Figure 3.2), Jordan (Figure 3.3), and Indonesia (Figure 3.4) to correctly identify the species of a Klebsiella pneumoniae Short Read Archive (SRA) raw sequencing file. No file submission is required in Reads2Type and the identification is totally performed on the user’s computer. The implementation of Reads2Type is available as a free-to-use web service (http://cge.cbs.dtu.dk/services/Reads2Type) as a part of the services provided by the CGE project. Figure 3.1. Reads2Type fits to the CGE pipeline in Figure 1.2 Figure 3.2. The screen capture of Reads2Type correctly identified a Klebsiella pneumoniae raw read file, tested using Internet broadband connection in Lund, Sweden Figure 3.3. The screen capture of Reads2Type correctly identified a Klebsiella pneumoniae raw read file, tested at the National Center for Agricultural Research and Extension, Amman, Jordan Figure 3.4. The screen capture of Reads2Type correctly identified a Klebsiella pneumoniae raw read file, tested using Internet broadband connection in Indonesia The second paper in Section 3.3, ”Benchmarking of Methods for Genomic Taxonomy”, compares the performance of five available methods forspecies identification. Those five methods are Reads2Type, SpeciesFinder, KmerFinder, TaxonomyFinder, and MLST. It is mentioned in the paper that only the first three methods are suitable for taxonomy identification of raw sequencing files. Among those tools, Reads2Type is the only tool forspecies identification that has the webbased version, which does not require the sequencing data to be uploaded to the server. Instead, a small reference database (4.6 MB) is automatically transferred into the client computer’s memory in the initiation step. Based on an extensive series of experiment on identifying the species of 10,407 raw sequencing files representing 168 species, Reads2Type (87%) is slightly more accurate than Species-Finder (86%) and in general KmerFinder (97%) is the most accurate among these three techniques. However, on average Reads2Type is almost 2.5 times as fast as SpeciesFinder and KmerFinder. The secret of the low runtime of Reads2Type relies on the small-sized reference database (i.e. the database only consists of frequently seen bacterial sequences), the narrow down strategy (i.e. when a read matches a sequence shared by a group of bacteria, the search space is reduced to that group), and the use of suffix tree for fast string matching. 3.2 Paper I The following paper was submitted to Journal of Clinical Microbiology. 3.3 Paper II The following paper was submitted to Journal of Clinical Microbiology. Chapter 4 Chainmapper Intro: Culture dependent has drawbacks. Culture independent via 16S sequencing is better but still has drawback. Culture dependent via direct sequencing (shotgun) is better than 16S. Why better? Because it find things that 16S & culture can’t find. Challenge of direct sequencing: difficult to analyze. Solution: Chainmapper. Method - Data Preprocessing: same as Chapter 2 Method – Alignment: Same as chapter 2 Method – Species ID: to ID up to species level. Method – Strain ID: more accurate & better than species ID, but slower. Method – Antibiotic resistance & virulence (Chainmapper does this too) (We stop here, then we go to Chapter 5, a paper about direct sequencing + use of Chainmapper for direct sequencing, case study: patient’s urine samples. Then chapter 6 = more about urine sample results not in paper, plus trying Chainmapper in 4 samples) 4.1 Introduction There are problems in culture based WGS, such as the complicated procedures for microbial cultivation, the long cultivation time, the different selective media required for different type of organisms to grow, the resulting sequencing data of only one genome at a time [Didelot et al., 2012], and the inability to culture some organisms in the samples [Hugenholtz et al., 1998] [Rappé and Giovannoni, 2003]. Sequencing the 16S rRNA of the bacteria in a complex environmental sample offers an amazing culture independent approach to analyze its genetic diversity without the need to grow isolates [Muyzer et al., 1993]. This approach is faster and can reveal the presence of the uncultured organisms. However, 16S rRNA based analysis of metagenomic samples only identifies the organisms in the sample. It is impossible to identify the presence of plasmids, non-16S genes, and non-coding regions based on 16S sequencing. Moreover, 16S sequencing requires gene filtering to get only the expected 16S genes. Whole genome shotgun metagenomic sequencing allows sequencing all genomes in a sample simultaneously [Tyson et al., 2004]. This technique offers an alternative solution for cataloging the microbial composition of a sample. Finding antibiotic resistance genes, virulence genes, other genes, and even detecting the presence of plasmids are made possible using this approach. However, short reads from the mixture of several organisms make the assignment of each read to an organism and a gene difficult. When reliable analysis technique is available, whole genome shotgun metagenomic sequencing allows deep exploration on the organism composition, the enumeration of each organism in the sample, as well as other DNA-related information. Chainmapper is a command line tool for profiling microbial community in a biological sample and estimating their abundance. Chainmapper can be accessed at http://cge.cbs.dtu.dk/services/Chainmapper-1.0/. 4.2 Method The input data is the raw metagenomic sequencing data. The workflow of Chainmapper is shown in Figure 4.1. Figure 4.1. The workflow of Chainmapper 4.2.1 Data Preprocessing As explained in Chapter 2.2, the quality of the sequencing data must be ensured before the analysis starts. Chainmapper provides option for automated read trimming (Step 1 Figure 4.1). However, it is still advised that the users do the read trimming and quality control manually and start Chainmapper with trimmed sequencing data. Manual preprocessing is advised because trimmed sequencing data might still have low quality due to the presumed DNA damage. 4.2.2 Alignment Chainmapper aligns the metagenomic sequence reads to a reference database immediately after aligning them to another database (Step 2.1, 2.2, 2.3, 3.1, 5.1 Figure 4.1). The reference databases used for mapping are the NCBI database of bacterial complete and draft genomes [Federhen, 2012] by default. Other reference databases, such as the database of fungi, viruses, invertebrates, protozoa, all nucleotides, and MetaHIT can also be used. All alignments are done using BWA MEM [Li, 2013], except the alignment against the nucleotide database (Step 5.1 Figure 4.1), which uses Bowtie2 [Langmead and Salzberg, 2012], due to the inability of BWA MEM to index the huge nucleotide database. Species identification, strain identification, antibiotic resistance finding, and virulence finding are the four Chainmapper modules that can be run in parallel after the alignment. 4.2.3 Species Identification Species identification is the core function in Chainmapper (Step 2.1, 3.1, 4.1, 5.1, and 6.1 Figure 4.1). This module quickly identifies the dominating organisms in the sample on species level and calculates the number of unmapped reads. The outputs are the list of most abundant species in the sample and their estimated abundance, represented altogether in a graph and a tab separated text file. The users are free to play around with the minimum number of reads (in percentage) that categorizes a species as abundant (the default is 0.01%). In addition, this module can also profile the microbial community of the given sample in phylum level or in other taxonomical ranks. As the differences in genome sequences among lower taxa are smaller, assigning short reads to organisms is more suitable on higher taxonomy level. Hence, the confidence level of assigning each read to a species is lower than assigning each read to a phylum, especially when plenty of reads are mapped to more than one organisms, requiring the strain identification module to further confirm the species abundance. 4.2.4 Strain Identification Strain identification identifies the strains having high genome coverage and average depth in the given sample (Step 4.2 and 5.2 Figure 4.1). The genome coverage represents the fraction of the reference genome of a particular strain covered by the sequencing reads, whereas the average depth represents the number of copies of that genome in the sequencing data, given its genome coverage. The average depth can be associated with the strain’s relative abundance. Identified organisms are grouped by plasmids, bacteria, fungi (if required), viruses (if required), invertebrates (if required), and protozoa (if required). Strains are reported if their genome coverages are at least 10% and the strains are covered at least 1X. For plasmids and viruses, the coverage thresholds are lowered to 97.5%, as their reference sequences are relativelyshort. Protozoa and invertebrates should only be counted when very big fraction of sequencing reads are mapped to them, contributing to genome coverage closer to 100%, as they have larger genome size. Therefore, when eukaryotic reads are found in huge amount during species identification yet their coverages are low in strain identification, these reads are just false positive. Strain identification is especially used to double check the presence and the abundance of the species, as this module can also work independently to find out the dominant strains in spite of its longer runtime. The strain identification (~120 minutes) takes 3 times longer than the species identification (~40 minutes) for ~5 million reads due to the time needed by the strain identification to merge the alignment results from the paired reads, to sort the merged file into reference genome, to index them using SAMtools [Li et al., 2009a], and to calculate the genome coverage and depth using BEDTools [Quinlan and Hall, 2010]. 4.2.5 Antibiotic Resistance and Virulence Finding This module identifies the presence of genes causing antibiotic resistance and virulence in the sequencing data, crucial for targeted treatment and confirmation of the presence of virulent strains. The database for antibiotic resistance genes was adopted from ResFinder [Zankari et al., 2012], while the database for virulence genes was taken from the virulence factor database (VFDB) [Yang et al., 2008][Chen et al., 2012]. The resistance genes and virulence factors are reported if the sequencing data covers at least 97.5% of the genes at least 1X due to their short reference sequences. Finding the resistance genes on a sequencing data having 3-5 million reads takes about 10 minutes using one CPU core and 8 GB RAM, mostly used for sorting the binary alignment mapping (BAM) files, indexing them, and calculating the genome coverage. Chapter 5 Direct Sequencing for Diagnostics Culture based has drawback, we need to find solution for rapid diagnosis 5.1: Direct sequencing is the solution. Chainmapper is software to analyse output from direct sequencing. Case study = UTI disease. I explain very brief what is UTI, what are possible outcome of analyzing UTI sequencing data. Doctors diagnose patients with microbial infections based on visible symptoms and medical records. If there are doubts, the doctors may proceed with clinical laboratory tests to identify the precise pathogenic strain causing the disease. The microbiological laboratory personnel isolates the microbe and then use either the conventional morphological species identification or the culture based WGS technique to investigate the pathogen causing the disease. However, the tests could take days or even weeks, while in the meantime the patient could progress to a worse state. Therefore, a rapid diagnosis on a medical sample is important to prescribe a correct, targeted treatment to the patient and to monitor the progression of the disease. More importantly, a rapid detection of a potential outbreak is vital for early warning and control of the disease. 5.1 Direct sequencing and Chainmapper Direct DNA sequencing approach for routine diagnostics is introduced in this chapter. Whole genome shotgun metagenomic sequencing, also known as direct sequencing, is a sequencing technique used to study the genomic materials recovered directly from samples. This technique does not require growing the microbe in a proper culture medium and it allows simultaneous sequencing of all genomes in a sample. In addition, direct sequencing can be done within hours because the DNA from the sample directly goes to the sequencing steps. Meanwhile, the culture dependent WGS demands several days or months for growing the microbes and selecting the microbial coloniesunder study, provided that the microbes can grow in the given medium. Rapid growers like E. coli might take one day for incubation and media preparation, but slow growers like Mycobacterium tuberculosis requires three months for incubation. In 2012, CGE introduced a novel approach of direct sequencing for routine diagnostics. A paper titled “Rapid whole genome sequencing for the detection and characterization of microorganisms directly from clinical samples” presented in Section 5.2 evaluates the applicability of whole genome shotgun metagenomic sequencing for routine diagnostics. This paper shows that Chainmapper, as described in Chapter 4, fits into the concept of direct sequencing. Moreover, it was shown that Chainmapper outperforms MG-RAST in terms of execution time and Kmer in terms of accuracy. This paper also presents an application of Chainmapper on surveying the microbial composition of 24 samples from patients suspected to have Urinary Tract Infection (UTI). UTI is microbial infiltration of the unsterile urinary tract that causes infection in the urethra, bladder, ureter, or kidney. The most common cause is uropathogenic Escherichia coli (UPEC), corresponding to 80% of the cases. However, the remaining 20% are caused by other pathogens such as Staphylococcus saprophyticus, Klebsiella spp., Proteus mirabilis, and Enterococcus faecalis. A complete review of UTI is explained in [Wang et al., 2013][Shepherd and Pottinger, 2013] and [Barber et al., 2013]. Most of the samples were found to be positively infected by UPEC and E. faecalis. However, there are a few exceptional cases. Some samples contain other pathogens and other samples contain no pathogens. This paper shows that sequencing directly on the urine samples harvested not only the list of most dominant organisms in the samples, but also the collection of hard-to-culture bacteria, such as Lactobacillus iners, Gardnerella vaginalis, Prevotella timonensis, and Aerococcus urinae found in some urine samples. The success of this study indicates that sequencing the whole genomes directly from the clinical samples brings advantages to routine diagnostics and outbreak surveillance in the clinical settings. 5.2 Paper III The following paper was published in Journal of Clinical Microbiology in November 2013. Chapter 6 Chainmapper in Practice 6.1 Introduction The previous two chapters describe the methodology of Chainmapper and its implementation on direct sequencing for routine diagnostics. This chapter extends the implementation of Chainmapper to identify organisms found in various metagenomic samples and to detect any potential virulence in the samples. In addition to the identification of organisms, Chainmapper also investigates the presence of antibiotic resistance genes, virulence factors, and human microbiome in these samples. resistance is 6.1.1, virulence is 6.1.2, microbiome is 6.1.3 6.1.1: Antibiotic (natural/synthetic) cures. Antibiotic can happen “naturally”. Thus, antibiotic resistence also happens naturally. But human boosts it to worse state (overprescription, etc). I explain 4 ways antibiotic works (cell wall etc) & 4 ways resistance works (restricting the access, etc). Worse than that, resistance is easy to move (HGT), like virus. Resistance increases, good antibiotic decreases. But… now we know that resistance is caused by gene. 6.1.2: Virulence gene = gene that produce toxins. Sometimes species/strain ID can’t tell if sample contains “virulent E. coli” or “commensal E. coli”. If virulence gene is present, we’re sure that “virulent E. coli” is there. 6.1.3: Microbiome= collection of genes discovered in intestine. Metahit is the database name. Why aligning to microbiome? 1) to reduce number of unknown DNA (example: oh… this read is some unknown genes from intestine) 2) because we can tell whether someone obese/lean, healthy/sick abundance of some bacterial group defines. 6.2: 5 datasets 6.3: Result 6.3.1-5: Species/strain ID of Dataset 1-5 6.3.6: Resistance ID of Dataset 1-5 from the 6.3.7: Virulence ID of Dataset 1-5 6.1.1 Antibiotic Resistance Genes Figure 6.1. The timeline of antibiotic deployment and the evolution of antibiotic resistance. Texts above the timeline are the years the antibiotics were discovered, the texts below the timeline are the years the resistance were found. Since the invention of antibiotics in 1928, the number of incurable bacterial diseases has reduced [McIntyre, 2007]. Many antibiotics are naturally produced by bacteria and fungi either in response to environmental stress, or to discourage microbial competitors in the environment, or as signaling molecules through quorum sensing mechanism [Martínez, 2012], and some antibiotics are synthetically developed once their lead compounds have been revealed. However, the awareness of bacterial resistance towards both natural and synthetic antibiotics has shifted the attitude in hospital and agriculture [Cantas et al., 2013]. Antibiotics may attack the cell wall biosynthesis, protein synthesis, DNA synthesis, and folic acid synthesis of the pathogen, but bacteria evade the antibiotics by either restricting the access of antibiotics, or secreting enzymes that inactivate the antibiotics, or modifying the antibiotic target, or failing antibiotic activation [Wilson et al., 2011]. Antibiotic resistance may happen naturally as the species producing antibiotics have their own mechanisms of resistance [Allen et al.,2010]. Overprescribing antibiotics, prescribing antibiotics at the wrong doses or duration, prescribing antibiotics for viral infections, and the use of antibiotics on crowded livestock also on the other hand greatly contribute to the resistance [Arason et al., 1996] [Boken et al., 1995] [Wang et al., 1999] [Yap, 2013]. Furthermore, resistance genes can also be transferred into a pathogen, making antibiotics unable to kill or inhibit them. The biggest contributor to the resistance is genetic plasticity, such as acquired mutations and horizontal transfer genes (HGT). HGT is easier, quicker, and safer for bacteria to occur than acquired mutation, as resistance genes can be picked up by plasmids and be shared to any new environments [Wilson et al., 2011]. The options of susceptible antibiotics become limited, while the discovery of novel antibiotics has almost stopped, as depicted in Figure 6.1. However, the antibiotic resistance is found regulated by certain resistance genes, therefore sequencing the whole bacterial genome enables the detection of antimicrobial resistance. 6.1.2 Virulence Factors Not only our bodies develop strategies to attack pathogenic bacteria, but bacteria also have strategies to effectively attack our bodies, contributing to microbial pathogenecity. Virulence factors are gene products that enable pathogens to settle on or within a host and increase the possibility to cause disease [Yang et al., 2008] [Chen et al., 2012]. Some examples of virulencefactors are bacterial toxins, adherence to the host, antiphagocytosis capsules, iron uptake, and bacterial proteases. If the healthcare personnels are in doubt whether a sample contains pathogenic or non-pathogenic strain, then the presence of a virulence gene is enough to increase the likelihood that the sample contains the pathogenic strain [Wilson et al., 2011]. 6.1.3 Human microbiome Human microbiome is the complex ensemble of microbes and microbial genes associated with human. Aligning the metagenomic sequencing data against the novel gene catalogues of human microbiome is necessary when the samples are in close association to any human body parts. For instance, samples from sewer, which might contain human feces, need to be mapped to the curated references of human microbiome genes. Figure 6.2. The composition and health effects of predominant human fecal bacteria. The figure shows approximate numbers of the different genera [Gibson and Roberfroid, 1995]. In this thesis, MetaHIT 2nd gene catalogue [Qin et al., 2010] is used. Some MetaHIT genes match the genes submitted to NCBI, yet some other MetaHIT genes are novel and unnamed. When a large number of sequencing reads do not match to any NCBI reference genomes, acknowledging the reads to be mapped to novel MetaHIT genes is informative enough to suggest that those reads match to novel fecal genes. Identification of microbes in samples taken from body sites can also be associated to health profile. Some microbes in human body are harmful and disgusting but some microbes even synthesize nutrition for human using undigested food that the human eats and kick pathogens out from attachment sites in the body to pay their “accommodation rent”, as depicted in Figure 6.2 [Wikoff et al., 2009] [Benson et al., 2010]. Several studies suggested the consumption of probiotic products as they contain a number of gut bacteria beneficial for human health [Oelschlaeger, 2010] [Picard et al., 2005] [Soccol et al., 2010]. Some association studies have also shown that microbial population shift in human body is associated with some disease. For example, preliminary results showed that colorectal cancer is associated with the anaerobic Fusobacterium nucleatum [Castellarin et al., 2012] and patients with Crohn’s disease are abundant in Enterococcus faecium as well as several other Proteobacteria [Mondot et al., 2011]. Human gut microbial composition has been associated to diet, obesity, and inflammatory bowel disease (IBD) [Greenblum et al., 2012]. Intestine rich of Bacteroidetes and Parabacteroidetes is associated with animal protein and saturated fats in the diet, while intestine dominated by Prevotella and Desulfovibrio is associated with the diet of carbohydrates, simple sugar, and vegetables [Wu et al., 2011]. The ratio of Bacteroidetes to Firmicutes is associated with obesity [Turnbaugh et al., 2006] [Ley et al., 2005] [Ley et al., 2006] and a large population of Enterobacteriaceae is associated with IBD [Garrett et al., 2010]. Humans have been categorized according to the microbial composition oftheir intestines [Arumugam et al., 2011] and vaginas [Ravel et al., 2011], but the inherent differences between and within different groups are still under examination and debates. 6.2 Datasets The datasets used are sequencing data from urine specimens, sewer samples, ancient toilet samples, airplane toilet samples, and vulture samples. 6.2.1 Urine specimens The first set of samples was collected in April and September 2012 from patients with suspected UTI at Hvidovre hospital, Denmark. A total of 24 urine samples, each 10 mL, were prepared for whole genome metagenomic sequencing. The samples were sequenced using Ion Torrent PGM (Life Technologies) producing variable-length single-end reads with the number of reads ranging from 1.3 million to 3.8 million. The sequencing preparation was described in Section 5.2.Chainmapper was used to explore the microbial content of these urine specimens. Knowing whether the urine samples contain no pathogens, or pathogens related to other diseases, or pathogens related to the disease but the pathogens were just named or described a few months before are very informative to help the doctor to diagnose the patients. 6.2.2 Sewer sample Sample was taken from sewer around Herlev hospital in the Northern Denmark and the DNA extracts were directly sequenced using Ion Torrent PGM, producing 2.8 million variable-length single-end reads. Identifying the microbial life in the sewer around a hospital may reveal the list of threatening nosocomial infections in the hospital. 6.2.3 Ancient toilet samples A pair of 300-years-old latrines unearthed from beneath Kultorvet Square, Copenhagen, Denmark, were sampled and sequenced. The DNA was extracted from feces and soil around the feces. This study was done in collaboration with the Museum of Copenhagen and the Center for GeoGenetics, University of Copenhagen. The low oxygen content of the soil in the excavation site means that the remains were very well preserved and the smell of rotten eggs means that the bacteria had not yet eaten up all of the contents. The samples were sequenced using Illumina with multiple lanes, where adaptors have slightly different sequences, so that reads from two different lanes can be distinguished albeit the mixed barcoded libraries are run once at a time. The length of the sequencing reads is 100 bp and the numbers of reads are described in Table 6.1. The study of the complex microbial assemblage in these old fecal samplesaims to characterize the health profile of the lower social class people living in the 18th century near Kultorvet Square, Copenhagen, Denmark, and find, if any, potential outbreaks occurred at that time. 6.2.4 Airplane toilet samples Toilets of airplanes departing from five different cities to Copenhagen were sampled. Those five departure cities are Aalborg in Denmark, Bangkok in Thailand, Washington D.C. and Newark in the United States, and Toronto in Canada. The samples were directly sequenced at the National Food Institute, Technical University of Denmark using Illumina MiSeq with the read length of 150 bp and the number of reads as mentioned in Table 6.2. The assessment of the microbial content in the human waste collected from these samples may reveal the microbial diversity of body waste flushed by passengers into the toilet. Most of the passengers are presumably either the residents of the departure cities or inhabitants of Copenhagen spending several days in those cities. 6.2.5 Vulture samples Vultures are scavenging birds that are notorious for eating carcasses of dead animals, typically died of infectious diseases [Bangert et al., 1988]. Their diet raises microbiological and pathological questions. First, do the pathogens survive in their digestive system, come out alive in their feces and capable of causing an outbreak? It is possible that vultures develop resistance against bacteria [Carvalho et al., 2003], so that this study might find alternative medication for the diseases. Second, as their typical food is armadillo that naturally carries and disseminates pathogens, such as leprosy, why do they not develop sores on their faces? To answer this question, it should be assured that leprosy bacteria indeed existed on their faces or were eradicated before the bacteria disseminate to the entire body of the vultures. The Molecular Microbial Ecology Group at the University of Copenhagen sequenced the intestines of two dead North American vultures (Cathartes aura) to answer the first question and the faces of both vultures to answer the second question using Illumina, resulting 100 bp read length with the number of reads as mentioned in Table 6.3. 6.3 Results The datasets were evaluated using Chainmapper. The community profile of each dataset was presented and discussed. 6.3.1 Profiling organisms in the urine specimens The species and strain identifications of the urine specimens were shown in Figure A.1 to Figure A.24 and summarized in Figure 6.3. Figure 6.3. The community profiles of the 24 urine samples. Taken together, the results from the species identification of the urine specimens agree with the ones from the strain identification. Many readsmatch with strains from draft genomes, indicating that the inclusion of database of prokaryotic draft genomes in Chainmapper is very helpful for metagenomic profiling. The patients were all initially suspected to have UTI, yet Chainmapper found that not all of the urine samples contained pathogens. The urine samples can be grouped into four categories: 1) the ones containing no pathogen, 2) the ones dominated by enterococci, 3) the ones dominated by E. coli, and 5) the ones dominated by other pathogens. 6.3.1.1 Urine samples containing no pathogen As seen in Table 6.4, urine samples #1, #4, #7, #8, and #16 were not dominated by any pathogen. The dominant organisms in #4, and #8 were the fastidious bacteria Lactobacillus iners and Lactobacillus sp. 7_1_47FAA. L. iners can present in women with healthy vagina, women with bacterial vaginosis, or women that has just been subjected to antibiotic therapy [Macklaim et al., 2011], because L. iners has a persistence mechanism regardless the presence of pathogens [McMillan et al., 2013]. One suggestion might be that these patients do not have UTI. Based on the species identification, there are no dominant species in urine sample #16. The strain identification showed that the most abundant organism in the sample (depth = 30X) is candidate division TM7 single cell isolates TM7b, although it could just be a false positive due to low genome coverage. This strain was found in various environmental samples, ranging from forest soil, activated sludge, to human mouth. Its role was still under investigation and by the time of the writing its phylum name was still under proposal [Hugenholtz et al., 2001]. However, there are a large number of reads in this sample (41.9%) that do not match with either the NCBI microbial database or the human genome. To further identify these unknown reads, Chainmapper was resumed to align them against the NCBI nucleotide database, resulting in an increase of the percentage of total reads mapped to the human genome from 46.31% to 83.6%. This happens because a specific human genome build-37.1 is used as reference in the contaminant removal stage, yet the nucleotide reference database contains human genome sequences from various sources having several different base sequences frombuild-37.1. In addition, after mapping to the nucleotide database, the proportion of unknown reads remarkably dropped to 8.7% because many unknown reads are assigned to the human genomic sequences from the nucleotide database and 6.75% of the total reads matches to Pan troglodytes, Mus musculus, Macaca mulatta, Pongo abelli, Danio rerio, Macaca fascicularis, Sus scrofa, and Gorilla gorilla, suggesting that these reads match the subset of human genome sequences shared with those mammals. There can be many shared sequences between human genome and other organisms due to either the unclosed gaps in the current human genome, or single nucleotide polymorphisms between human and those organisms, or sequencing errors in the assembly of human genome. Furthermore, the species identification shows that minor but significant amount of reads match with Plasmodium vivax, Epichloe festucae, and many other unrelated eukaryotes. Looking at their low genome coverages and depths, the assignment to these reads are most likely false positives as the eukaryotic genomes are longer than the prokaryotic genomes. This finding has prompted the plausible speculation that this patient does not have UTI. Gardnerella vaginalis, which typically indicates bacterial vaginosis, dominates urine sample #7 (6.6%) according to the species identification. However, strain identification showed that the dominant organism was G. vaginalis 409-05, which is commensal [Santiago et al., 2011]. This showed the importance of strain identification that determines whether the dominant strain is pathogenic or not. 6.3.1.2 Urine samples dominated by enterococci Urine samples #3, #26, #31, #33, and #34 were dominated by E. faecalis PC1.1 and E. faecalis 62, see Table 6.5. E. faecalis PC1.1 is a candidate probiotic isolated from human feces and does not infer any virulence at the time of writing [O Cuív et al., 2013]. E. faecalis 62 lacks of elements involved in virulence [Brede et al., 2011]. There is a possibility that the virulent strain of this species, the vancomycin resistant E. faecalis V583, coexists in the sample although its coverage and depth are lower than the two other strains, but the information about the presence of this virulent strain could be more crucial than the information about the presence of dominating E. faecalis 62 or E. faecalis PC1.1. Finding virulence factors is thus important to find out whether E. faecalis V583 really existed in these urine samples. E. faecium dominates urine sample #32 and further strain identification narrows down the list of possible strains into the vancomycin resistant E. faecium Aus0004 [Lam et al., 2012] and E. faecium DO. Virulence finding is thus required to make sure the presence of the disease causing strain. 6.3.1.3 Urine samples dominated by E. coli The dominant organism in urine samples #6, #10, #12, #20, #21, #25, #27, and #29 was E. coli, as seen on Table 6.6. Particularly, the highly uropathogenic E. coli 536 was in urine samples #6, #20, and #21. Urine #29 was also dominated by this strain, yet the number of supporting reads and the average genome depth were almost neglectable, possibly because the patient was getting better. This strain was responsible for 70-90% of the estimated 150 million UTIs diagnosed annually [Brzuszkiewicz et al., 2006]. The strain identification also found a large number of E. coli/Shigella plasmids in urine sample #21 having nearly 100% coverage, from E. coli O7:K1 strain CE10 to E. coli SE11. Shigella sonnei Ss046 plasmids were also found with slightly lower coverage in urine sample #21. This supports the conjecture of E. coli domination in these samples E. coli ATCC-8739, was also found in abundance in urine samples #25 and #10. This pathogenic strain is typically found in fecal samples. E. coli ATCC-8739 in urine sample #10 might co-occur with the pathogenic Citrobacter, which usually invades the gastrointestinal tract. Urine sample #12 mostly contained E. coli UM146, which was just two years ago described as uropathogenic [Reeves et al., 2011]. This sample was also suspected to contain E. coli 536, supporting the suspection of UTI, while Bifidobacterium bifidum NCIMB 41171, which may root from the probiotic taken by the patient, was also found. Urine sample #27 had another exceptional E. coli strain M605 (45.37%, 1X), which was still a draft genome in the NCBI database at the time of the writing, thus there is no further explanation about the pathogenecity of the strain. However, there is small chance that this sample also contained E. coli 536. 6.3.1.4 Urine samples dominated by other pathogens Table 6.7 shows the dominant organisms in urine samples dominated by other pathogens. Prevotella timonensis, which is associated to bacterial vaginosis [Srinivasan et al., 2012], predominates urine sample #13. Proteus mirabilis HI4320 in urine sample #19 typically causes UTI in immunocompromised patients or patients with catheter [Nielubowicz and Mobley, 2010], as opposed to E. coli that typically causes UTI to healthy individuals. Meanwhile, urine sample #24 was dominated by Proteus mirabilis strain HI4320 as well as the uncultured Aerococcus urinae strain ACS-120-V-Col10a. In conventional biochemical identification, A. urinae can easily be misidentified as staphylococci because the bacteria also has the shape of cocci and turn violet on Gram staining [Cattoir et al., 2010] [de Jong et al., 2010]. This is one of the potential advantages of exploring the microbial composition through shotgun metagenomic approach. The frequency of A. urinae infection is emerging nowadays, although not dominating, and this bacterium is resistant to several antibiotics such as sulphametoxazole and gentamicin [Rasmussen, 2013]. Urine #28 was dominated by Staphylococcus lugdunensis, another rare cause of UTI. This species is usually mistaken to Staphylococcus aureus in cultivation, showing that direct sequencing promises typing accuracy [Frank et al., 2008] [Haile et al., 2002]. Urine #35 contains another rare UTI pathogen, Stenotrophomonas maltophilia K279a, along with the co-occurrence of E. faecalis. S. maltophilia was known to be multidrug resistant [Nicodemo and Paez, 2007], thus it is important to find the susceptible antibiotics for this patient. 6.3.2 Profiling organisms in the sewer sample Figure A.25 represents the species, strain, and plasmid identification for the sewer sample. Again, the species identification results agree with the strain identification ones. However, 42.4% of the reads did not match to any known genome database, even after they have been aligned to the nucleotide database. This suggests that there is a large amount of novel DNA in the sewer sample. The microbial life in the sewer sample consisted of mostly typical fecal bacteria and bacteria causing nosocomial infections. Putting aside the unknown reads, the largest proportion of the reads (6.19%) were mapped to novel MetaHIT genes. The next most abundant organisms were the pathogenic soil bacteria Acinetobacter lwoffii, A. johnsonii, and A. baumannii. A. lwoffii is the normal flora found on the skin, oropharynx, and perineum of healthy people [Rathinavelu et al., 2003], but when it enters the human body, e.g. via catheters, it could cause nosocomial bacteremia. Additionally, A. lwoffii in this sewer sample matched the multidrug resistant strain. A. johnsonii is naturally found in water, soil, human skin, and feces, yet it is common to see A. johnsonii in the hospital sewage [Zong and Zhang, 2013]. Meanwhile, A. baumannii is one of the most troublesomepathogens in healthcare institutions, especially in the intensive care units as it is resistant to all old-school antibiotics and so far it has no known natural habitat outside the hospital [Peleg et al., 2008]. A. baumannii plasmids were found in high coverage and depth, explaining why this species has higher percentage in the species identification but low coverage and depth in the strain identification. The next most abundant organism in the sewer is Arcobacter butzleri, which causes watery diarrhea and bacteremia [Bücker et al., 2009]. The clear presence of Bacteroides vulgatus, which are typical intestinal bacteria in healthy individuals, supported the speculation that the sewer contained a large amount of feces [Cuív et al., 2011]. Aeromonas caviae, which was also found plentiful in the sample, is associated with both intestinal and extra-intestinal infections and usually causes diarrhea in children [Wilcox et al., 1992] without blood and mucus [Beatson et al., 2011] and bacteremia [Kimura et al., 2013]. 6.3.3 Profiling organisms in the ancient toilet Figure A.26 to Figure A.30 show the species and strain identification for the 200years old fecal samples. In principle, the species identification results of the feces agree with the strain identification ones. However, the results from strain identification are different than the ones from species identification on the control soil samples, due to high diversity in the community of microbes having low genome coverages. 6.3.3.1 Ancient feces As confirmed by the strain identification, the dominant population of the feces was Collinsella aerofaciens, followed by Bifidobacterium spp. The decrease of C. aerofaciens might be associated with weight loss diet, i.e. reduced carbohydrate diet [Walker et al., 2011], irritable bowel syndrome [Salonen et al., 2010], and reduced risk of colon cancer [Moore and Moore, 1995]. C. aerofaciens is a commensal anaerobic gut bacterium that produces major amounts of lactic acid. C. aerofaciens ferments oligosaccharides and simple sugars [Rey et al., 2013] and produces H2 gas [Moore and Moore, 1995]. Next, three Bifidobacterium species are found consistently abundant in both surface and inside the feces: B. angulatum, B. adolescentis, and B. catenulatum. Bifidobacteria benefit human by lowering the blood cholesterol level, acting as immunomodulators, producing vitamins, reducing blood ammonia level, and producing acetate and lactate that inhibit the growth of pathogens [Gibson and Roberfroid, 1995]. Based on the trace of the latrine, Copenhagen lower social class people living in the 1700s ate seasonal things, such as raspberries, blackberries, and apples, as well as cherries, figs, flaxseeds, rye, and a whole range of plants. From this archaelogical information and the study of microbial profile of the feces, researchers may proceed with a study associating the foods they ate and the resulting bacterial composition in the gut. 6.3.3.2 Control soils The most abundant microbes in both control soil samples were Mycobacterium phlei and Mycobacterium tusciae JS617. M. phlei is a fast-growing, saprophytic nontuberculous mycobacterium typically found in soil and dust and on plants [Abdallah et al., 2012]. Meanwhile, M. tusciae JS617 is a slow-growing scotochromogenic mycobacterium [Tortoli et al., 1999], which was first isolated from creosote-contaminated soil in Germany and still a draft genome until the time of writing. Both species causes disease in immunosupressed people. The next most abundant strain was Micromonospora lupini strain Lupac_08, which was first isolated from root nodules of wild legume Lupinus angustifolious [Alonso-Vega et al., 2012]. M. lupini plays important role in soil ecology, biodegradation, biocontrol, and plant growth promotion [Hirsch and Valdés, 2010]. All of these three species are soil-related. 6.3.4 Profiling organisms in the airplane toilet The species and strain identification for the airplane toilets were shown in Figure A.31 to Figure A.35. The results from the species identification again agree with the ones from the strain identification. The proportion of novel MetaHIT genes are 25.75% in the Bangkok airplane toilet, 33.34% in Aalborg airplane toilet, and 28.73%-29.98% in the three North America airplane toilets. The number of unknown reads were generally low: 5.07% in Aalborg, 7.64% in Bangkok, 2.3%2.4% in Newark and Washington D.C., and 1.5% in Toronto airplane toilet. In general, all samples were dominated by Eubacterium rectale and Bacteroides vulgatus. E. rectale is anaerobic fecal bacterium and responsible for the production of butyrate that protects the colon from many diseases [Duncan and Flint, 2008]. This organism was typically found abundant in the colon of people without the ulcerative colitis [Duncan and Flint, 2008] [Macfarlane et al., 2004], suggesting that the chance of passengers having ulcerative colitis is low. B. vulgatus is commensal bacterium [Tilg and Gasbarrini, 2013] and it could promote or protect against colitis [Cuív et al., 2011]. Besides these two bacteria, Faecalibacterium prausnitzii, which might has potential to treat ulcerative colitis and Crohn’s diseases [Siaw and Hart, 2013], were found plenty in the US and Danish samples but not so many in Thailand and Toronto samples. An interesting finding in Aalborg toilet sample, which is not found in other samples, was the presence of Ruminococcusgnavus. Species identification suggests that R. gnavus was just minority but strain identification suggests that it was the second most covered genome (81%) and second most abundant (5X) strains after E. rectale. The slight difference between species and strain identification was perhaps because many of R. gnavus regions were shared with the other organisms or because the amount of feces is too low. In addition, there were two recent cases of R. gnavus bacteremia, one in Odense and one in Vejle [Hansen et al., 2013]. Further research on R. gnavus could help to reveal its pathogenecity. Figure 6.4. The B/F Ratio of the airplane toilet dataset. Green and purple are the percentages of Bacteroidetes and Firmicutes, respectively. According to the composition of the phyla in each country depicted in Figure 6.4, the Bacteroidetes/Firmicutes (B/F) ratio in Bangkok sample was the highest among all, followed by the North America. On the contrary, Aalborg sample has the lowest ratio, where Firmicutes outnumbers Bacteroidetes. One speculation, though purely conjecture, might be that Bangkok had the least risk of obesity [Turnbaugh et al., 2006] [Ley et al., 2005] [Ley et al., 2006], which of course requires further confirmation. 6.3.5 Profiling organisms in the vulture samples Vulture could be the reservoir and vector of many diseases or the scavenger that might own curative substance. The species, strain, and plasmid identification as well as the summary of organisms in kingdom level were shown in Figure A.36 to Figure A.40. In general, the results from species identification, again, agree with the ones from strain identification. Withthe help of the vulture reference sequences from the nucleotide database, the majority of the reads are mapped to Cathartes aura. There were many reads mapped to other vultures and chicken genomes, which have high chromosome homology with vultures genomes [Nanda et al., 2006], suggesting the importance of sequencing the complete genome of C. aura before identifying the vulture samples. The aves percentage, which should be considered as contaminants, were 45.4%, 84.53%, 84.67%, and 67.17% for samples GRG4217_FS, GRG4217_LI, GRG4227_FS, GRG4227_LI, respectively, showing that many sequencing reads are putatively mapped to vultures. The fractions of unmapped reads were 31.76%, 11.24%, 8.6%, and 8.96% for samples GRG4217_FS, GRG4217_LI, GRG4227_FS, GRG4227_LI, respectively. 6.3.5.1 Face swab The most abundant bacteria found during species and strain identification of the face swab were Psychrobacter cryohalolentis (GRG4217_FS = 0.59%, GRG4227_FS = 0.51%) and Psychrobacter arcticus (GRG4217_FS = 0.1%, GRG4227_FS = 0.11%). Both Psychrobacter species are aerobic bacterium that can grow at -10 to 30 °C (-10 to 28 °C for P. arcticus), with optimal growth temperature at 22 °C [Bakermans et al., 2006]. P. arcticus was first isolated from permafrost sediment cores in Siberia. [Ayala-del Río et al., 2010]. It was also described just a few months ago that P. arcticus strain 273-4, which matched these samples, could develop biofilm under laboratory conditions and has large adhesin in attachment to surfaces [Hinsa-Leasure et al., 2013], answering why the permafrost bacteria are possibly found in turkey vulture’s face. P. cryohalolentis K5 was first isolated from a cryopeg within permafrost in Siberia [Bakermans et al., 2006]. The plasmid of P. cryohalolentis K5 was also found in abundance with relatively high coverage, supporting the speculation that this organism dominates their faces. The abundant Pseudomonas fluorescens in the species identification of sample GRG4227_FS could just possibly be a false positive, as the abundance of this organism was not confirmed in the strain identification. The plasmid of Acinetobacter baumannii AYE was additionally found in abundance on sample GRG4227_FS. 6.3.5.2 Gut intestine In sample GRG4217_LI, the only bacteria found in abundance were Herbaspirillum seropedicae. The most abundant microbes are Escherichia phage rv5 with 92.74% coverage and 38X depth, which did not “scream out” in the species identification due to small sequence and homologs with other viruses. In Sample GRG4227_LI, Clostridium perfringens was the most dominant microbes in the gut along with its plasmid. Lactobacillus sakei, which is normally found as psychotrophic lactic acid bacteria in fresh meat and used for biopreservation and food safety on fermented meat, was also found in abundance with high coverage. So did Hafnia alvei, which is commensal but sometimes causes disease in immunocompromised people. Even E. coli was found only in moderate coverage and amount. Another copious bacterium was the pathogenic Enterococcus hirae that causes septicemia in human. 6.3.6 Virulence factors Finding the virulence factors not only confirms the strain identification and the pathogenicity of a sample, but also confirms whether the strains in the samples are the virulent ones, especially when many strains of the same species were found in high confidence. Chainmapper virulence finding might help to alert the doctors about the presence of certain toxins or other virulence produced by the bacteria. However, if the sample contains no virulence gene, then it does not necessarily mean that the sample is free of toxin. The list of virulence factor is still incomplete, as not all virulence factors were comprehensively studied or even annotated. 6.3.6.1 Virulence factors in urine samples containing no pathogens There was no virulence detected on urine #1, #4, #7, #8, and #16. This supports earlier finding that these samples did not contain any pathogens. 6.3.6.2 Virulence factors in urine samples dominated by enterococci Table A.1 shows the virulence in urine samples dominated with enterococci. Most of known virulence factors of E. faecalis were found in abundance in urine #3, #31, and #34. Hyaluronidases (EF3023 and EF0818), genes producing enzymes that aid the dissemination of toxins and bacteria from cell to cell, were the most plentiful spreading factors found, followed by sprE serine protease with similar virulence function. The ace genes, which encode adherence proteins, were also found. The finding of efaA, an adhesin in endocarditis and a manganese transporter, could alert the doctor for the risk of endocarditis and the possibility of manganese deficiency on the patient. All known genes for biofilm formation (bopD and fsrABC) were found, alerting the doctor to prescribe anti-biofilm therapy, as biofilm is resistant to antibiotics. Patient with urine #26 had similar condition as patient with urine #3, except that genes expressing bacterial capsules (cps) were found. A special attention was needed for patient with urine #33, as not only gelatinese (gelE), which degrades hemoglobins, collagen, and fibrin, were found, but also cytolysin genes (cyl), which lyse erythrocytes, neutrophils, leukocytes, and macrophages, were found, suggesting doctors to keep an eye to the patient’s blood test, lest one needed blood transfusion or hyperbaric oxygen. The absence of E. faecium virulence factors in urine #32 did not mean there was no virulence found, but the research on E. faecium virulence was still on a very early stage. In addition, looking at the organism source of the virulence in urines with E. faecalis, it was clear that E. faecalis V583 did exist in urine #3, #26, #31, #33, and #34, answering the hesitation in subchapter 6.3.1.2. This additional virulence finding was found essential in confirming the occurrence or co-occurrence of a virulent strain having low coverage when the avirulent strain of the same species was found dominant. 6.3.6.3 Virulence factors in urine samples dominated by E. coli Table A.2 shows the virulence in urine samples dominated with E. coli. All virulence factors in urine samples #6, #20, and #21 belong to UPEC. The virulence genes in the urine sample #6 were dominated by pap genes, which encode P fimbriae, with PapG as the adhesin protein. P fimbriae are often associated with pyelonephritis, meaning that UTI infection might ascend to kidney, alerting doctor to check whether pyelonephritis did really occur, as nitrofurantoin is not susceptible for the patient having pyelonephritis, and patient could develop kidney mucosal inflammation, septic, bacteremia, or even meningitis. Once it was confirmed, the bacteria must be killed rapidly, with maximal suggested doses before worse things happen. The second most abundant virulence factors of this patient were iron acquisition genes: shuV, ireA, iucABD, iutA, chu, sit, irp1, irp2, ybtAEPQSTX, and fyuA. This informed the doctor to prescribe iron supplement if necessary. Also, all fim and ecp genes related to the formation of type 1 fimbriae and common pilus, respectively, were found. The tia invasion determinant is another adhesin that might be transferred horizontally from ETEC, adding the adherence power and invading upper urinary tract epithelial of this patient. The sat genes, which trigger kidney epithelial cell autophagy, were also found in abundance. Meanwhile,urine #21 contained many iron acquisition virulence genes, genes related to the production of type-1 fimbriae. The gsp, ecp, ipaH, and pap genes were also found in urine #21. Urine sample #20 had many UPEC virulence factors, dominated by sfa and cnf1 genes. The sfa genes encode S fimbriae that evoke adherence tobladder, kidney, erythrocytes, and endothelial cells. This adherence can cause pyelonephritis, sepsis, and meningitis. Another finding is that this urine sample contains many cytotoxic necrotizing factor-1 (cnf1) genes that trigger necrosis on the epithelial cells and decrease bacterial phagocytosis. The agn43 genes, which are autotransporters that use Type V secretion pathway, were also clearly found in the sample, suggesting bacterial autoaggregation, i.e. reciprocal bacterial adherence. This type of adherence had been associated to biofilm formation and long-term bacterial colonization in the bladder. Genes producing hemolysin hlyB that lyse the red blood cells were also found, alerting doctors to keep an eye on the blood test. Besides those, urine #20 also has many iron and hemin stealing genes, type-1 fimbriae, tia, gsp, and ecp genes. Urine sample #12 contained, although not in domination, many ibe and sfa genes. The sfa genes contribute to the production of S fimbriae that enables binding to the brain microvascular endothelial cells. Nonetheless, without ibe genes, the invasion will not happen. The ibe genes invade the brain microvascular endothelial cells, enabling traversal through the blood-brain barriers. Thus the patient had the risk of meningitis. Other than those, iron and heme uptake genes as well as genes encoding type-1 fimbriae were found in abundance. In urine samples #10 and #25 the enteroinvasive virulence factors were found in domination, confirming non-UPEC pathogens found in the strain identification. The Shigella virulence in urine sample #10 might be brought by the commensal/environmental E. coli ATCC-8739, the Enterobacteriaceae mobile elements, or the abundant Citrobacter, whose virulence has not been well defined yet, but not by Shigella according to the species and strain identification. There are plenty of genes with unknown functions, putatively related to virulence: 1) ipaH, one of Shigella’s invasion plasmid antigens, and 2) gsp genes that are putatively related to Shigella’s Type II secretion system. The genes ecp were exceptionally dominant in urine sample #25, showing the common pilus used for motility virulence factor. Also, iron uptake virulence are plentiful in both urine samples: the enterobactin fep and ent, the ABC transporter sit, and the yersiniabactin fyuA/psn and ybt. Urine samples #27 and #29 have little amount of virulence, although not neglectable. Urine sample #27 contains tia, ipaH, ecpE, gsp, fim, ibe, and chu, while urine sample #29 contains chu and shu (heme uptake), pap, sat, fim, and hly (hemolysin). 6.3.6.4 Virulence factors in urine samples dominated by other pathogens Table A.3 shows the virulence in the urine samples dominated with other pathogens. The virulence database used for this study did not containvirulence of Prevotella timonensis and Proteus mirabilis, thus no virulence was found in urine samples #13 and #19. Meanwhile, urine sample #28 was dominated by Staphylococcus lugdunensis. Staphylococcus was included in the virulence database, but S. lugdunensis was not. The high numbers of capE virulence genes, which expression could protect staphylococci with capsules, were found perhaps due to common genes found between staphylococci species. Urine sample #24 was dominated by P. mirabilis and Aerococcus urinae but the virulence was dominated by UPEC’s iron uptake genes, genes encoding S fimbriae, genes encoding F1C fimbriae, genes encoding Type I fimbriae, cnf1, and tia, as the virulence of P. mirabilis and A. urinae were not in the virulence database used in this study, and UPEC was the next most abundant organism. The same thing happened in urine sample #35. Stenotrophomonas maltophilia was not in the virulence database, therefore the virulence of next abundant organisms, E. faecalis and E. coli, were found abundant in the sample. 6.3.6.5 Virulence factors in the sewer sample There was no virulence detected in the sewer sample. This does not necessarily means that there was no virulence factors in the sample. As seen in the community profile, the sewer contains organisms, whose genomes and genes were deeply explored just recently. 6.3.6.6 Virulence factors in the ancient toilet There was no virulence in the old fecal samples. Again, this does not mean that there is no virulence found in the feces. However, Chainmapper found a high number of mycobacterial virulence factor aceA, a persistence factor of mycobacteria by sustaining intracellular infection in inflammatory macrophages. 6.3.6.7 Virulence factors in the airplane toilet Table 6.8, Table 6.9, and Table 6.10 show the virulence genes in the toilets of the airplanes departing from Aalborg, Bangkok, and Toronto, respectively. There were several low-depth pseudomonas virulence factors in Aalborg airplane fecal samples. Bangkok and Toronto airplane fecal samples contained low depth of virulence factors related to type 1 fimbriae and adherence, respectively. There was no virulence found in other airplane toilet. These data did not mean that those were the only virulence found in the samples, as the study on virulence factors was still ongoing. 6.3.6.8 Virulence factors in the vulture samples Table A.4, Table A.5, and Table A.6 show the virulence genes found on the face swab and gut intestine of the vultures. Very few virulence genes were found on Sample GRG4217_LI and GRG4227_FS, no virulence genes were found on Sample GRG4217_FS. The ipaH gene in sample GRG4217_LI is one of Shigella’s invasion plasmid antigens, indicating the existence of Shigella or enteroinvasive E. coli. The algU is a gene inferring antiphagocytosis virulence factors found in sample GRG4227_FS. However, sample GRG4227_LI is dominated by C. perfringens virulence factors. These are C. perfringens virulence factors, ordered by the abundance: 1. GroEL and fbp (fibronectin-binding protein) adherence factors 2. plc that produces alpha-toxin. Alpha-toxin has lethal, hemolytic, and dermonecrotic activities and helps developing gas gangrene. 3. pfoA that produces tethatoxin. Tetha-toxin damages the host membrane cells by forming pore. 4. nag that produces mu-toxin. Mu-toxin is hyaluronidase, part of spreading factor, which helps the C. perfringens to spread into deeper tissue. 5. colA that produces kappa-toxin. Kappa-toxin actively degrades the host tissues, aiding the growth, survival, and spread of C. perfringens, as well as helping the diffusion of other toxins. 6. nanHIJ sialidase. Sialidase cleaves and steals carbohydrate polymers for C. perfringens’s nutrient. Sialidase also increases the attachment of bacteria and toxin binding to host cells. 7. hly hemolysin genes that lyse host’s red blood cells. These are E. coli or Shigella virulence factors found in the sample 1. entD enterobactin factors and other iron chelation genes that steals host’s iron. The entD gene is the most abundant recognized virulence gene in this sample. 2. gsp, Type II Secretion System, which translocate toxins to reach host cells. 3. fim genes which promote the development of type I fimbriae. 6.3.7 Antibiotic Resistance After knowing the exact strains causing the illness to the patient, perhaps with additional help of virulence finding, the final step is to prescribe the right, targeted antibiotics. Finding the antibiotic resistance genes is important toward targeted, effective antibiotics for the patients. A diagnostician prescribing antibiotics, of which the patient is resistant, would harm the patient. Administering broadspectrum antibiotics, which might also kill bacteria not responsible for the disease, should be hindered to avoid the widespread of antibiotic resistance. After narrowing down the list of possible antibiotics due to another condition on the patient, e.g. allergy to penicillin, the doctor should pick the susceptible antibiotic with minimum risk. 6.3.7.1 Resistance in urine samples dominated by enterococci Since tetracycline, streptomycin, and kanamycin are not in the list of possible antibiotic therapy for UTI, the only information helpful for patient was that all urines with E. faecalis infection, i.e. urine sample #3, #26, #31, #33, and #34, and E. faecium infection, i.e. urine #32, had lsa(A) genes inferring resistance against lincosamides and streptogramin A. Streptogramin is typically used to treat enterococcal infection that is resistant to vancomycin. However, the administration of vancomycin itself should be avoided since this is usually the last resort for enterococcal infection. From this partial information, Chainmapper at least help healthcare personnel to reduce streptogramin from the antimicrobial susceptibility test for those patients. 6.3.7.2 Resistance in urine samples dominated by E. coli Putting aside tetracycline, streptomycin, and gentamycin, doctor can do the susceptibility tests towards urine samples #12, #20, #27, and #29 without any suggestion from Chainmapper. However, the UPEC patient #21 was predicted to be resistant to beta lactams, due to the high copy number of blaTEM-1genes. Patients with urine sample #6 and #25 were found resistant towards beta lactams due to the abundance of gene blaTEM-1as well as towards trimethoprim, inferred from the abundance of gene dfrA7 in urine #6 and dfrA14 in urine #25. With the help of Chainmapper to find potential resistance genes that takes only a few minutes, those resistant antibiotics can be safely removed from the susceptibility test. In urine #10, where a lot of E. coli and Citrobacter freundii has been found in abundance, resistance to beta lactams was possible with as low coverage as 96.68%. The interesting part was that the resistance genes found usually belongs to C. freundii, supporting the hypothesis that C. freundii also causes the sickness. 6.3.7.3 Resistance in urine samples dominated by otherpathogens Chainmapper did not suggest any resistance genes in urine sample #13 (Prevotella timonensis) and urine sample #19 (Proteus mirabilis). Since lincosamides, streptomycin, and kanamycin are not part of the solution for UTI, the resistance prediction by Chainmapper for patient with urine sample #32 did not necessarily informative. So did patient with urine sample #24, where chloramphenicol is not suggested for UTI patient. Patient #28 was highly resistant to fusidic acid, but since it is not the cure for UTI, the concern was to the next most abundant resistance gene, blaZ, conferring beta lactams resistance. Urine sample #35, which was dominated by E. faecalis and S. maltophilia, was found to be resistant against quinolones. The list of resistance genes found in the urine samples are shown in Table A.7 6.3.7.4 Resistance in the hospital sewer The list of resistance genes found in the sewer sample is shown in A.8. It is normal to find a lot of resistance genes in the hospital sewer [Guardabassi et al., 1998]. Aminoglycoside resistance genes was found in abundance, followed by macrolide resistance brought by A. baumannii, Beta lactamase class D (blaOXA-10), and tetracycline efflux gene tet(39). 6.3.7.5 Resistance in the ancient toilet There was no antibiotic resistance found in the feces. However, ole(C) gene, conferring resistance against oleandomycin, was found in both control soil samples. 6.3.7.6 Resistance in the airplane toilet Table A.9 to Table A.13 show the list of resistance genes found in airplanes departing from five different cities. Samples from the toilets of five different airplanes were sequenced. In general, tetracycline resistance was found to dominate in all of the samples. Additionally, all samples except Aalborg sample are quite copious in tet(Q). In the toilet of the Bangkok airplane, clindamycin and kanamycin resistance were found in abundance after tetracycline resistance. The beta lactams resistance was found with lower gene coverage and depth, 97.72%, 5X. Aalborg airplane toilet contains less resistance genes, probably due to lack of number of reads. Newark airplane toilet has more beta lactam antibiotic resistance genes found: cfxA6, cfxA3, cfxA4, cfxA, and cfxA5. Clindamycin, erythromycin, macrolide, aminoglycoside, chloramphenicol, and lincomycin resistance genes were there too. Beta lactams resistance was also found in moderate amount in Washington andToronto airplane fecal samples. 6.3.7.7 Resistance in vulture’s face and intestine samples Table A.14 and Table A.15 show the list of resistance genes found in the face swab and gut intestine of the vultures. In sample GRG4217_FS, there were various resistance gene in the face swab, ranging from streptothricin (sat2A), macrolide- streptogramin (msrE), macrolide (mphE), florfenicol (floR), chloramphenicol (cmx), to resistance towards aminoglycoside (aac(3)-IVa). Meanwhile, in the second gut intestine sample, more antibiotic resistance genes as we can see in Table 6.14, especially tetracycline, are found. Those genes are resistant to tetracycline (tetA(P), tetB(P), tet(W), tet(O), tet(M), tet(L), tet(D), tet(40)), streptomycin (strB, strA), macrolides (mef(A)), lincomycin(lnuC), florfenicol (floR, fexA), erythromycin (ermT), trimethoprim (dfrA1), beta lactam antibiotics (blaDHA), and aminoglycoside (aph(3’)-III). Tetracycline, lincomycin, and aminoglycoside resistance genes were the most plentiful in this sample. However, the second face sample and the first gut sample do not contain known antibiotic resistance genes. 6.4 Discussion This study provides more proof of concepts that Chainmapper can profile the microbial community of various samples. The results from five different datasets show that Chainmapper can be used to profile the microbial composition in the samples. Strain identification is important to confirm species identification. The source organisms of virulence factors and resistance genes support the species and strain identification but they are not the primarily used to identify organisms. The virulence finding is mainly important to support the speculation that there are virulent strains. It is especially when according to the strain identification, the presence of both virulent and avirulent strains in the samples are likely. The antibiotic resistance finding is also important to find antibiotics that are less susceptible to the patients. However, Chainmapper comes with some limitations. First, it requires huge computational resource to speed up Chainmapper runtime. Second, the databases of microbial genomes, virulence factors, and antimicrobial resistance genes are still incomplete. The more complete and thorough the reference databases are, the more accurate the results are. It is possible that harmful species, virulence factors, and resistance genes are not detected but the sample indeed contains virulent species or resistant to some antibiotics not mentioned by Chainmapper, due to the incomplete or outdated set of reference database. Further possible studies that would improve Chainmapper are: 1. Providing more proof of concept that Chainmapper works, by testing Chainmapper with more samples. 2. Updating the databases of virulence factors and antibiotic resistance genes, as well as the genomes of organisms, so that Chainmapper can provide better identification. 3. Testing the sensitivity of Chainmapper. Concluding remarks In brief, there is a big prospect in harnessing whole genome shotgun sequencing and Chainmapper in real time for routine diagnosis and outbreak prevention and control. Shotgun sequencing offers faster turnaround time for sequencing the clinical samples and affordable price for direct metagenomic sequencing, while Chainmapper offers rapid analysis of the microbial composition to explore the microbial composition from the sequencing data. With the race of commercialization in sequencing technologies, the cost of sequencing will most likely drop steadily, enabling clinical microbiological laboratories to perform the whole genome shotgun sequencing and utilize the rapid microbial identification by tools like Chainmapper. Once the automated metagenomic analysis pipeline is fully implemented, soon the metagenomic shotgun sequencing will be the standard practice in many clinical settings for routine diagnostics and outbreak surveillance in real time. Based on the precise and rapid taxonomy identification of the microbial community on patient sample using Chainmapper, as well as the additional information about the virulence factors and antibiotic resistance, medical personnels can diagnose more rapidly and targeted treatment can be administered. Furthermore, if the disease is considerably potential to start an outbreak, the dissemination of the disease could be prevented earlier and closely monitored. Chapter 7 Conclusion In this thesis two bioinformatics tools, Reads2Type and Chainmapper, for identifying microorganisms in clinical samples, are presented. Reads2Type is a web-based tool that can be used to rapidly identify isolates via a whole genome sequencing (WGS) approach. The advantage of this tool is that it is easy to implement and use. Its drawbacks are that growing bacteria takes time, the process is very costly, and that some types of bacteria are very difficult to cultivate. Chainmapper is a command-line tool to profile the microbial community directly from sequenced clinical samples. Direct sequencing can identify uncultured organisms, without prior microbial isolation. Thus more organisms are detected and at the same time one avoids long incubation times, which constitute an advantage with respect to using a culture dependent approach. Besides identifying the list of species and strains contained in the metagenomic samples, Chainmapper finds antibiotic resistance and virulence genes, too. Nevertheless, by using Chainmapper huge computational resources are needed, which are still not ubiquitously available. Therefore its use is not practical, yet . With the development of more powerful computer, one would expect that, in the future, Chainmapper shall win over Reads2Type, despite the fact that the latter is more practical to use compared to the former. Figure 7.1. MinION, a nanopore-based USB stick-sized DNA-sequencing machine With regard to this, it is worth mentioning that, thanks to nanopore technology (Figure 7.1), it is already possible to sequence DNA with a tool as small as a USB memory stick. In fact, the Oxford Nanopore company has initiated an early-access program for scientists, to test its MinIon sequencer [McDougall, 2013]. Therefore, in the nearest future, one would expect that high throughput ultra-long read sequence data will be produced, which have a very high accuracy and require a very short runtime. Because time and accuracy are key factors when dealing with infections and outbreaks, one would hope that tools such as Reads2Type and Chainmapper will be adopted by the medical authorities to speed routine diagnostics and prevent the spread of diseases. The results from my PhD study show how a combination of NGS technology, bioinformatics, and real-time epidemiology can be very beneficial for routine diagnostics and public health epidemiology. APPENDIX A.1. Next generation sequencing Figure 2.1. The Sanger-sequencing workflow, adopted from [Men et al., 2008] The era of sequencing technology started in the 70s with the development of the Sanger DNA-sequencing method [Sanger and Coulson, 1975]. This method defined the so-called first-generation sequencing technology, to which others then followed. In the 80s and 90s the method was further improved thanks to experimental techniques such as fluorescence labeling [Prober et al., 1987] and capillary electrophoresis [Swerdlow and Gesteland, 1990] [Swerdlow et al., 1991]. Figure 2.1 shows the workflow for the Sanger sequencing [Men et al., 2008]. The read length of Sanger sequencing reaches up to up to 1000 base pairs (bp) and have a 99.999% per base raw accuracy. Unfortunately, the first-generation Sanger sequencing produces a low throughput at a low cost [Shendure and Ji, 2008]. In fact the first human genome sequence was completed in 2003, at a cost of 3 billion USD and over a period of time that lasted more than 10 years [Venter et al., 2001]. (Again, check this reference! In this article in Nature it is written that the completion occurred in 2003, and NOT in 2001. So, is it 2001 or 2003?). The need for re-sequencing humans stimulated the development of a next-generation sequencing (NGS) methods. Figure 2.2. The workflow of pyrosequencing [Metzker, 2009] Thus the era of NGS begins by the introduction of various sequencing machines that write short DNA sequences [Metzker, 2009]. NGS technology was further classified into second generation sequencing (SGS) and third generation sequencing (TGS) technologies, which produce short reads and significantly longer reads, respectively [Schadt et al., 2010]. The SGS technology is still commonly used for sequencing—as the TGS technology is still in a development phase—together with the Sanger sequencing approach, which is adopted for validation purposes [Shendure and Ji, 2008]. Figure 2.3. The workflow of Illumina Genome Analyzer [Ansorge, 2009] With regard to SGS, the first commercialized SGS machines are the 454 pyrosequencing ones. By these it is possible to achieve a higher sequencing throughput, and at a lower cost than by the Sanger sequencing method [Margulies et al., 2005]. In 2007, Illumina/Solexa came up with Genome Analyzer, and Applied Biosystem introduced the SOLiD machines. These types of machines became the technology of choice for whole genome sequencing (WGS), genome resequencing, chromatin immunoprecipitation sequencing (ChIP-seq), ribonucleic acid sequencing (RNA-seq), and metagenomic sequencing. Also, sequencing is done through the same type of steps: library preparation by fragmenting the DNA, ligation of adaptor sequences, clonal amplification, and sequencing cycles based on enzyme-driven biochemistry and data imaging. The workflows of 454 pyrosequencing and Illuminase quencing are illustrated in Figure 2.2 and Figure 2.3, respectively. One of 454 pyrosequencing machines, GS FLX Titanium XL+, yields read lengths up to 1 kbp with typical throughput of 700 Mb and has a runtime of approximately 23 hours (source: http://www.454.com/). However, the artifact of 454 is homopolymerism, i.e. insertions of the same base, leading to high error rate [Huse et al., 2007]. Meanwhile, Illumina reads are shorter than those from 454 pyrosequencing, i.e., 100 bp with Genome Analyzer machines and 250 bp at best with MiSeq machines; the Illumina throughput can even reach 600 Gb for the HiSeq machines. The runtime of the MiSeq machines is 1.5 hours for preparation and 4 hours of sequencing, if one uses the Nextera Sample Preparation Kit (source: http://www.illumina.com/). Figure 2.4. The workflow of Ion Torrent PGM sequencing [Herper, 2010] With regard to TGS technology, this is mostly based on a single-molecule, real time (SMRT) system without the need to halt between read steps [Schadt et al., 2010]. It produces longer reads, each of which represents a single DNA molecule. The first TGS machine was the Ion Torrent semiconductor sequencer, which was released in 2010 [Rothberg et al., 2011] [Merriman et al., 2012]. It was based on ion detection, instead of dye-labeled oligonucleotides and expensive optics (see Figure 2.4 for the workflow). There are two types of commercially available Ion Torrent machines: Ion PGM and Ion Proton. The throughput of Ion Proton is 60 Gb per run, with reads reaching up to 200 bp and completed within 24 hours (source: http://www.lifetechnologies.com/dk/en/home/brands/iontorrent.html). In 2013 another TGS machine, the PacBio RS II, was launched. It was anticipated that it can produce 50,000 reads per run, with a read length up to up or above 20000 bp [Roberts et al., 2013]. Regrettably, the machine is too large and expensive, and has a very high error rate. However the errors can be eliminated with the help of suitable algorithms and SGS technology [Koren et al., 2012], as demonstrated by Korlach, who generated a de novo assembled genome with 99.999% base concordance with its reference genome [Chin et al., 2013]. Illumina, too, “goes long” by acquiring Moleculo that synthesizes long reads from discrete pools of short reads from Illumina. Nevertheless, Moleculo sequencing still requires small DNA fragments as inputs, can introduce biases, and still has GC-rich artefacts [Eisenstein, 2013]. Recently Oxford Nanopore started an earlyaccess program for scientists to test its USB stick-sized MinION sequencer [McDougall, 2013].