* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download The Genomics of Emerging Infectious Disease
Survey
Document related concepts
Social history of viruses wikipedia , lookup
Human microbiota wikipedia , lookup
Globalization and disease wikipedia , lookup
Plant virus wikipedia , lookup
Molecular mimicry wikipedia , lookup
Horizontal gene transfer wikipedia , lookup
History of virology wikipedia , lookup
Sociality and disease transmission wikipedia , lookup
Cross-species transmission wikipedia , lookup
Viral phylodynamics wikipedia , lookup
Transmission (medicine) wikipedia , lookup
Transcript
The Genomics of Emerging Infectious Disease www.plos.org A collection of essays, perspectives, and reviews from six PLoS Journals about how genomics can revolutionize our understanding of emerging infectious disease. Produced with support from Google.org. The PLoS Journal editors have sole responsibility for the content of this collection. Image credits: Brindley et al., PLoS Neglected Tropical Diseases 3(10) e538. McHardy et al., PLoS Pathogens 5(10) e1000566. Salama et al., PLoS Pathogens 5(10) e1000544. Editorial Genomics of Emerging Infectious Disease: A PLoS Collection Jonathan A. Eisen1*, Catriona J. MacCallum2* 1 University of California Davis, Davis, California, United States of America, 2 Public Library of Science, Cambridge, United Kingdom Today, the Public Library of Science publishes a collection of essays, perspectives, and reviews about how genomics, with all its associated tools and techniques, can provide insights into our understanding of emerging infectious disease (http://ploscollections. org/emerginginfectiousdisease/) [1–13]. This collection, focused on human disease, is particularly timely as pandemic H1N1 2009 influenza (commonly referred to as swine flu) spreads around the globe, and government officials, the public, journalists, bloggers, and tweeters strive to find out more. People want to know if this flu poses more of a threat than other seasonal flu strains, how fast it’s spreading (and where), and what can be done to contain it. As this collection illustrates, the increasing speed at which complete genome sequences and other genome-scale data can be generated for individual isolates and strains of a pathogen provides tremendous opportunities to identify the molecular changes in these disease agents that will enable us to track their spread and evolution through time (e.g., [3,7,8]) and generate the vaccines and drugs necessary to combat them (e.g., [5–7]). The collection also shines a spotlight on specific pathogens, some familiar and widespread, such as the influenza A virus (e.g., [9]); some ‘‘reemerging,’’ such as the Mycobacterium tuberculosis complex that causes tuberculosis [10]; and some identified only recently, as with the bacterium Helicobacter pylori (which causes peptic ulcers and gastric cancer [11]). There is no simple definition of an emerging disease, but it can be loosely described as a disease that is novel in some way—for example, one that displays a change in geographic location, genetics, or function. Emerging infectious diseases are caused by a wide range of organisms, but they are perhaps best typified by zoonotic viral diseases that cross from animal to human hosts and can have a devastating impact on human health, causing a high disease burden and mortality [8]. These zoonotic diseases include monkeypox, Hendra virus, Nipah virus, and severe acute respiratory syndrome coronavirus (SARS-CoV), in addition to influenza A and the lentiviruses that cause AIDS. The apparently increased transmission of pathogens from animals to humans over the recent decades has been attributed to the unintended consequences of globalization as well as environmental factors and changes in agricultural practices [8]. Generally, the burden of these diseases is most strongly felt by those in developing countries. Brindley et al. [12] point to the debilitating effects of the most common human infectious agent in such areas—helminths (parasitic worms)—and the role that genomics plays in advancing our understanding of molecular and medical helminthology. Compounding the problem of emerging infectious diseases in developing countries is the reality that researchers in developing countries have often been unable to participate fully in genomics research, because of their technological isolation and limited resources. As Harris et al. emphasize [13], ‘‘collaborations— starting with capacity building in genomics research—need to be fostered so that countries that are currently excluded from the genomics revolution find an entry point for participation.’’ This collection is a collaborative effort that combines financial support from Google.org (which has also sponsored research on PLoS Biology | www.plosbiology.org emerging infectious disease through its Predict and Prevent initiative [14]) with PLoS’s editorial independence and rigor. Gupta et al. [1] provide Google.org’s perspective and vision for how systematic application of genomics, proteomics, and bioinformatics to infectious diseases could predict and prevent the next pandemic. To realize this vision, they urge the community to unite under an ‘‘Infectious Disease Genomics Project,’’ analogous to the Human Genome Project. This is, as the authors admit, a potentially ‘‘grandiose’’ and difficult proposition. Some researchers might justifiably argue that much is already being achieved—as demonstrated by this collection—and that the vision is naı̈ve. However, as every article in the collection also points out, tremendous challenges remain if the potential of genomics in this field is to be realized. One problem is that, despite the fact that sequencing is now the method of choice for characterizing new disease agents, and new substantially faster and cheaper sequencing methods are continually being produced, we still lack the range of computational tools necessary to analyze these sequences in sufficient detail [4]. It is possible to sequence the entire assemblage of viruses in a particular tissue type or host species [3] and to obtain complete or nearly complete genome sequences for large samples of bacteria [7]. Yet we remain in the early, albeit essential, stages of pathogen discovery (Box 1). These sequences can be interpreted fully only when integrated with relevant environmental, epidemiological, and clinical data (e.g., [3,4,8]). And, despite the increased sequencing, really comprehensive genome data are still only available for a few key pathogens, which further limits our understanding. For example, a full quantitative understanding of the processes that shape the epidemiology and evolution—the phylodynamics—of RNA viruses is currently possible only for HIV and influenza A virus [3]. In this collection, you will find not only the views of leading researchers from several different disciplines, and a provocative vision from a funding agency, but also the contributions of six different PLoS journals (PLoS Biology, PLoS Medicine, PLoS Computational Biology, PLoS Genetics, PLoS Neglected Tropical Diseases, and PLoS Pathogens). The PLoS open-access model of publishing makes possible such a large multidisciplinary cross-journal collection, in which all articles are simultaneously available online Citation: Eisen J, MacCallum CJ (2009) Genomics of Emerging Infectious Disease: A PLoS Collection. PLoS Biol 7(10): e1000224. doi:10.1371/journal.pbio.1000224 Published October 26, 2009 Copyright: ß 2009 Eisen, MacCallum. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Competing Interests: The authors have declared that no competing interests exist. * E-mail: [email protected] (JAE); [email protected] (CJM) This article is part of the ‘‘Genomics of Emerging Infectious Disease’’ PLoS Journal collection (http://ploscollections.org/emerginginfectiousdisease/). 1 October 2009 | Volume 7 | Issue 10 | e1000224 Box 1. A Field Guide to Microbes? When an American robin (Turdus migratorius) showed up in London a few years ago, birders were rapidly all atwitter and many came flocking to town [22]. Why had this one bird created such a stir? For one main reason—it was out of place. This species is normally found in North America and only very rarely shows up on the other side of the ‘‘pond.’’ Amazingly, this rapid, collective response is not that unusual in the world of birding. When a bird is out of place, people notice quickly. This story of the errant robin gets to the heart of the subject of this collection because being out of place in a metaphorical way is what defines an emerging infectious disease. Sometimes we have never seen anything quite like the organism or the disease before (e.g., SARS, Legionella). Or perhaps, as with many opportunistic pathogens, we have seen the organism before but it was not previously known to cause disease. In other cases, such as with as pandemic H1N1 2009 or E. coli O157:H7, we have seen the organism cause disease before but a new form is causing far more trouble. And of course organisms can be literally out of place, by showing up in a location not expected (e.g., consider the anthrax letters [2]). Historically, despite the metaphorical similarities with the robin case, the response to emerging infectious disease is almost always much slower. Clearly, there are many reasons for these differences, which we believe are instructive to consider. At least four factors are required for birders’ rapid responses to the arrival of a vagrant bird: (1) knowledge of the natural ‘‘fauna’’ in a particular place, (2) recognition that a specific bird may be out of place, (3) positive identification of the possibly out-of-place bird, and (4) examination of the ‘‘normal’’ place for relatives of the identified bird. How are these requirements achieved? Mostly through the existence of high-quality field guides that allow one to place an organism such as a bird into the context of what is known about its relatives. This placement in turn is possible because of two key components of field guides. First, such guides contain information about the biological diversity of a group of organisms. This usually includes features such as a taxonomically organized list of species with details for each species on biogeography (distribution patterns across space and time, niche preferences, relative abundance), biological properties (e.g., behavior, size, shape, etc.), and genetic variation within the species (e.g., presence of subspecies). Second, a good field guide provides information on how to identify particular types (e.g., species) of those organisms. With such information, and with a network of interested observers, an out-of-place bird can be detected with relative ease. In much the same way, a field guide to microbes would be valuable in the study of emerging infectious diseases. The articles in this collection describe what can be considered the beginnings of species-specific field guides for the microbial agents of emerging diseases. If we want to truly gain the benefits that can come from good field guides it will be necessary to expand current efforts to include more organisms, more systematic biogeographical sampling, and more epidemiological and clinical data. But the current efforts are a great start. Figure: The American Robin (Turdus migratorius). (Photo Credit: NASA). doi:10.1371/journal.pbio.1000224.g001 for unrestricted reuse, regardless of venue (see also the podcast that accompanies the collection; http://ploscollections.org/podcast/ emerginginfectiousdisease.mp3). Our aim is that this collection will add to other ‘‘open science’’ activities that have helped provide insights into infectious disease more quickly than would have been thought feasible only a few years ago. This accelerated availability of research findings is exemplified by the recent response to the flu pandemic. Consider, PLoS Biology | www.plosbiology.org for example, data access. Traditionally, scientists have released data after publishing a study. Fortunately, in part due to experience from genome sequencing projects, prepublication flu sequence data have been released in a relatively unrestricted manner to the community [15]. This has in turn enabled anyone—not just those who collected the data—to carry out analyses while the epidemic is occurring (when in principle there is still time to save lives) rather than being forced to provide a 2 October 2009 | Volume 7 | Issue 10 | e1000224 communication of research results and ideas about flu vetted by expert moderators [21]. This is not to say there are no possible risks or drawbacks from more openness. For example, some governments may avoid releasing data because of fears about discrimination (as was seen in many aspects of the flu in Mexico). Others worry that complete openness might foster the spread of misinformation. However, as Fricke et al. argue in their article on the relationship between genomics and biopreparedness [2], open source genomic resources are actually of immense benefit to those in charge of our public health and biosecurity. It is clear that ‘‘for all stages of combating emerging infections, from the early identification of the pathogen to the development and design of vaccines, application of sophisticated genomics tools is fundamental to success’’ [8]. It is equally clear that open science and open access to publications and data will be key to that success. Whatever one’s position has been on the various open science initiatives, there is no doubt that the ‘‘esoteric’’ label on some open science initiatives has largely been eliminated by the emergence of H1N1 flu epidemic. The faster, cheaper, and more openly we can distribute the discoveries of science, the better for scientific progress and public health. As this collection emphasizes, managing the threat of novel, re-emerging, and longstanding infectious diseases is challenging enough even without barriers to scientific research. We encourage you to make the most of this collection by sharing, rating, and annotating the articles using our online commenting tools. Better yet, join the discussion by providing your own vision to prevent the emergence and spread of the next rogue pathogen. posthumous account of the spread of infection. Such a response highlights both the importance of early data access and the removal of restrictions in the use of data (e.g., in many past cases data might be released but use of the data in presentations and publications would be limited). The value of open access to sequence data is helping to put pressure both on private organizations to release their sequence data [16,17] and on all agencies to release other information (e.g., metadata about strains) more rapidly. This pressure is not being brought to bear only on flu data—in this collection Van Voorhis et al. [5] call on pharmaceutical companies to deposit the structural coordinates of drug targets from all globally important infectious disease organisms in public databases. Of course, data about any infectious disease are not very useful unless placed in the scientific context of past studies (i.e., publications) specifically about the disease or about methods to analyze such data. It is also important to have access to information about other diseases and other organisms that might impact its spread or evolution. Perhaps the most intriguing aspect of open science in response to flu has been the move toward prejournal publication release of findings. Many flu researchers took the available data, analyzed it, and posted results on blogs [18,19], wikis [20], and other sites. Although some view this ‘‘non peerreviewed’’ release as unseemly, it is clear that it has helped accelerate the science in the study of pandemic H1N1 2009 and led to some important journal papers [17]. Indeed, such advances helped provide one of the stimuli for PLoS’s most recent initiative, PLoS Currents: Influenza, a Google ‘‘Knol,’’ for the rapid References 12. Brindley PJ, Mitreva M, Ghedin E, Lustigman S (2009) Helminth genomics: The implications for human health. PLoS Negl Trop Dis 3: e538. doi:10.1371/ journal.ppat.1000538. 13. Coloma J, Harris E (2009) Molecular genomic approaches to infectious diseases in resource-limited settings. PLoS Med 6: e1000142. doi:10.1371/journal. pmed.1000142. 14. Google.org (2008) Predict and Prevent Initiative homepage. Available: http:// www.google.org/predict.html. Accessed 16 September 2009. 15. National Center for Biotechnology Information (2009) Influenza Virus Resource. Available: http://www.ncbi.nlm.nih.gov/genomes/FLU/SwineFlu. html. Accessed 11 September 2009. 16. Butler D (2005) Flu researchers slam US agency for hoarding data. Nature 437: 458–459. 17. Smith GJD, Vijaykrishna D, Bahl J, Lycett SJ, Worobey M, et al. (2009) Origins and evolutionary genomics of the 2009 swine-origin H1N1 influenza A epidemic. Nature 459: 1122–1125. 18. Porter S (2009) Did the California H1N1 swine flu come from Ohio? Discovering Biology in a Digital World blog. Available: http://scienceblogs. com/digitalbio/2009/04/did_the_california_h1n1_swine.php. Accessed 11 September 2009. 19. Koppstein D (2009) Swine flu phylogeny, part II. Koppology blog. Available: http://koppology.blogspot.com/2009/04/swine-flu-phylogeny-part-ii.html. Accessed 11 September 2009. 20. Rambaut A (2009) Human/Swine A/H1N1 Influenza Origins and Evolution. Available: http://tree.bio.ed.ac.uk/groups/influenza/. Accessed 11 September 2009. 21. Allen L (2009) Welcome to PLoS Currents: Influenza. PLoS Blog. Available: http://www.plos.org/cms/node/481. Accessed 8 September 2009. 22. Evans I (29 March 2009) American Robin Spotted in South London. Foxnews.com. Available at:http://www.foxnews.com/story/0,2933,189510,00. html. Accessed 14 September 2009. 1. Gupta R, Michalski MH, Rijsberman FR (2009) Can an Infectious Diseases Genomics Project predict and prevent the next pandemic? PLoS Biol 7: e1000219. doi:10.1371/journal.pbio.1000219. 2. Fricke WF, Rasko DA, Ravel J (2009) The role of genomics in the identification, prediction, and prevention of biological threats. PLoS Biol e1000217. doi:10.1371/ journal.pbio.1000217. 3. Holmes EC, Grenfell BT (2009) Discovering the phylodynamics of RNA viruses. PLoS Comput Biol 5: e1000505. doi:10.1371/journal.pcbi.1000505. 4. Berglund EC, Nystedt B, Andersson SGE (2009) Computational resources in infectious disease: Limitations and challenges. PLoS Comput Biol 5: e1000481. doi:10.1371/journal.pcbi.1000481. 5. Van Voorhis WC, Hol WGJ, Myler PJ, Stewart LJ (2009) The role of medical structural genomics in discovering new drugs for infectious diseases. PLoS Comp Biol 5: e1000530. doi:10.1371/journal.pcbi.1000530. 6. Seib KL, Dougan G, Rappuoli R (2009) The key role of genomics in modern vaccine and drug design for emerging infectious diseases. PLoS Genet 5: e1000612. doi:10.1371/journal.pgen.1000612. 7. Falush D (2009) Toward the use of genomics to study microevolutionary change in bacteria. PLoS Genet 5: e1000627. doi:10.1371/journal.pgen.1000627. 8. Haagmans BL, Andeweg AC, Osterhaus ADME (2009) The application of genomics to emerging zoonotic viral diseases. PLoS Pathog 5: e1000557. doi:10.1371/journal.ppat.1000557. 9. McHardy AC, Adams B (2009) The role of genomics in tracking the evolution of influenza A virus. PLoS Pathog 5: e1000566. doi:10.1371/journal.ppat.1000566. 10. Comas I, Gagneux S (2009) The past and future of tuberculosis research. PLoS Pathog 5: e1000600. doi:10.1371/journal.ppat.1000600. 11. Dorer MS, Talarico S, Salama NR (2009) Helicobacter pylori’s unconventional role in health and disease. PLoS Pathog 5: e1000544. doi:10.1371/journal.ppat. 1000544. PLoS Biology | www.plosbiology.org 3 October 2009 | Volume 7 | Issue 10 | e1000224 Essay Molecular Genomic Approaches to Infectious Diseases in Resource-Limited Settings Josefina Coloma1,2, Eva Harris1,2* 1 Division of Infectious Diseases and Vaccinology, School of Public Health, University of California Berkeley, Berkeley, California, United States of America, 2 Sustainable Sciences Institute, San Francisco, California, United States of America Only half a century after the landmark discovery of the double helix structure of DNA, the human genome was sequenced and a new era of biomedical research was ushered in [1]. Parallel advances in comparative genomics, genetics, highthroughput biochemical techniques, and bioinformatics have provided researchers in wealthy nations with a repertoire of tools to analyze the sequence and functions of organisms at an unprecedented pace and level of detail. Since the beginning of the genomics era [2,3], however, it has been evident that researchers in many developing countries will not be participating fully in genomics research, mainly because of their technological isolation and their limited resources and capacity for genomics research combined with the urgency of many other health priorities. To share the benefits of this technology equitably worldwide, some have advocated that developed and developing countries alike should participate in genomics research to prevent widening of the already large gap in global health resources [4]. As most of the funding that has fueled the rapid advance of the field comes from developed country governments, private initiatives, and industry, however, not much has been done to enable poorer countries to participate as equals in genomics research. Developing countries that are not directly participating in a genomics initiative can, nonetheless, gain from the discoveries of this field in a number of ways, as detailed below. It remains to be seen, however, how the developing world will specifically benefit from the refined genetic information and the drugs and vaccines produced as a result of genomics initiatives. Information exchange and translation of knowledge must be carried out continually through fora accessible to researchers in developing countries. ‘‘North–South’’ collaborations— starting with capacity building in genomics research—need to be fostered so that countries that are currently excluded from the genomics revolution find an entry point for participation. ‘‘South–South’’ collaborations must be encouraged to allow countries with limited resources to pool their human and financial capital, learn from each other’s experience, and share in the benefits of genomics. Ensuring that the benefits of genomics-based medicine are shared by developing countries involves their inclusion in the discussion of ethical, legal, social, economic, and sovereignty issues (Box 1). Summary Points N N N N N N Researchers in most developing countries lack the technology, resources, and capacity to participate fully in genomics research. Information exchange and knowledge translation must be carried out continually through ‘‘North–South’’ collaborations, starting with capacity building in genomics research; ‘‘South–South’’ collaborations must be encouraged to allow countries with limited resources to pool their human and financial capital and share in the benefits of genomics. Several emerging countries have made significant progress in the past decade by sequencing the genomes of organisms with little economic value in the developed world but of great local relevance. Molecular diagnostics and molecular epidemiology are the first frontier of genomics, with accessible tools that can be applied in resource-limited settings. Developing countries entering the genomics era should start by establishing their priorities and enacting appropriate legislation before embarking on large-scale projects. Access to training and capacity building of human resources in bioinformatics and data mining are crucial in the developing world. Initiatives in the Developing World In the developing world, the link between human genomics and infectious disease is particularly important. The influence of host genes on the differential susceptibility of individuals or populations to infection and the evolutionary influence of pathogens on the genetic composition of populations by selecting for resistant individuals through coevolution can be now dissected in more detail with genomics. An array of host– pathogen interactions are associated with particular human genes and loci, as best illustrated by the relationship of the malaria pathogen with host genetic evolution. As genetic information about larger populations becomes increasingly available, it is important to disseminate information Citation: Coloma J, Harris E (2009) Molecular Genomic Approaches to Infectious Diseases in Resource-Limited Settings. PLoS Med 6(10): e1000142. doi:10.1371/journal.pmed.1000142 Published October 26, 2009 Copyright: ß 2009 Coloma, Harris. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Funding: No specific funding was received for this study/essay. Competing Interests: The authors have declared that no competing interests exist. * E-mail: [email protected] The Essay section contains opinion pieces on topics of broad interest to a general medical audience. PLoS Medicine | www.plosmedicine.org Provenance: Commissioned; externally peer-reviewed. This article is part of the ‘‘Genomics of Emerging Infectious Disease’’ PLoS Journal collection (http:// ploscollections.org/emerginginfectiousdisease/). 1 October 2009 | Volume 6 | Issue 10 | e1000142 Box 1. Societal and Ethical Issues in Genomics to Be Discussed with Full Participation of All Nations N N N N N N Issues of confidentiality, stigmatization, discrimination, and misuse of genetic information Dangers of a reductionist approach to health issues based only on genetic information that ignores multifactorial determinants Issues about intellectual property rights associated with the patentability of DNA sequences, the applications derived from them, and the implications for developing countries [45] The potential exploitation of developing-country populations by creating genetic databases for a price [46] The potential risk of breeding human beings by design [47] Issues about informed consent, standard of care, and availability and pricing of new drugs and vaccines being tested in developing countries [48] relating genomics to disease as well as to devise intervention strategies for at-risk populations worldwide [5]. Because science and technology are increasingly recognized as vital components for national development, emerging economies and some developing countries are building their infrastructures to promote local innovation and to retain the value of their human, plant, and microbial genomic diversity and research. India, Thailand, South Africa, Indonesia, Brazil, and Mexico, for example, have devoted considerable resources to large-scale population genotyping projects that explore human genetic variation. The Institute for Genomic Medicine (INMEGEN) initiative in Mexico is the largest and most comprehensive, with a broad strategy for incorporating genomics into health care that includes infrastructure, strategic public– private partnerships, research and development in genomics relevant to local health problems, capacity building, and bioethics policy making [6,7]. Although it is unclear how Mexico will make the transition from early-phase investment to translation of knowledge into products and services with health and economic impacts, the country is taking important steps to address the challenges it and other emerging economies face, such as the shortage of trained professionals and the ability to retain local talent. For example, the National Council for Science and Technology (CONICYT) is making efforts to engage the Mexican scientific diaspora with expertise in genomics by offering repatriation packages tied to jobs at universities and research institutes, an approach that is also being adopted by Brazil. Brazil’s Foundation for Research Support in Sao Paolo (FAPESP) genomics initiative is also considered a political and scientific achievement. Key to its success has been early investment in training PLoS Medicine | www.plosmedicine.org young scientists by sponsoring scholarships abroad in areas related to genomics in which Brazil lacks expertise. To avoid brain drain, beneficiaries are required to return to Brazil for at least four years and must have a committed teaching position at a local university before they leave. One important principle of Brazil’s genomics initiative is that the projects are relevant to Brazil and the rest of the developing world but are low on the list of priorities of the US and Europe, thus providing both an important contribution to genomics and a benefit to Brazil’s economy and scientific endeavor [8]. FAPESP is in the process of sequencing the genes of the parasite that causes schistosomiasis, a disease that afflicts millions in Brazil. Another example in Brazil is the government-funded consortium Organization for Nucleotide Sequencing and Analysis (ONSA), formed to sequence and analyze the genome of the plant pathogen Xylella, which infects orange trees and has great economic impact [9]. This effort led to additional genomics projects on vectors of pathogens that cause major public health problems in Brazil, such as the sandfly Lutzomyia longipalpis, which transmits Leishmania spp., and the Triatominae bug species, which are vectors of Trypanosoma cruzi [10]. The impact of genomics on the developing world is also illustrated by multinational initiatives such as the one funded by the US National Institutes of Health (NIH), the UK’s Wellcome Trust, and private and public institutes in the US and Europe in collaboration with research centers in Brazil, Argentina, Venezuela and Singapore to sequence the genomes of the parasites T. brucei, T. cruzi and Leishmania major, which cause the deadly insect-borne diseases African sleeping sickness, Chagas disease, and leishmaniasis, respectively [11–13]. The potential new drug targets identified by these initiatives 2 have great relevance in over 100 developing countries where the diseases take a significant toll on the economy and the quality of life of their citizens. Similar initiatives have resulted in sequencing of other pathogens important to medicine and agriculture. The data from these projects are usually freely available online for data mining and for bioinformatics analysis at remote locations, as most researchers follow the recommendation set by the Bermuda Accord to make DNA sequences (especially human) freely and openly available without delay [14]. Resource-limited countries can enter the genomics era by creating partnerships and regional centers for technology and resources [15]. For example, DNA sequencing technology, still unaffordable for many researchers and public laboratories because of low-use volume and high costs of equipment, reagents, and maintenance, can be affordable if a regional center provides services to a pool of laboratories and researchers within a country or geographical region. As an illustration, using Brazilian infrastructure, Perú and Chile joined the global potato sequencing consortium, which will sequence different varieties of this important agricultural species [16]. Brazil has also generated several open-source bioinformatics tools for the annotation of bacterial and protozoan genomes that can be used by any researcher worldwide [17]. In Africa, the Center for Training in Functional Genomics of Insect Vectors of Human Disease (AFRO VECTGEN) was initiated by TDR (Special Programme in Research and Training in Tropical Diseases) at the World Health Organization (WHO) and the Department of Medical Entomology and Vector Ecology of the Malaria Research and Training Center in Mali to train young scientists in functional genomics who will ultimately use genome sequence data for research on insect vectors of human disease. The program triggers collaborative research with neighboring nations and the vector biology network in Mali, which was built around research grants funded by the US NIH and TDR/WHO [18]. The Malaria Genomic Epidemiology Network (MalariaGEN) uses a consortial approach that brings together researchers from 21 countries to overcome scientific, ethical, and practical challenges to conducting largescale studies of genomic variation that could assist efforts in the fight against malaria [19]. Successful ‘‘North–South’’ partnerships that help scientists bridge the genomic gap usually involve a project of mutual interest. An example is the October 2009 | Volume 6 | Issue 10 | e1000142 common effort of the International Livestock Research Institute (ILRI) in Nairobi and The Institute for Genome Research (TIGR; now the J. Craig Ventner Institute) to sequence and annotate the genome of Theileria parva, a cattle parasite that causes important economic losses to small farmers in Africa and elsewhere [20]. This effort has generated local human resources in genomics and infrastructure for the future. Application of Molecular, Genetic, and Genomic Tools with Limited Resources Although the genomics initiatives described above challenge the notion that developing countries must wait to import advances in science and technology that emerge from the developed world, poorer developing countries still do not have the resources to develop their own genomic projects on a large scale. However, implementing simpler molecular genetic approaches to solve health problems is very feasible in resource-limited settings. The decades preceding the human and microbial genome initiatives were highlighted by important developments in molecular and genetic methods applied to infectious diseases. These developments were enabled by increasingly available genetic information about many pathogens and their vectors and by molecular tools such as PCR and powerful sequencing technologies, which permitted rapid advances that were successfully introduced into the developing world with little delay. Molecular tools for diagnosis have gained a ready foothold because many poor countries do not have the facilities for traditional diagnosis and surveillance. Thus, diagnosis often relies on clinical observations or requires that a sample be sent out to foreign agencies such as the US Centers for Disease Control and Prevention (CDC) for confirmation. In addition, even when available, classic techniques based on serological, microscopic, and culture-based methods are often lengthy, of only moderate sensitivity, and not highly discriminatory at the level of species subtype or strain. By adapting DNA technologies to the existing infrastructure, using home-grown solutions to reduce their cost, and applying them to solve local health problems, molecular approaches to detect and type infectious agents on-site offer real value [21]. Fostering appropriate technology transfer and capacity-building in the ‘‘South’’ enables public health laboratories and research groups in less scientifically developed PLoS Medicine | www.plosmedicine.org countries to participate in global genomics by contributing their findings and sharing their expertise with their peers [22,23]. For example, we and others adapted PCRbased molecular diagnostic techniques for infectious diseases such as leishmaniasis and dengue for cost-effective application in laboratories with minimum infrastructure and basic technical expertise, which are now fully validated and used routinely throughout Latin America [21,24–30]. This approach relies on understanding the principles of the technologies, deconstructing them into their basic components, and rebuilding them on-site [21]. Another area where molecular tools have demonstrated their utility in resource-poor settings is in detecting drug resistance in a variety of pathogens. This has been facilitated in large part by successful ‘‘North– South’’ partnerships that have served to train scientists in developing countries in the use, implementation, and interpretation of modern molecular methods applied to emerging drug resistance (see [31]). This approach has been particularly successful with certain diseases, such as malaria, HIV/ AIDS, tuberculosis, and drug-resistant bacterial infections (both nosocomial and community-based). Unfortunately, most studies of drug-resistant pathogens are performed independently of one another, so data on the prevalence of resistance markers is scattered in disparate databases or in unpublished studies without links to clinical, laboratory, and pharmacokinetic data needed to relate the genetic information to relevant phenotypes. To enable molecular markers of malaria drug resistance to realize their potential as public health tools, the Worldwide Malaria Resistance Network (WARN) database is being created with the dual goals of improving treatment of malaria by informed drug selection and use and providing a prompt warning when treatment protocols need to be changed [32,33]. By accelerating the identification and validation of markers for resistance to combination therapies, this global database should help prolong the useful therapeutic lives of important new drugs. The ultimate power of genetic tools in resource-limited settings is evident in the field of molecular epidemiology, where genetic information about the host or infectious agent is analyzed together with clinical and epidemiological data to derive and implement appropriate interventions. For example, molecular tools based on limited sequence information, such as molecular fingerprinting of a polymorphic marker, have made important contributions to strengthening control of tuberculosis in both developed and developing countries by 3 enabling analysis of transmission patterns, helping identify phenotypic variation among strains, and facilitating evaluation of the global distribution, relative transmissibility, virulence, and immunogenicity of different lineages of M. tuberculosis [34–38]. Bacterial infections, food-borne outbreaks, and viral infections in developing countries, including the recent H1N1 influenza pandemic, are monitored using similar typing methodologies [39–41]. Molecular tools permit a refined case definition and thus have tremendous potential for decisionmaking support and informing targeted public health interventions in countries with high burdens of disease and limited technological capabilities and resources. The trend to move beyond genetic marker analysis to full genome sequencing is growing, as complete genome data can provide a wealth of information about etiologic agents of disease that was previously unknown. Full-genome approaches are not always necessary, however. In molecular epidemiology of infectious diseases, nucleic acid fingerprinting can provide enough answers to important epidemiological questions to allow critical interventions to be designed (see above). In fact, too much genetic information, in some instances, can obscure the picture, as several closely related pathogenic variants might coexist in one individual or one outbreak that differ by only a few nucleotides but that nonetheless belong to the same strain or subtype, complicating the interpretation of results [42]. The relatively rapid transfer of DNA technology from developed to developing countries is an excellent example of what can be done by forging strong relationships between universities and research groups and public-health laboratories across the world. The validity of adapting these technologies relies on links with epidemiological data and translation into local public health interventions. Setting Priorities General international ethical and scientific guidelines for genomics have been created and are being adapted by nations participating in the field as it evolves. Governments and regulatory agencies in the ‘‘North’’ have prepared for the eventual implementation of genomicsbased medicine in their respective countries. A critical problem faced by developing countries is the lack of national guidelines for genomics research and its ethical ramifications. Thus, a priority to be set by countries in the early steps of genomic applications is to draw up the October 2009 | Volume 6 | Issue 10 | e1000142 necessary rules and legislation on genomics and to generate procedures for implementation. Creating the necessary communication channels between researchers, social scientists, policy makers, and civil society organizations is also a critical step. Other key challenges facing emerging genomics researchers include proper informed consent and privacy protocols for research participants, protecting them against the potential discrimination that might emerge from genetic information and ensuring that any benefit that comes to fruition from the research reaches them. In parallel, capacity building of scientists in clinical research and of ethics committees in these issues is essential. Past experience with ‘‘safari research’’ in which biological samples are taken outof-country for research that does not benefit local populations have prompted countries such as Mexico, India, and Brazil to draw up legislation governing ‘‘sovereignty’’ over genomics material and data that restricts the export of biological materials for studies abroad and prioritizes national interests. Poorer countries currently lacking their own genomics initiatives could benefit from similar legislation balancing the protection of ‘‘genomic sovereignty’’ while fostering international collaborations that bring much-needed resources and increase local scientific capacity. Beyond the improvement of their basic genomics research capabilities, governments should engage their relevant ministries to develop a plan to integrate genetic and genomics products (including diagnostics, vaccines, therapies, and others), within the health system and public health programs with emphasis on accessibility and equity to improve health for all. A good example of priority setting in genomics is Mexico’s national genomics program over the last 15 years (see Box 2). Sharing Know-How To strengthen genomics globally, the tools necessary for analysis of genomics data are urgently needed in developing countries, where they are currently underutilized [43]. A problem with genomics is that much of the advanced knowledge is concentrated in individuals and a few research centers and companies rather than in textbooks or academia, restricting dissemination even though massive amounts of genomic data and software are openly accessible through the Internet. A conscious effort on the part of developed nations to transfer their knowledge of the use and analysis of genomic databases needs to be encouraged to help developing countries manage their own specific data on indigenous biological species, local epidemiology and infectious diseases, biodiversity, and other issues. Some successful programs and initiatives include the Wellcome Trust Sanger Institute training courses on bioinformatics and genomic analysis, the Sustainable Sciences Institute–Broad Institute bioinformatics workshops (Figure 1), and the TDR/WHOSouth African Bioinformatics Institute (SANBI) regional training center. Online training like the S-star alliance bioinformatics courses and conferences such as the African Bioinformatics Conference (Afbix’09) with remote participation are becoming more widespread and are an excellent option for countries with limited resources. GARSA (Genomic Analysis Resources for Sequence Annotation) is a Box 2. Building a Road toward Genomics: The Mexican Experience 1995–2009 [7] N N N N N N N N Increases in investment in science and technology (S&T) from 0.35% to 0.43% of the GNP and creation of national S&T legislation to increase regional funding Four-fold increase in number of students registered for doctoral-level programs Participation in international genomics efforts Creation of sequencing initiatives of organisms with local agricultural and health relevance Creation of a Genomics Sciences degree and two scientific societies in genomics Creation of the National Institute of Genetic Medicine (2004-INMEGEN) with seed funding for modern infrastructure; a strategy for development that includes country-wide strategic alliances; high-level research and academic programs; ethical, legal, and social implications of genomic medicine; and translation of the scientific knowledge into public goods Establishment of genomics research priorities based on most prevalent local diseases Plans for creation of public–private partnerships to guarantee sustainability PLoS Medicine | www.plosmedicine.org 4 flexible Web-based system designed to analyze genomic data in the context of a data analysis pipeline. Hosted in Brazil, this free system aims to facilitate the analysis, integration, and presentation of genomic information, concatenating several bioinformatics tools and sequence databases with a simple user interface [44]. An alternative to on-site sequencing is to partner with colleagues in moredeveloped countries to have samples processed abroad in sequencing centers. This is possible only if local legislation allows for export of biological samples, and if true partnership and trust exist with a colleague(s) in the developed country. Challenges for the Future As developing countries reevaluate their role in the genomics era, they will continue to explore the unique opportunities that arise from the vast natural and genomic diversity that they embody. As exemplified by the successes in Brazil, Mexico, and several African countries, it is possible to turn challenges and problems such as emerging and endemic infectious diseases into opportunities for unique scientific and economic growth. Access to sequencing facilities, open-source databases, and harmonized methodologies for genomic analysis are essential for the future of genomics in the developing world. However, unless a more concerted effort is made to include countries with limited scientific development and resources, it is unlikely that they will fully participate in genomics projects or use the technologies available other than by allowing their genetic material to be accessible to others. As emerging countries set their own priorities for genomics research and take ownership of its results, the main challenge across developing nations remains access to training and knowledge translation. Human resources and local capacity in genomics are thus central to development, as countries with these skills could participate in the potential benefits of the field with respect to health, food security, natural resource management, and other critical areas. ‘‘North–South’’ and ‘‘South–South’’ collaborations are a viable and extremely rewarding way to increase the capacities of developing countries to access genomic tools to address unique problems considered of little economic value outside these countries but of tremendous importance to the majority of the world’s population. Author Contributions ICMJE criteria for authorship read and met: JC EH. Wrote the first draft of the paper: JC. Contributed to the writing of the paper: JC EH. October 2009 | Volume 6 | Issue 10 | e1000142 Figure 1. Participants in a Bioinformatics/Genomics Analysis workshop in Managua, Nicaragua, in June 2008 (conducted by the Sustainable Sciences Institute and the Broad Institute). Photograph by Eva Harris. doi:10.1371/journal.pmed.1000142.g001 References 1. Venter JC (2003) A part of the human genome sequence. Science 299: 1183–1184. 2. Singer PA, Daar AS (2001) Harnessing genomics and biotechnology to improve global health equity. Science 294: 87–89. 3. Calva E, Cardosa MJ, Gavilondo JV (2002) Avoiding the genomics divide. Trends Biotechnol 20: 368–370. 4. Acharya T, Daar AS, Thorsteinsdottir H, Dowdeswell E, Singer PA (2004) Strengthening the role of genomics in global health. PLoS Med 1: e40. doi:10.1371/journal.pmed.0010040. 5. Manolio TA, Rodriguez LL, Brooks L, Abecasis G, Ballinger D, et al. (2007) New models of collaboration in genome-wide association studies: The Genetic Association Information Network. Nat Genet 39: 1045–1051. 6. Seguin B, Hardy BJ, Singer PA, Daar AS (2008) Genomics, public health and developing countries: The case of the Mexican National Institute of Genomic Medicine (INMEGEN). Nat Rev Genet 9 (Suppl 1): S5–9. 7. Jimenez-Sanchez G, Silva-Zolezzi I, Hidalgo A, March S (2008) Genomic medicine in Mexico: Initial steps and the road ahead. Genome Res 18: 1191–1198. 8. Castilla EE, Luquetti DV (2008) Brazil: Public Health Genomics. Public Health Genomics. PLoS Medicine | www.plosmedicine.org 9. 10. 11. 12. 13. 14. E-pub ahead of print (3 Sept). doi:10.1159/ 000153424. Simpson AJ, Reinach FC, Arruda P, Abreu FA, Acencio M, et al. (2000) The genome sequence of the plant pathogen Xylella fastidiosa. The Xylella fastidiosa Consortium of the Organization for Nucleotide Sequencing and Analysis. Nature 406: 151–159. Davila AM, Majiwa PA, Grisard EC, Aksoy S, Melville SE (2003) Comparative genomics to uncover the secrets of tsetse and livestock-infective trypanosomes. Trends Parasitol 19: 436–439. Berriman M, Ghedin E, Hertz-Fowler C, Blandin G, Renauld H, et al. (2005) The genome of the African trypanosome Trypanosoma brucei. Science 309: 416–422. El-Sayed NM, Myler PJ, Bartholomeu DC, Nilsson D, Aggarwal G, et al. (2005) The genome sequence of Trypanosoma cruzi, etiologic agent of Chagas disease. Science 309: 409–415. Ivens AC, Peacock CS, Worthey EA, Murphy L, Aggarwal G, et al. (2005) The genome of the kinetoplastid parasite, Leishmania major. Science 309: 436–442. Bentley DR (1996) Genomic sequence information should be released immediately and freely in the public domain. Science 274: 533– 534. 5 15. Rabinowicz PD (2001) Genomics in Latin America: Reaching the frontiers. Genome Res 11: 319–322. 16. Potato Genome Sequencing Consortium. Available: http://www.potatogenome.net. Accessed 19 July 2009. 17. Almeida LG, Paixao R, Souza RC, Costa GC, Almeida DF, et al. (2004) A new set of bioinformatics tools for genome projects. Genet Mol Res 3: 26–52. 18. Doumbia S, Chouong H, Traore SF, Dolo G, Toure AM, et al. (2007) Establishing an insect disease vector functional genomics training center in Africa. Afr J Med Med Sci 36 (Suppl): 31–33. 19. Malaria Genomic Epidemiology Network (2008) A global network for investigating the genomic epidemiology of malaria. Nature 456: 732–737. 20. Gardner MJ, Bishop R, Shah T, de Villiers EP, Carlton JM, et al. (2005) Genome sequence of Theileria parva, a bovine pathogen that transforms lymphocytes. Science 309: 134–137. 21. Harris E (1998) A low-cost approach to PCR: Appropriate transfer of biomolecular techniques. New York: Oxford University Press. 22. Coloma MJ, Harris E (2004) Innovative low cost technologies for biomedical research and diagnosis in developing countries. BMJ 329: 1160–1162. October 2009 | Volume 6 | Issue 10 | e1000142 23. Harris E (2004) Scientific capacity building in developing countries. EMBO Rep 5: 7–11. 24. Harris E, Tanner M (2000) Health technology transfer. BMJ 321: 817–820. 25. Aviles H, Belli A, Armijos R, Monroy FP, Harris E (1999) PCR detection and identification of Leishmania parasites in clinical specimens in Ecuador: A comparison with classical diagnostic methods. J Parasitol 85: 181–187. 26. Harris E, Kropp G, Belli A, Rodriguez B, Agabian N (1998) Single-step multiplex PCR assay for characterization of New World Leishmania complexes. J Clin Microbiol 36: 1989–1995. 27. Belli A, Rodriguez B, Aviles H, Harris E (1998) Simplified polymerase chain reaction detection of new world Leishmania in clinical specimens of cutaneous leishmaniasis. Am J Trop Med Hyg 58: 102–109. 28. Coloma J, Harris E (2008) Sustainable transfer of biotechnology to developing countries: fighting poverty by bringing scientific tools to developingcountry partners. Ann N Y Acad Sci 1136: 358–368. 29. Miagostovich MP, Sequeira PC, Dos Santos FB, Maia A, Nogueira RM, et al. (2003) Molecular typing of dengue virus type 2 in Brazil. Rev Inst Med Trop Sao Paulo 45: 17–21. 30. Schriefer A, Schriefer AL, Goes-Neto A, Guimaraes LH, Carvalho LP, et al. (2004) Multiclonal Leishmania braziliensis population structure and its clinical implication in a region of endemicity for American tegumentary leishmaniasis. Infect Immun 72: 508–514. 31. Falush D (2009) Toward the use of genomics to study microevolutionary change in bacteria. PLoS PLoS Medicine | www.plosmedicine.org 32. 33. 34. 35. 36. 37. 38. 39. Gen 5: e1000627. doi:10.1371/journal. pgen.1000627. Plowe CV, Roper C, Barnwell JW, Happi CT, Joshi HH, et al. (2007) World Antimalarial Resistance Network (WARN) III: Molecular markers for drug resistant malaria. Malar J 6: 121. Sibley CH, Barnes KI, Watkins WM, Plowe CV (2008) A network to monitor antimalarial drug resistance: a plan for moving forward. Trends Parasitol 24: 43–48. Bifani PJ, Mathema B, Kurepina NE, Kreiswirth BN (2002) Global dissemination of the Mycobacterium tuberculosis W-Beijing family strains. Trends Microbiol 10: 45–52. Filliol I, Driscoll JR, van Soolingen D, Kreiswirth BN, Kremer K, et al. (2003) Snapshot of moving and expanding clones of Mycobacterium tuberculosis and their global distribution assessed by spoligotyping in an international study. J Clin Microbiol 41: 1963–1970. Manca C, Reed MB, Freeman S, Mathema B, Kreiswirth B, et al. (2004) Differential monocyte activation underlies strain-specific Mycobacterium tuberculosis pathogenesis. Infect Immun 72: 5511–5514. Valway SE, Sanchez MP, Shinnick TF, Orme I, Agerton T, et al. (1998) An outbreak involving extensive transmission of a virulent strain of Mycobacterium tuberculosis. N Engl J Med 338: 633–639. Gagneux S, Comas I (2009) The past and future of tuberculosis research. PLoS Path 5(10): e600. doi:10.1371/journal.ppat.1000600. Poon LL, Chan KH, Smith GJ, Leung CS, Guan Y, et al. (2009) Molecular detection of a 6 40. 41. 42. 43. 44. 45. 46. 47. 48. novel human influenza (H1N1) of pandemic potential by conventional and real-time quantitative RT-PCR assays. Clin Chem 55: 1555–1558. Reis JN, Palma T, Ribeiro GS, Pinheiro RM, Ribeiro CT, et al. (2008) Transmission of Streptococcus pneumoniae in an urban slum community. J Infect 57: 204–213. Vieira N, Bates SJ, Solberg OD, Ponce K, Howsmon R, et al. (2007) High prevalence of enteroinvasive Escherichia coli isolated in a remote region of northern coastal Ecuador. Am J Trop Med Hyg 76: 528–533. Riley LW (2004) Molecular epidemiology of infectious diseases: Principles and practices. Herndon (Virginia): ASM Press. Teufel A, Krupp M, Weinmann A, Galle PR (2006) Current bioinformatics tools in genomic biomedical research. Int J Mol Med 17: 967–973. Davila AM, Lorenzini DM, Mendes PN, Satake TS, Sousa GR, et al. (2005) GARSA: Genomic analysis resources for sequence annotation. Bioinformatics 21: 4302–4303. Cook-Deegan RM, McCormack SJ (2001) Intellectual property. Patents, secrecy, and DNA. Science 293: 217. Burton B (2002) Proposed genetic database on Tongans opposed. BMJ 324: 443. Pang T (2002) The impact of genomics on global health. Am J Public Health 92: 1077–1079. Chokshi DA, Thera MA, Parker M, Diakite M, Makani J, et al. (2007) Valid consent for genomic epidemiology in developing countries. PLoS Med 4: e95. doi:10.1371/journal.pmed.0040095. October 2009 | Volume 6 | Issue 10 | e1000142 Perspective Can an Infectious Disease Genomics Project Predict and Prevent the Next Pandemic? Rajesh Gupta¤*, Mark H. Michalski¤, Frank R. Rijsberman Google.org, Mountain View, California, United States of America We believe that there is great potential in the systematic application of genomics, proteomics, and bioinformatics to infectious diseases, and that this potential has yet to be fully realized. We suggest that the international community unite under an Infectious Disease Genomics Project, analogous to the Human Genome Project, with a goal of a comprehensive, openaccess system of genomic information to accelerate scientific understanding and product development in the very settings where diseases have the highest probability of emerging. If properly structured, such an approach could shift fundamentally the global response to emerging infectious diseases. Genomics Is Systematically Transforming Medicine The ‘‘Genomic Revolution’’ has transformed our vision and understanding of how living organisms and systems interact with each other and with the environment [1]. Increasingly, the science of genomics serves as the foundation for translational research for advancing the management of many important diseases [2–7]. Decreasing costs and increasing throughput of new technologies has made possible multinational collaboration on large-scale projects such as the Human Microbiome Project and the 1000 Genomes Project [8–10]. Infectious disease management is also transforming thanks to molecular technologies as seen in HIV [11,12], tuberculosis [13,14], malaria [15,16], and other neglected tropical diseases [17,18]. Discovering novel pathogens and elucidating the implications of genetic variation among existing pathogens [19,20] is critical for rapidly mitigating pandemic threats, as demonstrated recently with severe acute respiratory syndrome (SARS) [21,22] and avian (H5N1) and pandemic H1N1 2009 influenza (commonly referred to as ‘‘swine flu’’) [23–26]. To fully harness the benefit of genomics in infectious diseases, a chain of overarching activities must occur. First, understanding the dynamics of infectious diseases through the genomics lens requires a tremendous amount of integrated comparative sequence, expression, epigenetic, and proteomic data from a variety of pathogens (bacteria, virus, protozoa, fungi), vectors (arthropod and avian sources), reservoirs (non-human mammals, environment) and human hosts. Second, generating, collating, organizing, and curating these data is an essential public health task. Third, translating this information to tools to improve surveillance and response mechanisms is critical to effectively impact disease management. If this bench-to-beside chain of activities were optimized, we envision that the following could occur: N N Fully annotated genomes of all known pathogens, vectors, non-human hosts, and reservoir species, as well as a large number of candidate microbes in families that have a high risk of generating future pathogens, are held in public open-access databases such as GenBank. A ‘‘Genomic search’’ of all available contextual information, from sample origins through to published analyses, is as simple as a Google search. N N N Sequencing and other molecular technologies are everyday tools-of-thetrade in every district hospital and laboratory in hotspots of emerging infectious disease, such as southeast Asia and sub-Saharan Africa. Automated molecular diagnostic assays are low-cost, reduced at least to the size of a smart mobile phone, and can return definitive diagnoses of a range of specialized known pathogen panels at the point of care. A range of products that use infectious disease genomic information routinely—such as vector maps, early warning systems, diagnostics, vaccines, and drugs—contribute to the prediction and prevention of epidemics. While progress is occurring in each of these areas, the outputs—which are needed today—are far from complete. Creating an Infectious Disease Genomics Project (IDGP) We believe that accelerated advances in the area of infectious diseases can occur under a global collaborative framework composed of discrete and delineated activities between the public and private sectors among resource-wealthy and resource-limited settings. The Human Genome Project (HGP) was a pioneering international effort that helped unlock the power of genomics for human health Citation: Gupta R, Michalski MH, Rijsberman FR (2009) Can an Infectious Disease Genomics Project Predict and Prevent the Next Pandemic? PLoS Biol 7(10): e1000219. doi:10.1371/journal.pbio.1000219 Published October 26, 2009 Copyright: ß 2009 Gupta et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Funding: Google.org is financially supported through its parent company, Google.com. At the time this manuscript was developed, RG was an employee of Google.org and MM was a consultant to Google.org. The funder had no role in the decision to publish or preparation of the manuscript. Competing Interests: The authors have declared that no competing interests exist. * E-mail: [email protected] The Perspective section provides experts with a forum to comment on topical or controversial issues of broad interest. PLoS Biology | www.plosbiology.org ¤ Current address: Stanford University, Stanford, California, United States of America This article is part of the ‘‘Genomics of Emerging Infectious Disease’’ PLoS Journal collection (http:// ploscollections.org/emerginginfectiousdisease/). 1 October 2009 | Volume 7 | Issue 10 | e1000219 Author Summary The world of genomics is transforming medicine, and is likely to influence the future development of new drugs, diagnostics, and vaccines. To date, the greater focus of genomics and medicine has been on conditions affecting resourcewealthy settings, primarily involving scientists and companies in those settings. However, we believe that it is possible to expand genomics into a more global technology that can also focus on diseases of resource-limited settings. This goal can be achieved if genomics is made a global priority. We feel one way to move in this direction is through a comprehensive approach to infectious diseases—i.e., an Infectious Disease Genomics Project—that would mirror the Human Genome Project. Without an active, unified effort specifically focused on allowing actors at any level to participate in the genomics revolution, infectious diseases that primarily affect the poor will likely not achieve the same level of scientific advancement as diseases affecting the wealthy. [27,28]. This effort generated important information in part by having clear, targeted outcomes and by implementing a standard methodology across all participants. The HGP was a great impetus for progress seen thus far in genomics and health. Moreover, the HGP recognized that sequencing was just the first step in a much bigger process [26]. A similar effort for infectious diseases could, in our view, help predict and prevent the next pandemic. To capitalize on existing successful efforts in the area of genomics and infectious diseases such as those by the Broad Institute, Genomics Standards Consortium, J Craig Venter Institute, the National Institute of Allergy and Infectious Diseases, and the Wellcome Trust Sanger Institute (to name a few), we urge the international community to unite its numerous activities under an Infectious Diseases Genomic Project (IDGP)—a coordinated, large-scale, international effort focused on the genomes of pathogens, vectors, hosts, and reservoirs and linked to end-point surveillance and response systems. Such a project could coordinate activities in four specific areas: generating data, linking data, analyzing data, and applying data (Figure 1). Generating Data At the outset, the IDGP would need to determine what the world requires in terms of genomic information. A standard approach to generating depth and diversity in genomic data is essential; beyond this, continuous real-time surveillance and characterization of evolving pathogens can help effectively forestall future epidemics/ pandemics. Frontline work by consortiums, genome research centers, and individual laboratories has yielded baseline approaches in this area and a wealth of critical genomic information for many important infectious agents [29–34]. While each actor in the genomics field brings its own priority for targeting particular pathogens or diseases, a clear roadmap to generating a complete genomic picture of all infectious agents, emerging threats, hosts, and reservoirs, incorporating a broad range of investigators with varied technological capacity, would enhance both data generation and application. Such a process allows for community-level priority setting, thereby enabling smaller-scale laboratories to tailor projects to fit the needs of local communities while contributing to global efforts. Linking Data The data collected must be connected to all relevant information and analytical tools in a single, easy-to-use, open-source, real-time interface. Such a system would improve on current systems by: gathering data across the public domain and working with companies/institutions to harness information in the private domain; linking accurate, annotated sequencing information to functional genomic and proteomic/functional proteomic information; attaching scientific literature associated with all levels of information; and including a self-sustaining financial mechanism potentially based on royalties from commercial products generated from the use of this system. Analyzing Data The data need to be linked via largescale, dynamic databases held in virtual servers allowing for collaboration and sharing while maintaining originating information for data rights and sovereignty. Concurrently, these data should be associated with a centralized collection of open-source bioinformatics tools capable of real-time operation in low- and highspeed computers and varying levels of internet connectivity. A single interface also would bring various sample collections together in formally structured biobanks that capture geospatial and context data to allow efficient scientific collaboration to take place. Centralizing the entire spectrum of information and analytic tools also allows researchers in resource-limited settings to participate in the genomics revolution without prohibitively costly machines, laboratories, and sample accessibility. Although we fully acknowledge Figure 1. A coordinated Infectious Disease Genome Project (IDGP) could unify sequencing efforts, enhance data usability, and lead to essential tools for infectious disease management. doi:10.1371/journal.pbio.1000219.g001 PLoS Biology | www.plosbiology.org 2 October 2009 | Volume 7 | Issue 10 | e1000219 that internet connectivity is a requirement that is not currently available to all, rapid technical innovation and investment from cheap netbook computers to new fiber optic cables in Africa are changing that equation. This system could be facilitated by virtual community collaboration or crowd-sourcing, taking full advantage of networking tools such as Wikipedia, Facebook, Twitter, FusionTables, and PLoS. Applying Data Technological advances for basic scientific discovery (such as next-generation sequencers, microarrays, mass spectrometers, cell-based assay methods, and other tools for transcriptome, metabolome, and proteome discovery), novel techniques to increase throughput and/or decrease the cost of analysis, and applied clinical decision-making and surveillance tools (point-of-care diagnostics, rapid multipathogen assays) are in progress and should be supported actively. The IDGP should be informed by and incorporate emerging technology platforms to rapidly develop more accurate field diagnostics and to identify new opportunities for vaccine and drug development. An IDGP is attainable if others share this vision, show leadership, and see the added value resulting from a coordinated effort. The HGP certainly was a more targeted effort and we acknowledge that an IDGP will have additional obstacles to overcome. Scientific disagreement over targets is bound to occur. Complications resulting from the proposed level of data sharing should not be underestimated, and care must be taken to ensure proprietary rights and acknowledgement when warranted. Adapting molecular genetic technologies to resource-limited settings is a significant challenge, but is occurring with some success. Bringing together a community of scientists and donors, each with their own objectives and goals, to work under a single framework, is a difficult proposition. Finally, there will be many who will find this perspective simply too grandiose. Leaps of progress also require big visions, however, and it may just be possible that the 2009 H1N1 influenza pandemic is a enough of a reminder of what is at stake to provide a catalyst for action. Google.org has supported global public health through its ‘‘Predict and Prevent’’ initiative with the aim of using the power of information and technology to address emerging infectious diseases by helping the world to know where to look for these diseases, find the threats earlier, and respond to them faster [35]. Google.org has focused its support on sequencing and pathogen discovery activities, bringing genomic technologies to resource-limited settings in East Africa, improving surveillance networks and systems, and exploring how our core competence in internet search can assist the infectious diseases community [36]. As firm supporters of the open access model for scientific publication [37], Google.org is pleased to support this series of essays, The Genomics of Emerging Infectious Disease, in partnership with the Public Library of Science (PLoS) journals (PLoS Biology, PLoS Computational Biology, PLoS Genetics, PLoS Medicine, PLoS Neglected Tropical Diseases, and PLoS Pathogens), not only to help define the current state of the art in pathogen genomics, but also, we hope, to stimulate debate on priorities for research and technology development. 11. Martinez-Cajas JL, Wainberg MA (2008) Antiretroviral therapy: Optimal sequencing of therapy to avoid resistance. Drugs 68: 43–72. 12. Wilkinson KA, Gorelick RJ, Vasa SM, Guex N, Rein A, et al. (2008) High-throughput SHAPE analysis reveals structures in HIV-1 Genomic RNA strongly conserved across distinct biological states. PLoS Biol 6: e96. doi:10.1371/journal. pbio.0060096. 13. Smith CV, Sacchettini JC (2003) Mycobacterium tuberculosis: A model system for structural genomics. Curr Opin Struct Biol 13: 658–664. 14. Cockle PJ, Gordon SV, Lalvani A, Buddle BM, Hewinson RG, et al. (2002) Identification of novel Mycobacterium tuberculosis antigens with potential as diagnostic reagents or subunit vaccine candidates by comparative genomics. Infect Immun 70: 6996–7003. 15. Gonzales JM, Patel JJ, Ponmee N, Jiang L, Tan A, et al. (2008) Regulatory hotspots in the malaria parasite genome dictate transcriptional variation. PLoS Biol 6: e238. doi:10.1371/journal. pbio.0060238. 16. Ekland EH, Fidock DA (2007) Advances in understanding the genetic basis of antimalarial drug resistance. Curr Opin Microbiol 10: 363–370. 17. Beaty BJ, Prager DJ, James AA, Jacobs-Lorena M, Miller LH, et al. (2009) From Tucson to genomics and transgenics: The Vector Biology Network and the emergence of modern vector biology. PLoS Negl Trop Dis 3: e343. doi:10.1371/ journal.pntd.0000343. 18. Hertz-Fowler C, Figueiredo LM, Quail MA, Becker M, Jackson A, et al. (2008) Telomeric expression sites are highly conserved in Trypanosoma brucei. PLoS ONE 3: e3527. doi:10.1371/ journal.pone.0003527. 19. Wolfe N, Heneine W, Carr J, Garcia A, Shanmugam V, et al. (2005) Emergence of unique primate T-lymphotropic viruses among central African bushmeat hunters. Proc Natl Acad Sci U S A 102: 7994–7999. Palacios G, Druce J, Du L, Tran T, Birch C, et al. (2008) A new arenavirus in a cluster of fatal transplant-associated diseases. N Engl J Med 358: 991–998. Grant P, Garson J, Tedder R, Chan P, Tam J, et al. (2003) Detection of SARS coronavirus in plasma by real-time RT-PCR. N Engl J Med 349: 2468. Marra M, Jones S, Astell C, Holt R, BrooksWilson A, et al. (2003) The genome sequence of the SARS-associated coronavirus. Science 300: 1399–1404. Gu J, Xie Z, Gao Z, Liu J, Korteweg C, et al. (2007) H5N1 infection of the respiratory tract and beyond: A molecular pathology study. Lancet 370: 1137–1145. Zhao Z-M, Shortridge KF, Garcia M, Guan Y, Wan X-F (2008) Genotypic diversity of H5N1 highly pathogenic avian influenza viruses. J Gen Virol 89: 2182–2193. Garten RJ, Davis CT, Russell CA, Shu B, Lindstrom S, et al. (2009) Antigenic and genetic characteristics of swine-origin 2009 A(H1N1) influenza viruses circulating in humans. Science 325: 197–201. Shinde V, Bridges CB, Uyeki TM, Shu B, Balish A, et al. (2009) Triple-reassortant swine influenza A (H1) in humans in the United States, 2005–2009. N Engl J Med 360: 2616–2625. Consortium IHGS (2001) Initial sequencing and analysis of the human genome. Nature 409: 860–921. Collins FS, Morgan M, Patrinos A (2003) The Human Genome Project: Lessons from largescale biology. Science 300: 286–290. Wellcome Trust Sanger Institute (2009) Pathogen genomics [Web site]. Available: http://www. sanger.ac.uk/Projects/Pathogens/. Accessed 11 August 2009. Moving beyond Discourse into Action References 1. Yudell M, DeSalle R (2002) The genomic revolution: Unveiling the unity of life. Washington (D. C.): Joseph Henry Press. 272 p. 2. Langston AA, Malone KE, Thompson JD, Daling JR, Ostrander EA (1996) BRCA1 mutations in a population-based sample of young women with breast cancer. N Engl J Med 334: 137–142. 3. Futreal P, Liu Q, Shattuck-Eidens D, Cochran C, Harshman K, et al. (1994) BRCA1 mutations in primary breast and ovarian carcinomas. Science 266: 120–122. 4. Helgadottir A, Manolescu A, Thorleifsson G, Gretarsdottir S, Jonsdottir H, et al. (2004) The gene encoding 5-lipoxygenase activating protein confers risk of myocardial infarction and stroke. Nature Genetics 36: 233–239. 5. Wellcome Trust C (2007) Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature 447: 661–678. 6. Consortium G (2007) New models of collaboration in genome-wide association studies: The Genetic Association Information Network. Nat Genet 39: 1045–1051. 7. Vigneri P, Wang J (2001) Induction of apoptosis in chronic myelogenous leukemia cells through nuclear entrapment of BCR-ABL tyrosine kinase. Nat Med 7: 228–234. 8. Gresham D, Kruglyak L (2008) Rise of the machines. PLoS Genet 4: e1000134. doi:10.1371/journal.pgen.1000134. 9. Spencer G (2008) Researchers establish international human microbiome consortium. NIH News. Available: http://www.nih.gov/news/ health/oct2008/nhgri-16.htm. Accessed 19 September 2009. 10. Spencer G (2008) International consortium announces the 1000 Genomes Project. NIH News. Available: http://www.nih.gov/news/health/ jan2008/nhgri-22.htm. Accessed 19 September 2009. PLoS Biology | www.plosbiology.org 3 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. October 2009 | Volume 7 | Issue 10 | e1000219 30. National Institute of Allergy and Infectious Disease (2009) Microbial Genome Sequencing Centers: Completed NIAID-Supported Sequencing Projects. Available: http://www3.niaid.nih. gov/research/resources/mscs/completed.htm. Accessed 11 August 2009. 31. Cole ST, Brosch R, Parkhill J, Garnier T, Churcher C, et al. (1998) Deciphering the biology of Mycobacterium tuberculosis from the complete genome sequence. Nature 393: 537–544. PLoS Biology | www.plosbiology.org 32. Gardner MJ, Hall N, Fung E, White O, Berriman M, et al. (2002) Genome sequence of the human malaria parasite Plasmodium falciparum. Nature 419: 498–511. 33. Greene JM, Collins F, Lefkowitz EJ, Roos D, Scheuermann RH, et al. (2007) National Institute of Allergy and Infectious Diseases bioinformatics resource centers: New assets for pathogen informatics. Infect Immun 75: 3212–3219. 34. Field D, Garrity G, Gray T, Morrison N, Selengut J (2008) The minimum information 4 about a genome sequence (MIGS) specification. Nat Biotechnol 26: 541–547. 35. Google.org (2008) Predict and Prevent initiative. Available: http://www.google.org/predict.html. Accessed 19 September 2009. 36. Ginsberg J, Mohebbi MH, Patel RS, Brammer L, Smolinski MS, et al. (2009) Detecting influenza epidemics using search engine query data. Nature 457: 1012–1014. 37. Gass A (2004) Open access as public policy. PLoS Biol 2: e353. doi:10.1371/journal.pbio.0020353. October 2009 | Volume 7 | Issue 10 | e1000219 Perspective The Role of Genomics in the Identification, Prediction, and Prevention of Biological Threats W. Florian Fricke, David A. Rasko, Jacques Ravel* Institute for Genome Sciences (IGS), University of Maryland School of Medicine, Baltimore, Maryland, United States of America Since the publication in 1995 of the first complete genome sequence of a free-living organism, the bacterium Haemophilus influenzae [1], more than 1,000 genomes of species from all three domains of life— Bacteria, Archaea, and Eukarya—have been completed and a staggering 4,300 are in progress (not including an even larger number of viral genome projects) (GOLD, Genomes Online Database v. 2.0; http://www.genomesonline.org/gold. cgi, as of August 2009). Whole-genome shotgun sequencing remains the standard in biomedical, biotechnological, environmental, agricultural, and evolutionary genomics (http://genomesonline.org/ gold_statistics.htm#aname). While nextgeneration sequencing technology is changing the field, this approach will continue to be used and lead to a previously unimaginable number of genome sequences, providing opportunities that could not have been thought of a few years ago. These opportunities include studying genomes in real-time to understand the evolution of known pathogens and predict the emergence of new infectious agents (Box 1). With the introduction of next-generation sequencing platforms, cost has decreased dramatically, resulting in genomics no longer being an independent discipline, but becoming a tool routinely used in laboratories around the world to address scientific questions. This global sequencing effort has been focusing primarily on pathogenic organisms, which today are still the subject of the majority of genome projects [2]. Sequencing two to five strains of the same pathogen has, in recent years, afforded us not only a better understanding of evolution, virulence, and biology in general [3], but, taken to the next level (hundreds or thousands of strains) it will enable even more accurate diagnostics to support epidemiological studies, food safety improvements, public health protection, and forensics investigations, among others. Biodefense Funding for Genomic Research Since the anthrax letter attacks of 2001, when letters containing anthrax spores were mailed to several news media offices and two Democratic senators in the United States, killing five people and infecting 17 others, funding agencies in the US and other countries have prioritized research projects on organisms that might potentially challenge our security and economy should they be used as biological weapons. This has resulted in large amounts of funding dedicated to socalled ‘‘biodefense’’ research, totaling close to $50 billion between 2001 and 2009 [4]. Genomics has benefited greatly from this influx of research dollars and as a result, representatives of most major animal, plant, and human pathogens have been sequenced (http://www.pathogenportal.org/). Supported by federal funds from the National Institutes of Health (NIH), the National Institute of Allergy and Infectious Diseases (NIAID), and the US Department of Defense, research programs, such as the Microbial Sequencing Centers and the Bioinformatics Resource Centers (http://www3. niaid.nih.gov/topics/pathogenGenomics/ PDF/genomicsinitiatives.htm), have been established that carry out genomics research on pathogenic organisms and have spearheaded a new phase of the genomics revolution. Similar programs were started in Europe, such as those at the Wellcome Trust Sanger Institute in the United Kingdom, and the multinational European effort, The Network of Excellence EuroPathoGenomics (http://www.noe-epg. uni-wuerzburg.de/epg_general.htm). As an example of the success of these types of programs, the genome sequences of over 90,000 influenza viruses were rapidly generated and are now deposited in GenBank (http://www.ncbi.nlm.nih.gov/ genomes/FLU/aboutdatabase.html). Because of the availability of large sequencing capacity and the large amount of information, the response to the 2009 H1N1 influenza pandemic was rapid and efficient (Box 2): Genomics information was generated within days and validated diagnostic tools were approved within weeks [5,6]. A global response was made possible through tremendous research efforts enabled by genomic research. Access to and Documentation of Sequence Data Open access to genomics resources (i.e., raw sequence data and associated publications) is an essential component of the nation preparedness to biological threats (biopreparedness), whether intentionally delivered or not. Although some consider open-source genomic resources a threat to security [7] because they make publicly available information that could facilitate the construction of dangerous infectious agents, we strongly disagree with this point of view. Rather, we and others [8] believe that it is an enabling tool more useful to those in charge of our public health and biosecurity than to those with ill intentions. Genomic sequence data can provide a starting point for the development of new vaccines, drugs, and diagnostic tests [9], hence improving public health capabilities and increasing our biopreparedness. Access to the organisms from which the sequences are derived should be restricted, not their genome sequences. Citation: Fricke WF, Rasko DA, Ravel J (2009) The Role of Genomics in the Identification, Prediction, and Prevention of Biological Threats. PLoS Biol 7(10): e1000217. doi:10.1371/journal.pbio.1000217 Published October 26, 2009 Copyright: ß 2009 Fricke et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Competing Interests: The authors have declared that no competing interests exist. The Perspective section provides experts with a forum to comment on topical or controversial issues of broad interest. PLoS Biology | www.plosbiology.org * E-mail: [email protected] This article is part of the ‘‘Genomics of Emerging Infectious Disease’’ PLoS Journal collection (http:// ploscollections.org/emerginginfectiousdisease/). 1 October 2009 | Volume 7 | Issue 10 | e1000217 Author Summary In all likelihood, it is only a matter of time before our public health system will face a major biological threat, whether intentionally dispersed or originating from a known or newly emerging infectious disease. It is necessary not only to increase our reactive ‘‘biodefense,’’ but also to be proactive and increase our preparedness. To achieve this goal, it is essential that the scientific and public health communities fully embrace the genomic revolution, and that novel bioinformatic and computing tools necessary to make great strides in our understanding of these novel and emerging threats be developed. Genomics has graduated from a specialized field of science to a research tool that soon will be routine in research laboratories and clinical settings. Because the technology is becoming more affordable, genomics can and should be used proactively to build our preparedness and responsiveness to biological threats. All pieces, including major continued funding, advances in next-generation sequencing technologies, bioinformatics infrastructures, and open access to data and metadata, are being set in place for genomics to play a central role in our public health system. Now that genomics technologies are broadly available, there is the potential for commercial interests to hamper the release of genomic data in the public domain. Thus it is important that federally funded large-scale genome sequencing efforts have enforceable rapid release policies. This accessibility could afford further opportunities to capitalize on investments in genome sequencing by providing the necessary resources to biopreparedness. Whereas genome projects aimed at sequencing one, two, or three isolates of a pathogen seemed adequate a few years ago, it is now possible to sequence rapidly hundreds of individual genomes for each species. Access to relevant, well-curated culture collections [10] and DNA preparations suitable for sequencing may become a bottleneck in the future when sequencing resources are no longer limiting. More importantly, the impact of large genomic sequence datasets from clinical isolates will be limited without key clinical metadata that characterize these isolates, such as patients’ medical information, date of isolation, and the number of culture passages in the laboratory. Open access to large numbers of sequences and associated metadata allows for powerful comparative genomic analyses and thus provides major insights into the characteristics of a pathogen. Standardized Box 1. Hot Spots for the Emergence of Infectious Disease Can we define ‘‘hot spots’’ of microbial populations where new infectious diseases are more likely to evolve? Human contact with new types of infectious agents precedes the emergence of infectious diseases. Infectious agents can be new in the sense of not having previously infected humans or new in the sense that a combination of preexisting genetic factors (for example, mobile elements or regulatory elements) have reassembled to give rise to an infectious agent with a substantially altered genome. The Ebola virus, which first emerged by infecting humans 1976 in Zaire [21], is an example of the former, whereas the acquisition of antimicrobial resistance by Acinetobacter baumannii [22] is an example of the latter. In both cases, a change in the selective pressure on an infectious agent allows its emergence from a specific setting. This selective pressure may be, for example, the new niche that the human host provides to the pathogen or the antimicrobial selection on a pathogen. Since both events rely on preexisting genetic resources and not on the de novo evolution of virulence factors, the potential of a setting to serve as a hot spot or reservoir for an emerging infectious disease is theoretically predictable from the examination of the total metagenome. In this scenario, traditional microbiological approaches that focus on single isolates of bacteria or viruses are limited in their predictive power since they lack a view of the complete genetic landscape. The potential infectious disease agent could, however, arise from an environment that only contains pieces of a ‘‘virulence puzzle,’’ i.e., individual virulence factors encoded within the genomes of different organisms (the metagenomic ‘‘gene soup’’). These pieces would have to be assembled in one species for the new pathogen to emerge as an infectious agent. PLoS Biology | www.plosbiology.org 2 vocabulary should be developed to describe these isolates and the genes they contain. Such efforts have already started, for example through the open-access journal Standards in Genome Sciences (SIGS) (http://standardsingenomics.org/ index.php/sigen), but the dedicated resources are not adequate and highlight the lack of understanding of the importance of metadata in genomics. Initiatives such as those of the Genomics Standards Consortium have made great strides [11,12], but still need widespread implementation from the ever-expanding genomic community. Open access to the genomic DNA that has been sequenced or the culture from which the DNA was extracted and to the associated metadata is key to successful genome sequencing projects, whether on single or several hundred genomes or metagenomes. Well-documented genome sequence data will form a key growing resource for biodefense and other research fields. Emerging New Bioinformatics Resources As we enter a new era of modern genomics, the ever-expanding sequence datasets are becoming more challenging to analyze. Future analysts will require powerful new bioinformatics tools in conjunction with new computer systems engineered with genomic analysis in mind. Open-source new bioinformatics software tools are being developed that exploit Web-based services and the increasing computing power provided by academic and commercial ‘‘cloud computing networks’’ (large computing resources provided as a service over the Internet). For example, ‘‘Science Clouds’’ (http://workspace.globus.org/clouds/) allow members of the scientific community to lease cloud computing resources free of charge. To leverage these capabilities, novel cloudoptimized bioinformatics tools are being developed, such as the genome sequence read mapper CloudBurst [13]. In addition, novel resources are currently under development to increase the availability of opensource bioinformatics tools for cloud computing (http://www.nsf.gov/awardsearch/ showAward.do?AwardNumber=0949201; http://www.nsf.gov/awardsearch/showAward. do?AwardNumber=0844494). These emerging tools make access to the Worldwide Web the only requirement to join the genomic revolution and achieve large scale bioinformatics analyses that could not be possible on local servers. As a consequence, it is conceivable that in the future genomic research will increasingly move away from the large sequencing centers toward a more decentralized organization. Decentralized October 2009 | Volume 7 | Issue 10 | e1000217 Box 2. Pandemic H1N1 2009 Influenza: A Recent Example of the Impact of Genomics on Biopreparedness Genomics can be readily applied to follow outbreaks of infectious diseases. This is clearly illustrated during the severe acute respiratory syndrome (SARS) outbreak in 2002–2003 and the emergence and worldwide spread of the pandemic H1N1 2009 influenza virus this year. In both cases, genomics played a key role in the immediate response to the outbreak. Initially, very little was known about the virus responsible for the SARS outbreak. Pangenomic virus microarrays identified it as a coronavirus [23]; however, it was only through detailed sequencing that the specific genotype of this virus could be determined [24]. Comparative sequence analysis identified the SARS virus as distinct from other coronaviruses in terms of its encoded proteins responsible for antigen presentation. This finding ultimately lead to development of diagnostics [25] and potential therapeutics [26]. This example of a sequencing approach as a rapid response to a virus outbreak demonstrates that genomics can be a useful and important, if not essential, epidemiological tool. In the ongoing H1N1 influenza outbreak, the National Center for Biotechnology Information (NCBI) established the Influenza Virus Resource (a database and tool for flu sequence analysis, annotation, and submission to GenBank; http://www.ncbi.nlm.nih.gov/genomes/FLU/SwineFlu. html), containing 462 complete viral genome sequences from worldwide viral samples (as of September, 2009). Some of the genomic data was completed, compared, and released to the public within two weeks of isolation of the DNA. The rapid generation of genome sequence data is providing a paradigm shift in the analysis of infectious disease outbreaks, from more classical methods of isolation to the rapid molecular examination of the pathogen in question. rapid genome sequencing and bioinformatic analysis of infectious agents will enable near-realtime global surveillance, detection of new pathogens, new virulence factors, antimicrobial resistance determinants, or engineered organisms. Population Genomics Applied to Single Cultures Because the resources for affordable high-throughput sequencing, data processing, and analysis are available, the time is right to think about microbial population genomics and large-scale microbial metagenomics in the context of biodefense research (Box 3). Traditionally, the concept of population genomics has applied to variation within a species. However, a bacterial culture, even if derived from a single clone, is composed of millions of cells that are not necessarily identical at the genome sequence level, hence forming a population of genomes. Therefore we propose to apply the concept of population genomics to microbial cultures. The assemblage of genotypes defines what is called a ‘‘culture,’’ ‘‘culture stock,’’ or ‘‘reference strain.’’ Population genomics addresses the genomic diversity within these assemblages and has significant implications for many fields of research but, most importantly, for pathogen evolution, diagnostics, epidemiology, and microbial forensics. For example, following the anthrax mail attacks of 2001, microbiologists and PLoS Biology | www.plosbiology.org genomicists joined forces to characterize the unique genetic traits of the Bacillus anthracis spores recovered from the envelopes, which were quickly identified as the B. anthracis Ames strain (DAAR et al., unpublished data). Sequencing the genome of several single colonies obtained from the spores revealed that the entire chromosome and its associated plasmids were 100% identical to the genome sequence of the ancestral B. anthracis Ames strain that was stored for over 20 years in a military laboratory in Frederick, Maryland. The only genotypic differences were found in a small, phenotypically and genetically distinct portion of cells grown from the spores used in the attacks. Genomic characterization of these phenotypic variants revealed a number of unique genetic alterations that together provided a characteristic DNA fingerprint of the spore population that could be unequivocally matched to the spore sample used in the attacks. Using this fingerprint, a genetic assay was developed to screen a B. anthracis spore repository, which identified the origin of the spores as a single spore stock of B. anthracis Ames. This stock was stored at the US Army Medical Research Institute for Infectious Diseases in Fort Detrick, Maryland, narrowing the pool of suspects to a manageable number (those who had access to the spore stock) for the investigative team. The police investigation that followed identified a potential suspect as the custodian of the spore stock. This was 3 the first use of microbial genomics as an essential tool in a forensic investigation. In the course of the investigation, scientists had to establish culture repositories from strains used in research in the US and build databases of genome sequences of all B. anthracis isolates. This work took several years and delayed the investigation significantly. A lesson to be learned from this investigation should therefore be that there is a need for comprehensive databases of unique DNA fingerprints of stocks of potentially threatening pathogens. In the event that another bioterror attack were to take place such genomic databases would be key in quickly establishing the source of the biological material. The concept of population genomics also applies to epidemiological studies of outbreaks of infectious diseases such as those caused by food-borne or zoonotic pathogens, such as Salmonella spp. Traditionally, epidemiologists and pathologists have used low-resolution methods such as pulsed-field gel electrophoresis (PFGE), multi-locus sequence typing (MLST), or multi-locus variable number tandem repeats analysis (MLVA) to trace an individual isolate from a patient back to a potentially infected food source or to isolates from other patients [14–17]. In 2006, for example, during an outbreak of pathogenic Escherichia coli O157:H7 infections in 26 states of the US, which was caused by contaminated spinach, isolates of the pathogen were recovered from cows and wild pigs (the zoonotic reservoirs), bags of spinach (the vehicle of transmission), and ill patients (http://www.cdc.gov/mmwr/preview/ mmwrhtml/mm55d926a1.htm). One of these isolates was designated as the reference for the outbreak based on conserved PFGE patterns. Genome sequencing of several isolates from the same outbreak performed in our laboratory, however, revealed genomic variations that questioned a direct evolutionary link between all outbreak-associated isolates (Eppinger et al., unpublished data). Comparative genomics followed by whole-genome phylogenetic analyses based on single nucleotide polymorphisms demonstrated that these isolates were indeed closely related to one another and only distantly related to other E. coli O157:H7 isolates, hence linking all isolates to the same outbreak, something that was not possible using PFGE patterns. In this case, phylogenetic analyses suggest that several highly related genotypes were at the source of the outbreak, thus challenging the October 2009 | Volume 7 | Issue 10 | e1000217 Box 3. Simple Genomics, Population Genomics, and Metagenomics ciated microbial communities (e.g., Vibrio cholerae, the etiologic agent of cholera) but potentially also by slight shifts in the proportions of different populations within the community that give an otherwise harmless species or strain an undesirable advantage over others, a similar situation to what is observed in bacterial vaginosis [20]. Probiotic dietary supplements of live microorganisms deliver beneficial bacteria that promote an healthy state of the targeted microbiota. In a completely hypothetical possibility, the opposite would also be plausible, where the healthy microbiota (skin, gut, or upper respiratory tract, among others) may be disturbed by introducing large amounts of ‘‘contrabiotics,’’ i.e., living nonpathogenic bacteria that would shift the microbiota away from a healthy state. A better understanding of the ecological principles that shape the composition of our microbiome might contribute to our biopreparedness for such a threat to public health. The field of biodefense has thoroughly embraced genomics and made it a keystone for developing better identification technologies, diagnostic tools, and vaccines and improving our understanding of pathogen virulence and evolution. Enabling technologies and bioinformatics tools have shifted genomics from a separate research discipline to a tool so powerful that it can provide novel insights that were not imaginable a few years ago, including for example redefining the notion of strains or cultures in the context of biopreparedness or microbial forensics. Challenges remain, though, mostly in the form of large amounts of data that are being generated, and will continue to be generated in the future, and are becoming difficult to manage. The need for better bioinformatic algorithms, access to faster computing capabilities, larger or novel and more efficient data storage devices, and better training in genomics are all in critical demand, and will be required to fully embrace the genomic revolution. Our nation’s preparedness for biological threats, whether they are deliberate or not, and our public health system would benefit greatly by leveraging these capabilities into better real-time diagnostics (in the environment as well as at the bedside), vaccines, a greater understanding of the evolutionary process that makes a friendly microbe become a pathogen (Box 3) (hence to better predict what microbial foes will be facing us in the near future), and better forensics and epidemiological tools. The time is right to be bold and capitalize on these enabling technological advances to sequence microbial species or complex microbial communities to the greatest level possible—that is, hundreds of genomes per species or samples—but let us not forget that informatics and computing resources are now becoming the bottleneck to actually making major progress in this field. 4. Franco C (2008) Billions for biodefense: Federal agency biodefense funding, FY2008-FY2009. Biosecur Bioterror 6: 131–146. 5. Rowe T, Abernathy RA, Hu-Primmer J, Thompson WW, Lu X, et al. (1999) Detection of antibody to avian influenza A (H5N1) virus in human serum by using a combination of serologic assays. J Clin Microbiol 37: 937–943. 6. Maurer-Stroh S, Ma J, Lee RT, Sirota FL, Eisenhaber F (2009) Mapping the sequence mutations of the 2009 H1N1 influenza A virus neuraminidase relative to drug and antibody binding sites. Biol Direct 4: 18. 7. Aldhous P (2001) Biologists urged to address risk of data aiding bioweapon design. Nature 414: 237–238. 8. Read TD, Parkhill J (2002) Restricting genome data won’t stop bioterrorism. Nature 417: 379. 9. Bambini S, Rappuoli R (2009) The use of genomics in microbial vaccine development. Drug Discov Today 14: 252–260. 10. Tindall BJ, Garrity GM (2008) Proposals to clarify how type strains are deposited and made available to the scientific community for the purpose of systematic research. Int J Syst Evol Microbiol 58: 1987–1990. 11. Garrity GM, Field D, Kyrpides N, Hirschman L, Sansone SA, et al. (2008) Toward a standards- It is now technically possible and scientifically desirable to combine sequencing projects on single genomes, genome populations, and metagenomes to study genome evolution. Single-genome projects provide the greatest resolution for identifying genetic factors responsible for specific virulence phenotypes and provide answers to many important questions, such as: What is the minimal gene set in a pathogen required to cause a specific disease phenotype? What does the genetic context of virulence or antibiotic resistance factors tell us about their evolutionary origin or the mobility between different microbial species or even genera? Population-level genome sequencing projects provide us with information about the pangenomic gene pool and the potential of a species to evolve into a novel pathogen. Are certain bacterial species or strains more likely than others to evolve pathogenic traits? What distinguishes a commensal from a pathogenic isolate? What provides the trigger or ability to convert a commensal or opportunistic strain into a pathogen? What role does horizontal gene transfer play in species evolution? Is an infection always caused by an individual isolate or might infection be caused by a combination of individuals in a population that all have different attenuated infectious potentials? Metagenomics projects sample the genetic reservoir (the set of genes carried by all members of a community) within a specific environment or sample. This ‘‘gene soup’’ reflects the maximum genetic potential accessible to individual isolates by horizontal gene transfer. utility of assigning a single reference strain to a specific outbreak. Instead, collecting and sequencing tens or hundreds of isolates from each source or patient linked to an outbreak would provide a better basis for understanding the genomic diversity within the outbreak population and would aid in defining the population dynamics of an outbreak. A New Concept: Contrabiotics Insufficient attention has been paid to the human microbiome (i.e., the consortium of microbes that inhabit the human body) as it relates to our efforts to increase biopreparedness. New analyses of the diversity and composition of the human microbiome are making it increasingly clear that human health depends on a delicate equilibrium between the microbial inhabitants and the human host [18,19]. Severe effects on health could be caused not only by the introduction of true pathogens in the traditional sense into these human-asso- Challenges for the Future References 1. Fleischmann RD, Adams MD, White O, Clayton RA, Kirkness EF, et al. (1995) Wholegenome random sequencing and assembly of Haemophilus influenzae Rd. Science 269: 496–512. 2. Guzman E, Romeu A, Garcia-Vallve S (2008) Completely sequenced genomes of pathogenic bacteria: A review. Enferm Infecc Microbiol Clin 26: 88–98. 3. Binnewies TT, Motro Y, Hallin PF, Lund O, Dunn D, et al. (2006) Ten years of bacterial genome sequencing: Comparative-genomicsbased discoveries. Funct Integr Genomics 6: 165–185. PLoS Biology | www.plosbiology.org 4 October 2009 | Volume 7 | Issue 10 | e1000217 12. 13. 14. 15. 16. compliant genomic and metagenomic publication record. OMICS 12: 157–160. Field D, Garrity GM, Sansone SA, Sterk P, Gray T, et al. (2008) Meeting report: The fifth Genomic Standards Consortium (GSC) workshop. OMICS 12: 109–113. Schatz MC (2009) CloudBurst: Highly sensitive read mapping with MapReduce. Bioinformatics 25: 1363–1369. Gerner-Smidt P, Hise K, Kincaid J, Hunter S, Rolando S, et al. (2006) PulseNet USA: A fiveyear update. Foodborne Pathog Dis 3: 9–19. Urwin R, Maiden MC (2003) Multi-locus sequence typing: A tool for global epidemiology. Trends Microbiol 11: 479–487. Keim P, Price LB, Klevytska AM, Smith KL, Schupp JM, et al. (2000) Multiple-locus variablenumber tandem repeat analysis reveals genetic relationships within Bacillus anthracis. J Bacteriol 182: 2928–2936. PLoS Biology | www.plosbiology.org 17. Boxrud D, Pederson-Gulrud K, Wotton J, Medus C, Lyszkowicz E, et al. (2007) Comparison of multiple-locus variable-number tandem repeat analysis, pulsed-field gel electrophoresis, and phage typing for subtype analysis of Salmonella enterica serotype Enteritidis. J Clin Microbiol 45: 536–543. 18. Gao Z, Tseng CH, Strober BE, Pei Z, Blaser MJ (2008) Substantial alterations of the cutaneous bacterial biota in psoriatic lesions. PLoS One 3: e2719. 19. Turnbaugh PJ, Ley RE, Mahowald MA, Magrini V, Mardis ER, et al. (2006) An obesityassociated gut microbiome with increased capacity for energy harvest. Nature 444: 1027–1031. 20. Srinivasan S, Fredricks DN (2008) The human vaginal bacterial biota and bacterial vaginosis. Interdiscip Perspect Infect Dis 2008: 750479. 21. Pourrut X, Kumulungui B, Wittmann T, Moussavou G, Delicat A, et al. (2005) The 5 22. 23. 24. 25. 26. natural history of Ebola virus in Africa. Microbes Infect 7: 1005–1014. Peleg AY, Seifert H, Paterson DL (2008) Acinetobacter baumannii: Emergence of a successful pathogen. Clin Microbiol Rev 21: 538–582. Wang D, Urisman A, Liu YT, Springer M, Ksiazek TG, et al. (2003) Viral discovery and sequence recovery using DNA microarrays. PLoS Biol 1: e2. doi:10.1371/journal.pbio.0000002. Marra MA, Jones SJ, Astell CR, Holt RA, Brooks-Wilson A, et al. (2003) The genome sequence of the SARS-associated coronavirus. Science 300: 1399–1404. Zhu M (2004) SARS immunity and vaccination. Cell Mol Immunol 1: 193–198. Haagmans BL, Osterhaus AD (2006) Coronaviruses and their therapy. Antiviral Res 71: 397–403. October 2009 | Volume 7 | Issue 10 | e1000217 Perspective Discovering the Phylodynamics of RNA Viruses Edward C. Holmes1,2*, Bryan T. Grenfell2,3 1 Center for Infectious Disease Dynamics, Department of Biology, The Pennsylvania State University, Mueller Laboratory, University Park, Pennsylvania, United States of America, 2 Fogarty International Center, National Institutes of Health, Bethesda, Maryland, United States of America, 3 Department of Ecology and Evolutionary Biology and Woodrow Wilson School, Princeton University, Princeton, New Jersey, United States of America Phylodynamics: The Discovery Phase The advent of extremely high throughput DNA sequencing ensures that genomic data from microbial organisms can be acquired in unprecedented quantities and with remarkable rapidity. Although this genomic revolution will affect all microbes alike, our focus here is on RNA viruses, as the rapidity of their evolution, which is observable over the time scale of human observation, allows phylodynamic inferences to be made with great precision. In the foreseeable future it is likely that complete genome sequencing will become the standard method of viral characterization, providing the highest possible resolution for phylogenetic studies. The rapidity with which genome sequence data were generated from the ongoing epidemic of swine-origin H1N1 influenza A virus [1] is testament to the power of this technology. Understandably, pathogen discovery is a major focus of this new-scale genome sequencing [2]. It is now possible to sequence the entire assemblage of viruses in a particular tissue type or host species [3–5], as well as all those viruses that are associated with specific disease syndromes [6,7]. In essence, this new era of metagenomics constitutes a crucial taxonomic discovery phase in virology and epidemiology that allows the genetic characterization of new viruses within hours of their isolation. Assembling an inventory of viruses that may emerge in human populations is of major importance to public health and to students of biodiversity. However, it is only the first step in developing a full quantitative understanding of the processes that shape the epidemiology and evolution— the phylodynamics—of RNA virus infections [8]. To achieve this goal, we argue here that the field of viral phylodynamics requires its own discovery phase; that is, a comprehensive and quantitative analysis of the interaction between the ecological and evolutionary dynamics of all circulating RNA viruses from the molecular to the global scale. Such a marriage of phylogenetic and epidemiological dynamics is currently only potentially possible for the select few human viruses for which large genome sequence datasets have been acquired, such as HIV and influenza A virus, and even here fundamental gaps in our knowledge remain (see below). Indeed, it is striking that so few complete genome sequences are currently available for viruses whose epidemiological dynamics are known in exquisite detail, such as measles [9,10]; these sequences have been so sparsely sampled in both time and space that a full phylodynamic perspective has not yet been achieved. We contend that a better understanding of RNA virus phylodynamics will allow more directed attempts at pathogen surveillance, facilitate more accurate predictions of the epidemiological impact of newly emerged viruses, and assist in the control of those viruses that exhibit complex patterns of antigenic variation such as dengue and influenza. Just as PCR and first-generation DNA sequencing ushered in the science of molecular epidemiology, so next-generation sequencing may herald the age of phylodynamics. Box 1 lists a number of key questions that can be addressed within this phylodynamics research program. A number of important advances are needed to meet our goal of a comprehensive catalog of the diversity of phylodynamic patterns in RNA viruses. Because answers to many of the most interesting research questions depend on sufficiently large sample sizes, we require large numbers of sequences that have been rigorously sampled according to strict temporal, spatial, and clinical criteria, and that as much of these data are publicly accessible as possible. A phylodynamic analysis has little value unless viral genomes are sampled on the same scale as the epidemiological processes under investigation. The only acute virus for which a suitably expansive genome dataset currently exists is influenza. In this case, the .4,000 complete genomes generated under the Influenza Genome Sequencing Project [11] have provided important new insights into the evolution and epidemiology of this major human pathogen [12]. To highlight one key insight here, these genome sequence data have revealed that multiple lineages of influenza virus are imported and circulate within specific geographic localities (even within relatively isolated populations), generating both frequent mixed infections [13] and reassortment events [14]. Even so, the sampling of these genome sequences (and associated epidemiological covariates) may not be dense enough to fully capture spatial dynamics [15]. There is also a marked absence of samples from asymptomatically infected patients (or those with mild disease), so it is impossible to link genetic variation to clinical syndrome. Such a bias against viruses sampled from individuals with asymptomatic infections is a common problem in molecular epidemiology. Epidemiological Factors It is also clear that for many RNA viruses we need to better understand a Citation: Holmes EC, Grenfell BT (2009) Discovering the Phylodynamics of RNA Viruses. PLoS Comput Biol 5(10): e1000505. doi:10.1371/journal.pcbi.1000505 Editor: Ernest Fraenkel, Massachusetts Institute of Technology, United States of America Published October 26, 2009 Copyright: ß 2009 Holmes, Grenfell. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Funding: BTG was supported by the RAPIDD program of the Science & Technology Directorate of the Department of Homeland Security and the National Institutes of Health (NIH), and National Science Foundation grant 0742373. ECH was supported by the NIH (grant GM080533). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Competing Interests: The authors have declared that no competing interests exist. * E-mail: [email protected] This article is part of the ‘‘Genomics of Emerging Infectious Disease’’ PLoS Journal collection (http:// ploscollections.org/emerginginfectiousdisease/). PLoS Computational Biology | www.ploscompbiol.org 1 October 2009 | Volume 5 | Issue 10 | e1000505 Box 1. Key Research Questions in RNA Virus Phylodynamics (1) What is the range of phylodynamic patterns observed in RNA viruses? Can they be categorized into specific groups? How do these patterns relate to other ‘‘life history’’ variables exhibited by RNA viruses? (2) What epidemiological and evolutionary processes give rise to these phylodynamic patterns? What generalities can be drawn? (3) How commonly does natural selection (compared to neutral evolutionary processes) determine the population dynamics of pathogens? On what scale does natural selection act? How does viral immune escape reduce herd immunity at the population level and allow the persistence of viral lineages in epidemic troughs? (4) What is the range of spatial patterns exhibited by RNA viruses? What epidemiological factors are responsible for these patterns? (5) How do different viral species (various respiratory viruses, for example) interact in host immunity? number of key epidemiological factors, such as the interaction between local persistence, epidemic dynamics in both time and space, the impact of measures to control the spread of infection, and the consequences of adaptive evolution in those viral genes that interact most intimately with the host immune response. It is instructive to imagine the ideal database for addressing these issues. In the case of acute infections, the goal would be to collect four parallel datasets on the appropriate scale of interest during outbreaks (Figure 1). This database would comprise, first, epidemic dynamics in time and space, ideally at a comparable or higher frequency than the generation time of individual infections. Second, and in parallel, our ideal study would collect viral genome sequence data at these time points, sampling both within and among infected hosts. Both disease incidence data (bolstered by contact tracing) and viral sequence data furnish information on the transmission network traced by an outbreak. Third, we would need to know the underlying contact network of susceptible individuals, which serves as fuel for the epidemic. This is a difficult structure to measure directly, although novel measure- ments of human interactions are increasingly shedding light on the problem [16]. Finally, measurements of the immunity structure of our contact network [17]— reflecting the past history of the virus in the population—are key for understanding both the dynamics of epidemic spread and the evolutionary pressures that shape virus diversity. The outbreak of foot-and-mouth disease (FMD, an RNA virus infection of cattle) in the UK in 2001 resulted in a database that is arguably closest to our ideal on the epidemiological scale [18,19]. Notwithstanding a variety of gaps in data from the epidemic [20], it is one of the most well-documented large outbreaks in terms of the availability of spatiotemporal incidence data in parallel with contact tracing and the underlying spatial pattern of the susceptible farms as a measure of the contact network. In addition, analyses of viral sequences from relatively small samples of farms have drawn important conclusions about epidemic spread and allowed the testing of new methods to recover the spatiotemporal patterns written into sequence data [18,20]. Importantly, samples exist from over half the ,2,000 confirmed infected premises in 2001: sequencing whole FMD virus genomes from these samples would provide a vast resource for basic and applied devel- Figure 1. Sampling scales for acute RNA viruses and the associated phylodynamic processes that viral genome sequence data and host sampling can elucidate. doi:10.1371/journal.pcbi.1000505.g001 PLoS Computational Biology | www.ploscompbiol.org 2 October 2009 | Volume 5 | Issue 10 | e1000505 opments in integrating epidemiological and phylogenetic information to dissect spatiotemporal spread. We suggest that achieving this task would be a huge contribution to understanding the phylodynamics of acute viruses. Another virtue of animal infections like FMD is that the relationship between the determinants of viral variability within and between hosts can also be dissected by experimental infections (see [21] for another example). A parallel limitation of many phylogenetic approaches to viral epidemiology is that they have often proceeded in the absence of the necessary metadata, such as the precise time and place of sampling or those that relate to clinical syndrome [22]. A perhaps more challenging goal for phylodynamics is therefore to integrate phylogenetic patterns with other biological variables, such as the nature of antigenic variation, the capacity for drug resistance, or the clinical syndrome of the host, as well as the spatial host network data outlined above. Cohort studies may be the most productive way to link genomics with epidemiological variables. The lack of a synthesis of phylogenetic and phenotypic/epidemiological data is reflected in the current debate over the mode of antigenic evolution in human influenza A virus. Although it has long been known that the hemagglutinin (HA) and neuraminidase (NA) proteins of hu- man influenza A virus evolve by strong natural selection to evade the host immune response—a process commonly called antigenic drift [23,24]—the precise mechanisms by which such drift occurs are uncertain. From a phylodynamics perspective, the key observation is that over long time periods a single lineage of HA sequences from subtype A/H3N2 influenza viruses links epidemic to epidemic [23], although intensive sampling has revealed that single populations may harbor far higher levels of genetic diversity [25]. Rather different phylodynamic patterns are seen in other influenza viruses, including those sampled from birds (Figure 2). Three models have been proposed to explain the distinctive phylodynamic pattern observed in human A/H3N2 viruses: (i) that there is short-lived cross-immunity among viral strains [26], (ii) that the HA evolves in a punctuated manner among antigenic types that are linked by a network of neutrally evolving sites [27], and (iii) that the virus continually reuses a limited number of antigenic combinations [28]. To determine which combination of these models best explains influenza phylodynamics will require more expansive genome sequence data, as well as focused sampling and epidemiological surveillance in Southeast Asia, which is likely the global source population for the virus [29]. More importantly, it is also crucial that these phylogenetic data are combined with detailed, spatiotemporally disaggregated antigenic information. Indeed, it is remarkable that despite the abundance of information on the antigenic characteristics of individual influenza viruses, most notably through the use of the hemagglutinin inhibition (HI) assay [17], these data have not been routinely linked to phylogenetic information. It is clear that both antigenic and phylogenetic analyses would greatly benefit from each other. New-Generation Computational Tools Another important challenge for phylodynamics is to match the remarkable ongoing developments in genome sequencing technology to the increase in the power of the computational tools available to analyze these sequence data. Crucially, in phylogenetics, the size of the space of possible trees increases faster than exponentially with the number of sequences, such that the availability of datasets comprising thousands of complete genomes [30] presents a major combinatorial problem. This problem creates a growing discrepancy between our ability to generate genome sequence data and our capacity to analyze them using the most sophisticated methods. Redressing this Figure 2. Phylodynamic patterns of human and avian influenza viruses. The left diagram shows the phylogeny of the hemagglutinin (HA) gene of human H3N2 influenza A viruses sampled between 1985 and 2005, revealing the ‘‘ladder-like’’ branching structure indicative of antigenic drift. By comparison, the phylogeny of the HA gene of human influenza B virus sampled over the same interval (center diagram) shows the cocirculation of the antigenically distinct ‘‘Victoria 1987’’ and ‘‘Yamagata 1988’’ lineages, as well a shorter length from root to tip, reflecting a lower rate of evolutionary change. Finally, the phylogeny for the HA gene of H4 avian influenza virus (right diagram) reveals the deep geographic division between the Eurasian and Australian versus North American lineages of this virus. doi:10.1371/journal.pcbi.1000505.g002 PLoS Computational Biology | www.ploscompbiol.org 3 October 2009 | Volume 5 | Issue 10 | e1000505 balance should be the major goal of bioinformatics in the future; and in fact some progress has been made recently [31]. It is also clear that improvements need to be made to the methods that are available to analyze genome sequence data. A powerful set of research tools in this area comprises those based on coalescent theory, as this provides a natural link between the analysis of epidemiological and phylogenetic patterns [8,32]. In particular, the coalescent allows the demographic characteristics of viral populations (particularly population size and growth rate) to be inferred directly from gene sequence data. Coalescent analyses are especially powerful in the case of RNA viruses, because their rapid evolution means that temporal and spatial dynamics are discernable over the period of human observation [33] and can in theory be combined with time series epidemiological data. However, currently available coalescent methods are restricted by the limited scope of demographic models and their inability to fully incorporate spatial information. In particular, most acute RNA viruses have complex population dynamics that combine distinct periods of growth and decline. The most commonly used phylodynamic tool available in such cases is the Bayesian skyline plot (and the related Bayesian ‘‘skyride’’ [34]), which represents a piecewise graphical depiction of changes in genetic diversity through time [32]. In the case of neutral evolution, such changes in genetic diversity also reflect underlying changes in the number of infected hosts. Although the Bayesian skyline plot can reveal unique features of epidemic dynamics (Figure 3) [30], precise estimates of parameters such as population growth rate are not yet possible. The coalescent methods commonly used to study RNA virus evolution focus largely on temporal dynamics (a natural function of the rapidity of viral evolution), with little consideration of patterns of spatial diffusion. Although these phylogeographic patterns are becoming increasingly well described for RNA viruses [35], few methods effectively recover the spatial component in genome sequence data. For example, commonly used parsimonybased approaches consider a single phylogenetic tree without an explicit spatial model (see, for example, [36]). In addition, these methods usually describe the place of origin and direction of spread of viral lineages without formal tests of competing spatial hypotheses. As a specific case in point, although gravity models (in which patterns of viral transmission reflect the size of and distance between population centers) have been applied successfully to morbidity and mortality data from human influenza A virus to describe its spread across the United States [37], they have yet to be interpreted within a phylogenetic setting. A clear push for the future should therefore be the development of coalescent tools that integrate the analysis of spatial and temporal dynamics within a single framework, with a focus on those that combine phylogenetic data and information on the dynamics of the host contact network of susceptible, infected, and immune individuals. Looking beyond the Consensus Sequence The vast majority of studies of RNA virus evolution undertaken to date, particularly of those viruses that cause acute infections, rely on the analysis of consensus sequences in which the nucleotide shown for any given site is the most common among all the genomes within a patient. Although the use of consensus sequences is adequate for many aspects of molecular epidemiology, in which complete genomes may suffice to determine even tight transmission chains [20], there is growing evidence that key evolutionary processes occur beyond the consensus. In particular, extensive intra-host gene sequencing has revealed the existence of minor viral subpopulations within individual hosts that are not detected by consensus sequencing and that are sometimes of great phenotypic importance [38,39]. Given the intrinsically high mutation rates of RNA viruses, as well as the immense size of intra-host populations, such extensive genetic and phenotypic diversity is only to be expected. Figure 3. Fluctuating genetic diversity of influenza A virus. The figure shows a Bayesian skyline plot of changing levels of genetic diversity through time for the HA gene (165 sequences) of A/H3N2 virus sampled from the state of New York, US, during the period 2001–2003. The y-axes depict relative genetic diversity (Net, where Ne is the effective population size, and t the generation time from infected host to infected host), which can be considered a measure of effective population size under strictly neutral evolution. Peaks of genetic diversity, reflecting the seasonal occurrence of influenza, are clearly visible. See [30] for a more detailed analysis. doi:10.1371/journal.pcbi.1000505.g003 PLoS Computational Biology | www.ploscompbiol.org 4 October 2009 | Volume 5 | Issue 10 | e1000505 A full description of the extent and structure of intra-host viral genetic variation is critical for understanding evolutionary dynamics, informing on such issues as the frequency of mixed infection, and hence the degree and extent of crossimmunity; the frequency with which antigenic variants are produced and whether antigenic evolution can occur on the time scale of individual infections; and the size of the population bottleneck that might accompany inter-host transmission. As a case in point, it is commonly assumed that viruses experience a severe population bottleneck as they are transmitted to new hosts, a phenomenon that greatly restricts the power of natural selection to fix advantageous mutations. Although this assumption appears to be true in some cases [40], whether this is a general property of RNA viruses is unclear; the evidence that multiple viral lineages can be transmitted among hosts argues against a narrow bottleneck in all cases [41]. To more accurately determine the size of the transmission bottleneck, analyses of intrahost genetic diversity along known transmission chains will be essential. On a larger scale, it is unclear whether phylodynamic patterns differ within and among hosts, and whether any differences among these scales of analysis are qualitative or quantitative. Intra-host sequence data are also essential for understanding the process of crossspecies virus transmission and emergence. Key parameters in determining whether a virus will adapt successfully to a new host species include the extent of intra-host genetic diversity, the fitness distribution of the mutations produced, and how many of these mutations will assist adaptation to new host species [41–43]. No such data are available for any acute RNA virus, so testing models for viral emergence is difficult. We believe, however, that understanding the mechanics of this adaptive process is at least as important as surveying for new emerging viruses. Challenges for the Future Our discussion has highlighted a number of key challenges for a successful phylodynamic research agenda. These challenges comprise data, theory, and methodological issues, and are briefly summarized as follows. First, with respect to data, it is clear that more genome sequences must be acquired and with increased temporal and spatial precision. For example, wherever possible, GenBank records should contain the exact day and precise latitude and longitude of sampling. In addition, it is essential that these sequence data be linked with the relevant metadata, such as the associated clinical syndrome and (if applicable) measure of antigenicity. Similarly, it is essential that equivalent genome sequence data be acquired from multiple time points within individual hosts. Second, in terms of theory, it is crucial that we fully integrate patterns of viral evolution across multiple epidemiological scales, from within hosts, to local outbreaks, and on to global pandemics. Although the coalescent is hugely useful in this respect, it is essential that its theoretical framework be extended to incorporate models of population growth and decline that most accurately reflect the population dynamics of acute RNA viruses, in particular the dynamics of the susceptible ‘‘denominator’’ that fuels epidemics. Sequencing of all available samples from the UK 2001 FMD epidemic would yield great scientific dividends here. Third and finally, with respect to methodology, new computational tools are needed to rapidly make phylodynamic inferences from genomic datasets that may contain thousands of sequences and that efficiently integrate genomic with other forms of biological data. We hope this review will stimulate research in all these areas. References 1. Novel Swine-Origin Influenza A (H1N1) Virus Investigation Team, Dawood FS, Jain S, Finelli L, Shaw MW, et al. (2009) Emergence of a novel swine-origin influenza A (H1N1) virus in humans. N Engl J Med 360: 2605–2615. 2. Lipkin WI (2009) Microbe hunting in the 21st century. Proc Natl Acad Sci U S A 106: 6–7. 3. Cox-Foster DL, Conlan S, Holmes EC, Palacios G, Evans JD, et al. (2007) A metagenomic survey of microbes in honey bee colony collapse disorder. Science 318: 283–287. 4. Finkbeiner SR, Allred AF, Tarr PI, Klein EJ, Kirkwood CD, et al. (2008) Metagenomic analysis of human diarrhea: viral detection and discovery. PLoS Pathog 4(2): e1000011. doi:10.1371/journal. ppat.1000011. 5. Zhang T, Breitbart M, Lee WH, Run JQ, Wei CL, et al. (2005) RNA viral community in human feces: Prevalence of plant pathogenic viruses. PLoS Biol 4(1): e3. doi:10.1371/journal.pbio.0040003. 6. Palacios G, Druce J, Du L, Tran T, Birch C, et al. (2008) A new arenavirus in a cluster of fatal transplant-associated diseases. N Engl J Med 358: 991–998. 7. Palmenberg AC, Spiro D, Kuzmickas R, Wang S, Djikeng A, et al. (2009) Sequencing and analyses of all known human rhinovirus genomes reveals structure and evolution. Science 324: 55–59. 8. Grenfell BT, Pybus OG, Gog JR, Wood JLN, Daly JM, et al. (2004) Unifying the epidemiological and evolutionary dynamics of pathogens. Science 303: 327–332. 9. Bjørnstad ON, Finkenstädt B, Grenfell BT (2002) Dynamics of measles epidemics. I. estimating scaling of transmission rates using a time series SIR model. Ecol Monogr 72: 169–184. 10. Grenfell BT, Bjornstad ON, Finkenstädt BF (2002) Dynamics of measles epidemics. II. Scaling noise, determinism and predictability with the time series SIR model. Ecol Monogr 72: 185–202. 11. Ghedin E, Sengamalay NA, Shumway M, Zaborsky J, Feldblyum T, et al. (2005) Largescale sequencing of human influenza reveals the dynamic nature of viral genome evolution. Nature 437: 1162–1166. 12. Nelson MI, Holmes EC (2007) The evolution of epidemic influenza. Nat Rev Genet 8: 196–205. 13. Ghedin E, Fitch A, Boyne A, DePasse J, Bera J, et al. (2009) Mixed infection and the genesis of influenza diversity. J Virol 83: 8832–8841. 14. Nelson MI, Simonsen L, Viboud C, Miller MA, Taylor J, et al. (2006) Stochastic processes are key determinants of the short-term evolution of influenza A virus. PLoS Pathog 2: e125. doi:10.1371/journal.ppat.0020125. 15. Nelson MI, Edelman L, Spiro DJ, Boyne AR, Bera J, et al. (2008) Molecular epidemiology of A/H3N2 and A/H1N1 influenza virus during a single epidemic season in the United States. PLoS Pathog 4(8): e1000133. doi:10.1371/journal. ppat.1000133. 16. Gonzalez MC, Hidalgo CA, Barabasi AL (2008) Understanding individual human mobility patterns. Nature 453: 779–782. 17. Smith DJ, Lapedes AS, de Jong JC, Bestebroer TM, Rimmelzwaan GF, et al. (2004) Mapping the antigenic and genetic evolution of influenza virus. Science 305: 371–376. 18. Cottam EM, Haydon DT, Paton DJ, Gloster J, Wilesmith JW, et al. (2006) Molecular epidemiology of the foot-and-mouth disease virus out- PLoS Computational Biology | www.ploscompbiol.org 5 19. 20. 21. 22. 23. 24. 25. 26. 27. break in the United Kingdom in 2001. J Virol 80: 11274–11282. Keeling MJ, Woolhouse MEJ, Shaw DJ, Matthews L, Chase-Topping M, et al. (2001) Dynamics of the 2001 UK foot and mouth epidemic: stochastic dispersal in a heterogeneous landscape. Science 294: 813–817. Cottam EM, Wadsworth J, Shaw AE, Rowlands RJ, Goatley L, et al. (2008) Transmission pathways of foot-and-mouth disease virus in the United Kingdom in 2007. PLoS Pathog 4(4): e1000050. doi:10.1371/journal.ppat.1000050. Hoelzer K, Shackelton LA, Holmes EC, Parrish CR (2008) Within-host genetic diversity of endemic and emerging parvoviruses of cats and dogs. J Virol 82: 11096–11105. Holmes EC (2007) Viral evolution in the genomic age. PLoS Biol 5(10): e278. doi:10.1371/journal. pbio.0050278. Fitch WM, Leiter JME, Li X, Palese P (1991) Positive Darwinian evolution in human influenza A viruses. Proc Natl Acad Sci U S A 88: 4270–4274. Webster RG, Laver WG, Air GM, Schild GC (1982) Molecular mechanisms of variation in influenza viruses. Nature 296: 115–121. Holmes EC, Ghedin E, Miller N, Taylor J, Bao Y, et al. (2005) Whole genome analysis of human influenza A virus reveals multiple persistent lineages and reassortment among recent H3N2 viruses. PLoS Biol 3(9): e300. doi:10.1371/ journal.pbio.0030300. Ferguson NM, Galvani AP, Bush RM (2003) Ecological and immunological determinants of influenza evolution. Nature 422: 428–433. Koelle K, Cobey S, Grenfell B, Pascual M (2006) Epochal evolution shapes the phylodynamics of October 2009 | Volume 5 | Issue 10 | e1000505 28. 29. 30. 31. 32. interpandemic influenza A (H3N2) in humans. Science 314: 1898–1903. Recker M, Pybus OG, Nee S, Gupta S (2007) The generation of influenza outbreaks by a network of host immune responses against a limited set of antigenic types. Proc Natl Acad Sci U S A 104: 7711–7716. Russell CA, Jones TC, Barr IG, Cox NJ, Garten RJ, et al. (2008) The global circulation of seasonal influenza A (H3N2) viruses. Science 320: 340–346. Rambaut A, Pybus OG, Nelson MI, Viboud C, Taubenberger JK, et al. (2008) The genomic and epidemiological dynamics of human influenza A virus. Nature 453: 615–619. Suchard MA, Rambaut A (2009) Many-core algorithms for statistical phylogenetics. Bioinformatics 25: 1370–1376. Drummond AJ, Rambaut A, Shapiro B, Pybus OG (2005) Bayesian coalescent inference of past population dynamics from molecular sequences. Mol Biol Evol 22: 1185–1192. 33. Drummond AJ, Pybus OG, Rambaut A, Forsberg R, Rodrigo AG (2003) Measurably evolving populations. Trends Ecol Evol 18: 481–488. 34. Minin VN, Bloomquist EW, Suchard MA (2008) Smooth skyride through a rough skyline: Bayesian coalescent-based inference of population dynamics. Mol Biol Evol 25: 1459–1471. 35. Holmes EC (2008) The evolutionary history and phylogeography of human viruses. Annu Rev Microbiol 62: 307–328. 36. Wallace RG, Hodac H, Lathrop RH, Fitch WM (2007) A statistical phylogeography of influenza A H5N1. Proc Natl Acad Sci U S A 104: 4473–4478. 37. Viboud C, Bjornstad ON, Smith DL, Simonsen L, Miller MA, et al. (2006) Synchrony, waves, and spatial hierarchies in the spread of influenza. Science 312: 447–451. 38. Aaskov J, Buzacott K, Thu HM, Lowry K, Holmes EC (2006) Long-term transmission of defective RNA viruses in humans and Aedes mosquitoes. Science 311: 236–238. PLoS Computational Biology | www.ploscompbiol.org 6 39. Jerzak G, Bernard KA, Kramer LD, Ebel GD (2005) Genetic variation in West Nile virus from naturally infected mosquitoes and birds suggests quasispecies structure and strong purifying selection. J Gen Virol 86: 2175–2183. 40. Keele BF, Giorgi EE, Salazar-Gonzalez JF, Decker JM, Pham KT, et al. (2008) Identification and characterization of transmitted and early founder virus envelopes in primary HIV-1 infection. Proc Natl Acad Sci U S A 105: 7552–7557. 41. Holmes EC (2009) The evolution and emergence of RNA viruses. Oxford Series in Ecology and Evolution. Harvey PH, May RM, eds. Oxford: Oxford University Press. 4 2. Kuiken T, Hol mes EC, McCa uley J , Rimmelzwaan GF, Williams CS, et al. (2006) Host species barriers to influenza virus infections. Science 312: 394–397. 43. Parrish CR, Holmes EC, Morens DM, Park EC, Burke DS, et al. (2008) Cross-species viral transmission and the emergence of new epidemic diseases. Microbiol Mol Biol Rev 72: 457–470. October 2009 | Volume 5 | Issue 10 | e1000505 Perspective Computational Resources in Infectious Disease: Limitations and Challenges Eva C. Berglund, Björn Nystedt, Siv G. E. Andersson* Department of Molecular Evolution, Uppsala University, Uppsala, Sweden Infectious diseases continue to be a major cause of death in the human population, with tuberculosis and malaria affecting 500 million people and causing 1–2 million deaths annually [1]. The situation is aggravated by the increasing prevalence of antibiotic-resistant bacteria and the risk that terrorists might use infectious organisms to aggress target populations. During the past decade, we have also witnessed the emergence of many new pathogens not previously detected in humans, such as the avian influenza virus, severe acute respiratory syndrome (SARS), and Ebola. The appearance of these novel agents and the reemergence of previously eradicated pathogens may be associated with the growing human population, flooding, and other environmental perturbations; global travel and migration; and animal trade and domestic animal husbandry practices. Simultaneously, we have seen an explosion of genome sequence data. Sequencing is now the method of choice for characterization of new disease agents, as exemplified by the rapid sequencing of the genome of the SARS virus, which was made available within a month of identification of the virus [2,3]. Like SARS, most newly emerging disease agents originate in animals and have been transmitted to humans recently at food markets, by insect bites, or through hunting [1]. The new sequencing technologies enable small academic research groups to create huge genome datasets at low cost. As a result, scientists with expertise in other fields of research, such as clinical microbiology and ecology, are just beginning to face the challenge of handling, comparing, and extracting useful information from millions of sequences. Here, we discuss the limitations of publicly available resources in the field of genomics of emerging bacterial pathogens, emphasizing areas where increased efforts in computational biology are urgently needed. Genome Evolution in Emerging Bacterial Pathogens A natural ecosystem of a bacterial population that incidentally infects hu- mans provides a high-risk microenvironment for the establishment of this pathogen in the human population (Box 1; Figure 1). Comparative studies of the genomes of well-recognized human pathogens, incidental pathogens, and their closely related nonpathogenic species [4– 11] are valuable for efforts to predict the propensity for host shifts and their consequences for human health. A successful infectious bacterium, whether it causes disease or not, must possess mechanisms for interacting with the host and evading the host immune system. The key players in these processes are often proteins on the surface of the bacterium, including secretion systems that release effector proteins into the surrounding medium or directly into the host cells. These host-interaction factors are often members of large protein families with many paralogs and often encoded by long genes with internal repeats. Fluctuations in gene length and copy number occur through homologous recombination over these repeats [12–15]. Adding to the variability of the hostinteraction genes is that they are often located on mobile elements such as plasmids or bacteriophages, which are easily gained and lost. Rapid sequence evolution of these genes may be driven by selection, because it often increases bacterial fitness by escaping the host immune system, creating a diverse set of binding structures or tuning effector proteins to a new host. As a consequence, host-interaction genes typically show extreme plasticity in both sequence and copy number, partly because they are under strong evolutionary pressure and partly because they are mechanistically prone to drastic mutational changes. Understanding these complex dynamics poses major challenges in many areas of computational biology, ranging from sequence assembly to epidemic risk assessment. Complete Genome Assembly Remains Difficult Despite the ease with which shotgun sequence data can be generated, assembling these data into a single genomic contig remains labor-intensive and timeconsuming. This obstacle is primarily due to the difficulty of assembling repeated sequences. Hence, resequencing approaches—where short sequence reads are directly mapped to an already completed reference genome—have become increasingly popular. Resequencing readily detects SNPs (single nucleotide polymorphisms) in single-copy genes, but performs very poorly in repeated and highly divergent regions of the genome. Genes involved in infection processes, with their complex repeat structures, high duplication frequency, and rapid evolution, are thus often left unresolved. The perhaps most imminent need is not for improved assembly algorithms but for Citation: Berglund EC, Nystedt B, Andersson SGE (2009) Computational Resources in Infectious Disease: Limitations and Challenges. PLoS Comput Biol 5(10): e1000481. doi:10.1371/journal.pcbi.1000481 Editor: Ernest Fraenkel, Massachusetts Institute of Technology, United States of America Published October 26, 2009 Copyright: ß 2009 Berglund et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Funding: The authors are supported by grants to SGEA from the European Union (QLK3-CT2000-01079, EUWOL and EuroPathogenomics), the Swedish Research Council (http://www.vr.se/), the Göran Gustafsson Foundation (http://www.gustafssonsstiftelse.se/), the Swedish Foundation for Strategic Research (http://www. stratresearch.se/) and the Knut and Alice Wallenberg Foundation (http://www.wallenberg.com/kaw/). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Competing Interests: The authors have declared that no competing interests exist. * E-mail: [email protected] This article is part of the ‘‘Genomics of Emerging Infectious Disease’’ PLoS Journal collection (http:// ploscollections.org/emerginginfectiousdisease/). PLoS Computational Biology | www.ploscompbiol.org 1 October 2009 | Volume 5 | Issue 10 | e1000481 Box 1. Genomic Changes Associated with Host Shifts The movement of a bacterial species from abundant animal hosts such as rodents, which are a major reservoir of infectious disease agents, to the relatively small human population is typically associated with decreased genome size and loss/alteration of the mobile gene pool [4,32–34]. One illustrative example can be found in the genus Mycobacterium, which contains several severe human pathogens, including the agents of tuberculosis (M. tuberculosis) and leprosy (M. leprae) and also the recently emerged M. ulcerans. M. ulcerans causes severe skin lesions; this disease, known as Buruli ulcer, is becoming a serious public health problem in West and Central Africa as well as in other parts of the tropics. Like many other recently emerged human pathogens [4,34–36], M. ulcerans appears to have switched from a generalist to a specialist lifestyle: starting with a progenitor very similar to the aquatic M. marinum. While M. marinum has been found both free-living and as an intracellular pathogen of fish and other species, M. ulcerans is thought to have a restricted host range and to be transmitted by insects (Figure 1A). The host switch was likely initiated by the uptake of a virulence plasmid, and preceded through a series of ‘‘bottleneck events’’ or (severe reductions in population size due to environmental circumstances). This process resulted in loss of about 1 Mb of the genome, major genomic rearrangements, extensive proliferation of insertion sequences, and a massive increase in number of pseudogenes [37–39]. In particular, there was a massive reduction in the size of the two major surface protein gene families (a decrease of more than 250 genes compared to M. marinum). This gene loss is thought to have been crucial for the organism to evade the human immune system, by limiting the number of antigens on the bacterial surface [40]. The uptake of a new virulence plasmid producing an immunosuppressive substance called mycolactone is also thought to have played a key role in the evolution and host switch of M. ulcerans. This plasmid consists mainly of three unusually large and internally repeated genes (over 100 kb in total), and thus illustrates the concept of long and repeated virulence genes (Figure 1B) [41]. These genes appear to evolve rapidly by recombination and gene conversion, and new variants can be directly connected to variations in the chemical structure of mycolactone [42], which might be important for host specificity, immunosuppressive potency, and drug design. Figure 1. Evolution of a new infectious disease agent. (A) Recent evolution of the specialist human pathogen M. ulcerans from the aquatic generalist pathogen M. marinum. (B) Arrangement of the three M. ulcerans plasmid–encoded repeated virulence genes (arrows from left to right: mlsA1 [51 kb], mlsA2 [7.6 kb], mlsB [43 kb]) coding for three polyketide synthases. The loading modules (labeled LM) and the 16 repeated modules depicted in purple (labeled 1–9 for mlsA1 and mlsA2, and 1–7 for mlsB) enable the serial buildup of the backbone carbon chain of the complex immunosuppressive substance mycolactone. doi:10.1371/journal.pcbi.1000481.g001 PLoS Computational Biology | www.ploscompbiol.org 2 better ways to integrate data from diverse sources, including shotgun sequencing, paired-end sequencing, PCR experiments, fosmid and BAC (bacterial artificial chromosome) clone sequencing, physical mapping, and restriction fragment data. A program integrating these different data should not only accurately assemble as much of the genome as possible, but also assist the researcher in designing additional experiments to resolve the remaining regions. Given the rapidly increasing number of incomplete genome sequences available, it would also be valuable with a quality-scoring standard that not only provides quality scores at individual sites under the assumption that the assembly is correct, but also reflects the uncertainty of the actual assembly over specific regions. While assembly software development is struggling to keep up, the sequencing revolution shows no signs of slowing down. Perhaps the most important new development is real-time single molecule detection platforms with ultra-long sequencing reads [16]. Within the next few years, we can expect to see read lengths of 20 kb, which will help resolve many of the complex genomic features underlying host adaptation and pathogenicity. Functional Annotation of Virulence and Host-Interaction Genes Annotation is the process of assigning meaningful information, such as the location or function of genes, to raw sequence data. Reliable and consistent annotations are thus fundamental for analysis and interpretation of genome data. Since annotation of new genomes is usually based on homology searches (e.g., BLAST hits), errors and inconsistencies tend to propagate. One way to reduce error propagation is to functionally annotate a set of reference genomes based on experimentally determined information. Annotation of new genomes could then start with searches in this database, which would allow high-quality annotation of all well-conserved genes. The Gene Ontology’s Reference Genome Project [17] and BioCyc [18] represent developments in this direction. However, the number of species included is still limited, and a broader taxonomic breadth of bacteria, with one reference species per genus, would be desirable. Functional annotation of pathogen genomes is particularly important, because genes involved in host-interaction processes are among the most difficult to annotate. One problem is that different October 2009 | Volume 5 | Issue 10 | e1000481 research groups often have studied homologous genes in various species, and given them different names that are not always logical or reflective of similarities in sequence and function. A manually curated database of protein families involved in host interactions that incorporates currently used gene names, sequence motifs, gene functions, and experimental results would substantially improve the situation. Much improved guidelines for how to annotate genes in large families with different combinations of sequence motifs would also be valuable. Comparative studies of very closely related genomes can help to distinguish functional genes from spurious ORFs (open reading frames) and pseudogenes, and thereby improve gene prediction. To this end, a tool to visualize all the fine details in comparisons of multiple closely related genomes is crucial. Such a tool was developed recently for genomes with a conserved order of genes, and it has been applied to analyze sequence deterioration in the typhus pathogen Rickettsia prowazekii and its closest relatives [10]. Future studies, however, will require software that can also handle multiple genome comparisons from highly rearranged genomes. Another limitation of currently available visualization tools is that, although multiple genomes can be included, only serial pairwise comparisons can be made. This limitation can be overcome by visualization of genome comparisons in ‘‘three dimensions’’ (3D visualization), enabling all-against-all comparisons to be viewed simultaneously (Figure 2). Just as 3D visualizations revolutionized the field of structural biology over the past decades, such developments might well revolutionize the field of comparative genomics in the years to come. Molecular Diagnostics and Vaccine Development Classification of infectious disease agents is typically based on multilocus sequence-typing (MLST) systems, by which new bacterial isolates are analyzed by sequencing five to seven predefined core genes [19]. With the increasing number of complete genome sequences of pathogenic and nonpathogenic strains, it will be possible to concatenate a much larger number of conserved genes and use this dataset to infer a tree to represent the underlying population structure [20]. However, while genotyping systems based on conserved genes can be useful for monitoring the spread of strains, they do not necessarily correlate with genomotypes edge the importance of mutation by recombination (Figure 3) and multiple-base insertion/deletion events as well as point mutations. With the expected huge increase of complete and draft genomes for many strains of a species, there is a need for programs capable of screening a large set of alignments for recombination signals, with novel statistical and visualization tools to analyze the full set of results. Figure 2. New visualization tools for genome comparisons. Comparison of the genes in multiple genomes can be represented visually by using a 3D program. Each arrow represents one gene, and the grey shading between genes indicates homology. Red indicates genes that are unique to one genome. The difference between this approach and existing programs is that all genomes can be compared to each other simultaneously, rather than by pairwise comparisons. With multiple genomes, and with zooming, flipping, and selecting options, even this rudimentary 3D program would be of great help in genome analysis. doi:10.1371/journal.pcbi.1000481.g002 defined by virulence properties [21]. This is because genes contributing to virulence are prone to horizontal gene transfer, gene duplications, and gene loss. Further complicating the development of molecular diagnostic methods is that homologs of virulence genes are often present also in nonpathogenic species, making it difficult to recognize pathogens solely from the gene content. Hence, classification and risk assessments for the emergence of novel infectious strains ultimately should be based on a combination of strain typing, gene content, and identification of virulence genes. Understanding the evolutionary dynamics of host-interaction genes in terms of both mechanisms and selective forces is also important in order to design drugs that will be effective in the long term. What good would be the development of a new antibiotic or vaccine if the intended target protein evolves beyond recognition before the drug reaches the market? One solution to this problem is to characterize the selective pressures on candidate vaccine targets, and then exclude genes or parts of genes based on their evolutionary dynamics [22]. However, current tools for measuring positive or diversifying selection are severely limited in that they assume that singlebase mutations are the only underlying mechanism of sequence change. For reliable analyses of genes with a complex evolution, a new generation of evolutionary tests needs to be developed that acknowl- PLoS Computational Biology | www.ploscompbiol.org 3 Predicting Risk for Disease Outbreaks The next challenge is to place the genomic data within its ecological context, which has led to a new research field called molecular ecosystems biology [23]. This field focuses on dissecting the many complex molecular interactions between the bacterial population and its environment. This environment can be highly specialized, as in the case of bacteria adapted to a single host species, or very complex as for soil-, water-, or airborne bacteria. The behavior of a pathogen thus depends on many ecological factors, such as seasonal fluctuations in temperature and nutritional availability, species richness and host population density. To be able to integrate and evaluate these data, new software is needed. Imagine a program that can read sequence data from hundreds of bacterial isolates, infer the underlying population structure, and combine it with gene expression data, Figure 3. New methods for analyzing evolution by recombination. Improved models and visualization tools are needed to analyze recombination. Virulence genes, here exemplified by the acfD gene in the Vibrio cholerae pathogenicity island [43], often display complex recombination patterns. The aligned acfD genes (arrows) from three V. cholerae strains (M2140, M1567, and M1118) are plotted separately; a line connects each site where the nucleotides in two strains differ from the third strain. Noninformative sites were removed before plotting. doi:10.1371/journal.pcbi.1000481.g003 October 2009 | Volume 5 | Issue 10 | e1000481 ecological factors, and clinical data such as the number of disease cases reported in various geographic areas. It should be possible to visualize global patterns in the data, such as abundance of particular strains and sequence variants and migration of infected hosts and vectors over geographic areas and seasons. Changes in taxonomic profiles, virulence genes, and metabolic pathways should be visualized in real time. This program could also be linked to a Web site where researchers can post daily updates of clinical cases, spread of virulence genes, appearance of new strains and new mutations, migration patterns, and news about genome and functional data. This site would be useful for estimating the risk for new epidemics to emerge in the human population. Analyzing Microbial Communities Analyzing the behavior of complete pathogen ecosystems is an immediate priority. Random shotgun sequencing projects of bacterial DNA from diverse environments count in the hundreds, and the amount of metagenomic sequence data already exceeds the available genomic sequences in public databases [24,25] (http://www.genomesonline.org). Several multinational projects on the human microbiome have been launched, which, together with studies of 16S rRNA amplicons, have provided new insights into the human intestinal [26–28], oral [29], and vaginal flora [30]. Comparison of the microbial flora in healthy and diseased people can be a powerful diagnostic tool and enable the discovery of both emerging pathogens and novel virulence factors, such as antibiotic resistance plasmids. An important technical development that holds great promise for associating the functional adaptation of the community as a whole with the metabolic pathways present in the individual strains is single- cell isolation followed by whole-genome amplification. Community sequencing also provides an excellent tool for epidemic surveillance of pathogenic strains and virulence genes in environments from which they may further spread to humans. The massive amount of data created by microbial community sequencing poses new challenges and will require extensive bioinformatics development [24]. Although the advent of longer sequence reads will have a large impact on the assembly of community data, the presence of many closely related species or strains in the same sample, along with horizontal gene transfer, will remain a daunting challenge. A whole new field of comparative algorithms needs to be developed, for example to provide meaningful comparisons between taxonomic profiles. New sequence databases will be essential for rapid access to both raw and processed data. Also, for fair comparisons between datasets, a certain level of standardization of sampling, experimental work, and statistics will be crucial [31]. Bioinformatics skills combined with a deep biological understanding of the system under study are needed to use these complex sequence datasets to answer such questions as: Who is there? What are they doing? How are they communicating? And what is the risk for disease? Challenges for the Future The priority goals for the next decade within the area of emerging infectious diseases should be the study of complete pathogen ecosystems and the dissection of host–pathogen interaction communication pathways directly in the natural environment. To achieve these goals, investments in user-friendly software and improved visualization tools, along with excellent expertise in computational biology, will be of utmost importance. Unfortunately, too few undergraduate students in clinical microbiology and microbial ecology are trained in computational skills, and national governments and universities need to take action to address this deficiency to meet the demands of the near future. Often neglected by public and private funding is the monumental need for stable and standardized infrastructure at all levels, from the individual research group to the intergovernmental organization. Only with proper investments in everything from hardware and personnel for data handling, to the development of sensible and standardized file formats, can we ensure that the current developments can be fully exploited to more efficiently battle emerging infectious diseases. Currently, the slow transition from a scientific in-house program to the distribution of a stable and efficient software package is a major bottleneck in scientific knowledge sharing, preventing efficient progress in all areas of computational biology. Efforts to design, share, and improve software must receive increased funding, practical support, and, not the least, scientific impact. Since microorganisms do not follow national borders, such initiatives are probably best started from intergovernmental organizations with close links to national centers with established communication networks to distribute know-how and advances further within the country, and vice versa, to facilitate the spread of new concepts and software to all members of the organization. Eventually, many of these initiatives may become community-driven. The example of Wikipedia, with more than 10 million entries written since the launch in 2001 and a current growth rate of thousands of articles daily (http://www.wikipedia.org), demonstrates the power of user-contributed initiatives. Acknowledgments We thank Eddie Persson for graphical work. References 1. Rappuoli R (2004) From Pasteur to genomics: Progress and challenges in infectious diseases. Nat Med 10: 1177–1185. 2. Marra MA, Jones SJ, Astell CR, Holt RA, Brooks-Wilson A, et al. (2003) The genome sequence of the SARS-associated coronavirus. Science 300: 1399–1404. 3. Rota PA, Oberste MS, Monroe SS, Nix WA, Campagnoli R, et al. (2003) Characterization of a novel coronavirus associated with severe acute respiratory syndrome. Science 300: 1394–1399. 4. Parkhill J, Wren BW, Thomson NR, Titball RW, Holden MT, et al. (2001) Genome sequence of Yersinia pestis, the causative agent of plague. Nature 413: 523–527. 5. Welch RA, Burland V, Plunkett G 3rd, Redford P, Roesch P, et al. (2002) Extensive mosaic structure revealed by the complete genome sequence of uropathogenic Escherichia coli. Proc Natl Acad Sci U S A 99: 17020–17024. 6. Dziejman M, Balon E, Boyd D, Fraser CM, Heidelberg JF, et al. (2002) Comparative genomic analysis of Vibrio cholerae: genes that correlate with cholera endemic and pandemic disease. Proc Natl Acad Sci U S A 99: 1556–1561. 7. Wolfgang MC, Kulasekara BR, Liang X, Boyd D, Wu K, et al. (2003) Conservation of genome content and virulence determinants among clinical and environmental isolates of Pseudomonas aeruginosa. Proc Natl Acad Sci U S A 100: 8484–8489. 8. Seshadri R, Myers GS, Tettelin H, Eisen JA, Heidelberg JF, et al. (2004) Comparison of the genome of the oral pathogen Treponema denticola with other spirochete genomes. Proc Natl Acad Sci U S A 101: 5646–5651. PLoS Computational Biology | www.ploscompbiol.org 4 9. Gill SR, Fouts DE, Archer GL, Mongodin EF, Deboy RT, et al. (2005) Insights on evolution of virulence and resistance from the complete genome analysis of an early methicillin-resistant Staphylococcus aureus strain and a biofilm-producing methicillin-resistant Staphylococcus epidermidis strain. J Bacteriol 187: 2426–2438. 10. Fuxelius HH, Darby AC, Cho NH, Andersson SG (2008) Visualization of pseudogenes in intracellular bacteria reveals the different tracks to gene destruction. Genome Biol 9: R42. 11. Berglund EC, Frank AC, Calteau A, Vinnere Pettersson O, Granberg F, et al. (2009) Runoff replication of host-adaptability genes is associated with gene transfer agents in the genome of mouse-infecting Bartonella grahamii. PLoS Genet 5: e1000546. doi:10.1371/journal. pgen.1000546. October 2009 | Volume 5 | Issue 10 | e1000481 12. Deitsch KW, Moxon ER, Wellems TE (1997) Shared themes of antigenic variation and virulence in bacterial, protozoal, and fungal infections. Microbiol Mol Biol Rev 61: 281–293. 13. Brayton KA, Knowles DP, McGuire TC, Palmer GH (2001) Efficient use of a small genome to generate antigenic diversity in tickborne ehrlichial pathogens. Proc Natl Acad Sci U S A 98: 4130–4135. 14. Nystedt B, Frank AC, Thollesson M, Andersson SG (2008) Diversifying selection and concerted evolution of a type IV secretion system in Bartonella. Mol Biol Evol 25: 287–300. 15. Bilek N, Ison CA, Spratt BG (2009) Relative contributions of recombination and mutation to the diversification of the opa gene repertoire of Neisseria gonorrhoeae. J Bacteriol 191: 1878–1890. 16. Gupta PK (2008) Single-molecule DNA sequencing technologies for future genomics research. Trends Biotechnol 26: 602–611. 17. The Gene Ontology’s Reference Genome Project: A unified framework for functional annotation across species. PLoS Comput Biol 5: e1000431. 18. Karp PD, Ouzounis CA, Moore-Kochlacs C, Goldovsky L, Kaipa P, et al. (2005) Expansion of the BioCyc collection of pathway/genome databases to 160 genomes. Nucleic Acids Res 33: 6083–6089. 19. Maiden MC, Bygraves JA, Feil E, Morelli G, Russell JE, et al. (1998) Multilocus sequence typing: A portable approach to the identification of clones within populations of pathogenic microorganisms. Proc Natl Acad Sci U S A 95: 3140–3145. 20. Ciccarelli FD, Doerks T, von Mering C, Creevey CJ, Snel B, et al. (2006) Toward automatic reconstruction of a highly resolved tree of life. Science 311: 1283–1287. 21. Turner KM, Feil EJ (2007) The secret life of the multilocus sequence type. Int J Antimicrob Agents 29: 129–135. 22. Bambini S, Rappuoli R (2009) The use of genomics in microbial vaccine development. Drug Discov Today 14: 252–260. 23. Raes J, Bork P (2008) Molecular eco-systems biology: Towards an understanding of community function. Nat Rev Microbiol 6: 693–699. 24. Kunin V, Copeland A, Lapidus A, Mavromatis K, Hugenholtz P (2008) A bioinformatician’s guide to metagenomics. Microbiol Mol Biol Rev 72: 557–578. 25. Liolios K, Mavromatis K, Tavernarakis N, Kyrpides NC (2008) The Genomes On Line Database (GOLD) in 2007: Status of genomic and metagenomic projects and their associated metadata. Nucleic Acids Res 36: D475–479. 26. Dethlefsen L, Huse S, Sogin ML, Relman DA (2008) The pervasive effects of an antibiotic on the human gut microbiota, as revealed by deep 16S rRNA sequencing. PLoS Biol 6: e280. doi:10.1371/journal.pbio.0060280. 27. Turnbaugh PJ, Hamady M, Yatsunenko T, Cantarel BL, Duncan A, et al. (2009) A core gut microbiome in obese and lean twins. Nature 457: 480–484. 2 8 . M a h o w a l d M A , R e y F E, S e e d o r f H , Turnbaugh PJ, Fulton RS, et al. (2009) Characterizing a model human gut microbiota composed of members of its two dominant bacterial phyla. Proc Natl Acad Sci U S A 106: 5859–5864. 29. Keijser BJ, Zaura E, Huse SM, van der Vossen JM, Schuren FH, et al. (2008) Pyrosequencing analysis of the oral microflora of healthy adults. J Dent Res 87: 1016–1020. 30. Spear GT, Sikaroodi M, Zariffard MR, Landay AL, French AL, et al. (2008) Comparison of the diversity of the vaginal microbiota in HIVinfected and HIV-uninfected women with or without bacterial vaginosis. J Infect Dis 198: 1131–1140. 31. Raes J, Foerstner KU, Bork P (2007) Get the most out of your metagenome: Computational analysis of environmental sequence data. Curr Opin Microbiol 10: 490–498. 32. Andersson SG, Kurland CG (1998) Reductive evolution of resident genomes. Trends Microbiol 6: 263–268. 33. Cole ST, Eiglmeier K, Parkhill J, James KD, Thomson NR, et al. (2001) Massive gene decay in the leprosy bacillus. Nature 409: 1007–1011. PLoS Computational Biology | www.ploscompbiol.org 5 34. Alsmark CM, Frank AC, Karlberg EO, Legault BA, Ardell DH, et al. (2004) The louseborne human pathogen Bartonella quintana is a genomic derivative of the zoonotic agent Bartonella henselae. Proc Natl Acad Sci U S A 101: 9716–9721. 35. Cole ST, Brosch R, Parkhill J, Garnier T, Churcher C, et al. (1998) Deciphering the biology of Mycobacterium tuberculosis from the complete genome sequence. Nature 393: 537–544. 36. Parkhill J, Sebaihia M, Preston A, Murphy LD, Thomson N, et al. (2003) Comparative analysis of the genome sequences of Bordetella pertussis, Bordetella parapertussis and Bordetella bronchiseptica. Nat Genet 35: 32–40. 37. Yip MJ, Porter JL, Fyfe JA, Lavender CJ, Portaels F, et al. (2007) Evolution of Mycobacterium ulcerans and other mycolactone-producing mycobacteria from a common Mycobacterium marinum progenitor. J Bacteriol 189: 2021–2029. 38. Rondini S, Kaser M, Stinear T, Tessier M, Mangold C, et al. (2007) Ongoing genome reduction in Mycobacterium ulcerans. Emerg Infect Dis 13: 1008–1015. 39. Stinear TP, Seemann T, Pidot S, Frigui W, Reysset G, et al. (2007) Reductive evolution and niche adaptation inferred from the genome of Mycobacterium ulcerans, the causative agent of Buruli ulcer. Genome Res 17: 192–200. 40. Huber CA, Ruf MT, Pluschke G, Kaser M (2008) Independent loss of immunogenic proteins in Mycobacterium ulcerans suggests immune evasion. Clin Vaccine Immunol 15: 598–606. 41. Stinear TP, Mve-Obiang A, Small PL, Frigui W, Pryor MJ, et al. (2004) Giant plasmid-encoded polyketide synthases produce the macrolide toxin of Mycobacterium ulcerans. Proc Natl Acad Sci U S A 101: 1345–1349. 42. Pidot SJ, Hong H, Seemann T, Porter JL, Yip MJ, et al. (2008) Deciphering the genetic basis for polyketide variation among mycobacteria producing mycolactones. BMC Genomics 9: 462. 43. Tay CY, Reeves PR, Lan R (2008) Importation of the major pilin TcpA gene and frequent recombination drive the divergence of the Vibrio pathogenicity island in Vibrio cholerae. FEMS Microbiol Lett 289: 210–218. October 2009 | Volume 5 | Issue 10 | e1000481 Perspective The Role of Medical Structural Genomics in Discovering New Drugs for Infectious Diseases Wesley C. Van Voorhis1, Wim G. J. Hol2, Peter J. Myler3,4,5*, Lance J. Stewart6* 1 Department of Medicine, University of Washington, Seattle, Washington, United States of America, 2 Department of Biochemistry, University of Washington, Seattle, Washington, United States of America, 3 Seattle Biomedical Research Institute, Seattle, Washington, United States of America, 4 Department of Global Health, University of Washington, Seattle, Washington, United States of America, 5 Department of Medical Education and Biomedical Informatics, University of Washington, Seattle, Washington, United States of America, 6 deCODE biostructures, Bainbridge Island, Washington, United States of America Introduction Whether we think of Alzheimer’s disease, microbial infection, or any other modern-day disease, new medicines are urgently needed. The number of new drugs registered since the advent of genomics, however, has not lived up to expectations. One recent review revealed that over 70 high-throughput biochemical screens against genetically validated drug targets in bacteria failed to yield a single candidate that could be tested in the clinic [1]. The reasons for the failure of highthroughput biochemical screens are not completely clear, but it could reflect the limited diversity of chemical libraries used and/or the absence of structural information for many of the targets. Indeed, structure-based drug design is playing a growing role in modern drug discovery, with numerous approved drugs tracing their origins, at least in part, to the use of structural information from X-ray crystallography or nuclear magnetic resonance (NMR) analysis of protein targets and their ligand-bound complexes. Although it is beyond the scope of this brief overview to present a comprehensive list of structures that have led to useful drugs, Table 1 lists some examples in which protein structure information has provided insights to the design and development of new therapeutic entities. These cases include both novel drug design based on native and ligandbound structures and optimization of inhibitors based on the binding mode revealed by the structures of inhibitor– target complexes. These approaches have allowed increased affinity for the target and/or improvement of pharmacological properties while maintaining target affinity. With the increasing availability of complete human and pathogen genome sequences and the substantial progress in structure determination methods, it is no surprise that the field of ‘‘structural genomics’’ has emerged recently. Its aim is to solve as many useful protein structures as possible from the entire genome of a single organism or group of related organisms. Over the past ten years, over 20 structural genomics initiatives have begun around the world (Table 2). The impact of these efforts on structural biology has been substantial, both in the sheer number of new structures and, perhaps even more importantly, in the development of new methodologies, especially the use of robotics and informatics to generate and capture data in a systematic way [2]. Over the next five years, thousands of new protein structures, many bound to their ligands, will be elucidated; laying the groundwork for structurebased design and development of new and improved chemotherapeutic agents against pathogen proteins. Here, we will focus on the intersection of structural biology with chemistry and biology—a field called ‘‘medical structural genomics’’—particularly on how the structures of medically relevant drug targets in pathogens can serve as a starting point for inhibitor design and drug development. We argue that the pharmaceutical industry should be persuaded to complement the publicly funded structural genomics initiatives by making public the structural coordinates of their drug targets for important infectious disease organisms in a timely fashion and by developing public–private partnerships to provide the maximal synergy between target validation, structure determination, and hit-tolead development. Target Selection A prerequisite of medical structural genomics is that the proteins whose structures are determined must be wellvalidated as good drug targets. The term ‘‘drugability’’ is often used to loosely describe how tractable any given target is for the development of a drug candidate. For infectious organisms, one key factor in defining drugability is that the target protein be essential for survival of the microbe. While essentiality has traditionally been defined using techniques such as ‘‘gene knockout’’ and RNA interference, these are not always feasible and should be complemented by chemical biology approaches (see below). Furthermore, the meaningfulness of these experiments can often be difficult to assess, since the interplay of host and pathogen is complex and full of surprises. For example, tremendous effort has been devoted recently to the development of antagonists for targets in the fatty acid biosynthesis Citation: Van Voorhis WC, Hol WGJ, Myler PJ, Stewart LJ (2009) The Role of Medical Structural Genomics in Discovering New Drugs for Infectious Diseases. PLoS Comput Biol 5(10): e1000530. doi:10.1371/journal. pcbi.1000530 Editor: Ernest Fraenkel, Massachusetts Institute of Technology, United States of America Published October 26, 2009 Copyright: ß 2009 Van Voorhis et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Funding: This work was supported by the NIAID funding to the Seattle Structural Genomics Center for Infectious Disease (SSGCID) contract HHSN266200700057C, the Medical Structural Genomics of Protozoan Pathogens (MSGPP) contract P01 AI067921 and to WCVV, grant 1R01AI080625. The funders had no role in preparation of the article. Competing Interests: Co-author Lance Stewart is an employee of deCODE biostructures, which developed the Fragments-of-Life library presented in Figure 1 and discussed in sections titled ‘Fragment-based drug discovery’ and ‘Targeting oligomeric enzymes’. Fragments-of-Life TM is a technology trademarked by deCODE biostructures and chemistry (http://www.decodechembio.com/Capabilities/StructuralBiology/FragmentsofLife. aspx). * E-mail: [email protected] (PJM); [email protected] (LJS) This article is part of the ‘‘Genomics of Emerging Infectious Disease’’ PLoS Journal collection (http:// ploscollections.org/emerginginfectiousdisease/). PLoS Computational Biology | www.ploscompbiol.org 1 October 2009 | Volume 5 | Issue 10 | e1000530 Table 1. Examples of how target protein structure can assist drug discovery and development. Source Target Protein Approach Reference(s) HIV gp41 Structure led to strategies that target viral entry. [43–45] HIV Protease Protease–inhibitor complexes allowed lead optimization. [46–52] HIV Reverse transcriptase Non-nucleoside inhibitor complexes led to drug design that targets pockets outside the enzyme’s active site. [53–55] Influenza virus Neuraminidase Complex with a transition state analog led to inhalable and orally active neuraminidase inhibitors. [56–59] Rhinovirus Coat protein Small fatty acid molecules bound in hydrophobic pocket led to new strategies of antiviral drug design. [60] Vibrio Cholera toxin Five receptor-binding sites provided inspiration for design of novel multivalent inhibitors. [61] Bacteria Peptide deformylase Protein–inhibitor complexes led to macrocyclic compounds with improved potency, selectivity and metabolic stability. [62] Trypanosoma GAPDH Novel adenosine analogs showed enhanced selectivity towards the parasite target versus human protein. [63,64] Human Cyclophilin and calcineurin A ternary complex with cyclosporine A led to insights into its immunosuppressive activity. [65] Human Renin The ligand-bound structure allowed design and improvement of orally active non-peptide inhibitors to regulate blood pressure. [66] Human Coagulation factor Xa Structure-based design led to improved pharmacological anticoagulant properties in a primate model. [67] Human Adenosine deaminase Optimization of a non-nucleoside inhibitor led to an orally active anti-inflammatory compound in a rat model. [68] Human Kinases Structures of kinases provided a basis to improve and design new therapeutics for various human diseases including cancer. [69] doi:10.1371/journal.pcbi.1000530.t001 pathway of bacteria [3]. Potent drug-like molecules with high bioavailability have been developed that can effectively shut down bacterial replication in vitro. These compounds were found to be ineffective in subsequent animal testing, however, because fatty acids are quite abundant in vertebrates, so bacteria can secure these host molecules for their survival and growth even if their own fatty acid biosynthesis pathways are blocked [4]. Thus, to improve target selection for medical structural genomics, it will be important to collaborate with chemical biology groups to undertake screening campaigns to identify compounds that cause the death of a pathogen under the appropriate assay conditions [5]. If the target protein of a drug is known, medical structural genomics offers a rapid and efficient way to obtain ligand-bound structures by using high-throughput X-ray crystallography and/or NMR. Conversely, when the target of a cell-active compound is unknown, medical structural genomics efforts provide purified protein for many potential drug targets that can be screened for interaction with the active compound by a number of biophysical methods (such as thermal stability [6]). The Medicinal Structural Genomics of Protozoan Pathogens (MSGPP, http:// www.msgpp.org/) initiative has already begun such an effort by screening thousands of anti-malaria compounds against 67 potential Plasmodium falciparum targets expressed in bacteria (WC Van Voorhis, unpublished data). These approaches aim to generate knowledge about the biological effect of a small molecule on a target protein. Follow-up experiments are then needed to test the activity of this compound in live organisms in order to validate the target; this valuable ‘‘chemical validation’’ makes the target much more likely to be drugable, and thus worthy of more intensive effort. The future will likely see more medical structural genomics centers working with chemical biology groups that have collections of ‘‘phenotype-defined’’ compounds (i.e., those with known anti-pathogen activity). The result will be synergistic target validation and hit-to-lead development using structurebased drug design. Fragment-Based Drug Discovery Fragment-based drug discovery has rapidly gained interest within the pharmaceutical industry (reviewed in [7] with roots of 128compound cocktails in [8]), as an alternative to expensive and sometimes inefficient high- PLoS Computational Biology | www.ploscompbiol.org 2 throughput screening methods for hit identification and optimization [9]. The general concept of fragment-based drug discovery involves screening libraries of ‘‘rule-of-three’’ compounds [10] against target macromolecules by using a variety of methods including X-ray crystallography, NMR, surface plasmon resonance, differential thermal denaturation, fluorescence polarization, and other techniques [7,11–14]. The rule of three consists of molecular weight ,300 daltons, #3 rotatable bonds, #3 hydrogen bond donors/acceptors, and Clog P (calculated log of octanol/water partition coefficient) ,3. These compounds generally include fragments or ‘‘building blocks’’ of available drugs, on the assumption that these fragments are more likely to be ‘‘drug-like.’’ Fragmentbased drug discovery has been used by commercial and academic groups, including our own, and has led to a number of leads for further drug development [15]. At deCODE biostructures, a partner in the Seattle Structural Genomics Center for Infectious Disease (SSGCID, http://www.ssgcid.org/) consortium, the approach to assembling a fragment library has been somewhat different. The Fragments of Life (FOL) library (Figure 1) is a collection of approximately 1,400 structurally diverse small molecules found in the cellular environment, metabolites, natural products, and their derivatives or isosteres (molecules of October 2009 | Volume 5 | Issue 10 | e1000530 Table 2. Structural genomics projects worldwide submitting to the Protein Data Bank. Name URL Target Focus Berkeley Structural Genomics Center (BSGC) http://www.strgen.org/ Near complete coverage of Mycoplasma genome Center for Eukaryotic Structural Genomics (CESG) http://www.uwstructuralgenomics.org/ PSI Center—Eukaryotic bottlenecks, specifically solubility Center for Structural Genomics of Infectious Disease (CSGID) http://csgid.org/csgid/ Medically relevant infectious disease targets Center for Structure of Membrane Proteins (CSMP) http://csmp.ucsf.edu/index.htm PSI Center—Bacterial and human membrane proteins Integrated Center for Structure and Function Innovation (ISFI) htp://techcenter.mbi.ucla.edu/ PSI Center—Protein solubility and crystallization improvement Israel Structural Proteomics Center http://www.weizmann.ac.il/ISPC/ Member of Structural Proteomics in Europe (see below) Joint Center for Structural Genomics (JCSG) http://www.jcsg.org/ PSI Center—High-throughput pipeline development and operation Marseilles Structural Genomics Program http://www.afmb.univ-mrs.fr/rubrique93.html Human health Medical Structural Genomics of Pathogenic Protozoa (MSGPP) http://www.msgpp.org/ Structural and functional genomics of ten species of pathogenic protozoa Montreal-Kingston Bacterial Structural Genomics Initiative (BSGI) http://euler.bri.nrc.ca/brimsg/bsgi.html ORFs from pathogenic and nonpathogenic bacterial strains Mycobacterium Tuberculosis Structural Genomics Consortium (TBsgc) http://www.doe-mbi.ucla.edu/TB/ Mycobacterium tuberculosis—To understand pathogenesis and for structure-based drug design Mycobacterium Tuberculosis Structural Proteomics Project (X-MTB) http://webclu.bio.wzw.tum.de/binfo/proj/mtb/ 35 Mycobacterium tuberculosis targets to identify five for drug development New York SGX Research Center for Structural Genomics (NYSGXRC) http://www.nysgrc.org/nysgrc/ PSI Center—High-throughput pipeline development and operation Ontario Center for Structural Proteomics (OCSP) http://www.uhnres.utoronto.ca/centres/proteomics/ Enzymatic activity characterization Oxford Protein Production Facility http://www.oppf.ox.ac.uk/OPPF/ Human and pathogen targets of biomedical relevance RIKEN Structural Genomics/Proteomics Initiative http://www.rsgi.riken.jp/rsgi_e/ Protein functional networks Seattle Structural Genomics Center for Infectious Disease (SSGCID) http://www.ssgcid.org/ Medically relevant infectious disease targets Southeast Collaboratory for Structural Genomics http://www.secsg.org/ High-throughput eukaryotic genome-scan methods development Structural Genomics of Pathogenic Protozoa http://www.sgpp.org/ PSI Center - Three-dimensional structures of proteins from four major pathogenic protozoa Structural Proteomics in Europe (SPINE) http://www.spineurope.org/ Structures of medically relevant proteins and protein complexes Structural Proteomics in Europe 2-Complexes (SPINE2 - Complexes) http://www.spine2.eu/SPINE2/ Structures of protein complexes from medically relevant signaling pathways Structural Genomics Consortium http://www.thesgc.org/ Medically relevant human and pathogen proteins Structure 2 Function Project http://s2f.umbi.umd.edu/ Poorly characterized and hypothetical protein targets The Accelerated Technologies Center for Gene to 3D Structure http://atcg3d.org/default.aspx PSI Center—Technologies development of X-ray source, synthetic gene design, and microfluidic crystallization The Midwest Center for Structural Genomics (MCSG) http://www.mcsg.anl.gov/ PSI Center—High-throughput methods development and operation The Northeast Structural Genomics Consortium (NESG) http://www.nesg.org/ PSI Center—Protein domains, network families, biomedical relevance Note: Some centers with fewer than ten released structures in the PDB (www.rcsb.org/pdb/) are not shown. PSI, Protein Structure Initiative. doi:10.1371/journal.pcbi.1000530.t002 similar size containing the same number and types of atoms). Also included in the FOL library are a series of biaryl small molecules (which contain two tethered five- or sixmembered ring structures) that mimic protein secondary structure elements (e.g., a-helical turns). Thus, this fragment set is useful for targeting both the active sites of enzymes and more complex protein surfaces including allosteric small molecule binding sites and protein–protein interfaces [16]. Targeting Oligomeric Enzymes Protein–protein interaction and assemblies, ranging from simple dimers to extremely complex arrangements as seen in the ribosome or the nuclear pore PLoS Computational Biology | www.ploscompbiol.org 3 complex, form the basis of most biological processes, and there are usually numerous points of contact between the macromolecules involved. Yet the protein–protein interfaces formed by oligomerization are not necessarily accompanied by a large gain in free energy, and small molecules have been shown to prevent critical protein–protein interactions [17]. These October 2009 | Volume 5 | Issue 10 | e1000530 Figure 1. Conceptual organization of the deCODE biostructures Fragments of Life library. The current ,1,400-compound library contains chemically tractable natural small molecule metabolites (FOL-Nat), metabolite-like compounds and their bioisosteres (FOL-NatD), and biaryl mimetics of protein architecture (FOL-Biaryl). The FOL-Nat members include any natural molecule of molecular weight ,350 daltons that exists as a substrate, natural product, or allosteric regulator of any metabolic pathway in any cell type, such as the biosynthetic pathways for the neurotransmitter serotonin (1) and the plant hormone auxin (2). The FOL-Nat members also include secondary metabolites such as bestatin (3), a secondary metabolite of Streptomyces olivoreticuli [38]. FOL-NatD fragments are defined as heteroatom-containing derivatives, isosteres, or analogs of any FOLNat molecule. For example, fragments 4–7 contain the indole scaffold, which is known to be a privileged building block for drug molecules [39]. To emulate protein architecture, the FOL-Biaryl fragments were selected from a variety of biaryl compounds that are potential mimics of protein a, b, or c turns [40–42]. These include a compound (8) whose structure in an energy-minimized state can be seen to mimic the architecture on an a-turn of a protein structure (here, residues Ser65-Ile66-Leu67-Lys68 of PDB ID:1RTP) and, similarly, a compound (9) whose structure mimics the b-turn of a protein structure (residues Ala20-Ala21-Asp22-Ser23). doi:10.1371/journal.pcbi.1000530.g001 findings have prompted recent discussion of a structure-based approach aimed at developing novel small-molecule antibiotics that modulate protein activity by binding to an interface between subunits within multi-protein complexes [18]. The bacterial enzyme inorganic pyrophosphatase may serve as an example for this approach, since it exists in a hexameric state that requires conformational flexibility for its essential role in converting inorganic pyrophosphate into phosphate [19–21]. Moreover, whereas all bacterial inorganic pyrophosphatases function as a homohexamer, the eukaryotic cytosolic and mitochondrial inorganic pyrophosphatases function as homodimers [21]. Hence eukaryotic inorganic pyrophosphatases have different oligomeric interfaces than those of bacterial enzymes. This suggests that it may be possible to inhibit the bacterial inorganic pyrophosphatase safely by targeting its oligomeric state rather than its highly conserved active site. A similar approach has recently been used to identify species-specific modulators of porphobilinogen synthase (PBGS) activity [22]. SSGCID has solved the highresolution X-ray crystal structure of inorganic pyrophosphatase from the pathogenic bacterium Burkholderia pseudomallei, and a subsequent FOL screen of this target identified several fragments that specifically bind at multiple oligomerization pockets in a molecular interface between the two trimers of the homohexamer (Figure 2). While these fragments remain to be validated in terms of their species-specific inhibition of inorganic pyrophosphatase activity, they represent potential starting points for the development of novel antibiotics. Industry-Generated Structures and the Protein Data Bank As we have seen above, protein structure information is the bread and butter of structure-based drug discovery. Structural genomics projects (Table 2) have substantially increased the number of protein structures solved and have made this information freely and openly available (i.e., at no cost and without restriction by copyright or other constraints) by depositing it in the Protein Data Bank (PDB) [23]. Most publishers have policies that require authors to deposit structural data in the PDB at the time of publication, so structures determined by academic researchers worldwide are, for the most part, well disseminated. By contrast, the pharmaceutical industry is sitting on a mountain of structural data for protein– PLoS Computational Biology | www.ploscompbiol.org 4 ligand complexes from globally important pathogens, which is not available to the wider scientific community. The secrecy engendered by the current economic incentives driving drug discovery in the commercial sector has led to a substantial waste of precious resources through duplication of effort and inability to learn from others’ successes and failures. The situation is unlikely to change without a concerted effort to find ways to overcome the financial and intellectual property barriers that prevent dissemination of this information. A recent publication suggested that open access industry–academia partnerships may provide one possible model [24]. We propose that the United States National Institutes of Health, along with other national and international research-funding agencies, issue calls for proposals that will fund the transfer of the highly valuable structural information from corporate databases into the PDB. Such an effort would obviously require discussion with industrial parties to negotiate mutually acceptable policies and mechanisms for the deposition of these structures in the public databases. These might include relaxation of release standards for industrial entities, such that structural information could be safely deposited in PDB at the time of structure October 2009 | Volume 5 | Issue 10 | e1000530 Figure 2. B. pseudomallei inorganic pyrophosphatase with bound ligand at an oligomeric interface. Homo-hexameric bacterial inorganic pyrophosphatase is a dimer of trimers (blue and green). The illustration shows the hexamer structure in a complex with three ligand fragment molecules (red spheres and stick structures represent fragment FOL 110), each of which is located at one of three ‘‘dimer of trimer’’ interfaces (1.5 ligands per monomer) (PDBID:3EJ0). The location of one pyrophosphate substrate (cyan spheres) at the active site of one of the monomers is indicated here based on the superimposed structure of the hexamer with pyrophosphate bound in the active site (PDBID:3EIY). The binding sites of the ligands (red) are clearly seen in a pocket formed by the homo-oligomeric assemblage, which is distant from the active site where pyrophosphate (cyan) binds. doi:10.1371/journal.pcbi.1000530.g002 determination and released only at a later date more appropriate for protection of intellectual property. Challenges for the Future We are currently witnessing an explosion in technological and computational advances in structural genomics, with protein structures of hundreds or thousands of medically relevant targets from infectious disease organisms likely to be available over the next few years. This new information provides both academic and for-profit scientists with an unprecedented opportunity to accelerate the development of new and improved chemotherapeutic agents against these pathogens. One major challenge will be the adaptation of existing fragment-based drug design methods to match the scale of the structural genomics era. New high-throughput methods need to be developed for fragment-screening to enhance the success rate for protein– ligand structure determination. Major attention is also needed to the development of fully automated, very high throughput crystal growth screening meth- ods to elucidate the binding of wellselected compounds to medically relevant targets. These screens need to cover many (up to 100) protein variants [25,26], 1,000–10,000 different small molecule compounds, and approximately 1,000 different crystal growth conditions [27], resulting in 108 to 109 conditions to be tested for a single drug target. Obviously, this will require development of even smaller volume assays than those currently in use [28–31]—down to the low picoliters—and automated detection of crystals in the millions of crystallization chambers [32–34]. Further development of automated capillary crystallization methods [35] might provide another way to achieve the very high throughput crystal screening required for reaching the full power of medical structural genomics in the future. Cryoprotection of the crystals is a specific hurdle, although it might be possible to routinely collect and merge partial datasets from multiple crystals under non-cryo conditions. Alternatively, the use of micromeshes [36,37] and further miniaturization of trays and other crystal screening PLoS Computational Biology | www.ploscompbiol.org 5 tools may allow cryoprotection of many crystals simultaneously. In addition, existing databases will need to be modified to allow easy dissemination of the results from these fragment screens, and a serious effort should be made to persuade small and big pharma to release coordinates of drug targets from globally important infectious disease organisms. It will also be critical (but challenging) for structural biologists to collaborate with medicinal chemists and molecular biologists to turn these fragment from promising leads to effective drugs. Together, these steps should begin to release a flood of structures that provide a tremendous resource for improving health in rich and poor countries alike. Acknowledgments The authors wish to thank all the individuals who have dedicated themselves to the SSGCID and MSGPP projects. In particular, we thank Robin Stacy, Bart Staker, Alberto Napuli, Frank E. Zucker, Erkang Fan, Christophe Verlinde, Ethan Merritt, and Frederick Buckner, to name but a few. October 2009 | Volume 5 | Issue 10 | e1000530 References 1. Payne DJ, Gwynn MN, Holmes DJ, Pompliano DL (2007) Drugs for bad bugs: Confronting the challenges of antibacterial discovery. Nat Rev Drug Discov 6: 29–40. 2. Haquin S, Oeuillet E, Pajon A, Harris M, Jones AT, et al. (2008) Data management in structural genomics: An overview. Methods Mol Biol 426: 49–79. 3. Wright HT, Reynolds KA (2007) Antibacterial targets in fatty acid biosynthesis. Curr Opin Microbiol 10: 447–453. 4. Brinster S, Lamberet G, Staels B, Trieu-Cuot P, Gruss A, et al. (2009) Type II fatty acid synthesis is not a suitable antibiotic target for gram-positive pathogens. Nature 458: 83–86. 5. Hoon S, Smith AM, Wallace IM, Suresh S, Miranda M, et al. (2008) An integrated platform of genomic assays reveals small-molecule bioactivities. Nat Chem Biol 4: 498–506. 6. Ericsson UB, Hallberg BM, Detitta GT, Dekker N, Nordlund P (2006) Thermofluor-based high-throughput stability optimization of proteins for structural studies. Anal Biochem 357: 289–298. 7. Congreve M, Chessari G, Tisi D, Woodhead AJ (2008) Recent developments in fragment-based drug discovery. J Med Chem 51: 3661–3689. 8. Verlinde CLMJ, Kim H, Bernstein BE, Mande SC, Hol WG (1997) Antitrypanosomiasis drug development based on structures of glycolytic enzymes. In: Veerapandian P, ed. Structurebased drug design. New York: Marcel Dekker. pp 365–394. 9. Rees DC, Congreve M, Murray CW, Carr R (2004) Fragment-based lead discovery. Nat Rev Drug Discov 3: 660–672. 10. Congreve M, Carr R, Murray C, Jhoti H (2003) A ‘‘rule of three’’ for fragment-based lead discovery? Drug Discov Today 8: 876–877. 11. Nienaber VL, Greer J (2000) Discovering novel ligands for macromolecules using X-ray crystallographic screening. Nature Biotechnol 18: 1105–1108. 12. Neumann T, Junker HD, Schmidt K, Sekul R (2007) SPR-based fragment screening: Advantages and applications. Curr Top Med Chem 7: 1630–1642. 13. Jhoti H, Cleasby A, Verdonk M, Williams G (2007) Fragment-based screening using X-ray crystallography and NMR spectroscopy. Curr Opin Chem Biol 11: 485–493. 14. Erlanson DA (2006) Fragment-based lead discovery: A chemical update. Curr Opin Biotechnol 17: 643–652. 15. Bosch J, Robien MA, Mehlin C, Boni E, Riechers A, et al. (2006) Using fragment cocktail crystallography to assist inhibitor design of Trypanosoma brucei nucleoside 2-deoxyribosyltransferase. J Med Chem 49: 5939–5946. 16. Davies DR, Mamat B, Magnusson OT, Christensen J, Haraldsson MH, et al. (2009) Discovery of leukotriene A4 hydrolase inhibitors using metabolomics biased fragment crystallography. J Med Chem 52: 4694–4715. 17. Liuzzi M, Deziel R, Moss N, Beaulieu P, Bonneau AM, et al. (1994) A potent peptidomimetic inhibitor of HSV ribonucleotide reductase with antiviral activity in vivo. Nature 372: 695–698. 18. Wells JA, McClendon CL (2007) Reaching for high-hanging fruit in drug discovery at proteinprotein interfaces. Nature 450: 1001–1009. 19. Kankare J, Salminen T, Lahti R, Cooperman BS, Baykov AA, et al. (1996) Structure of Escherichia coli inorganic pyrophosphatase at 2.2 Å resolution. Acta Crystallogr D Biol Crystallogr 52: 551–563. 20. Oksanen E, Ahonen AK, Tuominen H, Tuominen V, Lahti R, et al. (2007) A complete structural description of the catalytic cycle of 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31. 32. 33. 34. 35. 36. 37. 38. yeast pyrophosphatase. Biochemistry 46: 1228–1239. Sivula T, Salminen A, Parfenyev AN, Pohjanjoki P, Goldman A, et al. (1999) Evolutionary aspects of inorganic pyrophosphatase. FEBS Lett 454: 75–80. Lawrence SH, Ramirez UD, Tang L, Fazliyez F, Kundrat L, et al. (2008) Shape shifting leads to small-molecule allosteric drug discovery. Chem Biol 15: 586–596. Berman H, Henrick K, Nakamura H, Markley JL (2007) The worldwide Protein Data Bank (wwPDB): Ensuring a single, uniform archive of PDB data. Nucleic Acids Res 35: D301–303. Edwards AM, Bountra C, Kerr DJ, Willson TM (2009) Open access chemical and clinical probes to support drug discovery. Nat Chem Biol 5: 436–440. Choi KH, Groarke JM, Young DC, Rossmann MG, Pevear DC, et al. (2004) Design, expression, and purification of a Flaviviridae polymerase using a high-throughput approach to facilitate crystal structure determination. Protein Sci 13: 2685–2692. Graslund S, Sagemark J, Berglund H, Dahlgren LG, Flores A, et al. (2008) The use of systematic N- and C-terminal deletions to promote production and structural studies of recombinant proteins. Protein Expr Purif 58: 210–221. Luft JR, Collins RJ, Fehrman NA, Lauricella AM, Veatch CK, et al. (2003) A deliberate approach to screening for initial crystallization conditions of biological macromolecules. J Struct Biol 142: 170–179. Santarsiero BDYD, Lee CC, Spraggon G, Gu J, Scheibe D, Uber EC, Cornell EW, Nordmeyer RA, Kolbe WF, Jin J, Jones AL, Jaklevic JM, Schultz PG, Stevens RC (2002) An approach to rapid protein crystallization using nanodroplets. J Appl Crystallogr 35: 278–281. Hansen CL, Skordalakes E, Berger JM, Quake SR (2002) A robust and scalable microfluidic metering method that allows protein crystal growth by free interface diffusion. Proc Natl Acad Sci U S A 99: 16531–16536. Zheng B, Roach LS, Ismagilov RF (2003) Screening of protein crystallization conditions on a microfluidic chip using nanoliter-size droplets. J Am Chem Soc 125: 11170–11171. Gerdts CJ, Elliott M, Lovell S, Mixon MB, Napuli AJ, et al. (2008) The plug-based nanovolume Microcapillary Protein Crystallization System (MPCS). Acta Crystallogr D Biol Crystallogr 64: 1116–1122. Wilson J (2002) Towards the automated evaluation of crystallization trials. Acta Crystallogr D Biol Crystallogr 58: 1907–1914. Pan S, Shavit G, Penas-Centeno M, Xu DH, Shapiro L, et al. (2006) Automated classification of protein crystallization images using support vector machines with scale-invariant texture and Gabor features. Acta Crystallogr D Biol Crystallogr 62: 271–279. Liu R, Freund Y, Spraggon G (2008) Imagebased crystal detection: A machine-learning approach. Acta Crystallogr D Biol Crystallogr 64: 1187–1195. Fan E, Baker D, Fields S, Gelb MH, Buckner FS, et al. (2008) Structural genomics of pathogenic protozoa: An overview. Methods Mol Biol 426: 497–513. Wagner A, Diez J, Schulze-Briese C, Schluckebier G (2009) Crystal structure of ultralente—A microcrystalline insulin suspension. Proteins 74: 1018–1027. Thorne RESZ, Kmetko J, O’Niell J, Gillilan R (2003) Microfabricated mounts for high-throughput macromolecular cryocrystallography. J Applied Crystallography 36: 1455–1460. Schorlemmer HU, Bosslet K, Dickneite G, Luben G, Sedlacek HH (1984) Studies on the PLoS Computational Biology | www.ploscompbiol.org 6 39. 40. 41. 42. 43. 44. 45. 46. 47. 48. 4 9. 50. 51. 52. 53. 54. 55. 56. mechanisms of action of the immunomodulator Bestatin in various screening test systems. Behring Inst Mitt: 157–173. Costantino L, Barlocco D (2006) Privileged structures as leads in medicinal chemistry. Curr Med Chem 13: 65–85. Biros SM, Moisan L, Mann E, Carella A, Zhai D, et al. (2007) Heterocyclic alpha-helix mimetics for targeting protein-protein interactions. Bioorg Med Chem Lett 17: 4641–4645. Robinson JA (2008) Beta-hairpin peptidomimetics: design, structures and biological activities. Acc Chem Res 41: 1278–1288. Saraogi I, Hamilton AD (2008) alpha-Helix mimetics as inhibitors of protein-protein interactions. Biochem Soc Trans 36: 1414–1417. Root MJ, Steger HK (2004) HIV-1 gp41 as a target for viral entry inhibition. Curr Pharm Des 10: 1805–1825. Weissenhorn W, Dessen A, Harrison SC, Skehel JJ, Wiley DC (1997) Atomic structure of the ectodomain from HIV-1 gp41. Nature 387: 426–430. Ferrer M, Kapoor TM, Strassmaier T, Weissenhorn W, Skehel JJ, et al. (1999) Selection of gp41-mediated HIV-1 cell entry inhibitors from biased combinatorial libraries of nonnatural binding elements. Nat Struct Biol 6: 953–960. Lapatto R, Blundell T, Hemmings A, Overington J, Wilderspin A, et al. (1989) X-ray analysis of HIV-1 proteinase at 2.7 Å resolution confirms structural homology among retroviral enzymes. Nature 342: 299–302. Miller M, Schneider J, Sathyanarayana BK, Toth MV, Marshall GR, et al. (1989) Structure of complex of synthetic HIV-1 protease with a substrate-based inhibitor at 2.3 Å resolution. Science 246: 1149–1152. Navia MA, Fitzgerald PM, McKeever BM, Leu CT, Heimbach JC, et al. (1989) Threedimensional structure of aspartyl protease from human immunodeficiency virus HIV-1. Nature 337: 615–620. Wl oda wer A, Mil ler M, Jaskolski M, Sathyanarayana BK, Baldwin E, et al. (1989) Conserved folding in retroviral proteases: Crystal structure of a synthetic HIV-1 protease. Science 245: 616–621. Wlodawer A, Vondrasek J (1998) Inhibitors of HIV-1 protease: A major success of structureassisted drug design. Annu Rev Biophys Biomol Struct 27: 249–284. Abdel-Rahman HM, Al-karamany GS, ElKoussi NA, Youssef AF, Kiso Y (2002) HIV protease inhibitors: Peptidomimetic drugs and future perspectives. Curr Med Chem 9: 1905–1922. Chrusciel RA, Strohbach JW (2004) Non-peptidic HIV protease inhibitors. Curr Top Med Chem 4: 1097–1114. Das K, Lewi PJ, Hughes SH, Arnold E (2005) Crystallography and the design of anti-AIDS drugs: Conformational flexibility and positional adaptability are important in the design of nonnucleoside HIV-1 reverse transcriptase inhibitors. Prog Biophys Mol Biol 88: 209–231. Kohlstaedt LA, Wang J, Friedman JM, Rice PA, Steitz TA (1992) Crystal structure at 3.5 Å resolution of HIV-1 reverse transcriptase complexed with an inhibitor. Science 256: 1783–1790. Smerdon SJ, Jager J, Wang J, Kohlstaedt LA, Chirino AJ, et al. (1994) Structure of the binding site for nonnucleoside inhibitors of the reverse transcriptase of human immunodeficiency virus type 1. Proc Natl Acad Sci U S A 91: 3911–3915. Babu YS, Chand P, Bantia S, Kotian P, Dehghani A, et al. (2000) BCX-1812 (RWJ270201): Discovery of a novel, highly potent, orally active, and selective influenza neuramini- October 2009 | Volume 5 | Issue 10 | e1000530 57. 58. 59. 60. dase inhibitor through structure-based drug design. J Med Chem 43: 3482–3486. Bossart-Whitaker P, Carson M, Babu YS, Smith CD, Laver WG, et al. (1993) Threedimensional structure of influenza A N9 neuraminidase and its complex with the inhibitor 2deoxy 2,3-dehydro-N-acetyl neuraminic acid. J Mol Biol 232: 1069–1083. Kim CU, Lew W, Williams MA, Liu H, Zhang L, et al. (1997) Influenza neuraminidase inhibitors possessing a novel hydrophobic interaction in the enzyme active site: Design, synthesis, and structural analysis of carbocyclic sialic acid analogues with potent anti-influenza activity. J Am Chem Soc 119: 681–690. von Itzstein M, Wu WY, Kok GB, Pegg MS, Dyason JC, et al. (1993) Rational design of potent sialidase-based inhibitors of influenza virus replication. Nature 363: 418–423. Hadfield AT, Lee W, Zhao R, Oliveira MA, Minor I, et al. (1997) The refined structure of human rhinovirus 16 at 2.15 Å resolution: Implications for the viral life cycle. Structure 5: 427–441. 61. Merritt EA, Zhang Z, Pickens JC, Ahn M, Hol WG, et al. (2002) Characterization and crystal structure of a high-affinity pentavalent receptor-binding inhibitor for cholera toxin and E. coli heat-labile enterotoxin. J Am Chem Soc 124: 8818–8824. 62. Hu X, Nguyen KT, Jiang VC, Lofland D, Moser HE, et al. (2004) Macrocyclic inhibitors for peptide deformylase: A structure-activity relationship study of the ring size. J Med Chem 47: 4941–4949. 63. Aronov AM, Verlinde CL, Hol WG, Gelb MH (1998) Selective tight binding inhibitors of trypanosomal glyceraldehyde-3-phosphate dehydrogenase via structure-based drug design. J Med Chem 41: 4790–4799. 64. Bressi JC, Choe J, Hough MT, Buckner FS, Van Voorhis WC, et al. (2000) Adenosine analogues as inhibitors of Trypanosoma brucei phosphoglycerate kinase: Elucidation of a novel binding mode for a 2-amino-N(6)-substituted adenosine. J Med Chem 43: 4135–4150. 65. Jin L, Harrison SC (2002) Crystal structure of human calcineurin complexed with cyclosporin A PLoS Computational Biology | www.ploscompbiol.org 7 66. 67. 68. 69. and human cyclophilin. Proc Natl Acad Sci U S A 99: 13522–13526. Rahuel J, Rasetti V, Maibaum J, Rueger H, Goschke R, et al. (2000) Structure-based drug design: The discovery of novel nonpeptide orally active inhibitors of human renin. Chem Biol 7: 493–504. Lam PY, Clark CG, Li R, Pinto DJ, Orwat MJ, et al. (2003) Structure-based design of novel guanidine/benzamidine mimics: Potent and orally bioavailable factor Xa inhibitors as novel anticoagulants. J Med Chem 46: 4405–4418. Terasaka T, Kinoshita T, Kuno M, Seki N, Tanaka K, et al. (2004) Structure-based design, synthesis, and structure-activity relationship studies of novel non-nucleoside adenosine deaminase inhibitors. J Med Chem 47: 3730–3743. Noble ME, Endicott JA, Johnson LN (2004) Protein kinase inhibitors: Insights into drug design from structure. Science 303: 1800–1805. October 2009 | Volume 5 | Issue 10 | e1000530 Review The Key Role of Genomics in Modern Vaccine and Drug Design for Emerging Infectious Diseases Kate L. Seib1, Gordon Dougan2, Rino Rappuoli1* 1 Novartis Vaccines and Diagnostics, Siena, Italy, 2 The Wellcome Trust Sanger Institute, The Wellcome Trust Genome Campus, Hinxton, Cambridge, United Kingdom spp., M. tuberculosis) [3]. For many EIDs, the wealth of information emerging in the genome era has already had a significant impact on the way we approach vaccine and therapeutic development. For EIDs that appear in the near future, genomics will be in the first line of defense in terms of antigen identification, diagnostic development, and functional characterization. Since the completion of the genome sequence of Haemophilus influenzae—the first finished bacterial genome sequence—in 1995 [4], advances in sequencing technology and bioinformatics have produced an exponential growth of genome sequence information. At least one genome sequence is now available for each major human pathogen. As of October 2009, over 1,000 bacterial genomes were ‘‘completed’’ (i.e., closed genomes and whole genome shotgun sequences) and more than 1,000 were ongoing; over 3,000 viral genomes were completed (http://www.genomesonline.org/gold.cgi, http://www.ncbi.nlm.nih.gov/genomes/MICROBES/microbial_ taxtree.html, http://cmr.jcvi.org/tigr-scripts/CMR/shared/ Genomes.cgi). For a bacterial pathogen, which may have more than 4,000 genes, the genome sequence provides the complete genetic repertoire of antigens or drug targets from which novel candidates can be identified. For viral pathogens that may possess fewer than 10 genes, genomics can be used to define the variability that may exist between isolates. Host genetic factors also play a role in infectious disease [5,6], however, and the availability of ‘‘complete’’ human genome sequences, as well as large-scale human genome projects (see http://www.1000genomes.org/), are valuable resources. Hence, the sequences of both pathogen and host genomes can facilitate identification of a growing number of potential vaccine and drug targets (Figure 1). It is estimated that 10–100 times more candidates can be identified in one to two years using genomicsbased approaches than can be identified by conventional methods in the same time frame. Furthermore, genomics-based vaccine projects have substantially increased our understanding of microbial physiology, epidemiology, pathogenesis, and protein functions (see Box 1). Abstract: It can be argued that the arrival of the ‘‘genomics era’’ has significantly shifted the paradigm of vaccine and therapeutics development from microbiological to sequence-based approaches. Genome sequences provide a previously unattainable route to investigate the mechanisms that underpin pathogenesis. Genomics, transcriptomics, metabolomics, structural genomics, proteomics, and immunomics are being exploited to perfect the identification of targets, to design new vaccines and drugs, and to predict their effects in patients. Furthermore, human genomics and related studies are providing insights into aspects of host biology that are important in infectious disease. This ever-growing body of genomic data and new genome-based approaches will play a critical role in the future to enable timely development of vaccines and therapeutics to control emerging infectious diseases. By controlling debilitating and often-lethal infectious diseases, vaccines and antibiotics have had an enormous impact on world health. Now, with the arrival of the ‘‘genomics era,’’ a paradigm shift is occurring in the development of vaccines—and potentially also in the development of antibiotics—that is providing fresh impetus to this field. The world is still faced with a huge burden of infection, however, by classic pathogens (e.g., typhoid, measles), recently discovered causes of disease (e.g., Helicobacter pylori and hepatitis C virus [HCV]), and emerging infectious diseases (EIDs, e.g., H1N1 swine flu and severe acute respiratory syndrome coronavirus [SARS-CoV]). In addition, variant forms of previously identified infectious diseases are reemerging (e.g., Streptococcus pyogenes, also known as group A streptococcus [GAS], and dengue fever), along with antibiotic-resistant forms of microbes (e.g., methicillin-resistant Staphylococcus aureus [MRSA] and Mycobacterium tuberculosis) [1,2] (for a list of EIDs see http://www3.niaid.nih.gov/ topics/emerging/list.htm). The World Health Organization (WHO) estimates that we can expect at least one such new pathogen to appear every year. The fact that an infectious disease has emerged or reemerged indicates immune naı̈vety in the infected population, or altered virulence potential or an increase in antibiotic/antiviral resistance in the pathogen population. The rapid development of vaccines and therapeutics that target these pathogens is therefore essential to limit their spread. Traditional empirical approaches that screen for vaccines or drugs a few candidates at a time are timeconsuming and have often proven insufficient to control many EIDs, particularly when the causative pathogens are antigenically diverse (e.g., HIV), cannot be cultivated in the laboratory (e.g., HCV), lack suitable animal models of infection (e.g., Neisseria spp.), have complex mechanisms of pathogenesis (e.g., retroviruses), and/or are controlled by mucosal or T cell–dependent immune responses rather than humoral immune responses (e.g., Shigella PLoS Genetics | www.plosgenetics.org Citation: Seib KL, Dougan G, Rappuoli R (2009) The Key Role of Genomics in Modern Vaccine and Drug Design for Emerging Infectious Diseases. PLoS Genet 5(10): e1000612. doi:10.1371/journal.pgen.1000612 Editor: Nicholas J. Schork, University of California San Diego and The Scripps Research Institute, United States of America Published October 26, 2009 Copyright: ß 2009 Seib et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Funding: KLS is the recipient of an Australian NHMRC CJ Martin Fellowship. GD is supported by The Wellcome Trust. KLS and RR are employed by Novartis Vaccines. The funders had no role in the preparation of the article. Competing Interests: KLS and RR are employed by Novartis Vaccines. * E-mail: [email protected] This article is part of the ‘‘Genomics of Emerging Infectious Disease’’ PLoS Journal collection (http://ploscollections.org/emerginginfectiousdisease/). 1 October 2009 | Volume 5 | Issue 10 | e1000612 Figure 1. Genomics-based approaches used in the control of EIDs from the outbreak of a disease to the development of a vaccine or drug. (A) The causative agent of a disease may first be identified from patient samples by using metagenomics. (B) Vaccine and therapeutic targets can be identified from the pathogen genome using a variety of screening approaches that focus on the genome, transcriptome, proteome, immunome or structural genome. (C) The human genome can be screened to avoid homologies or similarities with pathogen vaccine and therapeutic targets, or to identify new targets. (D) Once candidate vaccine and therapeutic targets have been identified they must be shown to provide protection against disease and to be safe for use in patients. (E) The clinically tested vaccine or therapeutic can then be licensed for use. The clinical responses of a vaccine and/or therapeutic can be analyzed using human genome based studies (dotted arrows). The pathogen genome can also be used to analyze mutants that are able to evade the immune system in vaccinated subjects or organisms that develop antibiotic resistance. Examples of the approaches indicated are given in Table 1. doi:10.1371/journal.pgen.1000612.g001 PLoS Genetics | www.plosgenetics.org 2 October 2009 | Volume 5 | Issue 10 | e1000612 Box 1: Reverse Vaccinology Drives the Discovery of New Protein Functions play a role in adherence to lung epithelial cells and colonization in a murine model of infection, where they elicit host inflammatory responses [77,78]. In addition, the pilus subunits confer protection in passive and active immunization models [79]. The presence of pili that contain protective antigens in all three principal streptococcal pathogens indicates that these structures play an important role in virulence. Reverse vaccinology involves the in silico screening of the entire genome of a pathogen to find genes that encode proteins with the attributes of good vaccine targets, using either the genome of a single pathogenic isolate or the pangenome (the genomic information from several isolates) of a pathogenic species. Pili in pathogenic streptococci play a key role in virulence and are promising vaccine candidates The identification of pili (long filamentous structures that extend from the bacterial surface) in the main pathogenic strains of streptococci is a good example of how genomics can lead to the discovery of protein functions and increased understanding of host–pathogen interactions. The pili of gram-negative bacteria are well-described virulence factors. Little was known, however, about pili in gram-positive bacteria before the sequencing and analysis of the genomes of S. pyogenes, S. agalactiae, and S. pneumoniae (reviewed in [72]). During analysis of eight S. agalactiae genome sequences, three protective antigens identified by pan-genomic reverse vaccinology [20] were found to contain LPXTG motifs typical of cell wall-anchored proteins and seen to assemble into pili [73]. Further bioinformatics analysis revealed three independent loci that encode structurally distinct pilus types, each of which contains two surface-exposed antigens capable of eliciting protective immunity in mice [75]. Because of the limited variability of S. agalactiae pili, it has been suggested that a combination of only three pilin subunits could lead to broad protective immunity [74]. Following the identification of S. agalactiae pili, typical pilus regions were identified in the available S. pyogenes genomes based on the presence of genes encoding LPXTG-containing proteins. In addition, a combination of recombinant pilus proteins was shown to confer protection in mice against mucosal challenge with virulent S. pyogenes isolates [75]. Falugi and colleagues have since found that S. pyogenes pili are encoded by nine different gene clusters, and they estimate that a vaccine comprising a combination of 12 backbone variants could provide protection against over 90% of circulating S. pyogenes strains [76]. The availability of multiple complete genome sequences for S. pneumoniae, and the increased understanding of pilus proteins in other pathogenic streptococci, led to the discovery of two pilus ‘‘islands’’ that encode proteins that Reverse vaccinology leads to identification of the fHBP and its role in meningococcal species specificity Serogroup B N. meningitidis (MenB) strains are responsible for the majority of meningococcal disease in the developed world, yet there is no comprehensive MenB vaccine available. Screening of the MenB genome for vaccine candidates by using reverse vaccinology led to the discovery of the meningococcal factor H-binding protein (fHBP) [15], which was recently suggested to play an important role in the species specificity of N. meningitidis [80]. fHBP is a component of the Novartis multivalent MenB vaccine that entered Phase III clinical testing in 2008 [16,17] and is also under investigation by Wyeth Vaccines (designated LP2086) [81] and other groups [82]. Initially identified as the genome-derived Neisseria antigen 1870 (GNA1870), a Neisseria-specific putative surface lipoprotein of unknown function, fHBP was renamed because of its ability to bind complement factor H (fH), a molecule that down-regulates activation of the complement alternative pathway. Hence, binding of fH to the surface of Neisseria allows the pathogen to evade complement-mediated killing by the innate immune system [83]. fHBP is expressed by all N. meningitidis strains studied [84]. It induces high levels of bactericidal antibodies in mice [16] and is important for survival of bacteria in human serum and blood [83,85,86]. The discovery that binding of fH to N. meningitidis is specific for human fH, and that human fH alone is able to downregulate complement activation and bactericidal activity leading to increased bacterial survival has significant implications for the study of this organism [80]. The administration of human fH to infant rats challenged with MenB led to a greater than 10-fold increase in survival of bacteria [80], providing an important insight into host– pathogen interactions that may lead to the development of new animal models of infection. From the outbreak of a disease, metagenomics (the study of all the genetic material recovered directly from a sample) can be applied to diseased human samples to aid the rapid identification of the causative agent [7,8]. Once the complete genome sequence of the organism is available, high-throughput approaches can be used to screen for target molecules, as outlined below and in Table 1 [9,10]. Screening approaches vary depending on the nature of the pathogen but are based on several accepted principles and key requirements of vaccines and therapeutics, including the need for targets to be (i) expressed and accessible to the host immune system, or to a therapeutic agent, during human disease; (ii) genetically conserved; (iii) important for survival or pathogenesis; and (iv) free of measurable homology or similarity to host factors. Although many of the approaches described here focus on vaccine development, which involves screening of candidates for immunogenicity, they are largely applicable to drug development by altering the selection criteria used and screening candidates against compound libraries [11–13]. PLoS Genetics | www.plosgenetics.org Reverse Vaccinology, Pan-genomics, and Comparative Genomics The idea behind reverse vaccinology is to screen an entire pathogen genome to find genes that encode proteins with the attributes of good vaccine targets, such as, for example, bacterial surface associated proteins [14]. These proteins can then undergo normal laboratory evaluation for immunogenicity. The Neisseria meningitidis serogroup B (MenB) reverse vaccinology project provides the ‘‘proof of concept’’ for this type of approach. This project identified more novel vaccine candidates in 18 months than had been discovered in 40 years of conventional vaccinology [15]. Analysis of the genome sequence of the virulent MenB strain MC58 found 2,158 predicted open reading frames (ORFs); these were screened using bioinformatics tools to identify 570 ORFs that were predicted to encode surface-exposed or secreted proteins that might be accessible to the immune system [15]. Antigen screening 3 October 2009 | Volume 5 | Issue 10 | e1000612 Table 1. Approaches to identify vaccine and/or drug targets against EIDs in the genomic era. Approach Methods Used Limitations of Method Example Genomics/reverse vaccinology: Analysis of the genetic material of an organism in order to identify the repertoire of protein antigens/drug targets the organism has the potential to express. Bioinformatics screening of the genome sequence to identify ORFs predicted to be exposed on the surface of the pathogen or secreted, expression of recombinant proteins, generation of antibodies in mice to confirm surface exposure, and bactericidal activity [14]. Prediction algorithms need to be validated. Non-protein antigens including polysaccharides or glycolipids, and post-translational modifications cannot be identified. High-throughput cloning and protein expression is required. Serogroup B N. Major cause of meningitidis [15,16] septicemia and meningitis in the developed world. Pan-genomics: Analysis of the genetic material of several organisms of a single species to identify conserved antigens/ targets and ensure the chosen target covers the diversity of the organism. Similar to above, but ORFs are chosen by screening of multiple genomes with either direct sequencing or comparative genome hybridization [18]. Sequences of multiple isolates of a species are required. Similar limitations as described above. S. agalactiae [20] Leading cause of neonatal bacterial sepsis, pneumonia, and meningitis in the US and Europe. Comparative genomics: Analysis of the genetic material of several individuals of a single species, to identify antigens/ targets that are present in pathogenic strains but absent in commensal strains, and thus important for disease. Similar to pangenomics, but ORFs are chosen by screening of genomes from multiple strains of pathogenic and commensal strains of a species [18,21]. Similar limitations as for the above two approaches. E. coli [22] Major cause of mild to severe diarrhea, hemolytic-uremic syndrome, and urinary tract infections. Transcriptomics: Analysis of the set of RNA transcripts expressed by an organism under a specified condition. Gene expression is evaluated in vitro or in vivo using DNA microarrays or cDNA sequencing [24]. There is no direct correlation between the levels of mRNA and protein. In vivo studies require relatively large amounts of mRNA. V. cholerae [26] Causes diseases ranging from selflimiting to severe, life-threatening diarrhea, wound infections, and sepsis. Functional genomics: Analysis of the role of genes and proteins in order to identify genes required for survival under specific conditions. Genes that are functionally essential in specific conditions in vitro or in vivo are determined by gene inhibition followed by screening of mutants in animal models or cell culture to identify attenuated clones [87]. Genetic tools, acceptance of transposons, and natural competence of the pathogen are required. H. pylori [32] Major cause of duodenal and gastric ulcers and stomach cancer as a result of chronic low-level inflammation of the stomach lining. Proteomics: Analysis of the set of proteins expressed by an organism under a specified condition and/or in specific cellular locations (e.g., on the cell surface). 2D-PAGE, MS, and chromatographic techniques to identify proteins from whole cells, fractionated samples, or the cell surface [34]. Proteins with low abundance and/or solubility and proteins that are only expressed in vivo may not be identified. S. pyogenes [36] Cause of a range of diseases from mild pharyngitis to severe toxic shock syndrome, necrotizing fasciitis, and rheumatic fever. Immunomics: Analysis of the subset of proteins/epitopes that interact with the host immune system. Analysis of seroreactive proteins, using 2D-PAGE, phage display libraries, or protein microarrays, probed with host sera [38]. Bioinformatics prediction of B cell and T cell epitopes [37]. Potential bias against sequences that cannot be displayed. Large conformational epitopes made up of noncontiguous amino acids may not be detected. Prediction of B cell epitopes is difficult due to the need to identify conformational epitopes. S. aureus [39] Cause of wound infections. Has emerged as a significant opportunistic pathogen due to antibiotic resistance. Structural genomics: Analysis of the three-dimensional structure of an organism’s proteins and how they interact with antibodies or therapeutics. NMR or crystallography to determine the structure of proteins in the presence/absence of antibodies or therapeutics [51]. Poor understanding of determinants of immunogenicity, immunodominance, and structurefunction relationships. HIV [53] Causative agent of AIDS. Vaccinomics/immunogenetics pharmacogenetics: Analysis of how the human immune system responds to a vaccine or drug. Investigation of genetic heterogeneity/ polymorphisms in the host, at the individual or population level, that may alter immune responses to vaccines [68] or metabolism of therapeutics [71]. Ethical issues of ‘‘personalized’’ medicine. Immense diversity of the human genome and, in particular, in the human immune response. Mumps virus [69] Cause of disease ranging from selflimiting parotid inflammation to epididymo-orchitis, meningitis, and encephalitis. Organism Disease doi:10.1371/journal.pgen.1000612.t001 continued on the basis of several criteria: the ability of antigens to be expressed in Escherichia coli as recombinant proteins (350 candidates); confirmation by ELISA and flow cytometry that the antigen is exposed on the cell surface (91 candidates); the ability of induced antibodies to elicit killing, as measured by serum bactericidal assay and/or passive protection in infant rat assays (28 candidates); and PLoS Genetics | www.plosgenetics.org screening of a panel of diverse meningococcal isolates to determine whether the antigens are conserved. This approach resulted in the development of a multi-component recombinant MenB vaccine that entered Phase III clinical trials in 2008 [16,17]. As multiple genome sequences become available for a single species, the concept of pan-genomic reverse vaccinology is 4 October 2009 | Volume 5 | Issue 10 | e1000612 emerging as a powerful tool to identify vaccine candidates in antigenically diverse species [18]. Pan-genomics aims to identify the full complement of genes in a species, based on the superset of genes in several strains of the same species. Analysis of the genome sequences of eight Streptococcus agalactiae (also known as group B streptococcus) strains revealed substantial genetic heterogeneity and the extended gene repertoire of the species [19]. Screening found a total of 589 genes predicted to encode surface-exposed or secreted proteins in the S. agalactiae pan-genome (396 from the ‘‘core genome’’—genes conserved in all strains—and 193 from the ‘‘dispensable genome’’—genes that are present in two or more strains and are hence considered dispensable for survival). Based on further screening of this pool of candidates, including the ability of recombinant proteins to provide protection when used to immunize animals, a combination of four antigens—only one of which is in the core genome—was selected and shown to confer protection against a panel of S. agalactiae strains [20]. Whereas genome sequencing projects have typically focused on pathogenic organisms, comparison of the genomes of pathogenic and nonpathogenic strains allows vaccine and drug targets to be identified on the basis of proteins that are specifically involved in pathogenesis [21]. Comparative studies of up to 17 commensal and pathogenic E. coli genomes identified genes unique to certain pathogenic strains that are largely absent in commensal strains. This filter decreases the pool of targets to be screened and potentially limits any detrimental effects of therapeutics on the composition of the commensal flora [22]. New sequencing technologies will also open up opportunities for monitoring pathogen vaccine escape by screening for evidence of immune selection in the genomes of pathogen populations before and after vaccine selection. By deep-sequencing of bacterial and viral populations it will be possible to identify antigens under immune selection by monitoring the clustering of single nucleotide polymorphisms (SNPs) and other mutations that affect protein sequence. This approach has already been used to search for evidence of antigenic variation/selection in populations of Salmonella enterica serovar Typhi [23], where variation is extremely limited. Similar sequencing strategies could be applied to populations of bacteria taken before or after a vaccine trial in a particular geographical region. pathogens to identify genes essential to survival or virulence that may be valid vaccine candidates. DNA microarrays can be used to screen comprehensive libraries of pathogen mutants, by comparing bacterial isolates from before and after passage through animal models or exposure to compound libraries to identify attenuated clones [28–30]. For example, these methods have been used to identify 65 novel MenB genes that are required for the pathogen to cause septicemia in infant rats [31], 47 genes essential for H. pylori gastric colonization of the gerbil [32], and genes contributing to M. tuberculosis persistence in the host [33]. Analysis of a pathogen’s proteome (the near complete set of proteins expressed under a specified condition) to reveal potential vaccine and drug candidates can add significant value to in silico approaches [34]. High-throughput proteomic analyses can be performed by using mass spectrometry (MS), chromatographic techniques, and protein microarrays [35]. A novel proteome-based approach has been applied to identify the surface proteins of GAS by making use of proteolytic enzymes to ‘‘shave’’ the bacterial surface, releasing exposed proteins and partially exposed peptides. Seventeen surface proteins of a virulent GAS strain were identified in this way by using MS and genome sequence analysis. Their location on the pathogen surface was confirmed by flow cytometry, and one of them provided protective immunity in a mouse model of the disease [36]. The proteome of a pathogen can also be screened to identify the immunome (the near complete set of pathogen proteins or epitopes that interact with the host immune system) using in vitro or in silico techniques [37,38]. In vitro identification and screening of the immunome are based on the idea that antibodies present in serum from a host, which has been exposed to a pathogen, represent a molecular ‘‘imprint’’ of the pathogen’s immunogenic proteins and can be used to identify vaccine candidates. As such, several techniques have been developed to allow the highthroughput display of pathogen proteins, and the subsequent screening for proteins that interact with antibodies in sera. Immunogenic surface proteins of several organisms have been identified, including S. aureus using 2D-PAGE, membrane blotting, and MS [39]; S. agalactiae, S. pyogenes, and Streptococcus pneumoniae using phage- or E. coli-based comprehensive genomic peptide expression libraries [38,40]; and Francisella tularensis (the causative agent of tularemia or rabbit fever) [41] and V. cholerae using protein microarray chips [42]. Protein microarrays, in which proteins from the pathogen are spotted onto a microarray chip, can also be used to characterize protein–drug interactions, as well as other protein–protein, protein–nucleic acid, ligand–receptor, and enzyme–substrate interactions [43]. The ability to predict in silico which pathogen epitopes will be recognized by B cells or T cells has greatly improved in recent years [44]. Large-scale screening of pathogens including HIV, Bacillus anthracis, M. tuberculosis, F. tularensis, Yersinia pestis (the causative agent of bubonic plague), flaviviruses, and influenza for B cell and T cell epitopes is currently underway [45,46]. Although epitope prediction is not foolproof, it can serve as a guide for further biological evaluation. T cell epitopes are presented by MHC/HLA proteins on the surface of antigen-presenting cells, which vary considerably between hosts, complicating the task of functional epitope prediction. Additionally, B cell epitopes can be both linear and conformational. The ultimate aim of researchers in this field of study would be to engineer a single peptide that represents defined epitope combinations from a protein or organism, enabling the genetic variability of both pathogen and host to be overcome [44]. Structural genomics—the study of the three-dimensional structures of the proteins produced by a species—is increasingly Beyond Genomics: Other -Omics Approaches to Study Pathogens Pathogen genes that are up-regulated during infection and/or essential for microorganism survival or pathogenesis can be identified by using transcriptomics, i.e., the analysis of a near complete set of RNA transcripts expressed by the pathogen under a specified condition. Comprehensive DNA-based microarray chips (probed with cDNA generated from RNA by reverse transcription) [24] and ultra-high-throughput sequencing technologies that allow rapid sequencing and direct quantification of cDNA [25] enable the transcriptome of a pathogen to be characterized and particular types of gene product to be identified. For example, genes involved in the hyperinfectious state of Vibrio cholerae, which appears after passage through the human gastrointestinal tract, were identified through a comparison of the transcriptome of bacteria isolated directly from stool samples of cholera patients with that of V. cholerae grown in vitro [26]. Similarly, analysis of the transcription profile of M. tuberculosis during early infection in immune-competent (BALB/c) and severe combined immunodeficient (SCID) mice revealed a set of 67 genes activated exclusively in response to the host immune system [27]. Functional genomics—linking genotype, through transcriptomics and proteomics, to phenotype—has been applied to many PLoS Genetics | www.plosgenetics.org 5 October 2009 | Volume 5 | Issue 10 | e1000612 being applied to vaccine and drug development as a result of the explosion of genome and proteome data, and continuing improvements in the fields of protein expression, purification, and structural determination [47]. The structure-based design of antiviral therapeutics has led to the development of drugs directed at the active sites of the HIV-1 protease [48] and influenza neuraminidase [49]. More than 45,000 high-resolution protein structures are available in public databases (see http://www. wwpdb.org/stats.html), and several initiatives have been established to pursue high-throughput characterization of protein structures on a genome-wide scale [50], focusing on determining and understanding the structural basis of immune-dominant and immune-recessive antigens as well as protein active sites and potential drug-binding sites [51,52]. For example, structural characterization of the HIV envelope proteins gp120 and gp41 has revealed mechanisms used by the virus to evade host antibody responses, many of which involve hypervariability in immunodominant epitopes [53,54]. Based on this information, immune refocusing (e.g., by retargeted glycosylation, deletion, and/or substitution of amino acids) has been used to dampen the response to variable immunodominant epitopes of the envelope glycoprotein gp160, enabling the host to respond to previously subdominant epitopes [55]. High-throughput modification of proteins and their screening for immunogenicity and interaction with antimicrobials is predicted to become more common as techniques evolve [51]. result, the OspA-based Lyme disease vaccine (LYMErix) was taken off the market in 2002, but a recombinant OspA lacking the potentially autoreactive T cell epitope has been proposed as a replacement vaccine [62]. Rather than targeting drugs to pathogen enzymes, an alternative approach has focused on targeting the host-cell proteins that are exploited by pathogens for replication and survival. The use of techniques including microarray-based analysis of virusinduced host gene expression has revealed several possible targets [63,64]. The cholesterol-lowering drugs statins, for example, have an anti-HIV effect that is believed to be mediated by preventing activation of the host protein Rho, which is activated by the HIV envelope protein and required for virus entry to the cell [65]. Furthermore, such studies can improve our understanding of the host immune responses that protect against a pathogen (i.e., innate, antibody, Th1, or Th2 responses), which will aid the selection of appropriate vaccine adjuvants. For example, induction of interferon signaling early in infection may be critical to confer protection against SARS-CoV, as determined from functional genomic studies of early host responses to SARS-CoV infection in the lungs of macaques [66]. Many of the genes of the human immune system are highly polymorphic, which enables the population as a whole to generate sufficient immunological diversity to combat EIDs. This variation also impacts on the outcome of vaccination and treatment. The International HapMap Project has identified over 3.1 million SNPs in 270 individuals [67] and the 1000 Genomes Project aims to identify even more genetic variants. The field of vaccinomics (also called immunogenetics) investigates heterogeneity in host genetic markers that results in variations in vaccine-induced immune responses, with the aim of predicting and minimizing vaccine failures or adverse events [68]. For example, polymorphisms of HLA and immunoregulatory cytokine receptor genes are associated with variable outcomes of vaccination against mumps [69]. Similarly, pharmacogenetics, which investigates genetic differences in the way individuals metabolize therapeutics, has found that human variability in the speed of metabolism of the common first-line tuberculosis drug isoniazid is associated with genetic variants, including SNPs, in the gene encoding arylamine N-acetyltransferase (NAT2) [70,71]. The ability to predict an individual’s response to a vaccine or drug, may eventually allow physicians to determine whether a patient is genetically susceptible to a disease, the possible adverse effects of a vaccine or drug, and the appropriate schedule or dose to use. The Contribution of Human Genomics When designing new vaccines, one important consideration is the risk that the vaccine might generate ‘‘self’’ immune reactions against host epitopes; immune responses against a pathogen antigen can cross-react with host antigens if homologies exist in the primary amino acid sequence or structure, potentially leading to damage to the host tissue [56]. Drugs aimed at pathogen targets could also theoretically target similar host molecules. The availability of the human genome sequence combined with methods for predicting B cell and T cell epitopes will facilitate screening for the presence of homologies between candidate microbial vaccine antigens and proteins in humans, enabling issues of autoimmunity and cross-reactivity to be tackled [57]. As such, vaccine or drug targets identified using methods based on pathogen genomics should be screened for homology or similarity to human proteins in silico, using programs such as BLAST (Basic Local Alignment Search Tool; http://blast.ncbi.nlm.nih.gov/ Blast.cgi) to query human genome databases. Interestingly, analysis of 30 viral genomes revealed that around 90% of viral pentapeptides, which could be components of epitopes, are identical to human peptides [58]. There is little homology, however, between validated immunogenic disease-associated peptides/epitopes and host peptides [57,59], suggesting that screening approaches that include prediction of immunogenicity could improve the pool of target candidates. It is important to keep in mind that we do not fully understand how self-tolerance is broken, so we currently have no perfect way of predicting all potential autoimmune triggers that could be associated with vaccination. While many links have been made between autoimmune disease and vaccination, they have been confirmed in only a small number of cases (reviewed in [60]). For example, treatment-resistant Lyme arthritis is associated in certain patients with immune reactivity to the outer surface protein A (OspA) of the causative agent of Lyme disease, Borrelia burgdorferi, and an OspA epitope (OspA165–173) has homology to the human lymphocyte function-associated antigen (hLFA)-1aL [61]. As a PLoS Genetics | www.plosgenetics.org Challenges for the Future We predict that genomics will greatly aid the control of EIDs because of the increased efficiency with which vaccine and therapeutic targets can be identified using the genome-based approaches described above. Furthermore, we anticipate the continual refinement and development of novel genome-based approaches as sequencing becomes faster and more affordable. Several challenges remain, however, in the identification of these targets and in the processes needed to bring a new vaccine or drug to the market. Understanding the molecular nature of epitopes, the mechanisms of action of adjuvants, and T cell and mucosal immunity are key priorities to be tackled in the coming years [3]. These issues can be addressed by improved structural studies of antigen epitopes and the compilation of databases containing information on structure, immunogenicity, and in silico B cell and T cell epitope predictions. Genome-based development of effective vaccines and therapeutics is still largely dependent on the availability of valid models to measure efficacy and protection 6 October 2009 | Volume 5 | Issue 10 | e1000612 the stepwise series of prelicensure clinical trials (Phase I, II, and III) that are required to document the safety, immunogenicity, and efficacy of a vaccine are still highly time-consuming and costly. We can only hope that the increasingly ‘‘smart’’ identification and design of targets, and the fresh impetuous given to the fields of vaccine and drug development by the arrival of genomics, will enable increased success of those vaccines and drugs that do make it into clinical development. against disease; however, the increased understanding of microbial pathogenesis that is emerging from genomics should greatly aid in this respect. Likewise, the continued development of animal models with knockout and allele-specific mutations in key components of the immune response will greatly increase understanding of the type of immune response needed to control disease and the ways in which the immune system can be programmed to protect the host against disease. Unfortunately, References 1. Dong J, Olano JP, McBride JW, Walker DH (2008) Emerging pathogens: Challenges and successes of molecular diagnostics. J Mol Diagn 10: 185–197. 2. Yang X, Yang H, Zhou G, Zhao GP (2008) Infectious disease in the genomic era. Annu Rev Genomics Hum Genet 9: 21–48. 3. Rappuoli R (2007) Bridging the knowledge gaps in vaccine design. Nat Biotechnol 25: 1361–1366. 4. Fleischmann RD, Adams MD, White O, Clayton RA, Kirkness EF, et al. (1995) Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. Science 269: 496–512. 5. Casanova JL, Abel L (2007) Human genetics of infectious diseases: A unified theory. EMBO J 26: 915–922. 6. Burgner D, Jamieson SE, Blackwell JM (2006) Genetic susceptibility to infectious diseases: Big is beautiful, but will bigger be even better? Lancet Infect Dis 6: 653–663. 7. Nakamura S, Yang CS, Sakon N, Ueda M, Tougan T, et al. (2009) Direct metagenomic detection of viral pathogens in nasal and fecal specimens using an unbiased high-throughput sequencing approach. PLoS ONE 4: e4219. doi:10.1371/journal.pone.0004219. 8. Bittar F, Richet H, Dubus JC, Reynaud-Gaubert M, Stremler N, et al. (2008) Molecular detection of multiple emerging pathogens in sputa from cystic fibrosis patients. PLoS ONE 3: e2908. doi:10.1371/journal.pone.0002908. 9. Rinaudo CD, Telford JL, Rappuoli R, Seib KL (2009) Vaccinology in the genome era. J Clin Invest 119: 2515–2525. 10. Kaushik DK, Sehgal D (2008) Developing antibacterial vaccines in genomics and proteomics era. Scand J Immunol 67: 544–552. 11. Pucci MJ (2007) Novel genetic techniques and approaches in the microbial genomics era: identification and/or validation of targets for the discovery of new antibacterial agents. Drugs R D 8: 201–212. 12. Mills SD (2006) When will the genomics investment pay off for antibacterial discovery? Biochem Pharmacol 71: 1096–1102. 13. Van Voorhis WC, Hol WGJ, Myler PJ, Stewart LJ (2009) The role of medical structural genomics in discovering new drugs for infectious diseases. PLoS Comput Biol 5(10): e530. 10.1371/journal.pcbi.1000530. 14. Masignani V, Rappuoli R, Pizza M (2002) Reverse vaccinology: A genomebased approach for vaccine development. Expert Opin Biol Ther 2: 895–905. 15. Pizza M, Scarlato V, Masignani V, Giuliani MM, Arico B, et al. (2000) Identification of vaccine candidates against serogroup B meningococcus by whole-genome sequencing. Science 287: 1816–1820. 16. Giuliani MM, Adu-Bobie J, Comanducci M, Arico B, Savino S, et al. (2006) A universal vaccine for serogroup B meningococcus. Proc Natl Acad Sci U S A 103: 10834–10839. 17. Rappuoli R (2008) The application of reverse vaccinology, Novartis MenB vaccine developed by design. 16th International Pathogenic Neisseria Conference, Rotterdam, The Netherlands: http://www.IPNC2008.org. Abstr. 81 p. 18. Muzzi A, Masignani V, Rappuoli R (2007) The pan-genome: Towards a knowledge-based discovery of novel targets for vaccines and antibacterials. Drug Discov Today 12: 429–439. 19. Tettelin H, Masignani V, Cieslewicz MJ, Donati C, Medini D, et al. (2005) Genome analysis of multiple pathogenic isolates of Streptococcus agalactiae: implications for the microbial ‘‘pan-genome.’’ Proc Natl Acad Sci U S A 102: 13950–13955. 20. Maione D, Margarit I, Rinaudo CD, Masignani V, Mora M, et al. (2005) Identification of a universal Group B streptococcus vaccine by multiple genome screen. Science 309: 148–150. 21. Bhagwat AA, Bhagwat M (2008) Methods and tools for comparative genomics of foodborne pathogens. Foodborne Pathog Dis 5: 487–497. 22. Rasko DA, Rosovitz MJ, Myers GS, Mongodin EF, Fricke WF, et al. (2008) The pangenome structure of Escherichia coli: Comparative genomic analysis of E. coli commensal and pathogenic isolates. J Bacteriol 190: 6881–6893. 23. Holt KE, Parkhill J, Mazzoni CJ, Roumagnac P, Weill FX, et al. (2008) Highthroughput sequencing provides insights into genome variation and evolution in Salmonella typhi. Nat Genet 40: 987–993. 24. Dhiman N, Bonilla R, O’Kane DJ, Poland GA (2001) Gene expression microarrays: A 21st century tool for directed vaccine design. Vaccine 20: 22–30. 25. Morozova O, Marra MA (2008) Applications of next-generation sequencing technologies in functional genomics. Genomics 92: 255–264. 26. Merrell DS, Butler SM, Qadri F, Dolganov NA, Alam A, et al. (2002) Hostinduced epidemic spread of the cholera bacterium. Nature 417: 642–645. 27. Talaat AM, Lyons R, Howard ST, Johnston SA (2004) The temporal expression profile of Mycobacterium tuberculosis infection in mice. Proc Natl Acad Sci U S A 101: 4602–4607. PLoS Genetics | www.plosgenetics.org 28. Scarselli M, Giuliani MM, Adu-Bobie J, Pizza M, Rappuoli R (2005) The impact of genomics on vaccine design. Trends Biotechnol 23: 84–91. 29. Saenz HL, Dehio C (2005) Signature-tagged mutagenesis: technical advances in a negative selection method for virulence gene identification. Curr Opin Microbiol 8: 612–619. 30. Sakata T, Winzeler EA (2007) Genomics, systems biology and drug development for infectious diseases. Mol Biosyst 3: 841–848. 31. Sun YH, Bakshi S, Chalmers R, Tang CM (2000) Functional genomics of Neisseria meningitidis pathogenesis. Nat Med 6: 1269–1273. 32. Kavermann H, Burns BP, Angermuller K, Odenbreit S, Fischer W, et al. (2003) Identification and characterization of Helicobacter pylori genes essential for gastric colonization. J Exp Med 197: 813–822. 33. Sassetti CM, Boyd DH, Rubin EJ (2003) Genes required for mycobacterial growth defined by high density mutagenesis. Mol Microbiol 48: 77–84. 34. Zhu H, Bilgin M, Snyder M (2003) Proteomics. Annu Rev Biochem 72: 783–812. 35. Grandi G (2006) Genomics and proteomics in reverse vaccines. Methods Biochem Anal 49: 379–393. 36. Rodriguez-Ortega MJ, Norais N, Bensi G, Liberatori S, Capo S, et al. (2006) Characterization and identification of vaccine candidate proteins through analysis of the group A Streptococcus surface proteome. Nat Biotechnol 24: 191–197. 37. De Groot AS, McMurry J, Moise L (2008) Prediction of immunogenicity: in silico paradigms, ex vivo and in vivo correlates. Curr Opin Pharmacol 8: 620–626. 38. Meinke A, Henics T, Hanner M, Minh DB, Nagy E (2005) Antigenome technology: A novel approach for the selection of bacterial vaccine candidate antigens. Vaccine 23: 2035–2041. 39. Vytvytska O, Nagy E, Bluggel M, Meyer HE, Kurzbauer R, et al. (2002) Identification of vaccine candidate antigens of Staphylococcus aureus by serological proteome analysis. Proteomics 2: 580–590. 40. Giefing C, Meinke AL, Hanner M, Henics T, Bui MD, et al. (2008) Discovery of a novel class of highly conserved vaccine antigens using genomic scale antigenic fingerprinting of pneumococcus with human antibodies. J Exp Med 205: 117–131. 41. Eyles JE, Unal B, Hartley MG, Newstead SL, Flick-Smith H, et al. (2007) Immunodominant Francisella tularensis antigens identified using proteome microarray. Proteomics 7: 2172–2183. 42. Rolfs A, Montor WR, Yoon SS, Hu Y, Bhullar B, et al. (2008) Production and sequence validation of a complete full length ORF collection for the pathogenic bacterium Vibrio cholerae. Proc Natl Acad Sci U S A 105: 4364–4369. 43. Stoevesandt O, Taussig MJ, He M (2009) Protein microarrays: high-throughput tools for proteomics. Expert Rev Proteomics 6: 145–157. 44. De Groot AS, Moise L, McMurry JA, Martin W (2008) Epitope-based immunonederived vaccines: a strategy for improved design and safety. In: Falus A, ed. Clinical Applications of Immunomics. New York: Springer. pp 39–69. 45. Sette A, Fleri W, Peters B, Sathiamurthy M, Bui HH, et al. (2005) A roadmap for the immunomics of category A-C pathogens. Immunity 22: 155–161. 46. De Groot AS, Rivera DS, McMurry JA, Buus S, Martin W (2008) Identification of immunogenic HLA-B7 ‘‘Achilles’ heel’’ epitopes within highly conserved regions of HIV. Vaccine 26: 3059–3071. 47. Lundstrom K (2007) Structural genomics and drug discovery. J Cell Mol Med 11: 224–238. 48. Kaldor SW, Kalish VJ, Davies JF, 2nd, Shetty BV, Fritz JE, et al. (1997) Viracept (nelfinavir mesylate, AG1343): A potent, orally bioavailable inhibitor of HIV-1 protease. J Med Chem 40: 3979–3985. 49. Kim CU, Lew W, Williams MA, Liu H, Zhang L, et al. (1997) Influenza neuraminidase inhibitors possessing a novel hydrophobic interaction in the enzyme active site: Design, synthesis, and structural analysis of carbocyclic sialic acid analogues with potent anti-influenza activity. J Am Chem Soc 119: 681–690. 50. Todd AE, Marsden RL, Thornton JM, Orengo CA (2005) Progress of structural genomics initiatives: An analysis of solved target structures. J Mol Biol 348: 1235–1260. 51. Dormitzer PR, Ulmer JB, Rappuoli R (2008) Structure-based antigen design: A strategy for next generation vaccines. Trends Biotechnol 26: 659–667. 52. Nicola G, Abagyan R (2009) Structure-based approaches to antibiotic drug discovery. Curr Protoc Microbiol Chapter 17: Unit 17.2. 53. Zhou T, Xu L, Dey B, Hessell AJ, Van Ryk D, et al. (2007) Structural definition of a conserved neutralization epitope on HIV-1 gp120. Nature 445: 732–737. 7 October 2009 | Volume 5 | Issue 10 | e1000612 54. Prabakaran P, Dimitrov AS, Fouts TR, Dimitrov DS, KuanTeh J (2007) Structure and function of the HIV envelope glycoprotein as entry mediator, vaccine immunogen, and target for inhibitors. In: Advances in Pharmacology. Academic Press. pp 33–97. 55. Tobin GJ, Trujillo JD, Bushnell RV, Lin G, Chaudhuri AR, et al. (2008) Deceptive imprinting and immune refocusing in vaccine design. Vaccine 26: 6189–6199. 56. Ercolini AM, Miller SD (2009) The role of infections in autoimmune disease. Clin Exp Immunol 155: 1–15. 57. Amela I, Cedano J, Querol E (2007) Pathogen proteins eliciting antibodies do not share epitopes with host proteins: A bioinformatics approach. PLoS ONE 2: e512. doi:10.1371/journal.pone.0000512. 58. Kanduc D, Stufano A, Lucchese G, Kusalik A (2008) Massive peptide sharing between viral and human proteomes. Peptides 29: 1755–1766. 59. Kanduc D, Lucchese A, Mittelman A (2007) Non-redundant peptidomes from DAPs: Towards ‘‘the vaccine’’? Autoimmun Rev 6: 290–294. 60. Wraith DC, Goldman M, Lambert PH (2003) Vaccination and autoimmune disease: What is the evidence? Lancet 362: 1659–1666. 61. Gross DM, Forsthuber T, Tary-Lehmann M, Etling C, Ito K, et al. (1998) Identification of LFA-1 as a candidate autoantigen in treatment-resistant Lyme arthritis. Science 281: 703–706. 62. Willett TA, Meyer AL, Brown EL, Huber BT (2004) An effective secondgeneration outer surface protein A-derived Lyme vaccine that eliminates a potentially autoreactive T cell epitope. Proc Natl Acad Sci U S A 101: 1303–1308. 63. Kellam P (2006) Attacking pathogens through their hosts. Genome Biol 7: 201. 64. Andeweg AC, Haagmans BL, Osterhaus AD (2008) Virogenomics: the virushost interaction revisited. Curr Opin Microbiol 11: 461–466. 65. del Real G, Jimenez-Baranda S, Mira E, Lacalle RA, Lucas P, et al. (2004) Statins inhibit HIV-1 infection by down-regulating Rho activity. J Exp Med 200: 541–547. 66. de Lang A, Baas T, Teal T, Leijten LM, Rain B, et al. (2007) Functional genomics highlights differential induction of antiviral pathways in the lungs of SARS-CoV-infected macaques. PLoS Pathog 3: e112. doi:10.1371/journal. ppat.0030112. 67. International HapMap Consortium (2007) A second generation human haplotype map of over 3.1 million SNPs. Nature 449: 851–861. 68. Poland GA, Ovsyannikova IG, Jacobson RM (2009) Application of pharmacogenomics to vaccines. Pharmacogenomics 10: 837–852. 69. Ovsyannikova IG, Jacobson RM, Dhiman N, Vierkant RA, Pankratz VS, et al. (2008) Human leukocyte antigen and cytokine receptor gene polymorphisms associated with heterogeneous immune responses to mumps viral vaccine. Pediatrics 121: e1091–1099. 70. Sim E, Lack N, Wang CJ, Long H, Westwood I, et al. (2008) Arylamine Nacetyltransferases: Structural and functional implications of polymorphisms. Toxicology 254: 170–183. 71. Baudhuin LM, Langman LJ, O’Kane DJ (2007) Translation of pharmacogenetics into clinically relevant testing modalities. Clin Pharmacol Ther 82: 373–376. 72. Telford JL, Barocchi MA, Margarit I, Rappuoli R, Grandi G (2006) Pili in gram-positive pathogens. Nat Rev Microbiol 4: 509–519. PLoS Genetics | www.plosgenetics.org 73. Lauer P, Rinaudo CD, Soriani M, Margarit I, Maione D, et al. (2005) Genome analysis reveals pili in Group B Streptococcus. Science 309: 105. 74. Margarit I, Rinaudo CD, Galeotti CL, Maione D, Ghezzo C, et al. (2009) Preventing bacterial infections with pilus-based vaccines: The group B streptococcus paradigm. J Infect Dis 199: 108–115. 75. Mora M, Bensi G, Capo S, Falugi F, Zingaretti C, et al. (2005) Group A Streptococcus produce pilus-like structures containing protective antigens and Lancefield T antigens. Proc Natl Acad Sci U S A 102: 15641–15646. 76. Falugi F, Zingaretti C, Pinto V, Mariani M, Amodeo L, et al. (2008) Sequence variation in Group A Streptococcus pili and association of pilus backbone types with Lancefield T serotypes. J Infect Dis 198: 1834–1841. 77. Barocchi MA, Ries J, Zogaj X, Hemsley C, Albiger B, et al. (2006) A pneumococcal pilus influences virulence and host inflammatory responses. Proc Natl Acad Sci U S A 103: 2857–2862. 78. Bagnoli F, Moschioni M, Donati C, Dimitrovska V, Ferlenghi I, et al. (2008) A second pilus type in Streptococcus pneumoniae is prevalent in emerging serotypes and mediates adhesion to host cells. J Bacteriol 190: 5480–5492. 79. Gianfaldoni C, Censini S, Hilleringmann M, Moschioni M, Facciotti C, et al. (2007) Streptococcus pneumoniae pilus subunits protect mice against lethal challenge. Infect Immun 75: 1059–1062. 80. Granoff DM, Welsch JA, Ram S (2009) Binding of complement factor H (fH) to Neisseria meningitidis is specific for human fH and inhibits complement activation by rat and rabbit sera. Infect Immun 77: 764–769. 81. McNeil LK, Murphy E, Zhao XJ, Guttmann S, Harris S, et al. (2009) Detection of LP2086 on the cell surface of Neisseria meningitidis and its accessibility in the presence of serogroup B capsular polysaccharide. Vaccine 27: 3417–3421. 82. Koeberling O, Seubert A, Granoff DM (2008) Bactericidal antibody responses elicited by a meningococcal outer membrane vesicle vaccine with overexpressed factor H-binding protein and genetically attenuated endotoxin. J Infect Dis 198: 262–270. 83. Madico G, Welsch JA, Lewis LA, McNaughton A, Perlman DH, et al. (2006) The meningococcal vaccine candidate GNA1870 binds the complement regulatory protein factor H and enhances serum resistance. J Immunol 177: 501–510. 84. Masignani V, Comanducci M, Giuliani MM, Bambini S, Adu-Bobie J, et al. (2003) Vaccination against Neisseria meningitidis using three variants of the lipoprotein GNA1870. J Exp Med 197: 789–799. 85. Welsch JA, Ram S, Koeberling O, Granoff DM (2008) Complement-dependent synergistic bactericidal activity of antibodies against factor H-binding protein, a sparsely distributed meningococcal vaccine antigen. J Infect Dis 197: 1053–1061. 86. Seib KL, Serruto D, Oriente F, Delany I, Adu-Bobie J, et al. (2009) Factor Hbinding protein is important for meningococcal survival in human whole blood and serum and in the presence of the antimicrobial peptide LL-37. Infect Immun 77: 292–299. 87. Mazurkiewicz P, Tang CM, Boone C, Holden DW (2006) Signature-tagged mutagenesis: Barcoding mutants for genome-wide screens. Nat Rev Genet 7: 929–939. 8 October 2009 | Volume 5 | Issue 10 | e1000612 Review Toward the Use of Genomics to Study Microevolutionary Change in Bacteria Daniel Falush* Department of Microbiology, University College Cork, Environmental Research Institute, Lee Road, Cork, Ireland extremes, with their genomes showing signs of both clonal descent and DNA import from other strains. In this essay, I will argue that the clonal mode of reproduction shared by all bacteria and Archaea, in which replication occurs by binary fission, in fact provides an extremely powerful context for association studies. These studies will require both appropriate technologies for genotyping and evolutionary analysis and judiciously chosen strain collections. I will here concentrate on two examples in which placing evolutionary changes in their clonal context provides the power to relate phenotype to genotype. Population-scale genome sequencing promises to allow a full and unbiased catalogue of variation within the same clonal context. This reconstruction will facilitate identification of loci that show correlations with phenotype or anomalous patterns that indicate natural selection, with minimal assumptions about the mechanisms by which phenotypes change. Abstract: Bacteria evolve rapidly in response to the environment they encounter. Some environmental changes are experienced numerous times by bacteria from the same population, providing an opportunity to dissect the genetic basis of adaptive evolution. Here I discuss two examples in which the patterns of rapid change provide insight into medically important bacterial phenotypes, namely immune escape by Neisseria meningitidis and host specificity of Campylobacter jejuni. Genomic analysis of populations of bacteria from these species holds great promise but requires appropriate concepts and statistical tools. Bacteria lack a natural reproductive system, comparable to meiosis in eukaryotes, that segregates genes randomly. Instead, they evolve progressively through mostly small genetic changes, a proportion of which have noteworthy phenotypic effects. Some phenotypes are intrinsically difficult to study in the laboratory: virulence in humans or adaptation to particular ecological niches, for example. For these traits in particular, a promising avenue for scientific investigation is to identify the genetic changes that have provided the basis for their evolution in natural populations. Most human phenotypes are hard to study in vitro and, consequently, methods for relating differences amongst humans to natural genetic variation are well developed. Association studies were proposed as an effective way of identifying genes with small phenotypic effects more than a decade ago [1] and, although initially controversial [2], the recent development of arrays for genotyping hundreds of thousands of single nucleotide polymorphisms (SNPs) scattered across the whole genome has allowed the approach to be successfully applied to many different human diseases and other phenotypes [3]. This success should inspire the development of equivalent protocols within bacteriology. One challenge in developing generally applicable protocols for mapping phenotypic traits in bacteria is that processes by which microevolution occurs vary tremendously between species. For example, the human pathogen Mycobacterium tuberculosis, the causal agent of tuberculosis (TB), diverged recently from an obscure organism occasionally isolated from humans in Africa called Mycobacterium canetti [4]. M. tuberculosis shows very little variation and there is no evidence of strains acquiring DNA by import from other M. tuberculosis strains or indeed from any other organism, so that individuals are clones of each other, distinguished only by rare mutations or other small changes. By contrast, individual Helicobacter pylori, a cause of gastric cancer, acquire DNA from other members of the species at an extremely high rate. Consequently, as well as varying in gene content [5], strains isolated from different host individuals in the same ethnic group typically differ from each other at approximately 3% of nucleotides in core genes, and this diversity segregates nearly randomly [6]. The majority of bacterial species fall between these PLoS Genetics | www.plosgenetics.org Example 1: Immune Escape during Clonal Spread of Neisseria meningitidis Neisseria meningitidis lives in the human nasopharynx and is best known for its role in meningitis and other forms of meningococcal disease. N. meningitidis is a major cause of morbidity and mortality in childhood in industrialised countries and is responsible for epidemics, principally in Africa and Asia. Many lineages persist stably within human populations, causing little disease. There are a handful of ‘‘hyperinvasive’’ lineages, however, that have a distinct epidemiology, spreading rapidly from location to location and causing clusters of disease cases but not persisting in any one place. Mark Achtman and colleagues examined variation within a single hyperinvasive lineage of N. meningitidis, designated subgroup III, over a period of three decades [7]. The strains within subgroup III showed little diversity in most of their housekeeping and other genes surveyed. A few loci were identified that did show variation, however, allowing clonal relationships to be partially reconstructCitation: Falush D (2009) Toward the Use of Genomics to Study Microevolutionary Change in Bacteria. PLoS Genet 5(10): e1000627. doi:10.1371/journal. pgen.1000627 Editor: David S. Guttman, University of Toronto, Canada Published October 26, 2009 Copyright: ß 2009 Daniel Falush. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Funding: The author is funded by Science Foundation of Ireland grant number 05/FE1/B882. The funders had no role in the preparation of the article. Competing Interests: The author has declared that no competing interests exist. * E-mail: [email protected] This article is part of the ‘‘Genomics of Emerging Infectious Disease’’ PLoS Journal collection (http://ploscollections.org/emerginginfectiousdisease/). 1 October 2009 | Volume 5 | Issue 10 | e1000627 ed. This reconstruction demonstrated that there were strong bottlenecks during geographical spread, with a single ancestor for each major wave of infection. It also showed that, notwithstanding the low overall level of variation, certain genes encoding specific antigens changed repeatedly in different countries and pandemic waves. The most remarkable variation was found in the transferrinbinding protein B gene (tbpB), which encodes a protein responsible for iron uptake that is expressed on the surface of the bacterium. This gene had evolved on three occasions by nonsynonymous point mutations that altered the structure of the protein and on 21 occasions by import of different versions of the protein from a variety of sources, including from N. lactamica, a closely related and entirely noninvasive species that also colonizes humans (Figure 1). The import events vary: analysis of similar tbpB changes in a closely related lineage showed that between 2 kb and 10 kb of sequence was transferred, which often altered the sequence of the flanking genes as well as tbpB [8]. In each case, however, an effect of the imported DNA was to change the externally exposed part of the protein from the usual version (called the family 4 version) to one of two antigenically highly distinct versions (family 1 and family 3). The fact that functionally equivalent changes to tbpB are achieved by heterogeneous genetic events shows that the large number of imports is not caused by a recombination mechanism that is specific to the locus. Instead it reflects the amplifying effect of natural selection within the large number of bacteria that circulate during epidemics. Imports happen at a low rate throughout the genome, but those that cause an antigenic change at the tbpB locus have a selective advantage, meaning that they are observed at a much higher rate than imports elsewhere in the genome. High diversity at a particular antigen locus is usually explained by invoking a mechanism called negative frequency-dependent selection [9]. Hosts who have been exposed to a particular variant develop immune responses against this variant. Bacteria with antigenically distinct variants escape this response, giving them an advantage in colonizing that host. At the population level, this selection should lead to the persistence of multiple variants. Yet, despite this selection for rare variants within individual epidemics, the antigenic diversity of subgroup III did not increase progressively over time but was instead reset at the beginning of each new epidemic, which was started by a strain with a family 4 allele. The continuous generation of subgroup III strains with family 1 and 3 tbpB alleles is better explained by a mechanism called source–sink dynamics [10]. The source consists of an environment within which transmission of the bacterium is self-sustaining. Sinks consist of environments that bacteria can colonize effectively (perhaps by undergoing genetic modification) but from which onward transmission is ineffective. Here, the sink environment consists of individuals with acquired immunity to subgroup III strains that carry family 4 alleles, while the source is the remainder of the human population. The fact that the variant genotypes capable of colonizing the sink do not spread geographically but instead are repeatedly regenerated locally suggests that that these strains have reduced overall transmission fitness in naı̈ve hosts, which comprise the majority of individuals in populations where an epidemic has not occurred recently. Two other examples of sink environments are the lungs of immunocompromised patients for Pseudomonas aeruginosa, and the human urinary tract for Escherichia coli [10]; as for the N. meningitidis example, specific genetic changes have been identified that adapt strains of these bacteria to those environments but at the expense of overall transmission fitness, with the result that infections occur generally sporadically. Example 2: Host Specificity in Campylobacter jejuni Campylobacter jejuni is a gram-negative bacterium commonly found in animal feces. It is often associated with poultry and naturally colonises the GI tract of many bird species. C. jejuni is one of the most common causes of human gastroenteritis in the world. Infection caused by Campylobacter species can be severely debilitating but is rarely life-threatening. Human infection is sporadic and, although poorly prepared food is often thought to be implicated, it is generally difficult to track the source. There has therefore been a substantial effort to isolate bacteria from a wide variety of reservoirs and to genotype them using multilocus sequence typing (MLST), which involves obtaining the DNA Figure 1. Acquisition of new tbpB genes by subgroup III Neisseria meningitidis during epidemic spread. Colours indicate the family of each tbpB allele, with red corresponding to family 4, green corresponding to family 1, and blue corresponding to family 3. The bars highlight the time frame, most common tbpB type, and geographical extent of each epidemic (in 1987, pilgrims from the Hajj pilgrimage briefly distributed the lineage worldwide). The circles correspond to variant genotypes. Small circles indicating that the variant allele was found in only one strain; large circles indicate it was found in between two and four strains. doi:10.1371/journal.pgen.1000627.g001 PLoS Genetics | www.plosgenetics.org 2 October 2009 | Volume 5 | Issue 10 | e1000627 sequence for each isolate at a standardized panel of genes (seven for Campylobacter) that are chosen because they have an essential function and are present in the vast majority of isolates in the species [11]. The C. jejuni strains acquired by chickens are distinct from those of the wild birds around them, even when the poultry are kept outdoors [12]. Within farm animals, certain lineages are found with very different frequencies in chickens and cattle, whereas several genotypes are found at high frequency in both (strains with the MLST type ST-21, for example) [13]. Strains from different farm animals are more similar to each other than they are to strains found, for example, in starlings (a native European bird that is also common in may other countries, including the US) [14]. The digestive system of chickens differs from that of cattle in multiple aspects, and their body temperature is several degrees higher than that of cattle. This raises the question of how some lineages are able to compete successfully in both hosts. Mechanisms facilitating rapid phenotypic adaptation include: (1) inbuilt regulatory mechanisms that allow individual bacteria to alter gene expression in response to new environments [15], (2) ‘‘contingency loci’’ that mutate rapidly, creating phenotypic variation amongst bacteria that are otherwise genetically identical [16], and (3) import of DNA from other strains that are already adapted to the current environment. A first step toward understanding the evolution of host specificity is to establish whether it is possible to predict the host origin of strains based on their genome sequence. One approach to doing this uses phylogenetic relationships. For example, the program AdaptML (http://almlab.mit.edu/ALME/Software/ Software.html) attempts to assign branches of the phylogenetic tree to preferred habitats based on where the strains on that branch were isolated [17]. For C. jejuni, habitat can, for example, be equated to host species. The observation of a group of phylogenetically related strains in a single host species might reflect the common ancestor of those strains acquiring the traits required to survive in that species. Since C. jejuni recombines frequently, the genome composition of each strain is determined by the sources from which it has imported DNA, as well by which strains it is phylogenetically related to. For example, ST-21, together with its variants, is a lineage analogous to subgroup III of N. meningitidis. Like subgroup III, the lineage has imported DNA from other strains on numerous occasions during its spread, with the result that many isolates have variant genotypes that differ from ST-21 at one or two of the seven MLST fragments. By convention, these strains are grouped with ST-21 into the ST-21 clonal complex. ST-21 itself has been found at high frequency in several agricultural species and elsewhere. Therefore, if a new strain is found to be ST-21, this provides little information on where it might have originated. However, for the variants of ST-21, Noel McCarthy and colleagues obtained significantly better than random assignment by predicting hosts based on the frequency with which the variant allele was found in chicken or cattle [13]. A useful signal of host-of-origin is thus provided by the DNA that each isolate has acquired (Figure 2). Furthermore, the high rate of recombination within particular hosts represents a mechanism by which complex adaptations to a particular host species can be acquired quickly subsequent to a host switch. Figure 2. A schematic illustration of the evolution of the C. jejuni ST-21 clonal complex in cattle and chickens. The common ancestor of the complex occurred in chickens (red). During evolution, the lineage occasionally switched to a cattle host (indicated by a blue branch) and sometimes back to chicken. The bacteria acquired DNA by homologous recombination from other C. jejuni in the same host. Since recombination is assumed to occur from donors within the same host, the gene pool is determined by the genomic composition of the strains that colonize each host. The gene pools are illustrated for two separate loci (right and left facing arrows) in chickens and cattle. The gene pools contain alleles whose frequencies occur at much higher frequency in one host than another (shown in colour) and others that did not (shown in black). The former are informative about the host in which the recombination event occurred, while the latter are not. The recombination event labelled a introduces the left facing black arrow gene from the cattle gene pool and is phylogenetically informative because it defines a lineage that is largely restricted to cattle. The five recombination events labelled b are not phylogenetically informative, since they only affect a single strain in the sample. These events are nevertheless informative because they introduce alleles that are characteristic of the host species. The event labelled c is both phylogenetically informative and characteristic of host. The event labelled d is noninformative. doi:10.1371/journal.pgen.1000627.g002 genotype based on natural variation. The first is the magnifying effect of natural selection in enormous bacterial populations. This selection acts to rapidly increase the frequency of genotypes that give small fitness advantages in a particular environment, even if these genotypes are generated only rarely. Adaptation in bacteria is likely to be more frequent and to leave more distinctive genetic signatures than in species such as humans where signals of adaptation to local environments have proved to be remarkably subtle [18]. The second is the fact that evolution occurs in the context of progressively changing clonal backgrounds. This property can make it possible to identify strains that have extremely similar genomes but nevertheless differ phenotypically [19]. These strains represent the natural equivalent of an isogenic line and can allow precise inferences about the effects of natural variation and how different changes interact with each other. In order to fully exploit the advantages of bacteria for detecting phenotypic associations, it is necessary to develop a conceptual and analytical framework within which rapid evolutionary change can be interpreted. One such framework is source–sink dynamics [10]. The Neisseria example illustrates the power of microevolutionary analysis in a source–sink ecological context to identify first the sink (hosts with immune responses to tbpB family 4 alleles) and second the loci under an immediate selective pressure to change within that sink (the tbpB gene). Source–sink dynamics cannot be applied to investigate host specificity within Campylobacter, because individual host species, e.g., chicken, cattle, and individual species of wild birds, each harbour large, viable populations of bacteria with high rates of within-species transmission and do not represent sinks. Nevertheless, there is a key similarity between the Neisseria and Campylobacter The Power of Bacterial Genomics Studies in bacteria have two major advantages over those in humans or other mammals when it comes to relating phenotype to PLoS Genetics | www.plosgenetics.org 3 October 2009 | Volume 5 | Issue 10 | e1000627 examples, namely that the strains are repeatedly challenged by an environment that is novel in the recent history of the strain. In the Neisseria example, this challenge is repeatedly met by genetic changes at particular antigenic loci, which consequently have extremely atypical patterns of variation. In Campylobacter this challenge is met in the context of a high rate of import of DNA across the genome from other Campylobacter strains that already colonize the new host. The availability of full genome sequences promises to enhance our understanding of the bacterial responses to new environments in a number of ways. First, phylogenetic relationships will be better resolved. In the Neisseria example, a well-resolved tree will elucidate patterns of transmission within epidemics and, for example, whether tbpB imports take place at the later stages of each wave and if strains with such imports ever reacquire family 4 alleles and seed later epidemics. In the Campylobacter example this will allow estimates of the number of occasions that the ST-21 lineage has jumped between host-species and establish whether there are sublineages that are becoming progressively more adapted to single-host transmission. Second, genomics will provide a complete catalogue of loci whose pattern of descent is atypical of the genome as a whole and therefore either associated with a particular phenotype or putatively affected by selection. In the Neisseria example, an elevated rate of change at particular loci and consistency in the nature of those changes would provide signs of selection. In the Campylobacter example, loci that are imported at very high frequency and/or that are highly differentiated between host species may be involved in adaptation to a new host. An isolate-byisolate analysis of the patterns of import should establish whether the multi-host lifestyle of ST-21 and, by extension, of C. jejuni as a whole is facilitated by import of DNA from locally adapted strains. Third, genomics will allow detection of epistasis between loci. Epistasis occurs when the fitness effects of alleles at one gene are modified by the genotype at one or more additional genes. In outbreeding diploids, such as mammals, each allele has its fitness tested on a new genetic background in every generation, with the result that epistasis does not leave a distinctive signature in the frequency of particular combinations of alleles unless the loci are closely linked on the same chromosome or selection is very strong. In bacteria, combinations of alleles remain together for many generations wherever they occur in the genome, providing ample opportunity for epistasis to bring particular combinations of alleles to high frequency. For example, subgroup III strains that have imported variant tbpB alleles can potentially enhance their fitness by importing other parts of the genome that adapt other strains in the Neisseria population to having high fitness when carrying family 1 or family 3 alleles. These parts of the genome could be detected by identifying parallel changes that have occurred on the 21 occasions that a variant tbpB allele was imported during the spread of subgroup III strains. Fitness interactions establish functional relationships between loci and represent a central part of the evolutionary landscape, for example triggering the origin of species [20]. Genome sequencing of bacteria should provide key insights on the nature of these interactions in natural populations. In C. jejuni and other zoonoses, genomic analyses will facilitate a qualitative advance in our understanding of the epidemiology, ecology, and molecular biology of host switches. These developments will allow accurate delineation of the sources of human infection and an understanding of the factors promoting successful and pathogenic colonization of humans. In N. meningitidis and similar bacteria, we will gain a much better understanding of the genetic differences between invasive and noninvasive strains and the particular adaptive strategies that cause lineages to become invasive. These advances will together allow the design of targeted interventions that reduce the burden of human disease. Challenges for the Future Advances in sequencing technology mean that it is becoming economically feasible to obtain complete or nearly complete genome sequences for large samples of bacteria. To better exploit this technology to understand bacterial phenotypes, the field should emulate the research program of human genetics and (1) develop statistical tools that use sequence variation to infer mechanisms of evolution [21] and patterns of genetic relationship [22]; (2) collect and sequence samples of isolates in which bacteria that differ in phenotypes of interest are matched as far as possible in time and space [23]; and (3) design statistical tools for detecting phenotypic associations [24] and natural selection [25] by identifying patterns of relationship at particular loci that are atypical of the genome as a whole. Acknowledgments Mark Achtman, Jim Bull, Jana Haase, Riikka Haukkanen, and Daniel Stoebel provided insightful discussions and comments on the manuscript. References 9. Brisson D, Dykhuizen DE (2004) ospC diversity in Borrelia burgdorferi: Different hosts are different niches. Genetics 168: 713–722. 10. Sokurenko EV, Gomulkiewicz R, Dykhuizen DE (2006) Source-sink dynamics of virulence evolution. Nat Rev Microbiol 4: 548–555. 11. Maiden MCJ, Bygraves JA, Feil E, Morelli G, Russell JE, et al. (1998) Multilocus sequence typing: A portable approach to the identification of clones within populations of pathogenic microorganisms. Proc Natl Acad Sci U S A 95: 3140–3145. 12. Colles FM, Jones TA, McCarthy ND, Sheppard SK, Cody AJ, et al. (2008) Campylobacter infection of broiler chickens in a free-range environment. Environ Microbiol 10: 2042–2050. 13. McCarthy ND, Colles FM, Dingle KE, Bagnall MC, Manning G, et al. (2007) Hostassociated genetic import in Campylobacter jejuni. Emerg Infect Dis 13: 267–272. 14. Colles FM, McCarthy ND, Howe JC, Devereux CL, Gosler AG, et al. (2009) Dynamics of Campylobacter colonization of a natural host, Sturnus vulgaris (European starling). Environ Microbiol 11: 258–267. 15. Coulson RM, Ouzounis CA (2003) The phylogenetic diversity of eukaryotic transcription. Nucleic Acids Res 31: 653–660. 16. Moxon R, Bayliss C, Hood D (2006) Bacterial contingency loci: The role of simple sequence DNA repeats in bacterial adaptation. Annu Rev Genet 40: 307–333. 17. Hunt DE, David LA, Gevers D, Preheim SP, Alm EJ, et al. (2008) Resource partitioning and sympatric differentiation among closely related bacterioplankton. Science 320: 1081–1085. 1. Risch N, Merikangas K (1996) The future of genetic studies of complex human diseases. Science 273: 1516–1517. 2. Weiss KM, Terwilliger JD (2000) How many diseases does it take to map a gene with SNPs? Nat Genet 26: 151–157. 3. Hardy J, Singleton A (2009) Genomewide association studies and human disease. N Engl J Med 360: 1759–1768. 4. Fabre M, Koeck JL, Le Fleche P, Simon F, Herve V, V, et al. (2004) High genetic diversity revealed by variable-number tandem repeat genotyping and analysis of hsp65 gene polymorphism in a large collection of ‘‘Mycobacterium canettii’’ strains indicates that the M. tuberculosis complex is a recently emerged clone of ‘‘M. canettii’’. J Clin Microbiol 42: 3248–3255. 5. Gressmann H, Linz B, Ghai R, Pleissner KP, Schlapbach R, et al. (2005) Gain and loss of multiple genes during the evolution of Helicobacter pylori. PLoS Genet 1: e43. doi:10.1371/journal.pgen.0010043. 6. Suerbaum S, Maynard Smith J, Bapumia K, Morelli G, Smith NH, et al. (1998) Free recombination within Helicobacter pylori. Proc Natl Acad Sci U S A 95: 12619–12624. 7. Zhu P, van Der EA, Falush D, Brieske N, Morelli G, et al. (2001) Fit genotypes and escape variants of subgroup III Neisseria meningitidis during three pandemics of epidemic meningitis. Proc Natl Acad Sci U S A 98: 5234–5239. 8. Linz B, Schenker M, Zhu P, Achtman M (2000) Frequent interspecific genetic exchange between commensal Neisseriae and Neisseria meningitidis. Mol Microbiol 36: 1049–1058. PLoS Genetics | www.plosgenetics.org 4 October 2009 | Volume 5 | Issue 10 | e1000627 22. Rosenberg NA, Pritchard JK, Weber JL, Cann HM, Kidd KK, et al. (2002) Genetic structure of human populations. Science 298: 2381–2385. 23. The Wellcome Trust Case Control Consortium (2007) Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature 447: 661–678. 24. Marchini J, Howie B, Myers S, McVean G, Donnelly P (2007) A new multipoint method for genome-wide association studies by imputation of genotypes. Nat Genet 39: 906–913. 25. Sabeti PC, Varilly P, Fry B, Lohmueller J, Hostetter E, et al. (2007) Genomewide detection and characterization of positive selection in human populations. Nature 449: 913–918. 18. Coop G, Pickrell JK, Novembre J, Kudaravalli S, Li J, et al. (2009) The role of geography in human adaptation. PLoS Genet 5: e1000500. doi:10.1371/ journal.pgen.1000500. 19. Beres SB, Richter EW, Nagiec MJ, Sumby P, Porcella SF, et al. (2006) Molecular genetic anatomy of inter- and intraserotype variation in the human bacterial pathogen group A Streptococcus. Proc Natl Acad Sci U S A 103: 7059–7064. 20. Coyne JA, Orr HA (2004) Speciation. Sunderland (MA): Sinauer Associates. 21. McVean GA, Myers SR, Hunt S, Deloukas P, Bentley DR, et al. (2004) The fine-scale structure of recombination rate variation in the human genome. Science 304: 581–584. PLoS Genetics | www.plosgenetics.org 5 October 2009 | Volume 5 | Issue 10 | e1000627 Review The Application of Genomics to Emerging Zoonotic Viral Diseases Bart L. Haagmans, Arno C. Andeweg, Albert D. M. E. Osterhaus* Department of Virology, Erasmus Medical Center, Rotterdam, The Netherlands influenza A viruses and severe acute respiratory syndrome coronavirus (SARS-CoV), may need multiple genetic changes to adapt successfully to humans as a new host species; these changes might include differential receptor usage, enhanced replication, evasion of innate and adaptive host immune defenses, and/or increased efficiency of transmission. Understanding the complex interactions between the invading pathogen on the one hand and the new host on the other as they progress toward a new host– pathogen equilibrium is a major challenge that differs substantially for each successful interspecies transmission and subsequent spread of the virus. Abstract: Interspecies transmission of pathogens may result in the emergence of new infectious diseases in humans as well as in domestic and wild animals. Genomics tools such as high-throughput sequencing, mRNA expression profiling, and microarray-based analysis of single nucleotide polymorphisms are providing unprecedented ways to analyze the diversity of the genomes of emerging pathogens as well as the molecular basis of the host response to them. By comparing and contrasting the outcomes of an emerging infection with those of closely related pathogens in different but related host species, we can further delineate the various host pathways determining the outcome of zoonotic transmission and adaptation to the newly invaded species. The ultimate challenge is to link pathogen and host genomics data with biological outcomes of zoonotic transmission and to translate the integrated data into novel intervention strategies that eventually will allow the effective control of newly emerging infectious diseases. Genomics of Zoonotic Viruses and Their Hosts New molecular techniques such as high-throughput sequencing, mRNA expression profiling, and array-based single nucleotide polymorphism (SNP) analysis provide ways to rapidly identify emerging pathogens (Nipah virus and SARS-CoV, for example) and to analyze the diversity of their genomes as well as the host responses against them. Essential to the process of identification and characterization of genome sequences is the exploitation of extensive databases that allow the alignment of viral genome sequences and the linkage of these genomics data to those obtained by classical viral culture and serological techniques, and epidemiological, clinical, and pathological studies [4]. Extensive genetic analysis of HIV-1, for example, has provided clues to the geography and time scale of the early diversification of HIV-1 strains when the virus emerged in humans. HIV-1 strains are divided into multiple clades, each of which has independently evolved from a simian immunodeficiency virus (SIV) that naturally infects chimpanzees in West and Central Africa. Current estimates date the common ancestor of HIV-1 to the beginning of the twentieth century [5]. Emerging Zoonotic Viruses Most of the well-known human viruses persist in the population for a relatively long time, and coevolution of the virus and its human host has resulted in an equilibrium characterized by coexistence, often in the absence of a measurable disease burden. When pathogens cross a species barrier, however, the infection can be devastating, causing a high disease burden and mortality. In recent years, several outbreaks of infectious diseases in humans linked to such an initial zoonotic transmission (from animal to human host) have highlighted this problem. Factors related to our increasingly globalized society have contributed to the apparently increased transmission of pathogens from animals to humans over the past decades; these include changes in human factors such as increased mobility, demographic changes, and exploitation of the environment (for a review see Osterhaus [1] and Kuiken et al. [2]). Environmental factors also play a direct role, and many examples exist. The recently increased distribution of the arthropod (mosquito) vector Aedes aegypti, for example, has led to massive outbreaks of dengue fever in South America and Southeast Asia. Intense pig farming in areas where frugivorous bats are common is probably the direct cause of the introduction of Nipah virus into pig populations in Malaysia, with subsequent transmission to humans. Bats are an important reservoir for a plethora of zoonotic pathogens: two closely related paramyxoviruses—Hendra virus and Nipah virus—cause persistent infections in frugivorous bats and have spread to horses and pigs, respectively [3]. The similarity between human and nonhuman primates permits many viruses to cross the species barrier between different primate species. The introduction into humans of HIV-1 and HIV-2 (the lentiviruses that cause AIDS), as well as other primate viruses, such as monkeypox virus and Herpesvirus simiae, provide dramatic examples of this type of transmission. Other viruses, such as PLoS Pathogens | www.plospathogens.org Citation: Haagmans BL, Andeweg AC, Osterhaus ADME (2009) The Application of Genomics to Emerging Zoonotic Viral Diseases. PLoS Pathog 5(10): e1000557. doi:10.1371/journal.ppat.1000557 Editor: Marianne Manchester, The Scripps Research Institute, United States of America Published October 26, 2009 Copyright: ß 2009 Haagmans et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Funding: Supported by the VIRGO consortium, an Innovative Cluster approved by the Netherlands Genomics Initiative and partially funded by the Dutch Government (BSIK 03012), The Netherlands and the US National Institutes of Health, RO1 grant HL080621-O1A1. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Competing Interests: The authors have declared that no competing interests exist. * E-mail: [email protected] This article is part of the ‘‘Genomics of Emerging Infectious Disease’’ PLoS Journal collection (http://ploscollections.org/emerginginfectiousdisease/). 1 October 2009 | Volume 5 | Issue 10 | e1000557 Because zoonotic pathogens typically may cause variable clinical outcomes in human hosts that differ in age, nutritional status, genetic background, and immunological condition, deciphering the complex interactions between evolving pathogens and their hosts is a great challenge. The genome sequences of many host species have become available the last decade, and with them a range of novel tools are available to study virus–host interactions at the molecular level. This progress, together with advances in high-throughput sequencing technology and, not least, in (bio)informatics and statistics, allows us to analyze the ‘‘genomewide’’ networks of gene interactions that control the host response to pathogens. By comparing and contrasting the outcomes of infection with closely related pathogens in different but related host species, we can further delineate the various host pathways involved in the different outcomes. The power of this approach was nicely demonstrated for SIV infection of various primate host species. Natural reservoir hosts of SIV do not develop AIDS upon infection, whereas non-natural hosts, such as rhesus macaques and pig-tailed macaques, when infected experimentally with SIV, develop AIDS in a similar manner to HIV-infected humans. Transcriptional profiling indicates that SIV infection of these species produces a distinctive host response [6]. SIV-infected primates with symptoms of AIDS have a high viral load, immune activation, and loss of certain types of T cells, whereas SIVinfected sooty mangabeys (the species from which HIV-2 is thought to have originated) have substantially lower levels of innate immune activation than the symptomatic primates, partly due to the production of less interferon-a by plasmacytoid dendritic cells in response to SIV and other Toll-like receptor ligands [7]. Identification of host factors that restrict HIV infection may aid the development of effective intervention strategies. Below, we elaborate on two other examples of recent important zoonotic events that led to sustained virus transmission in the human host, and the role that genomics has played in the elucidation of their pathogenesis thus far. are essential to identify critical mutations that enable the circulating virus to spread efficiently, interact with different receptors, and cause disease in the new host. For example, the importance of residue 627 of the PB2 protein of the viral polymerase in determining species restriction has been demonstrated through these kinds of approaches [10]. Furthermore, changes in the hemagglutinin molecules may allow influenza A viruses to switch receptor specificity. The hemagglutinin of avian H5N1 influenza viruses preferentially binds to oligosaccharides that terminate with a sialic acid–a-2,3-Gal disaccharide, whereas the hemagglutinins of mammalian influenza A viruses prefer oligosaccharides that terminate with sialic acid–a-2,6-Gal (Figure 1). Fatal viral pneumonia in humans infected with avian H5N1 viruses is partly due to the ability of these viruses to attach to and replicate in the cells of the lower respiratory tract, which have oligosaccharides that terminate in sialic acid–a-2,3-Gal disaccharide [11,12]. The sequence of the hemagglutinin protein may also affect its binding affinity for neutralizing antibodies. Understanding the relationship between genetic diversity and antigenic properties of these viruses [13] may help to predict the emergence of influenza viruses and to develop effective vaccines. Microarray-assisted mRNA expression profiling of emerging zoonotic viral infections, including influenza A virus, is used to phenotype the host response in great detail. By comparing mRNA expression in individuals infected with an emerging virus to expression in individuals infected with a related established virus, researchers can generate a ‘‘molecular fingerprint’’ of the host response genes or pathways specifically involved in the oftenexuberant host responses to the emerging virus. By using genetically engineered influenza A viruses, a role for the nonstructural NS1 viral protein in evasion of the innate host response has been demonstrated [14]. Interestingly, the NS1 protein derived from the 1918 Spanish H1N1 pandemic influenza virus blocked expression of interferon-regulated genes more efficiently than did the NS1 protein from established seasonal influenza viruses [14]. Other genomics studies of genetically engineered influenza A viruses containing some or all of the gene segments from either the 1918 H1N1 virus or the highly pathogenic avian influenza A virus (H5N1), suggest that these highly pathogenic influenza viruses induce severe disease in mice and macaques through aberrant and persistent activation of proinflammatory cytokine and chemokine responses [15–18]. Application of genomics tools not only supports the elucidation of mechanisms underlying pathogenesis but may also help to identify leads for therapeutic intervention. In ferrets, H5N1 infection induced severe disease that was associated with strong expression of interferon response genes including the interferon-cinduced cytokine CXCL10. Treatment of H5N1-infected ferrets with an antagonist of the CXCL10 receptor (CXCR3) reduced the severity of the flu symptoms and the viral titers compared to the controls [19], clearly demonstrating the potential of biological response modifiers for the clinical management of viral infections. The host evasion and evolution of influenza virus is further discussed in [20]. Influenza Virus Influenza is caused by RNA viruses of the Orthomyxoviridae family. Whereas fever and coughs are the most frequent symptoms, in more serious cases a fatal pneumonia can develop, particularly in the young and the elderly. Typically, influenza is transmitted through the air by coughs or sneezes, creating aerosols containing the virus; but influenza can also be transmitted by bird droppings, saliva, feces, and blood. Birds and pigs play an important role in the emergence of new influenza viruses in humans. Fecal sampling of migratory birds has revealed that they harbor a large range of different subtypes of influenza A viruses [8]. Some wild duck species, particularly mallards, are potential long-distance vectors of highly pathogenic avian influenza virus (H5N1), whereas others, including diving ducks, are more likely to act as ‘‘sentinel’’ species that die upon infection [9]. Following the introduction of a new pandemic influenza A virus subtype from an avian reservoir, either directly or via another mammalian species such as the pig, the virus may continue to circulate in humans in subsequent years as a seasonal influenza virus. In the past century, three major influenza epidemics resulted in the loss of many millions of lives. Spanish flu alone caused the deaths of more than 50 million people by the end of World War I in 1918. The 2009 outbreak of a new H1N1 virus (causing ‘‘swine flu’’) that started in Mexico further illustrates the pandemic potential of influenza A viruses. After introduction of a new influenza A virus from an avian or porcine reservoir into the human species, viral genomics studies PLoS Pathogens | www.plospathogens.org SARS-CoV Coronaviruses (CoVs) primarily infect the upper respiratory and gastrointestinal tract of mammals and birds. Five different currently known CoVs infect humans and are believed to cause a significant percentage of all common colds in human adults. Surprisingly, recent studies revealed that approximately 6% of bats sampled in China were positive for CoVs [21]. Subsequent phylogenetic studies revealed that bat CoVs that resembled 2 October 2009 | Volume 5 | Issue 10 | e1000557 Figure 1. Zoonotic transmission of influenza A virus. The hemagglutinin of avian influenza A viruses (blue) preferentially bind to oligosaccharides that terminate in sialic acid–a-2,3-Gal (red), whereas the hemagglutinin on human influenza A viruses (green) prefer oligosaccharides that terminate in sialic acid–a-2,6-Gal (orange). Fatal viral pneumonia in humans infected with the H5N1 subtype of avian influenza A viruses is likely due to the ability of these viruses to attach to and replicate in the lower respiratory tract cells, which have sialic acid-a-2,3Gal terminated saccharides. The horizontal arrows indicate interspecies transmission, including the transmission from an avian or porcine reservoir into the human species. Image credit: Bart Haagmans, Erasmus MC. Original images (left to right, from top to bottom) by Roman Köhler, Alvesgaspar, Anton Holmquist, Joshua Lutz, and CDC. doi:10.1371/journal.ppat.1000557.g001 affects the efficiency by which the virus can enter cells [23]. By a combination of phylogenetic and bioinformatics analyses, chimeric gene design, and reverse genetics–aided generation of viruses that encode spike proteins of diverse isolates, researchers have reconstructed the events that led to the emergence of a virus able to spread efficiently in humans [24]. Structural modeling predicted that the SARS-CoV that caused the epidemic had an increased affinity for both civet and human ACE2 receptors due to adaptation (Figure 2). Subsequent functional genomics studies of these viruses in diverse species provided further insight into the role of specific host genes involved in the pathogenic response [25,26]. The pathological changes observed in the lungs are initiated by a disproportionate innate immune response, illustrated by elevated levels of inflammatory cytokines and chemokines, such human SARS-CoV clustered in a putative group comprising one subgroup of bat CoVs and another of SARS-CoVs from humans and other mammalian hosts. According to the current hypothesis SARS-CoV has arisen by recombination between two bat viruses. Phylogenetic analysis of SARS-CoV isolates from animals indicate that the resulting bat virus was transmitted first to palm civets (Paguma larvata), a wild cat-like animal hunted for its meat, and subsequently to humans at live animal markets in southern China [22]. Genome analyses have provided evidence that genetic variation in the spike gene of these viruses from civets is associated with increased transmission of the virus [21]. In addition, species-tospecies variation in the sequence of the gene angiotensin-converting enzyme 2 (ACE2), which encodes the SARS-CoV receptor, also PLoS Pathogens | www.plospathogens.org 3 October 2009 | Volume 5 | Issue 10 | e1000557 Figure 2. Zoonotic transmission of SARS-CoV. Genomic analyses provided evidence that genetic changes in the spike gene of SARS-CoV from bats (left) and civet cats (center) are essential for the animal-to-human transmission (horizontal arrows). Species-to-species genetic variation in the (thus far unidentified) viral receptor in bats and in the angiotensin converting enzyme 2 (ACE2) gene, encoding the SARS-CoV receptor in civet cats and humans also affects the efficiency with which the virus can enter cells (vertical arrows). The SARS-CoV that caused the epidemic evolved a high affinity for both civet (center) and human (right) ACE2 receptors (indicated by the single diagonal and the right side vertical arrow). Image credit: Bart Haagmans, Erasmus MC. Original images (left to right) by Dodoni, Paul Hilton, and Hoang Dinh Nam. doi:10.1371/journal.ppat.1000557.g002 as CXCL10 (IP-10), CCL2 (MCP-1), interleukin (IL)-6, IL-8, IL12, IL-1b, and interferon-c [27]. These clinical data were confirmed experimentally by demonstrating that SARS-CoV infection of diverse cell types induces a range of cytokines and chemokines, thus providing a conceptual framework for SARSCoV pathogenesis. Host genome expression analyses of various animal hosts and humans with different outcomes of infection indicated differential activation of innate immune genes in, for example, aged subjects compared to young subjects. Importantly, treatment of aged macaques with pegylated interferon-a (i.e. interferon-a covalently modified with polyethylene glycol polymer chains, to enhance its bioavailability) reduced SARS-CoV replication and pathogenic responses [28]. Thus, host genomics analysis may provide markers of pathogenesis and leads for PLoS Pathogens | www.plospathogens.org therapeutic intervention, as in this example of SARS-CoV infection. Challenges for the Future Rapid identification of newly emerging viruses through the use of genomics tools is one of the major challenges for the near future. In addition, the identification of critical mutations that enable viruses to spread efficiently, interact with different receptors, and cause disease in diverse hosts through, for instance, enhanced viral replication or circumvention of the innate and adaptive immune responses, needs to be further expanded. Although microarrayassisted transcriptional profiling can provide us with a wealth of information regarding host genes and gene-interacting networks in 4 October 2009 | Volume 5 | Issue 10 | e1000557 virus–host interactions, future research should focus on combining data obtained in different experimental settings. Therefore, the careful design of complementary sets of experiments using different formats of virus–host interactions is absolutely needed for successful genomics studies [29]. Special attention should be addressed to the comparative analysis of the host response in diverse animal species. Thus far a limited number of laboratory animal species has been studied, but the recent elucidation of the genome of several other animal species will provide tools to decipher the virus–host interactions in the more relevant natural host. Recent developments in the sequencing of the RNA transcriptome may aid this development. Ultimately, microarray technology may also extend to genotyping of the human host by SNP analysis, to identify markers of host susceptibility and severity of disease, that can be used in tailor-made clinical management of disease caused by emerging infections. Comparative analysis of host responses to emerging viruses may also point toward a similar dysregulated host response to a range of emerging virus infections, enabling the rational design of multipotent biological response modifiers to combat a variety of emerging viral infections. By focusing on broad-acting intervention strategies rather than on the discovery of a newly emerging pathogen that is not characterized yet, we may be able to protect ourselves from several unexpectedly emerging infections with the same clinical manifestations. This approach may readily reduce the burden of disease and time will be gained to design preventive pathogen specific intervention strategies such as antiviral therapy or vaccination. Clearly, for all stages of combating emerging infections, from the early identification of the pathogen to the development and design of vaccines, application of sophisticated genomics tools is fundamental to success. References 1. Osterhaus A (2001) Catastrophes after crossing species barriers. Philos Trans Soc Lond B Biol Sci 356: 791–793. 2. Kuiken T, Leighton FA, Fouchier RA, LeDuc JW, Peiris JS, et al. (2005) Public health. Pathogen surveillance in animals. Science 309: 1680–1681. 3. Field HE, Mackenzie JS, Daszak P (2007) Henipaviruses: Emerging paramyxoviruses associated with fruit bats. Curr Top Microbiol Immunol 315: 133–159. 4. Rivers TM (1937) Viruses and Koch’s postulates. J Bacteriol 33: 1–12. 5. Worobey M, Gemmel M, Teuwen DE, Haselkorn T, Kunstman K, et al. (2008) Direct evidence of extensive diversity of HIV-1 in Kinshasa by 1960. Nature 455: 661–664. 6. Lederer S, Favre D, Walters KA, Proll S, Kanwar B, et al. (2009) Transcriptional profiling in pathogenic and non-pathogenic SIV infections reveals significant distinctions in kinetics and tissue compartmentalization. PLoS Pathog 5: e1000296. doi:10.1371/journal.ppat.1000296. 7. Mandl JN, Barry AP, Vanderford TH, Kozyr N, Chavan R, et al. (2008) Divergent TLR7 and TLR9 signaling and type I interferon production distinguish pathogenic and nonpathogenic AIDS virus infections. Nat Med 14: 1077–1087. 8. Munster VJ, Baas C, Lexmond P, Waldenström J, Wallensten A, et al. (2007) Spatial, temporal, and species variation in prevalence of influenza A viruses in wild migratory birds. PLoS Pathog 3: e61. doi:10.1371/journal.ppat.0030061. 9. Keawcharoen J, van Riel D, van Amerongen G, Bestebroer T, Beyer WE, et al. (2008) Wild ducks as long-distance vectors of highly pathogenic avian influenza virus (H5N1). Emerg Infect Dis 4: 600–607. 10. Hatta M, Gao P, Halfmann P, Kawaoka Y (2001) Molecular basis for high virulence of Hong Kong H5N1 influenza A viruses. Science 293: 1840–1842. 11. van Riel D, Munster VJ, de Wit E, Rimmelzwaan GF, Fouchier RA, et al. (2006) H5N1 virus attachment to lower respiratory tract. Science 312: 399. 12. Yamada S, Suzuki Y, Suzuki T, Le MQ, Nidom CA, et al. (2006) Haemagglutinin mutations responsible for the binding of H5N1 influenza A viruses to human-type receptors. Nature 444: 378–382. 13. Smith DJ, Lapedes AS, de Jong JC, Bestebroer TM, Rimmelzwaan GF, et al. (2004) Mapping the antigenic and genetic evolution of influenza virus. Science 305: 371–376. 14. Geiss GK, Salvatore M, Tumpey TM, Carter VS, Wang X, et al. (2002) Cellular transcriptional profiling in influenza A virus-infected lung epithelial cells: The role of the nonstructural NS1 protein in the evasion of the host innate defense and its potential contribution to pandemic influenza. Proc Natl Acad Sci U S A 99: 10736–10741. 15. Kobasa D, Jones SM, Shinya K, Kash JC, Copps J, et al. (2007) Aberrant innate immune response in lethal infection of macaques with the 1918 influenza virus. Nature 445: 319–323. 16. Baskin CR, Bielefeldt-Ohmann H, Tumpey TM, Sabourin PJ, Long JP, et al. (2009) Early and sustained innate immune response defines pathology and death PLoS Pathogens | www.plospathogens.org 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 5 in nonhuman primates infected by highly pathogenic influenza virus. Proc Natl Acad Sci U S A 106: 3455–3460. Kash JC, Tumpey TM, Proll SC, Carter V, Perwitasari O, et al. (2006) Genomic analysis of increased host immune and cell death responses induced by 1918 influenza virus. Nature 443: 578–581. Kash JC, Basler CF, Garcı́a-Sastre A, Carter V, Billharz R, et al. (2004) Global host immune response: Pathogenesis and transcriptional profiling of type A influenza viruses expressing the hemagglutinin and neuraminidase genes from the 1918 pandemic virus. J Virol 78: 9499–9511. Cameron CM, Cameron MJ, Bermejo-Martin JF, Ran L, Xu L, et al. (2008) Gene expression analysis of host innate immune responses during lethal H5N1 infection in ferrets. J Virol 82: 11308–11317. McHardy AC, Adams, B (2009) The role of genomics in tracking the evolution of influenza A virus. PLoS Pathog e1000566: doi:10.1371/journal. ppat.1000566. Tang XC, Zhang JX, Zhang SY, Wang P, Fan XH, et al. (2006) Prevalence and genetic diversity of coronaviruses in bats from China. J Virol 80: 7481–7490. Song HD, Tu CC, Zhang GW, Wang SY, Zheng K, et al. (2005) Cross-host evolution of severe acute respiratory syndrome coronavirus in palm civet and human. Proc Natl Acad Sci U S A 102: 2430–2435. Li W, Zhang C, Sui J, Kuhn JH, Moore MJ, et al. (2005) Receptor and viral determinants of SARS-coronavirus adaptation to human ACE2. EMBO J 24: 1634–1643. Sheahan T, Rockx B, Donaldson E, Sims A, Pickles R, et al. (2008) Mechanisms of zoonotic severe acute respiratory syndrome coronavirus host range expansion in human airway epithelium. J Virol 82: 2274–2285. Rockx B, Baas T, Zornetzer GA, Haagmans B, Sheahan T, et al. (2009) Early upregulation of acute respiratory distress syndrome-associated cytokines promotes lethal disease in an aged-mouse model of severe acute respiratory syndrome coronavirus infection. J Virol 83: 7062–7074. de Lang A, Baas T, Teal T, Leijten LM, Rain B, et al. (2007) Functional genomics highlights differential induction of antiviral pathways in the lungs of SARS-CoV-infected macaques. PLoS Pathog 3: e112. doi:10.1371/journal. ppat.0030112. Baas T, Roberts A, Teal TH, Vogel L, Chen J, et al. (2008) Genomic analysis reveals age-dependent innate immune responses to severe acute respiratory syndrome coronavirus. J Virol 82: 9465–9476. Haagmans BL, Kuiken T, Martina BE, Fouchier RA, Rimmelzwaan GF, et al. (2004) Pegylated interferon-alpha protects type 1 pneumocytes against SARS coronavirus infection in macaques. Nat Med 10: 290–293. Andeweg AC, Haagmans BL, Osterhaus ADME (2008) Virogenomics: The virus –host interaction revisited. Curr Opin Microbiol 11: 1–6. October 2009 | Volume 5 | Issue 10 | e1000557 Review The Role of Genomics in Tracking the Evolution of Influenza A Virus Alice Carolyn McHardy1*, Ben Adams2 1 Computational Genomics and Epidemiology, Max Planck Institute for Informatics, Saarbruecken, Germany, 2 Department of Mathematical Sciences, University of Bath, United Kingdom fusing the virus membrane envelope with the host cell membrane, thus delivering the viral genome into the cell (Figure 1). Segment 6 encodes another surface glycoprotein called neuraminidase (N), which cleaves terminal sialic acid residues from glycoproteins and glycolipids on the host cell surface, thus releasing budding viral particles from an infected cell [10]. Influenza A viruses are further classified into distinct subtypes based on the genetic and antigenic characteristics of these two surface glycoproteins. Sixteen hemagglutinin (H1–16) and nine neuraminidase subtypes (N1–9) are known to exist, and they occur in various combinations in influenza viruses endemic in aquatic birds [10,11]. Viruses with the subtype composition H1N1 and H3N2 have been circulating in the human population for several decades. Of these two subtypes, H3N2 evolves more rapidly, and has until recently caused the majority of infections [1,12,13]. In the spring of 2009, however, a new H1N1 virus originating from swine influenza A viruses, and only distantly related to the H1N1 already circulating, gained hold in the human population. The emergence of this virus has initiated the first influenza pandemic of the twenty-first century [7,14,15]. Hemagglutinin is about five times more abundant than neuraminidase in the viral membrane and is the major target of the host immune response [16–18]. Following exposure to the virus, whether by infection or vaccination, the host immune system acquires the capacity to produce neutralizing antibodies against the viral surface glycoproteins. These antibodies participate in clearing an infection and may protect an individual from future infections for many decades [19]. Five exposed regions on the surface of hemagglutinin, called epitope sites, are predominantly recognized by such antibodies [16,17]. However, the human subtypes of influenza A continuously evolve and acquire genetic mutations that result in amino acid changes in the epitopes. These changes reduce the protective effect of antibodies raised against previously circulating viral variants. This ‘‘antigenic drift’’ necessitates frequent modification and readministration of the influenza vaccine to ensure efficient protection (Box 1). Abstract: Influenza A virus causes annual epidemics and occasional pandemics of short-term respiratory infections associated with considerable morbidity and mortality. The pandemics occur when new human-transmissible viruses that have the major surface protein of influenza A viruses from other host species are introduced into the human population. Between such rare events, the evolution of influenza is shaped by antigenic drift: the accumulation of mutations that result in changes in exposed regions of the viral surface proteins. Antigenic drift makes the virus less susceptible to immediate neutralization by the immune system in individuals who have had a previous influenza infection or vaccination. A biannual reevaluation of the vaccine composition is essential to maintain its effectiveness due to this immune escape. The study of influenza genomes is key to this endeavor, increasing our understanding of antigenic drift and enhancing the accuracy of vaccine strain selection. Recent large-scale genome sequencing and antigenic typing has considerably improved our understanding of influenza evolution: epidemics around the globe are seeded from a reservoir in East-Southeast Asia with year-round prevalence of influenza viruses; antigenically similar strains predominate in epidemics worldwide for several years before being replaced by a new antigenic cluster of strains. Future indepth studies of the influenza reservoir, along with largescale data mining of genomic resources and the integration of epidemiological, genomic, and antigenic data, should enhance our understanding of antigenic drift and improve the detection and control of antigenically novel emerging strains. Influenza is a single-stranded, negative-sense RNA virus that causes acute respiratory illness in humans. In temperate regions, winter influenza epidemics result in 250,000–500,000 deaths per year; in tropical regions, the burden is similar [1,2]. Influenza viruses of three genera or types (A, B, and C) circulate in the human population. Influenza viruses of the types B and C evolve slowly and circulate at low levels. Type A evolves rapidly and can evade neutralization by antibodies in individuals who have been previously infected with, or vaccinated against, the virus. As a result it regularly causes large epidemics. Furthermore, distinct reservoirs of influenza A exist in other mammals and in birds. Four times in the last hundred years these reservoirs have provided genetic material for novel viruses that have caused global pandemics [3–8]. The genome of influenza A viruses comprises eight RNA segments of 0.9–2.3 kb that together span approximately 13.5 kb and encode 11 proteins [9]. Segment 4 encodes the major surface glycoprotein called hemagglutinin (H), which is responsible for attaching the virus to sialic acid residues on the host cell surface and PLoS Pathogens | www.plospathogens.org Citation: McHardy AC, Adams B (2009) The Role of Genomics in Tracking the Evolution of Influenza A Virus. PLoS Pathog 5(10): e1000566. doi:10.1371/ journal.ppat.1000566 Editor: Marianne Manchester, The Scripps Research Institute, United States of America Published October 26, 2009 Copyright: ß 2009 McHardy et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Funding: The authors received no specific funding for this work. Competing Interests: The authors have declared that no competing interests exist. * E-mail: [email protected] This article is part of the ‘‘Genomics of Emerging Infectious Disease’’ PLoS Journal collection (http://ploscollections.org/emerginginfectiousdisease/). 1 October 2009 | Volume 5 | Issue 10 | e1000566 Figure 1. Schematic representation of an influenza A virion. Three proteins, hemagglutinin (HA, a trimer of three identical subunits), neuraminidase (NA, a tetramer of four identical subunits), and the M2 transmembrane proton channel (a tetramer of four identical subunits), are anchored in the viral membrane, which is composed of a lipid bilayer. The large, external domains of hemagglutinin and neuraminidase are the major targets for neutralizing antibodies of the host immune response. The M1 matrix protein is located below the membrane. The genome of the influenza A virus is composed of eight individual RNA segments (conventionally ordered by decreasing length, bottom row), which each encode one or two proteins. Inside the virion, the eight RNA segments are packaged in a complex with nucleoprotein (NP) and the viral polymerase complex, consisting of the PA, PB1, and PB2 proteins. doi:10.1371/journal.ppat.1000566.g001 To monitor for novel emerging strains, the World Health Organization (WHO) maintains a global surveillance program. A panel of experts meets twice a year to review antigenic, genetic, and epidemiological data and decides on the vaccine composition for the next winter season in the northern or southern hemisphere [20]. If an emerging antigenic variant is detected and judged likely to become predominant, an update of the vaccine strain is recommended. This ‘‘predict and produce’’ approach mostly results in efficient vaccines that substantially limit the morbidity and mortality of seasonal epidemics [21]. The recommendation has to be made almost a year before the season in which the vaccine is used, however, because of the time required to produce and distribute a new vaccine. Problems arise when an emerging variant is not identified early enough for an update of the vaccine composition [22–24]. Thus, gaining a detailed understanding of the evolution and epidemiology of the virus is of the utmost importance, as it may lead to earlier identification of novel emerging variants [20]. PLoS Pathogens | www.plospathogens.org The development of high-throughput sequencing has recently provided large datasets of high-quality, complete genome sequences for viral isolates collected in a relatively unbiased manner, regardless of virulence or other unusual characteristics [9,25]. Analyses of the genome sequence data combined with large-scale antigenic typing [26,27] have given insights into the pattern of global spread, the genetic diversity during seasonal epidemics, and the dynamics of subtype evolution. Influenza data repositories such as the NCBI Influenza Virus Resource (http:// www.ncbi.nlm.nih.gov/genomes/FLU/FLU.html) [28] and the Global Initiative on Sharing All Influenza Data (GISAID; http:// platform.gisaid.org/) database [29] make the genomic information publicly available, together with epidemiological data for the sequenced isolates. The GISAID model for data sharing requires users to agree to collaborate with, and appropriately credit, all data contributors. A notable success of this initiative has been the contribution of countries, such as Indonesia and China, which 2 October 2009 | Volume 5 | Issue 10 | e1000566 antigenic drift. Furthermore, the antigenic drift of H3N2 is not continuous but punctuated: antigenically homogenous clusters of strains predominate for an average of 3 years before being replaced by a new cluster. In accordance with the punctuated nature of antigenic drift, periods of predominantly neutral evolution alternate with periods of strong selection for antigenic change [13,36]. Phylogenetic trees illustrating the evolution of the hemagglutinin gene of H3N2 have a cactus-like shape with a strong temporal structure in which the trunk represents the succession of surviving viral lineages over time. Short side branches indicate that most strains are driven to extinction and that viral diversity at any given time is limited [31,34]. The underlying causes of this punctuated antigenic drift and limited viral diversity at a given point in time have been investigated in phylodynamic modeling studies (Box 2). Major changes in antigenicity (antigenic shift) are associated with the introduction of novel viruses into the human population that have a hemagglutinin segment of an influenza A virus from another host species and can be transmitted efficiently among humans [5]. Such viruses may arise by segment reassortment between a human influenza A virus and an influenza A virus from another host species. Alternatively, an entire virus from another host species may cross into the human population. The appearance of such viruses is rare, as it requires the viral genes encoded by the different segments to be compatible with each other and the virus to be capable of replication and transmission in the human population, which is also thought to be a polygenic trait [6,7,10,37,38]. Antigenic shift can have grave consequences because neutralizing antibodies against the viral surface proteins offer limited or no cross-protection across subtypes. Crossprotection can also be very limited between viruses of the same subtype that have evolved independently in different hosts for long periods of time [14]. Thus, a larger part of the population is susceptible to infection with such viruses than to infection with endemic viruses [10,14]. Antigenic shift caused three global pandemics in the twentieth century, the 1918 H1N1 pandemic, the 1957 H2N2 pandemic, and 1968 H3N2 pandemic (reviewed in [3–5,8]): The 1918 pandemic had the most devastating impact, with an estimated 20–50 million deaths worldwide [39]. There is some uncertainty concerning the origin of the 1918 virus due to the lack of data from this time [6,40–43]. A recent phylogenetic study suggests that this virus may have been generated by reassortment of avian viruses with already circulating viruses in a mammalian host such as human or swine [44]. The H2N2 virus that caused the 1957 pandemic was a reassortant of five human H1N1 segments and avian segments encoding the viral surface proteins and the PB1 protein. Similarly, the reassortant H3N2 virus of the 1968 pandemic featured avian segments encoding hemagglutinin and PB1. H3N2 still circulates today, together with an H1N1 lineage introduced in 1977, which is similar to the H1N1 viruses circulating in the 1950s [4]. The first pandemic virus of the twenty-first century probably entered the human population in January or February of 2009 [15]. Phylogenetic analyses of the viral genome determined that the virus has a complex reassortment history with segments of ‘‘avian-like’’ Eurasian swine influenza A viruses (NA and M) that were first observed in Eurasian swine in 1979, and of a triple reassortant virus identified in North American swine after 1998. The segments derived from the triple reassortant stem themselves from human H3N2 (PB1), an avian influenza A virus (PA, PB2), and classical North American swine influenza A viruses (HA, NP, NS), which have a common ancestry with the 1918 H1N1 virus [14,45]. Experiments have shown that the new H1N1 virus replicates efficiently in mammalian model organisms such as Box 1. Broadly Protective Vaccines Current influenza vaccines are based on detergentinactivated viruses. They elicit antibodies with a narrow range of protection that target predominantly the variable regions of the hemagglutinin protein. Accordingly, the seasonal influenza vaccine includes one strain with segments of the surface proteins for each of the A/H1N1, A/H3N2 and B viruses, and it is updated every 1–3 years to match the predominant variants of influenza. Research into vaccines that offer broader protection across diverse subtypes and antigenic drift variants is ongoing [21,59–61]. This research is particularly important with respect to the emergence of novel viruses with pandemic potential, such as the 2009 H1N1 virus. In such an event, the time period between the detection of the virus and the onset of a pandemic is too short to produce a specific vaccine for immediate vaccination of the population. Work in this area is focused on developing vaccines that elicit antibodies against conserved viral components, such as certain regions of hemagglutinin, neuraminidase, and the M2 proton channel in the viral membrane [60]. Other types of vaccines based on live attenuated viruses or plasmid DNA expression vectors, or supplemented with adjuvants, show promise in inducing a more broadly protective immune response [61]. have previously been reticent about placing data in the public domain. The WHO also supports the endeavor of rapid publication of all available sequences for influenza viruses and there is hope that comprehensive submission to public databases will soon become a reality [24,30]. In the future, mining these resources and establishing a statistical framework based on epidemiological, antigenic, and genetic information could provide further insights into the rules that govern the emergence and establishment of antigenically novel variants and improve the potential for influenza prevention and control. Host Immune Evasion by Antigenic Drift and Shift Influenza viruses can rapidly acquire genetic diversity because of high replication rates in infected hosts, an error-prone RNA polymerase (which introduces mutations during genome replication), and segment reassortment (Figure 2). Mutations that change amino acid residues appear significantly more often than silent mutations in the evolution of the hemagglutinin gene of human influenza A, particularly in the protein epitopes [31–34]. This observation indicates that selection for antigenic change of the virus is the driving force in the evolutionary ‘‘arms race’’ between the virus and the immunity of the human population [35]. Reassortment of the eight genome segments between two distinct viruses present simultaneously in a host cell can result in hybrid viruses with genome segments from two different progenitors. Antigenic mapping allows researchers to generate a quantitative, two-dimensional representation of antigenic distances between genetically divergent strains [26]. This technique has revealed that the relationship between antigenic change and genetic change is nonlinear for the hemagglutinin of influenza A/ H3N2. The rate of genetic change of the virus is almost constant over time, but some mutations exert a disproportionately large effect on the antigenic type, whereas others are ‘‘hitchhikers’’ with no phenotypic effect. Elucidating the effects of different mutations at individual sites on the antigenic type will improve our understanding of the overall genotype-to-phenotype mapping for PLoS Pathogens | www.plospathogens.org 3 October 2009 | Volume 5 | Issue 10 | e1000566 Figure 2. Generation of genetic diversity and antigenic drift in the evolution of human influenza A viruses. Blue and yellow viruses depict two antigenically similar strains of the same subtype circulating in the human population. The genetic diversity of the circulating viral population increases through mutation and reasssortment. Single white arrows indicate relationships between ancestral and descendant viruses. White marks on the segments indicate neutral mutations and red marks indicate mutations that affect the antigenic regions of the surface proteins. Incoming pairs of orange arrows indicate the generation of reassortants with segments from two different ancestral viruses. As these viruses continue to circulate, immunity against them builds up in the host population, represented here by the narrowing of the bottleneck. In parallel, viruses with mutations affecting the antigenic regions of the surface proteins accumulate in the viral population. At some point a novel antigenic drift variant, indicated by a red colored virus, which is less affected by immunity in the human population, is generated. This variant is able to cause widespread infection and founds a new cluster of antigenically similar strains. doi:10.1371/journal.ppat.1000566.g002 selective sweeps caused by a novel antigenic drift variant rising to predominance reduce the genomic diversity of the circulating viral population, either genome-wide or for the hemagglutinin segment only [12]. Reassortment results in substantial differences in the evolutionary histories of individual segments. However, similarities in the histories of some segments indicate that besides the antigenic characteristics of hemagglutinin, the genomic context and compatibility of certain segment combinations might be an important contributor to viral fitness [12,51]. A case in point is the antigenically novel ‘‘Fujian’’ strain which became predominant in the 2003–2004 season, following a reassortment event that placed a hemagglutinin segment from a lineage that had been circulating at low levels for several years into a new genomic context [49]. The importance of other segments in the adaptive evolution of the virus is further supported by the observation that a number of other segments, including the one encoding neuraminidase, evolve at similar rates to the segment encoding hemagglutinin [12]. ferrets, mice, and cynomolgus macaques and is likely to be capable of long-term circulation in the human population, particularly in the event of further adaptive changes through mutation or reassortment [46–48]. The novel H1N1 appears, so far, to cause relatively mild human infections in comparison to other viruses such as the highly pathogenic H5N1 avian influenza A viruses that, since 1997, have repeatedly been transmitted to humans and caused severe disease but so far have not been capable of sustained transmission between humans. The emergence of a novel pandemic virus, which may have been circulating undetected in swine for a decade [14,45], has highlighted the need for increased genomic surveillance of the viral populations in mammalian hosts such as swine. These hosts could be a vessel for mammalian adaptation of avian viruses, either by reassortment with human or swine viruses or through adaptive changes [8], but have been monitored less intensely than avian populations. The latest emergence of a pandemic H1N1 virus has also underscored the vital importance of further research into the molecular factors that determine the host range and capacity for sustained human-tohuman transmission of influenza A viruses. Geographic Spread Genomic analysis has led to profound insights into the global patterns of circulation and evolution of influenza A. Over the course of seasonal epidemics in temperate regions, little evidence has been found for selection for amino acid change and adaptive evolution in the antigenic regions of the surface proteins [36]. There is, however, substantial genetic diversity due to multiple introductions of distinct strains, wide spatial spread, and frequent Reassortment in Subtype Evolution Whole-genome studies have revealed that segment reassortment between different viruses of the same subtype is an important mechanism in the evolution of human-adapted subtypes and generates extensive genome-wide diversity [34,36,49–51]. Periodic PLoS Pathogens | www.plospathogens.org 4 October 2009 | Volume 5 | Issue 10 | e1000566 months before they emerge in Oceania, Europe, and North America and 12–18 months before they reach South America. Box 2. Modeling Antigenic Evolution There is a long history of the use of mathematical models to study epidemiological and evolutionary s ystems [63]. For rapidly evolving RNA viruses such as influenza the dynamics of these systems are densely interwoven, and recent work has sought to develop unified ‘‘phylodynamic’’ models to examine the processes underlying the observed epidemiological and evolutionary patterns (reviewed in [35]). A better understanding of the mechanisms driving viral evolution will enhance our capacity to accurately identify novel emerging strains. For influenza, phylodynamic models have been developed to probe the complex processes relating to viral persistence in the human population, antigenic turnover, and the limited genetic diversity at any given point in time. The first models predicted that diversity increases exponentially unless long-term, partial cross-immunity between strains is supplemented by temporary broad immunity that lasts for several months and protects against all infections, regardless of the genetic or antigenic similarity of strains [64,65]. Subsequently, it has been proposed that a genotype-to-phenotype mapping defined by neutral networks underlies influenza evolution [66]. A neutral network is a set of genotypes linked by single mutations that all map to the same phenotype, in this case the antigenic characteristics of a virus. Hence, genetic divergence is not accompanied by antigenic divergence as long as the genotype remains in the same network. In certain genetic contexts, however, mutations can move a genotype onto an adjacent network, resulting in a significant change in the antigenic phenotype. Incorporating this evolutionary framework into an epidemiological model leads to both epidemiological and evolutionary patterns characteristic of human influenza A/H3N2. Challenges for the Future A key objective for research into the antigenic drift of influenza A is to improve the accuracy of vaccine strain choice, in particular for seasons preceding the establishment of novel antigenic drift variants. More intensive surveillance and sampling, particularly in EastSoutheast Asia, could facilitate the early detection of novel emerging drift variants and alleviate problems related to the time required for vaccine production. A better understanding of the evolutionary and epidemiological rules governing antigenic drift, viral fitness, the role of the source region, and establishment of predominance would be particularly helpful for the selection of vaccine strains when considerable variation among antigenically novel strains is observed and it is unclear which, if any, will become predominant. Such insights are likely to come both from phylodynamic modeling studies and by mining genomic resources for genome-wide properties associated with viral fitness and predominance. Some molecular properties of hemagglutinin with predictive value for this task have already been identified [53–56], such as the number of changes at sites under positive selection or in the most extensively altered epitope, although the sites under selection might change over time [26]. It is notable that the lack of antigenic information for sequenced viral isolates in public repositories currently restricts the direct analysis of genetic determinants in antigenic drift [24]. If the World Health Organization were to establish similar policies for the deposition of antigenic information into public databases as exist for sequence data, this could create a valuable resource for research in this area. As existing databases grow, new statistical and computational techniques are being developed for interpretation of these large-scale, population-level genomic datasets in combination with epidemiological and phenotypic information [57]. Ultimately, the expert analysis of the WHO in the detection and control of antigenically novel emerging strains could be extensively supported by the development of a suitable predictive framework based on statistical learning that takes into consideration the population-level phylodynamics of antigenic change [57,58]. Such a framework could utilize epidemiological, genomic, and antigenic information and detailed knowledge of the genetic and epidemiological characteristics of antigenic drift to assess the likelihood of strains rising to predominance. segment reassortment in seasonal epidemics [9,12,36,49,50]. The viral population circulating in one season does not directly seed the epidemic in the following one. Instead, gene flow and viral spread are global, with similar strains appearing in northern and southern hemisphere epidemics across several seasons. There is a global reservoir of viral diversity from which seasonal epidemics in temperate regions are seeded [12,27,52]. This reservoir is located in East-Southeast Asia, where a region-wide network of temporally overlapping epidemics maintains infection incidence throughout the year [27]. Novel strains appear in this region on average 6–9 Acknowledgments We thank Linus Roune for his help creating the figures. References 1. WHO (2003) Fact sheet number 211. Available: http://www.who.int/ mediacentre/factsheets/fs211/en/. Accessed 13 August 2009. 2. Viboud C, Alonso WJ, Simonsen L (2006) Influenza in tropical regions. PLoS Med 3: e89. doi:10.1371/journal.pmed.0030089. 3. Palese P (2004) Influenza: old and new threats. Nat Med 10: S82–87. 4. Kilbourne ED (2006) Influenza pandemics of the 20th century. Emerg Infect Dis 12: 9–14. 5. Cox NJ, Subbarao K (2000) Global epidemiology of influenza: past and present. Annu Rev Med 51: 407–421. 6. Morens DM, Taubenberger JK, Fauci AS (2009) The persistent legacy of the 1918 influenza virus. N Engl J Med 361: 225–229. 7. Neumann G, Noda T, Kawaoka Y (2009) Emergence and pandemic potential of swine-origin H1N1 influenza virus. Nature 459: 931–939. 8. Horimoto T, Kawaoka Y (2005) Influenza: Lessons from past pandemics, warnings from current incidents. Nat Rev Microbiol 3: 591–600. 9. Ghedin E, Sengamalay NA, Shumway M, Zaborsky J, Feldblyum T, et al. (2005) Large-scale sequencing of human influenza reveals the dynamic nature of viral genome evolution. Nature 437: 1162–1166. 10. Webster RG, Bean WJ, Gorman OT, Chambers TM, Kawaoka Y (1992) Evolution and ecology of influenza A viruses. Microbiol Rev 56: 152–179. PLoS Pathogens | www.plospathogens.org 11. Fouchier RA, Munster V, Wallensten A, Bestebroer TM, Herfst S, et al. (2005) Characterization of a novel influenza A virus hemagglutinin subtype (H16) obtained from black-headed gulls. J Virol 79: 2814–2822. 12. Rambaut A, Pybus OG, Nelson MI, Viboud C, Taubenberger JK, et al. (2008) The genomic and epidemiological dynamics of human influenza A virus. Nature 453: 615–619. 13. Wolf YI, Viboud C, Holmes EC, Koonin EV, Lipman DJ (2006) Long intervals of stasis punctuated by bursts of positive selection in the seasonal evolution of influenza A virus. Biol Direct 1: 34. 14. Garten RJ, Davis CT, Russell CA, Shu B, Lindstrom S, et al. (2009) Antigenic and genetic characteristics of swine-origin 2009 A(H1N1) influenza viruses circulating in humans. Science 325: 197–201. 15. Fraser C, Donnelly CA, Cauchemez S, Hanage WP, Van Kerkhove MD, et al. (2009) Pandemic potential of a strain of influenza A (H1N1): Early findings. Science 324: 1557–1561. 16. Wiley DC, Wilson IA, Skehel JJ (1981) Structural identification of the antibodybinding sites of Hong Kong influenza haemagglutinin and their involvement in antigenic variation. Nature 289: 373–378. 17. Wilson IA, Cox NJ (1990) Structural basis of immune recognition of influenza virus hemagglutinin. Annu Rev Immunol 8: 737–771. 5 October 2009 | Volume 5 | Issue 10 | e1000566 18. Wilson IA, Skehel JJ, Wiley DC (1981) Structure of the haemagglutinin membrane glycoprotein of influenza virus at 3 Å resolution. Nature 289: 366–373. 19. Yu X, Tsibane T, McGraw PA, House FS, Keefer CJ, et al. (2008) Neutralizing antibodies derived from the B cells of 1918 influenza pandemic survivors. Nature 455: 532–536. 20. Russell CA, Jones TC, Barr IG, Cox NJ, Garten RJ, et al. (2008) Influenza vaccine strain selection and recent studies on the global migration of seasonal influenza viruses. Vaccine 26(Suppl 4): D31–34. 21. Karlsson Hedestam GB, Fouchier RA, Phogat S, Burton DR, Sodroski J, et al. (2008) The challenges of eliciting neutralizing antibodies to HIV-1 and to influenza virus. Nat Rev Microbiol 6: 143–155. 22. de Jong JC, Beyer WE, Palache AM, Rimmelzwaan GF, Osterhaus AD (2000) Mismatch between the 1997/1998 influenza vaccine and the major epidemic A(H3N2) virus strain as the cause of an inadequate vaccine-induced antibody response to this strain in the elderly. J Med Virol 61: 94–99. 23. CDC (2004) Preliminary assessment of the effectiveness of the 2003–04 inactivated influenza vaccine—Colorado, December 2003. MMWR Morb Mortal Wkly Rep 53: 8–11. 24. Salzberg S (2008) The contents of the syringe. Nature 454: 160–161. 25. Obenauer JC, Denson J, Mehta PK, Su X, Mukatira S, et al. (2006) Large-scale sequence analysis of avian influenza isolates. Science 311: 1576–1580. 26. Smith DJ, Lapedes AS, de Jong JC, Bestebroer TM, Rimmelzwaan GF, et al. (2004) Mapping the antigenic and genetic evolution of influenza virus. Science 305: 371–376. 27. Russell CA, Jones TC, Barr IG, Cox NJ, Garten RJ, et al. (2008) The global circulation of seasonal influenza A (H3N2) viruses. Science 320: 340–346. 28. Bao Y, Bolotov P, Dernovoy D, Kiryutin B, Zaslavsky L, et al. (2008) The influenza virus resource at the National Center for Biotechnology Information. J Virol 82: 596–601. 29. Enserink M (2007) Data sharing. New Swiss influenza database to test promises of access. Science 315: 923. 30. Bogner P, Capua I, Lipman DJ, Cox NJ, et al. (2006) A global initiative on sharing avian flu data. Nature 442: 981. 31. Fitch WM, Leiter JM, Li XQ, Palese P (1991) Positive Darwinian evolution in human influenza A viruses. Proc Natl Acad Sci U S A 88: 4270–4274. 32. Fitch WM, Bush RM, Bender CA, Cox NJ (1997) Long term trends in the evolution of H(3) HA1 human influenza type A. Proc Natl Acad Sci U S A 94: 7712–7718. 33. Bush RM, Fitch WM, Bender CA, Cox NJ (1999) Positive selection on the H3 hemagglutinin gene of human influenza virus A. Mol Biol Evol 16: 1457–1465. 34. Nelson MI, Holmes EC (2007) The evolution of epidemic influenza. Nat Rev Genet 8: 196–205. 35. Grenfell BT, Pybus OG, Gog JR, Wood JL, Daly JM, et al. (2004) Unifying the epidemiological and evolutionary dynamics of pathogens. Science 303: 327–332. 36. Nelson MI, Simonsen L, Viboud C, Miller MA, Taylor J, et al. (2006) Stochastic processes are key determinants of short-term evolution in influenza A virus. PLoS Pathog 2: e125. doi:10.1371/journal.ppat.0020125. 37. Lowen AC, Palese P (2007) Influenza virus transmission: Basic science and implications for the use of antiviral drugs during a pandemic. Infect Disord Drug Targets 7: 318–328. 38. Kuiken T, Holmes EC, McCauley J, Rimmelzwaan GF, Williams CS, et al. (2006) Host species barriers to influenza virus infections. Science 312: 394–397. 39. Johnson NP, Mueller J (2002) Updating the accounts: Global mortality of the 1918–1920 ‘‘Spanish’’ influenza pandemic. Bull Hist Med 76: 105–115. 40. Taubenberger JK, Reid AH, Lourens RM, Wang R, Jin G, et al. (2005) Characterization of the 1918 influenza virus polymerase genes. Nature 437: 889–893. 41. Reid AH, Taubenberger JK, Fanning TG (2004) Evidence of an absence: The genetic origins of the 1918 pandemic influenza virus. Nat Rev Microbiol 2: 909–914. 42. Antonovics J, Hood ME, Baker CH (2006) Molecular virology: Was the 1918 flu avian in origin? Nature 440: E9; discussion E9–10. PLoS Pathogens | www.plospathogens.org 43. Taubenberger JK (2006) The origin and virulence of the 1918 ‘‘Spanish’’ influenza virus. Proc Am Philos Soc 150: 86–112. 44. Smith GJ, Bahl J, Vijaykrishna D, Zhang J, Poon LL, et al. (2009) Dating the emergence of pandemic influenza viruses. Proc Natl Acad Sci U S A 106: 11709–11712. 45. Smith GJ, Vijaykrishna D, Bahl J, Lycett SJ, Worobey M, et al. (2009) Origins and evolutionary genomics of the 2009 swine-origin H1N1 influenza A epidemic. Nature 459: 1122–1125. 46. Maines TR, Jayaraman A, Belser JA, Wadford DA, Pappas C, et al. (2009) Transmission and pathogenesis of swine-origin 2009 A(H1N1) influenza viruses in ferrets and mice. Science 325: 484–487. 47. Munster VJ, de Wit E, van den Brand JM, Herfst S, Schrauwen EJ, et al. (2009) Pathogenesis and transmission of swine-origin 2009 A(H1N1) influenza virus in ferrets. Science 325: 481–483. 48. Itoh Y, Shinya K, Kiso M, Watanabe T, Sakoda Y, et al. (2009) In vitro and in vivo characterization of new swine-origin H1N1 influenza viruses. Nature;E-pub ahead of print. doi:10.1038/nature08260. 49. Holmes EC, Ghedin E, Miller N, Taylor J, Bao Y, et al. (2005) Whole-genome analysis of human influenza A virus reveals multiple persistent lineages and reassortment among recent H3N2 viruses. PLoS Biol 3: e300. doi:10.1371/ journal.pbio.0030300. 50. Nelson MI, Edelman L, Spiro DJ, Boyne AR, Bera J, et al. (2008) Molecular epidemiology of A/H3N2 and A/H1N1 influenza virus during a single epidemic season in the United States. PLoS Pathog 4: e1000133. doi:10.1371/journal.ppat.1000133. 51. Nelson MI, Viboud C, Simonsen L, Bennett RT, Griesemer SB, et al. (2008) Multiple reassortment events in the evolutionary history of H1N1 influenza A virus since 1918. PLoS Pathog 4: e1000012. doi:10.1371/journal.ppat.1000012. 52. Nelson MI, Simonsen L, Viboud C, Miller MA, Holmes EC (2007) Phylogenetic analysis reveals the global migration of seasonal influenza A viruses. PLoS Pathog 3: e131. doi:10.1371/journal.ppat.0030131. 53. Fitch WM, Bush RM, Bender CA, Subbarao K, Cox NJ (2000) The Wilhelmine E. Key 1999 Invitational lecture. Predicting the evolution of human influenza A. J Hered 91: 183–185. 54. Gupta V, Earl DJ, Deem MW (2006) Quantifying influenza vaccine efficacy and antigenic distance. Vaccine 24: 3881–3888. 55. Blackburne BP, Hay AJ, Goldstein RA (2008) Changing selective pressure during antigenic changes in human influenza H3. PLoS Pathog 4: e1000058. doi:10.1371/journal.ppat.1000058. 56. Kryazhimskiy S, Bazykin GA, Plotkin J, Dushoff J (2008) Directionality in the evolution of influenza A haemagglutinin. Proc Biol Sci 275: 2455–2464. 57. Pybus OG, Rambaut A (2009) Modelling: Evolutionary analysis of the dynamics of viral infectious disease. Nat Rev Genet 10: 540–550. 58. Bishop CM (2006) Pattern recognition and machine learning. In: Jordan M, Kleinberg J, Schoellkopf B, eds. , Singapore: Springer. 59. Sui J, Hwang WC, Perez S, Wei G, Aird D, et al. (2009) Structural and functional bases for broad-spectrum neutralization of avian and human influenza A viruses. Nat Struct Mol Biol 16: 265–273. 60. Gerhard W, Mozdzanowska K, Zharikova D (2006) Prospects for universal influenza virus vaccine. Emerg Infect Dis 12: 569–574. 61. Carrat F, Flahault A (2007) Influenza vaccine: The challenge of antigenic drift. Vaccine 25: 6852–6862. 62. Fisher RA (1999) The genetical theory of natural selection. Oxford (UK): Oxford University Press. pp 318. 63. Ross R (1910) The prevention of malaria. New York: E.P. Dutton. pp 669. 64. Ferguson NM, Galvani AP, Bush RM (2003) Ecological and immunological determinants of influenza evolution. Nature 422: 428–433. 65. Tria F, Lässig M, Peliti L, Franz S (2005) A minimal stochastic model for influenza evolution. J Stat Mech;doi:10.1088/1742-5468/2005/07/P07008. 66. Koelle K, Cobey S, Grenfell B, Pascual M (2006) Epochal evolution shapes the phylodynamics of interpandemic influenza A (H3N2) in humans. Science 314: 1898–1903. 6 October 2009 | Volume 5 | Issue 10 | e1000566 Review The Past and Future of Tuberculosis Research Iñaki Comas, Sebastien Gagneux* Division of Mycobacterial Research, MRC National Institute for Medical Research, London, United Kingdom largely ineffective vaccine (Bacille Calmette-Guérin [BCG]), and just a few drugs that were decades old (streptomycin, rifampicin, isoniazid, ethambutol, pyrozinamide) [3]. Tragically, these are the tools still in use today in most parts of the world where TB remains one of the most important public health problems (Figure 1). In addition to the lack of appropriate tools to control TB globally, much about the disease was unknown in the early 1990s and many dogmas were guiding the field at the time. These included the view that differences in the clinical manifestation of TB were primarily driven by host variables and the environment as opposed to bacterial factors, a notion reinforced by early DNA sequencing studies that reported very limited genetic diversity in MTBC compared with other bacterial pathogens [6]. According to other dogmas, TB was mainly a consequence of reactivation of latent infections rather than ongoing disease transmission, and that mixed infections and exogenous reinfections with different strains were very unlikely. The development of molecular techniques to differentiate between strains of MTBC made it possible to readdress some of these points. One of these methods, a DNA fingerprinting protocol based on the Mycobacterium insertion sequence IS6110, quickly evolved into the first international gold standard for genotyping of MTBC [7]. It also became a key component of pragmatic public health efforts, such as detecting disease outbreaks and ongoing TB transmission [8], and allowed differentiation between patients who relapsed due to treatment failure and those reinfected with a different strain [9]. This latter finding demonstrated for the first time that previous exposure to MTBC does not protect against subsequent exogenous reinfection and TB disease, which is a phenomenon with implications for vaccine design. Many other new insights were gained through these molecular epidemiological studies [10], which, for the most part, were performed in wealthy countries; corresponding data from most high-burden areas remained limited because of poor infrastructure and lack of funding. Routine genotyping of MTBC for public health purposes also revived discussions about the role of pathogen variation in Abstract: Renewed efforts in tuberculosis (TB) research have led to important new insights into the biology and epidemiology of this devastating disease. Yet, in the face of the modern epidemics of HIV/AIDS, diabetes, and multidrug resistance—all of which contribute to susceptibility to TB—global control of the disease will remain a formidable challenge for years to come. New highthroughput genomics technologies are already contributing to studies of TB’s epidemiology, comparative genomics, evolution, and host–pathogen interaction. We argue here, however, that new multidisciplinary approaches— especially the integration of epidemiology with systems biology in what we call ‘‘systems epidemiology’’—will be required to eliminate TB. Introduction Tuberculosis (TB) remains an important public health problem [1]. With close to 10 million new cases per year, and a pool of two billion latently infected individuals, control efforts are struggling in many parts of the world (Figure 1). Nevertheless, the renewed interest in research and improved funding for TB give reasons for optimism. Recently, the Stop TB Partnership, a network of concerned governments, organizations, and donors lead by the WHO (http://www.stoptb.org/stop_tb_initiative/), outlined a global plan to halve TB prevalence and mortality by 2015 and eliminate the disease as a public health problem by 2050 [2]. Attaining these goals will depend on both strong government commitment and increased interdisciplinary research and development. As existing diagnostics, drugs, and vaccines will be insufficient to achieve these objectives, a substantial effort in both basic science and epidemiology will be necessary to develop better tools and strategies to control TB [3]. Here we review the recent history of TB research and some of the latest insights into the evolutionary history of the disease. We then discuss ways in which we could benefit from a more comprehensive systems approach to control TB in the future. Recent History of the Field Citation: Comas I, Gagneux S (2009) The Past and Future of Tuberculosis Research. PLoS Pathog 5(10): e1000600. doi:10.1371/journal.ppat.1000600 TB is caused by several species of gram-positive bacteria known as tubercle bacilli or Mycobacterium tuberculosis complex (MTBC). MTBC includes obligate human pathogens such as Mycobacterium tuberculosis and Mycobacterium africanum, as well as organisms adapted to various other species of mammal. In the developed world, TB incidence declined steadily during the second half of the 20th century and so funds available for research and control of TB decreased substantially during that time [4]. When TB started to reemerge in the early 1990s, fuelled by the growing pandemic of HIV/AIDS (Box 1), scientists and public health officials were caught off-guard; billions of dollars of emergency funds were necessary to control TB outbreaks [5]. Moreover, long-term neglect of basic TB research and product development meant that global TB control relied on a 100-year-old diagnostic method (i.e. sputum smear microscopy) of poor sensitivity, an 80-year-old and Editor: Marianne Manchester, The Scripps Research Institute, United States of America PLoS Pathogens | www.plospathogens.org Published October 26, 2009 Copyright: ß 2009 Comas, Gagneux. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Funding: Work in our laboratory is supported by the Medical Research Council, UK, and the US National Institutes of Health grants HHSN266200700022C and AI034238. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Competing Interests: The authors have declared that no competing interests exist. * E-mail: [email protected] This article is part of the ‘‘Genomics of Emerging Infectious Disease’’ PLoS Journal collection (http://ploscollections.org/emerginginfectiousdisease/). 1 October 2009 | Volume 5 | Issue 10 | e1000600 Figure 1. The global incidence of TB. The number of new TB cases per 100,000 population for the year 2007 according to WHO estimates (adapted from [1]). doi:10.1371/journal.ppat.1000600.g001 standardize, however, and whether MTBC genotype plays a meaningful role in TB severity remains controversial [14]. Comparative genomics of MTBC also yielded interesting insights into the evolution and geographic distribution of the organism. Because MTBC has essentially no detectible horizontal gene transfer [15,16], LSPs can be used as phylogenetic markers to trace the evolutionary relationships of different strain families. Following such an approach, studies have shown that humans did not, as previously believed, acquire MTBC from animals during the initiation of animal domestication, rather the human- and animal-adapted members of MTBC share a common ancestor, which might have infected humans even before the Neolithic transition [17,18]. LSPs also allowed researchers to define several discrete strain lineages within the humanadapted members of MTBC, which are associated with different human populations and geographical regions (Figures 2 and 3) [15,19,20]. Because of the lack of horizontal gene exchange in MTBC, phylogenetic trees derived using various molecular markers define the same phylogenetic groupings [21], and several studies based on single nucleotide polymorphisms (SNPs) and other molecular makers have gathered additional support for the highly phylogeographical population structure of MTBC [22–25]. outcome of infection and disease. Some strains of MTBC appeared over-represented in particular patient populations, which suggested that strain diversity may have epidemiological implications. The completion of the first whole genome sequence of M. tuberculosis in 1998 [11] and the development of DNA microarrays offered a new opportunity to address this question by interrogating the entire genome of multiple clinical strains of MTBC. These comparative genomics studies revealed that genomic deletions, also known as large sequence polymorphisms (LSPs), are an important source of genome plasticity in MTBC [12]. Furthermore, statistical analyses of patient data suggested possible associations between strain genomic content and disease severity in humans [13]. Clinical phenotypes in TB are difficult to Box 1. The Influence of Modern Epidemics on TB Incidence HIV/AIDS and diabetes are important comorbidities that dramatically increase the susceptibility to TB. The synergy between TB and HIV/AIDS is a particular problem in subSaharan Africa, while the impact of diabetes on TB is increasing in many rapidly growing world economies; it may already be a more important risk factor for TB than HIV/AIDS in places like India and Mexico. The emergence of multidrug-resistant strains represents an additional threat to global TB control. The strong association between HIV/AIDS and drug-resistant TB has been well established, but whether similar interactions exist between drug-resistant TB and diabetes needs to be explored further. PLoS Pathogens | www.plospathogens.org Ancient History of the Pathogen Although LSPs have proven very useful for defining different lineages within MTBC, these markers do not reflect actual genetic distances, and the mode of molecular evolution in MTBC cannot be easily inferred from them [21]. By contrast, DNA sequencebased methods can provide important clues about the evolutionary forces shaping bacterial populations. Multilocus sequence typing (MLST), in which fragments of seven structural genes are 2 October 2009 | Volume 5 | Issue 10 | e1000600 Figure 2. Global distribution of the six main lineages of human MTBC. Each dot represents the most frequent lineage(s) circulating in a country. Colours correspond to the lineages defined in Figure 3 (adapted from [20]). doi:10.1371/journal.ppat.1000600.g002 sequenced for each strain [26], has been used very successfully to define the genetic population structure of many bacterial species [27]. Because of the low degree of sequence polymorphisms in MTBC, however, standard MLST is uninformative [28]. A recent study of MTBC extended the traditional MLST scheme by sequencing 89 complete genes in 108 strains, covering 1.5% of the genome of each strain [29]. Phylogenetic analysis of this extended multilocus sequence dataset resulted in a tree that was highly congruent with that generated previously using LSPs (Figure 3). The new sequence-based data also revealed that the MTBC strains that are adapted to various animal species represent just a subset of the global genetic diversity of MTBC that affects different human populations [29]. Furthermore, by comparing the geographical distribution of various human MTBC strains with their position on the phylogenetic tree, it became evident that MTBC most likely originated in Africa and that human MTBC originally spread out of Africa together with ancient human migrations along land routes. This view is further supported by the fact that the so-called ‘‘smooth tubercle bacilli,’’ which are the closest relatives of the human MTBC, are highly restricted to East Africa [30]. The multilocus sequence data reported by Hershberg et al. [29] further suggested a scenario in which the three ‘‘modern’’ lineages of MTBC (purple, blue, and red in Figure 3) seeded Eurasia, which experienced dramatic human population expansion in more recent times. These three lineages then spread globally out of Europe, India, and China, respectively, accompanying waves of colonization, trade and conquest. In contrast to the ancient human migrations, however, this more recent dispersal of human MTBC occurred primarily along water routes [29]. The availability of comprehensive DNA sequence data has also allowed researchers to address questions about the molecular PLoS Pathogens | www.plospathogens.org evolution of MTBC. In-depth population genetic analyses by Hershberg et al. highlight the fact that purifying selection against slightly deleterious mutations in this organism is strongly reduced compared to other bacteria [29]. As a consequence, nonsynonymous SNPs tend to accumulate in MTBC, leading to a high ratio of nonsynonymous to synonymous mutations (also known as dN/ dS). The authors hypothesized that the high dN/dS in MTBC compared to most other bacteria might indicate increased random genetic drift associated with serial population bottlenecks during past human migrations and patient-to-patient transmission. If confirmed, this would indicate that ‘‘chance,’’ not just natural selection, has been driving the evolution of MTBC. Although these kinds of fundamental evolutionary questions are often underappreciated by clinicians and biomedical researchers, studying the evolution of a pathogen ultimately allows for better epidemiological predictions by contributing to our understanding of basic biology, particularly with respect to antibiotic resistance. A Vision for the Future Thanks to recent increases in research funding for TB [4], substantial progress has been made in our understanding of the basic biology and epidemiology of the disease. Unfortunately, this increased knowledge has not yet had any noticeable impact on the current global trends of TB (Figure 1). While TB incidence appears to have stabilized in many countries, the total number of cases is still increasing as a function of global human population growth [1]. Of particular concern are the ongoing epidemics of multidrug-resistant TB [31], as well as the synergies between TB and the ongoing epidemics of HIV/AIDS and other comorbidities such as diabetes (Box 1). 3 October 2009 | Volume 5 | Issue 10 | e1000600 Figure 3. The global phylogeny of Mycobacterium tuberculosis complex (MTBC). The phylogenic relationships between various human- and animal-adapted strains and species are largely consistent when defined by using either (A) large sequence polymorphisms (LSPs) or (B) single nucleotide polymorphisms (SNPs) identified by sequencing 89 genes in 108 MTBC strains. Numbers inside the squares in (A) refer to specific lineagedefining LSPs. Colors indicate congruent lineages (adapted from [20] and [29]). doi:10.1371/journal.ppat.1000600.g003 As our understanding of TB improves, we would like to be able to make better predictions about the future trajectory of the disease and to develop new tools to control the disease better and ultimately reverse global trends. For this to be feasible, TB epidemiology needs to evolve into a more predictive, interdisciplinary endeavour; a discipline we might refer to as ‘‘systems epidemiology’’ (Figure 4). Systems biology is already a rapidly emerging field, in which cycles of mathematical modelling and experiments using various largescale ‘‘-omics’’ datasets are integrated in an iterative manner [32]. Novel biological processes are being discovered through these systems approaches, which might not have been possible using more traditional methods [33–35]. Last year, Young et al. argued that systems biology approaches will be necessary to elucidate some of the key aspects of host– pathogen interactions in TB [36] and to develop new drugs, vaccines, and biomarkers to evaluate new interventions [3]. For PLoS Pathogens | www.plospathogens.org example, according to another dogma in the TB field, latent TB infections are caused by physiologically dormant bacilli and can thus be differentiated from active disease where MTBC is actively growing and dividing [37]. In reality, however, the phenomenon of TB latency most likely reflects a whole spectrum of responses to TB infection, involving phenotypically distinct bacterial subpopulations and spanning various degrees of bacterial burden and associated host immune responses [38]. We agree with Young et al. [36] that TB latency and similar biological complexities will only be adequately addressed using systems approaches, and we argue further that to comprehend the current TB epidemic as a whole, and to better predict its future trajectory, a complementary systems epidemiology approach will be necessary (Figure 4). Mathematical models are already being used extensively to study the epidemiology of TB and to guide control policies [39]. Recent applications have shown that socioeconomic factors are key drivers 4 October 2009 | Volume 5 | Issue 10 | e1000600 Figure 4. A systems epidemiology approach to TB research. The spread of TB is influenced by social and biological factors. On the one hand, the new discipline of systems biology integrates approaches that address the host, the pathogen, and interactions between the two. On the other hand, epidemiology addresses the burden of the disease and the social, economic, and ecological causes of its frequency and distribution. There is little crosstalk between these two disciplines at the moment. ‘‘Systems epidemiology’’ is an attempt to take into account the interactions between these various fields of research. doi:10.1371/journal.ppat.1000600.g004 of today’s TB epidemic [40]. In addition, much theoretical emphasis has been placed on trying to define the impact that drug resistance will have on the global TB epidemic [41]. Some of this theoretical work has become more complex by incorporating new biological insights obtained empirically and through targeted experimental studies. Early theoretical studies on the spread of drug-resistant MTBC were based on the assumption that all drug-resistant bacteria had an inherent fitness disadvantage compared to drugsusceptible strains [42]; however, as is becoming clear from experimental and molecular epidemiological investigation, substantial heterogeneity exists with respect to the reproductive success of drug-resistant strains [43–46]. Newer mathematical models account for some of this heterogeneity [47–49]. One could imagine an expansion of such mathematical approaches—much as systems biology operates—in which epidemiological modelling is combined with more comprehensive biological data related to the host, the pathogen, and their interactions (Figure 4). Of course, environmental and sociological data would also need to be considered [40]. As mathematical models become more finely tuned, they could in turn inform future experimental work to test some of the specific predictions. The genomics revolution now offers the opportunity to study host– PLoS Pathogens | www.plospathogens.org pathogen interactions at an unprecedented depth. To be able to make sense out of the current and upcoming deluge of -omics data, however, scientists will have to rely on a mathematically and statistically robust analytical framework. Ideally, some of these theoretical approaches will be able to accommodate increasingly diverse sets of data in order to capture the various biological, environmental, and social aspects of TB. Among the newly emerging technologies, we believe that nextgeneration DNA sequencing will play an important role in improving our understanding of TB [50]. Whole-genome sequencing could potentially become the new gold standard for strain typing in routine molecular epidemiology [51]. For host genetics and TB susceptibility, too, de novo DNA sequencing based approaches could have advantages over traditional SNP typing [52]. For example, many of the human populations carrying the largest proportion of the global TB burden have not been sufficiently characterised genetically (Figure 1) [53,54], and screening for currently limited human SNP collections might have little relevance for these populations [55]. Furthermore, comprehensive DNA sequencing of TB patients and controls in various human populations could help unveil rare but biologically relevant mutations [56]. Another approach increasingly being 5 October 2009 | Volume 5 | Issue 10 | e1000600 used to study both the host and the pathogen is sequence-based transcriptomics, in which gene expression is measured by whole genome sequencing of RNA transcripts; a method referred to as RNA-seq [57]. One of the advantages of this approach over existing microarray-based methods is that changes in the expression of noncoding RNAs and other novel transcripts can be easily detected. RNA-seq is particularly useful for genome-wide studies of small regulatory RNAs, as such studies are more difficult to perform using standard DNA microarrays. Recent studies, for example, have reported a role for small regulatory RNAs in M. tuberculosis [58], and there is little doubt more regulatory RNAs will soon be identified by RNA-seq [57]. the problems has been that the macrophage and mouse infection models used in these studies relied on poorly characterised strains, and finding relevant links to human disease has been all but impossible [14,21]. In TB control, too, potential new dogmas might emerge to limit future progress. A strong T cell–derived interferon gamma (INFc) response appears to be crucial for the immunological control of TB, and many MTBC antigens have been identified based on their capacity to elicit INFc responses in TB patients or their infected contacts [62]. Some of these antigens are being developed into new TB diagnostics and vaccines, but the potential impact of MTBC diversity on immune responses is not generally being considered [21]. A recent study in The Gambia showed that INFc responses to one of the key MTBC antigens differed in an MTBC lineage–specific manner [63]. Developing a universally effective vaccine might be the only way to eliminate TB in the future [3]. This is particularly true given the large reservoir of latently infected individuals in the world, which would be impossible to eliminate through prophylactic drug treatment. Considering that natural TB infection does not protect against exogenous reinfection and disease, however, mimicking natural infection using attenuated strains or a cocktail of traditional INFc-inducing antigens might not necessarily be the most promising vaccine strategy. Indeed, the largely unsuccessful implementation of BCG vaccination might serve as a warning [64]. Challenges for the Future Advances in TB research are hampered by the fact that MTBC is a Biosafety Level 3 pathogen with a long generation time, making it slow and complex to culture. Moreover, TB is a chronic disease that can develop over many years, and is characterised by extended periods of latency during which MTBC cannot be isolated from infected individuals. All of these factors complicate and prolong the development of new interventions and their assessment in clinical trials. As we have already mentioned, the field has been marked by a number of dogmas that, in some cases, might have contributed to the slow progress in TB research. New insights are now questioning some of these views, but at the same time, new opinions could well evolve into new dogmas. For example, we and others have spent much of our scientific careers seeking convincing evidence for the role of MTBC strain diversity in human disease. Although some pieces of evidence have recently started to emerge [59–61], the subject needs more work. One of Acknowledgments We thank Peter Small and Douglas Young for comments on the manuscript. References 1. World Health Organization (2009) Global tuberculosis control - surveillance, planning, financing. Geneva, Switzerland: WHO. 2. Stop TB Partnership (2006) The global plan to stop TB 2006–2015. Geneva: WHO. 3. Young DB, Perkins MD, Duncan K, Barry CE (2008) Confronting the scientific obstacles to global control of tuberculosis. J Clin Invest 118: 1255–1265. 4. Kaufmann SH, Parida SK (2007) Changing funding patterns in tuberculosis. Nat Med 13: 299–303. 5. Frieden TR, Fujiwara PI, Washko RM, Hamburg MA (1995) Tuberculosis in New York City–turning the tide. N Engl J Med 333: 229–233. 6. Sreevatsan S, Pan X, Stockbauer KE, Connell ND, Kreiswirth BN, et al. (1997) Restricted structural gene polymorphism in the Mycobacterium tuberculosis complex indicates evolutionarily recent global dissemination. Proc Natl Acad Sci U S A 94: 9869–9874. 7. van Embden JD, Cave MD, Crawford JT, Dale JW, Eisenach KD, et al. (1993) Strain identification of Mycobacterium tuberculosis by DNA fingerprinting: recommendations for a standardized methodology. J Clin Microbiol 31: 406–409. 8. Small PM, Hopewell PC, Singh SP, Paz A, Parsonnet J, et al. (1994) The epidemiology of tuberculosis in San Francisco. A population-based study using conventional and molecular methods. N Engl J Med 330: 1703–1709. 9. Small PM, Shafer RW, Hopewell PC, Singh SP, Murphy MJ, et al. (1993) Exogenous reinfection with multidrug-resistant Mycobacterium tuberculosis in patients with advanced HIV infection. N Engl J Med 328: 1137–1144. 10. Mathema B, Kurepina NE, Bifani PJ, Kreiswirth BN (2006) Molecular epidemiology of tuberculosis: current insights. Clin Microbiol Rev 19: 658–685. 11. Cole ST, Brosch R, Parkhill J, Garnier T, Churcher C, et al. (1998) Deciphering the biology of Mycobacterium tuberculosis from the complete genome sequence. Nature 393: 537–544. 12. Tsolaki AG, Hirsh AE, DeRiemer K, Enciso JA, Wong MZ, et al. (2004) Functional and evolutionary genomics of Mycobacterium tuberculosis: insights from genomic deletions in 100 strains. Proc Natl Acad Sci U S A 101: 4865–4870. 13. Kato-Maeda M, Rhee JT, Gingeras TR, Salamon H, Drenkow J, et al. (2001) Comparing genomes within the species Mycobacterium tuberculosis. Genome Res 11: 547–554. 14. Nicol MP, Wilkinson RJ (2008) The clinical consequences of strain diversity in Mycobacterium tuberculosis. Trans R Soc Trop Med Hyg 102: 955–65. 15. Hirsh AE, Tsolaki AG, DeRiemer K, Feldman MW, Small PM (2004) Stable association between strains of Mycobacterium tuberculosis and their human host populations. Proc Natl Acad Sci U S A 101: 4871–4876. PLoS Pathogens | www.plospathogens.org 16. Supply P, Warren RM, Banuls AL, Lesjean S, Van Der Spuy GD, et al. (2003) Linkage disequilibrium between minisatellite loci supports clonal evolution of Mycobacterium tuberculosis in a high tuberculosis incidence area. Mol Microbiol 47: 529–538. 17. Brosch R, Gordon SV, Marmiesse M, Brodin P, Buchrieser C, et al. (2002) A new evolutionary scenario for the Mycobacterium tuberculosis complex. Proc Natl Acad Sci U S A 99: 3684–3689. 18. Mostowy S, Cousins D, Brinkman J, Aranaz A, Behr MA (2002) Genomic deletions suggest a phylogeny for the Mycobacterium tuberculosis complex. J Infect Dis 186: 74–80. 19. Reed MB, Pichler VK, McIntosh F, Mattia A, Fallow A, et al. (2009) Major Mycobacterium tuberculosis lineages associate with patient country of origin. J Clin Microbiol 47: 1119–28. 20. Gagneux S, Deriemer K, Van T, Kato-Maeda M, de Jong BC, et al. (2006) Variable host-pathogen compatibility in Mycobacterium tuberculosis. Proc Natl Acad Sci U S A 103: 2869–2873. 21. Gagneux S, Small PM (2007) Global phylogeography of Mycobacterium tuberculosis and implications for tuberculosis product development. Lancet Infect Dis 7: 328–337. 22. Baker L, Brown T, Maiden MC, Drobniewski F (2004) Silent nucleotide polymorphisms and a phylogeny for Mycobacterium tuberculosis. Emerg Infect Dis 10: 1568–1577. 23. Gutacker MM, Mathema B, Soini H, Shashkina E, Kreiswirth BN, et al. (2006) Single-nucleotide polymorphism-based population genetic analysis of Mycobacterium tuberculosis strains from 4 geographic sites. J Infect Dis 193: 121–128. 24. Filliol I, Motiwala AS, Cavatore M, Qi W, Hernando Hazbon M, et al. (2006) Global phylogeny of Mycobacterium tuberculosis based on single nucleotide polymorphism (SNP) analysis: insights into tuberculosis evolution, phylogenetic accuracy of other DNA fingerprinting systems, and recommendations for a minimal standard SNP set. J Bacteriol 188: 759–772. 25. Brudey K, Driscoll JR, Rigouts L, Prodinger WM, Gori A, et al. (2006) Mycobacterium tuberculosis complex genetic diversity: mining the fourth international spoligotyping database (SpolDB4) for classification, population genetics and epidemiology. BMC Microbiol 6: 23. 26. Maiden MC, Bygraves JA, Feil E, Morelli G, Russell JE, et al. (1998) Multilocus sequence typing: a portable approach to the identification of clones within populations of pathogenic microorganisms. Proc Natl Acad Sci U S A 95: 3140–3145. 27. Maiden MC (2006) Multilocus sequence typing of bacteria. Annu Rev Microbiol 60: 561–588. 6 October 2009 | Volume 5 | Issue 10 | e1000600 28. Achtman M (2008) Evolution, population structure, and phylogeography of genetically monomorphic bacterial pathogens. Annu Rev Microbiol 62: 53–70. 29. Hershberg R, Lipatov M, Small PM, Sheffer H, Niemann S, et al. (2008) High functional diversity in Mycobacterium tuberculosis driven by genetic drift and human demography. PLoS Biol 6: e311. 30. Gutierrez C, Brisse S, Brosch R, Fabre M, Omais B, et al. (2005) Ancient origin and gene mosaicism of the progenitor of Mycobacterium tuberculosis. PLoS Pathogens 1: 1–7. 31. World Health Organization (2008) Anti-tuberculosis drug resistance in the world report no. 4. Geneva, Switzerland: WHO. 32. Zak DE, Aderem A (2009) Systems biology of innate immunity. Immunol Rev 227: 264–282. 33. Gilchrist M, Thorsson V, Li B, Rust AG, Korb M, et al. (2006) Systems biology approaches identify ATF3 as a negative regulator of Toll-like receptor 4. Nature 441: 173–178. 34. Querec TD, Akondy RS, Lee EK, Cao W, Nakaya HI, et al. (2009) Systems biology approach predicts immunogenicity of the yellow fever vaccine in humans. Nat Immunol 10: 116–125. 35. Stuart LM, Boulais J, Charriere GM, Hennessy EJ, Brunet S, et al. (2007) A systems biology analysis of the Drosophila phagosome. Nature 445: 95–101. 36. Young D, Stark J, Kirschner D (2008) Systems biology of persistent infection: tuberculosis as a case study. Nat Rev Microbiol 6: 520–8. 37. Gill WP, Harik NS, Whiddon MR, Liao RP, Mittler JE, et al. (2009) A replication clock for Mycobacterium tuberculosis. Nat Med 15: 211–4. 38. Young DB, Gideon HP, Wilkinson RJ (2009) Eliminating latent tuberculosis. Trends Microbiol 17: 183–188. 39. Cohen T, Dye C, Colijn C, Murray M (2009) Mathematical models of the epidemiology and control of drug-resistant TB. Expert Rev Resp Med in press. 40. Lonnroth K, Jaramillo E, Williams BG, Dye C, Raviglione M (2009) Drivers of tuberculosis epidemics: The role of risk factors and social determinants. Soc Sci Med 68: 2240–6. 41. Borrell S, Gagneux S (2009) Infectiousness, reproductive fitness, and evolution of drug-resistant Mycobactyerium tuberculosis. Int J Tuberc Lung Dis in press. 42. Dye C, Williams BG, Espinal MA, Raviglione MC (2002) Erasing the world’s slow stain: strategies to beat multidrug-resistant tuberculosis. Science 295: 2042–2046. 43. Bottger EC, Springer B, Pletschette M, Sander P (1998) Fitness of antibioticresistant microorganisms and compensatory mutations. Nat Med 4: 1343–1344. 44. Gagneux S, Burgos MV, DeRiemer K, Encisco A, Munoz S, et al. (2006) Impact of bacterial genetics on the transmission of isoniazid-resistant Mycobacterium tuberculosis. PLoS Pathog 2: e61. 45. Gagneux S, Long CD, Small PM, Van T, Schoolnik GK, et al. (2006) The competitive cost of antibiotic resistance in Mycobacterium tuberculosis. Science 312: 1944–1946. 46. van Soolingen D, de Haas PE, van Doorn HR, Kuijper E, Rinder H, et al. (2000) Mutations at amino acid position 315 of the katG gene are associated with high-level resistance to isoniazid, other drug resistance, and successful PLoS Pathogens | www.plospathogens.org 47. 48. 49. 50. 51. 52. 53. 54. 55. 56. 57. 58. 59. 60. 61. 62. 63. 64. 7 transmission of Mycobacterium tuberculosis in the Netherlands. J Infect Dis 182: 1788–1790. Cohen T, Murray M (2004) Modeling epidemics of multidrug-resistant M. tuberculosis of heterogeneous fitness. Nat Med 10: 1117–1121. Blower SM, Chou T (2004) Modeling the emergence of the ‘hot zones’: tuberculosis and the amplification dynamics of drug resistance. Nat Med 10: 1111–1116. Dye C (2009) Doomsday postponed? Preventing and reversing epidemics of drug-resistant tuberculosis. Nat Rev Microbiol 7: 81–87. Mardis ER (2008) Next-generation DNA sequencing methods. Annu Rev Genomics Hum Genet 9: 387–402. MacLean D, Jones JD, Studholme DJ (2009) Application of ‘next-generation’ sequencing technologies to microbial genetics. Nat Rev Microbiol 7: 287–296. Hardy J, Singleton A (2009) Genomewide association studies and human disease. N Engl J Med 360: 1759–1768. Tishkoff SA, Reed FA, Friedlaender FR, Ehret C, Ranciaro A, et al. (2009) The genetic structure and history of Africans and African Americans. Science 324: 1035–44. Basu A, Mukherjee N, Roy S, Sengupta S, Banerjee S, et al. (2003) Ethnic India: a genomic view, with special reference to peopling and structure. Genome Res 13: 2277–2290. Campbell MC, Tishkoff SA (2008) African Genetic Diversity: Implications for human demographic history, modern human origins, and complex disease mapping. Annu Rev Genomics Hum Genet 9: 403–33. Goldstein DB (2009) Common genetic variation and human traits. N Engl J Med 360: 1696–1698. Wang Z, Gerstein M, Snyder M (2009) RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet 10: 57–63. Arnvig KB, Young DB (2009) Identification of small RNAs in Mycobacterium tuberculosis. Mol Microbiol 73: 397–408. de Jong BC, Hill PC, Aiken A, Awine T, Antonio M, et al. (2008) Progression to active tuberculosis, but not transmission, varies by Mycobacterium tuberculosis lineage in the Gambia. J Infect Dis 198: 1037–43. Caws M, Thwaites G, Dunstan S, Hawn TR, Thi Ngoc Lan N, et al. (2008) The influence of host and bacterial genotype on the development of disseminated disease with Mycobacterium tuberculosis. PLoS Pathog 4: e1000034. Thwaites G, Caws M, Chau TT, D’Sa A, Lan NT, et al. (2008) The relationship between Mycobacterium tuberculosis genotype and the clinical phenotype of pulmonary and meningeal tuberculosis. J Clin Microbiol 46: 1363–8. Ernst JD, Lewinsohn DM, Behar S, Blythe M, Schlesinger LS, et al. (2007) Meeting report: NIH workshop on the Tuberculosis Immune Epitope Database. Tuberculosis (Edinb) 88: 366–70. de Jong BC, Hill PC, Brookes RH, Gagneux S, Jeffries DJ, et al. (2006) Mycobacterium africanum elicits an attenuated T Cell response to Early Secreted Antigenic Target, 6 kDa, in patients with tuberculosis and their household contacts. J Infect Dis 193: 1279–1286. Andersen P, Doherty TM (2005) Opinion: The success and failure of BCG implications for a novel tuberculosis vaccine. Nat Rev Microbiol 3: 656–62. October 2009 | Volume 5 | Issue 10 | e1000600 Review Helicobacter pylori ’s Unconventional Role in Health and Disease Marion S. Dorer, Sarah Talarico, Nina R. Salama* Division of Human Biology, Fred Hutchinson Cancer Research Center, Seattle, Washington, United States of America countries has fallen dramatically, for unknown reasons, with a corresponding decrease in gastric cancer [7]. This public health success is tempered by the recent demonstration of an inverse relationship between H. pylori infection and esophageal adenocarcinoma, Barrett’s esophagus, and reflux esophagitis [8]. H. pylori has been with humans since our earliest days, thus it is not surprising that its relationship is that of both a commensal bacterium and a pathogen, causing some diseases and possibly protecting against others. In addition, it is genetically diverse, likely as a result of constant exposure to both environmental and immunological selection, suggesting that genetic diversification is a strategy for long-term colonization. Abstract: The discovery of a bacterium, Helicobacter pylori, that is resident in the human stomach and causes chronic disease (peptic ulcer and gastric cancer) was radical on many levels. Whereas the mouth and the colon were both known to host a large number of microorganisms, collectively referred to as the microbiome, the stomach was thought to be a virtual Sahara desert for microbes because of its high acidity. We now know that H. pylori is one of many species of bacteria that live in the stomach, although H. pylori seems to dominate this community. H. pylori does not behave as a classical bacterial pathogen: disease is not solely mediated by production of toxins, although certain H. pylori genes, including those that encode exotoxins, increase the risk of disease development. Instead, disease seems to result from a complex interaction between the bacterium, the host, and the environment. Furthermore, H. pylori was the first bacterium observed to behave as a carcinogen. The innate and adaptive immune defenses of the host, combined with factors in the environment of the stomach, apparently drive a continuously high rate of genomic variation in H. pylori. Studies of this genetic diversity in strains isolated from various locations across the globe show that H. pylori has coevolved with humans throughout our history. This long association has given rise not only to disease, but also to possible protective effects, particularly with respect to diseases of the esophagus. Given this complex relationship with human health, eradication of H. pylori in nonsymptomatic individuals may not be the best course of action. The story of H. pylori teaches us to look more deeply at our resident microbiome and the complexity of its interactions, both in this complex population and within our own tissues, to gain a better understanding of health and disease. The Role of Infection in Disease Risk H. pylori infection is generally acquired during childhood and, without specific antibiotic treatment, can persist for the lifetime of the host. Disease often does not develop until adulthood, after decades of infection, and H. pylori induces variable pathologies in the stomach. Duodenal ulcer disease is characterized by gastritis that is largely confined to the antrum (the distal compartment of the stomach), relatively low inflammation of the corpus (the middle, acid-secreting compartment), and high levels of stomach acid secretion (Figure 1A). Those with gastric ulcer or stomach cancer have high levels of inflammation of the corpus, multifocal gastric atrophy, and low levels of stomach acid secretion, due to the destruction of stomach acid–secreting parietal cells (Figure 1B) [9,10]. Some of this inflammatory response is controlled by the cytokine IL-1b, which is induced by H. pylori infection [11] and both elicits a proinflammatory response and inhibits secretion of gastric acid [12]. Polymorphisms in the interleukin gene cluster, including IL-1b, are risk factors for H. pylori–associated gastric cancer [13,14], and studies of the transcriptional response of both human and model hosts to H. pylori confirm induction of transcriptional regulators of proinflammatory programs. In Citation: Dorer MS, Talarico S, Salama NR (2009) Helicobacter pylori’s Unconventional Role in Health and Disease. PLoS Pathog 5(10): e1000544. doi:10.1371/journal.ppat.1000544 Common wisdom circa 1980 suggested that the stomach, with its low pH, was a sterile environment. Then, endoscopy of the stomach became common and, in 1984, pathologist Robin Warren and gastroenterologist Barry Marshall saw an extracellular, curved bacillus, often in dense sheets, lining the stomach epithelium of patients with gastritis (inflammation of the stomach) and ulcer disease [1]. Soon, the medical community understood that the gram-negative bacterium Helicobacter pylori, not stress, is the major cause of stomach inflammation, which, in some infected individuals, precedes peptic ulcer disease (10%–20%), distal gastric adenocarcinoma (1%–2%), and gastric mucosal-associated lymphoid tissue (MALT) lymphoma (,1%) [2–5]. Thus, H. pylori gained distinction as the only known bacterial carcinogen [6]. It is believed that half of the world’s population is infected with H. pylori; however, the burden of disease falls disproportionately on less-developed countries. The incidence of infection in developed PLoS Pathogens | www.plospathogens.org Editor: Marianne Manchester, The Scripps Research Institute, United States of America Published October 26, 2009 Copyright: ß 2009 Dorer et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Funding: Work in the Salama lab is supported by National Institutes of Health grant AI054423. The funder had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Competing Interests: The authors have declared that no competing interests exist. * E-mail: [email protected] This article is part of the ‘‘Genomics of Emerging Infectious Disease’’ PLoS Journal collection (http://ploscollections.org/emerginginfectiousdisease/). 1 October 2009 | Volume 5 | Issue 10 | e1000544 Learning about Disease from H. pylori H. pylori expands our view of how microbes survive at high levels while activating inflammatory responses and shows us that microbes may be underappreciated as an important factor in chronic disease pathogenesis. In the case of pathogens that cause acute infections, there is a massive inflammatory response, which often supports bacterial replication and transmission. Alternatively, some pathogens, such as Mycobacterium tuberculosis, persist in the host by manipulating the immune response to create a protected compartment. H. pylori introduces a third strategy; it actively replicates and maintains a continuous balance with the inflammatory response over years of infection with little evidence for increased H. pylori– related disease upon immune suppression [25]. As the role of chronic inflammation in many diseases including cardiovascular disease, diabetes mellitus, Alzheimer’s disease, and others is increasingly recognized, researchers are focusing on infectious agents as one possible source of this chronic inflammation. Genomic Insights into the Biology of H. pylori The study of H. pylori is strongly influenced by the genomic age. The sequencing of its genome was completed in 1997 [26], just 13 years after Marshall and Warren reported their discovery. However, almost a quarter (24%) of H. pylori genes have no sequence similarity with genes available in public databases [27], suggesting that lessons learned from well-studied bacteria like Escherichia coli would not necessarily apply to this evolutionarily distinct Epsilonproteobacteria. By using more advanced bioinformatic approaches, researchers are now identifying some pathways first thought absent in H. pylori. For example, H. pylori appeared to lack the E. coli recBCD pathway, which is involved in homologous recombination and DNA double-strand break repair. More careful examination of conserved domains and motifs, however, identified the H. pylori addA and addB genes, which are present in most grampositive and many gram-negative bacteria and whose protein products have enzymatic functions similar to those of the recBCD pathway [28]. By 1999, H. pylori was the first species to have complete genomes sequenced from two different strains—an important milestone, given its genetic diversity. Comparison of the two genomes revealed that 6%–7% of the genes were present in one strain but not in the other. There was also a high level of nucleotide diversity between the two strains, with only eight genes sharing at least 98% nucleotide identity; however, most nucleotide differences were synonymous changes [27]. Microarrays designed upon these sequences were then used for comparative genomic hybridization of H. pylori strains isolated from different ethnic groups and geographic areas [29,30]. These studies found that 25% of H. pylori genes are variably present among strains. Such genome-wide analyses have played an important role in dividing H. pylori genes into two classes: variable genes that are absent in some strains and core genes that are present in all strains analyzed. The variable genes are likely adaptive for different environmental niches, which for the human stomach–restricted H. pylori comprise genetically distinct hosts. The largest annotated class of variable genes encode proteins expressed on or that modify the bacterial cell surface (outer membrane proteins and proteins involved in lipopolysaccharide synthesis) [30], consistent with a function at the interface of the bacteria and host. The core genes have diverse functions. Some core genes are required for viability in culture. A genomic study that utilized microarray-based mapping of a genomesaturating transposon library (a collection of H. pylori strains that includes transposon mutants randomly distributed throughout the genome) revealed that 23% of the genome is required for viability in culture because these genes could not tolerate transposon Figure 1. Distinct pathologies of H. pylori–induced disease. (A) Duodenal ulcer disease correlates with high inflammation in the antrum (red bursts), lower levels of inflammation in the corpus, and high acid secretion (+). (B) Gastric ulcer or adenocarcinoma correlates with increased inflammation in the corpus, low acid secretion, and multifocal atrophy (wavy lines). doi:10.1371/journal.ppat.1000544.g001 addition, transcription profiles reveal induction of several chemokines and cytokines including those produced by nonlymphoid cells, and robust induction of innate immune defenses including iron sequestration proteins and antimicrobial peptides [15]. These studies suggest it would be wise to explore diverse functional classes of genes for host genetic variant associations with H. pylori disease progression. To this end, H. pylori researchers are eagerly awaiting an unbiased genome-wide association study of risk factors associated with progression to intestinal-type gastric cancer or peptic ulcer disease in patients infected with H. pylori. Such a study has been completed for sporadic diffuse-type gastric cancer, which can be associated with H. pylori infection, revealing two candidate loci, one that encodes a likely tumor suppressor (prostate stem cell antigen [PSCA]) [16]. Genomic studies of this sort will help elucidate host factors that synergize with H. pylori infection to cause disease. The association of H. pylori infection with gastric cancer raises the interesting question of whether H. pylori encodes one or more oncogenes. Oncogenic viruses initiate and promote cellular transformation by integrating virally encoded oncogenes into the host genome [17,18]. By contrast, H. pylori remains primarily extracellular and does not integrate its genome into the host DNA. The bacterium can still affect the function of host cells, however, by translocating a bacterial protein, CagA, into host cells via a specialized secretion system called the cag Type IV secretion system (T4SS) [19,20]. In host cells, CagA interacts with a number of cellular complexes implicated in oncogenesis [21,22]. Despite elucidation of potentially transforming activities, transgenic expression of CagA in the mouse stomach is only weakly oncogenic [23]. As the cag T4SS also induces proinflammatory cytokines via the intracellular bacterial peptidoglycan recognition molecule Nod1, cancer progression may occur through synergy with the host inflammatory response [24]. While CagA may not promote cancer itself, exposure to CagA and inflammatory insults may select for heritable host cell changes (genetic or epigenetic) that together contribute to cancer progression. PLoS Pathogens | www.plospathogens.org 2 October 2009 | Volume 5 | Issue 10 | e1000544 Learning about Disease from H. pylori to sample adaptive variants. HIV, for example, has a flexible reverse transcriptase that makes point mutations, insertions, deletions, transversions, and duplications that produce variants that may have a selective advantage [35]. Genetic variation in a microbe indicates constant selection by a dynamic environment, and H. pylori is a very genetically diverse species of bacteria [36– 38]. Genetic diversification may help H. pylori to adapt to a new host after transmission, to different micro-niches within a single host, and to changing conditions in the host over time—for example, by avoiding clearance by host defenses. Genetic diversity arises from within-genome diversification as well as from reassortment by recombination with DNA from other infecting H. pylori, generating novel clones within the stomach (Figure 2). Within-genome diversification can include point mutations, intragenomic recombination, and slipped-strand mispairing during DNA replication within repetitive sequences. Reassortment can occur by recombination with either DNA from a superinfecting H. pylori strain or a variant clone of the same strain. Central to this reassortment is H. pylori’s natural competence—the ability to take up exogenous DNA and incorporate it into its genome. Evidence from our lab shows that natural competence is induced by DNA damage, suggesting that H. pylori responds to stress by diversifying its genome (MSD and NRS, unpublished data). However, there are controls on this rampant genetic exchange: restriction-modification systems, which include a restriction endonuclease that cleaves a specific DNA sequence and a DNA methyltransferase that protects the bacterium’s own DNA from being cleaved by methylating the target DNA sequence. Genes that encode restriction-modification systems compose the second largest class of variably present genes with known function, so the complement of available restrictionmodification systems varies between strains, giving a methylation code to the DNA from each strain. This mechanism serves to limit or prevent recombination between H. pylori strains as well as between H. pylori and other bacteria or eukaryotic cells [39]. The H. pylori genome encodes relatively few proteins that regulate transcription. Instead, some of the same processes that govern the generation of genetic diversity (i.e., slipped-strand mispairing, methyltransferase activity, and recombination) also play an important role in varying gene expression in response to environmental cues. There are 46 H. pylori genes that have long repeats of one or two nucleotides that are prone to slipped-strand mispairing during replication [26,27,40]. These genes are phasevariable because changes in the number of repeats can shift the reading frame of the gene, switching gene expression on or off (Figure 2). In addition, many H. pylori promoters have mononucleotide repeats that regulate gene expression by changing the spacing between important regulatory sites in these promoters. Orphan methyltransferases, which have lost their corresponding restriction enzyme, may also regulate gene expression by methylating sequences in the promoter region of genes, and some of the methyltransferase genes are themselves subject to phasevariable expression. Recombination regulates gene expression through deletions and duplications that occur during gene conversion and locus switching. These mechanisms suggest that H. pylori survives by constantly generating variants that adapt its physiology to new environments. One example of how H. pylori’s genetic variability helps it adapt to new environments involves its adhesin genes, which encode proteins that bind to the Lewis human blood group antigens, which are carbohydrate-based epitopes [41]. The protein encoded by one of these adhesin genes, BabA, binds the Lewis-b antigen on the gastric mucosa, helping the bacterium adhere to the mucosa. The babA gene is silent in some H. pylori strains but can be Box 1. Tracking Human Genealogy with H. pylori Genomics Currently, a number of companies propose to predict your ‘‘genetic genealogy’’ from the DNA in a cheek swab. They do this by analyzing informatively variable parts of our genomes (such as the Y chromosome or mitochondrial DNA) that show characteristic differences between ethnic and geographic populations; thus, they can tell if you may be distantly related to Ghengis Khan, for example. Unfortunately, population bottlenecks [51], small population sizes, and long generation times have limited the amount of genetic diversity in the human population that can be used for these analyses. It turns out, however, that genomic sequencing of the H. pylori strain harbored by an individual does a better job in resolving ancestry than the usual human genomic markers [52]. This is because of high genetic diversity among H. pylori strains [53], a restricted mode of transmission (primarily within families or households [54]), and the association of H. pylori with humans throughout our evolution [55]. A major source of H. pylori’s genetic diversity is recombination between strains [38], which blurs signatures of descent. Despite this confounding factor, Achtman and colleagues [53] identified evolutionary signatures in strain sequences from diverse geographic sources. These signatures, combined with new statistical tools that take into account admixture and recombination [55], have tracked ancient human migrations, such as our emergence from Africa [55], and more recent events such as colonization of the Pacific islands [56]. H. pylori gene sequences can even distinguish between the Buddhist and Muslim ethnic groups that have coexisted for at least 1,000 years in Ladakh [52]. The fact that H. pylori has maintained evolutionarily distinct strain signatures during many generations of contact suggests either that interracial interactions that promote transmission are very limited or that additional mechanisms prevent strains from one ethnic population from establishing a foothold in hosts of another ethnic population. insertion [31]. Additional core genes are essential only in the context of host infection and several groups have completed screens for transposon mutants that fail to colonize animal models of infection [32,33]. An example of such a colonization core gene is addA, which is required for recombinational repair of DNA double-strand breaks, presumably caused by the host inflammatory response [28]. The nucleotide sequence diversity in H. pylori’s core genes can distinguish between different ethnic and geographic human populations, demonstrating that passage of H. pylori between closely related humans has continued uninterrupted over tens of thousands of years (see Box 1). Different geographic and ethnic groups that have similar infection rates have quite varied relative risks of H. pylori–associated diseases such as gastric cancer [34]. Thus, in addition to host genetic and environmental exposures, differences among strains likely contribute to variation in disease risk. Consequently, studies of pathogenesis need to be reproduced in representative strain backgrounds to ensure that discoveries in one strain apply in strain populations with a diverse evolutionary history. H. pylori Diversification during Persistent Infection Genetic diversification can aid in the persistence of organisms that continue to replicate during chronic infection, allowing them PLoS Pathogens | www.plospathogens.org 3 October 2009 | Volume 5 | Issue 10 | e1000544 Learning about Disease from H. pylori Figure 2. Mechanisms that create genetic diversity in H. pylori. Colored arrows represent different genes, and the correspondingly colored triangles, rectangles, and circles represent the proteins encoded by these genes. Diversification mechanisms (right side of figure) include spontaneous point mutations, slipped-strand mispairing, and intragenomic recombination. Allelic changes involving nonsynonymous point mutations and mosaic genes resulting from intragenomic recombination can alter the function and/or the antigenic epitopes of the encoded protein. Gene expression can also be regulated by gene conversion resulting from intragenomic recombination, and phase variation mediated by slippedstrand mispairing. Reassortment of genes (left side of figure) by natural transformation with exogenous DNA also contributes to genetic diversity. Natural transformation with DNA from a superinfecting strain, for example, can introduce new genes and new alleles of already present genes (horizontal gene transfer). Similarly, natural transformation with DNA from a variant clone of the same strain can further propagate an advantageous allele acquired by within-genome diversification. doi:10.1371/journal.ppat.1000544.g002 H. pylori’s Interaction with the Microbiome expressed if it recombines with the babB gene, an event mediated by homologous sequences at the 59 and 39 ends of the two genes [42]. Thus, recombination can help H. pylori alter its adherence properties to adapt to selective pressures in the host. These selective pressures may include variation in the host receptors present or in conditions that favor a shift in the ratio of bacteria adherent to the gastric cell epithelium over those swimming freely in the mucus. Genetic variation may also be important for the ability of H. pylori to evade the host immune system. H. pylori further exploits the Lewis antigen system by ‘‘camouflaging’’ its surface lipopolysaccharide with its own Lewis-type antigen, which mimics that of the individual host. The bacterium adapts the spectrum of Lewis antigens it expresses by phase variation of the genes involved in their biosynthesis [43]. Furthermore, recombination among the many members of the large outer membrane protein (omp) gene family has the potential to create mosaic omp genes, generating antigenic variation that may keep H. pylori ahead of the ability of the host’s immune system to recognize these cell surface exposed epitopes. PLoS Pathogens | www.plospathogens.org H. pylori share their niche with the stomach microbiome, the collection of microorganisms living on and in us. Study of microorganisms was once limited to only those microbes that could be cultured in the laboratory. Advances in sequencing technology now allow us to study the collection of genes encoded by any group of organisms—so-called metagenomics—making it possible to characterize also the microbes that cannot be cultured but nevertheless affect our health. Given that H. pylori engages in DNA exchange, the metagenome may serve as a repository for novel traits. When present, H. pylori dominates the microbiome in the stomach [44,45], although the effect of this dominance is not known. Perhaps H. pylori infection changes the composition of the stomach microbiome, with unknown consequences. Challenges for the Future H. pylori is considered pathogenic, even carcinogenic. With this simple view, eradication seems an obvious choice. In reality, however, the relationship between H. pylori and disease is more 4 October 2009 | Volume 5 | Issue 10 | e1000544 Learning about Disease from H. pylori nuanced. Like the cancer risk associated with smoking, a recent trial showed that the cancer risk from H. pylori diminished measurably only 12 years after eradication of the infection [46]. Some studies suggest that infection may prevent diseases of the esophagus, and there is a debate in the literature concerning a relationship between H. pylori and childhood asthma [8,47,48]. There is clear consensus that H. pylori should be eliminated in cases of peptic ulcer disease, gastric MALT lymphoma, early gastric cancer, first-degree relatives of gastric cancer patients, and uninvestigated dyspepsia in high-prevalence populations. Despite its potential to prevent ulcer and cancer, universal eradication of H. pylori infection has not gained wide support, because of the mixture of positive and negative disease associations with infection, the lack of a definitive bacterial or host molecule accounting for disease causation, and poor success rates of treating non-ulcer dyspepsia by clearing H. pylori infection [49,50]. Thus a more detailed picture of this host–pathogen interaction is needed and likely will depend upon further advances in both endoscopy and genomics. We have a poor understanding of the immune responses to H. pylori and the reasons that most hosts fail to clear infection. The host restriction of H. pylori to humans and some nonhuman primates has hampered development of robust animal models to study the disease process. Thus progress will require improvements in animal models and improved access to patient samples. Endoscopy of the upper gastrointestinal tract is an invasive procedure, so a major limitation to research is collection of bacterial and human tissue samples from infected people. Available samples are biased toward patients with severe dyspepsia, ulcer symptoms, and gastric cancer, and only a small fraction of the stomach can be sampled. Advances in less-invasive methods, such as capsule endoscopy, may allow increased sampling to monitor bacterial and tissue changes during chronic colonization, including isolation and phenotypic analysis of immune effector cells in infected tissue. Less-invasive methods would also provide an opportunity to study infection in asymptomatic individuals and transmission of H. pylori infection, conditions in which the selective pressures that drive the observed H. pylori genetic diversification likely operate. A major opportunity to increase our understanding of how H. pylori causes or prevents disease arises from recent advances in high-throughput sequencing technologies. Currently, several platforms allow researchers to accomplish in a single experiment sequencing or resequencing of tens of H. pylori genomes, characterization of host immune and epithelial cell types that change during infection with highly sensitive digital expression tag analysis, or analysis of the microbiome present in the stomach and esophagus through metagenomic sequencing or targeted bacterial or fungal small ribosomal subunit DNA sequencing. The sequence data generated by such experiments will address several important mysteries of H. pylori biology, including the timing and extent of H. pylori genetic diversification. While strains from unrelated individuals show dramatic variation in gene content and gene sequence, the extent of sequence variation among clones during persistent infection of a single host or upon transmission has not been adequately sampled. Whole-genome sequencing of multiple isolates of individual patients with dense spatial and temporal sampling would definitively establish when, where, and by what mechanisms genetic diversity is generated. This information will inform efforts to combat resistance to current antibiotics, to develop vaccines, and to understand H. pylori’s coevolution with humans. Exploration of the influence of H. pylori on the microbiome will identify organisms that collaborate with or can be antagonized by H. pylori. Such organisms may mediate some of the disease risks that have been associated with H. pylori presence and absence. Finally, the rapid pace of resequencing of H. pylori’s human host will provide a deeper understanding of genetic variation in the human population that may influence risk for H. pylori–associated pathologies and which, by association, could provide clues to the cellular pathways disrupted in disease. Thus, genomic approaches to study host response, the human microbiome, bacterial genetic variation, and, perhaps most importantly, the intersections among these components, will help researchers determine whether eradication is appropriate for all individuals in all populations. Acknowledgments We thank Olivier Humbert and Laura Sycuro for their critical comments on the manuscript and Laura Sycuro for providing H. pylori images. References 12. El-Omar EM (2001) The importance of interleukin 1beta in Helicobacter pylori associated disease. Gut 48: 743–747. 13. El-Omar EM, Carrington M, Chow WH, McColl KE, Bream JH, et al. (2000) Interleukin-1 polymorphisms associated with increased risk of gastric cancer. Nature 404: 398–402. 14. Figueiredo C, Machado JC, Pharoah P, Seruca R, Sousa S, et al. (2002) Helicobacter pylori and interleukin 1 genotyping: an opportunity to identify highrisk individuals for gastric carcinoma. J Natl Cancer Inst 94: 1680–1687. 15. Humbert O, Pinto-Santini DM, Salama NR (2008) Genomotyping of Helicobacter pylori and its host: microarray-based insights on gene variation, expression and function. In: Yamaoka Y, ed. Helicobacter pylori Molecular Genetics and Cellular Biology. Norfolk, UK: Caister Academic Press. pp 205–244. 16. Sakamoto H, Yoshimura K, Saeki N, Katai H, Shimoda T, et al. (2008) Genetic variation in PSCA is associated with susceptibility to diffuse-type gastric cancer. Nat Genet 40: 730–740. 17. Maeda N, Fan H, Yoshikai Y (2008) Oncogenesis by retroviruses: Old and new paradigms. Rev Med Virol 18: 387–405. 18. Howley PM, Livingston DM (2009) Small DNA tumor viruses: Large contributors to biomedical sciences. Virology 384: 256–259. 19. Segal ED, Cha J, Lo J, Falkow S, Tompkins LS (1999) Altered states: Involvement of phosphorylated CagA in the induction of host cellular growth changes by Helicobacter pylori. Proc Natl Acad Sci U S A 96: 14559–14564. 20. Stein M, Rappuoli R, Covacci A (2000) Tyrosine phosphorylation of the Helicobacter pylori CagA antigen after cag-driven host cell translocation. Proc Natl Acad Sci U S A 97: 1263–1268. 21. Bourzac KM, Guillemin K (2005) Helicobacter pylori-host cell interactions mediated by type IV secretion. Cell Microbiol 7: 911–919. 1. Marshall BJ, Warren JR (1984) Unidentified curved bacilli in the stomach of patients with gastritis and peptic ulceration. Lancet 1: 1311–1315. 2. Nomura A, Stemmermann GN, Chyou P, Kato I, Perez-Perez G, et al. (1991) Helicobacter pylori infection and gastric carcinoma among japanese americans in Hawaii. N Engl J Med 325: 1132–1136. 3. Parsonnet J, Friedman GD, Vandersteen DP, Chang Y, Vogelman JH, et al. (1991) Helicobacter pylori infection and the risk of gastric carcinoma. N Engl J Med 325: 1127–1131. 4. Parsonnet J, Hansen S, Rodriguez L, Gelb AB, Warnke RA, et al. (1994) Helicobacter pylori infection and gastric lymphoma. N Engl J Med 330: 1267–1271. 5. Kusters JG, van Vliet AH, Kuipers EJ (2006) Pathogenesis of Helicobacter pylori infection. Clin Microbiol Rev 19: 449–490. 6. WHO (2006) Fact sheet No. 297, Cancer. World Health Organization. 7. Peek RM Jr, Blaser MJ (2002) Helicobacter pylori and gastrointestinal tract adenocarcinomas. Nat Rev Cancer 2: 28–37. 8. Anderson LA, Murphy SJ, Johnston BT, Watson RG, Ferguson HR, et al. (2008) Relationship between Helicobacter pylori infection and gastric atrophy and the stages of the oesophageal inflammation, metaplasia, adenocarcinoma sequence: Results from the FINBAR case-control study. Gut 57: 734–739. 9. Amieva MR, El-Omar EM (2008) Host-bacterial interactions in Helicobacter pylori infection. Gastroenterology 134: 306–323. 10. Rubin CE (1997) Are there three types of Helicobacter pylori gastritis? Gastroenterology 112: 2108–2110. 11. Basso D, Scrigner M, Toma A, Navaglia F, Di Mario F, et al. (1996) Helicobacter pylori infection enhances mucosal interleukin-1 beta, interleukin-6, and the soluble receptor of interleukin-2. Int J Clin Lab Res 26: 207–210. PLoS Pathogens | www.plospathogens.org 5 October 2009 | Volume 5 | Issue 10 | e1000544 Learning about Disease from H. pylori 39. Humbert O, Salama NR (2008) The Helicobacter pylori HpyAXII restrictionmodification system limits exogenous DNA uptake by targeting GTAC sites but shows asymmetric conservation of the DNA methyltransferase and restriction endonuclease components. Nucleic Acids Res 36: 6893–6906. 40. Salaun L, Linz B, Suerbaum S, Saunders NJ (2004) The diversity within an expanded and redefined repertoire of phase-variable genes in Helicobacter pylori. Microbiology 150: 817–830. 41. Lloyd KO (2000) The chemistry and immunochemistry of blood group A, B, H, and Lewis antigens: Past, present and future. Glycoconj J 17: 531–541. 42. Backstrom A, Lundberg C, Kersulyte D, Berg DE, Boren T, et al. (2004) Metastability of Helicobacter pylori bab adhesin genes and dynamics in Lewis b antigen binding. Proc Natl Acad Sci U S A 101: 16923–16928. 43. Wirth HP, Yang M, Peek RM Jr, Tham KT, Blaser MJ (1997) Helicobacter pylori Lewis expression is related to the host Lewis phenotype. Gastroenterology 113: 1091–1098. 44. Bik EM, Eckburg PB, Gill SR, Nelson KE, Purdom EA, et al. (2006) Molecular analysis of the bacterial microbiota in the human stomach. Proc Natl Acad Sci U S A 103: 732–737. 45. Andersson AF, Lindberg M, Jakobsson H, Backhed F, Nyren P, et al. (2008) Comparative analysis of human gut microbiota by barcoded pyrosequencing. PLoS ONE 3: e2836. doi:10.1371/journal.pone.0002836. 46. Mera R, Fontham ET, Bravo LE, Bravo JC, Piazuelo MB, et al. (2005) Long term follow up of patients treated for Helicobacter pylori infection. Gut 54: 1536–1540. 47. Raj SM, Choo KE, Noorizan AM, Lee YY, Graham DY (2009) Evidence against Helicobacter pylori being related to childhood asthma. J Infect Dis 199: 914–915; author reply 915–916. 48. Chen Y, Blaser MJ (2008) Helicobacter pylori colonization is inversely associated with childhood asthma. J Infect Dis 198: 553–560. 49. Chey WD, Wong BC (2007) American College of Gastroenterology guideline on the management of Helicobacter pylori infection. Am J Gastroenterol 102: 1808–1825. 50. Malfertheiner P, Megraud F, O’Morain C, Bazzoli F, El-Omar E, et al. (2007) Current concepts in the management of Helicobacter pylori infection: The Maastricht III Consensus Report. Gut 56: 772–781. 51. Cann RL, Stoneking M, Wilson AC (1987) Mitochondrial DNA and human evolution. Nature 325: 31–36. 52. Wirth T, Wang X, Linz B, Novick RP, Lum JK, et al. (2004) Distinguishing human ethnic groups by means of sequences from Helicobacter pylori: Lessons from Ladakh. Proc Natl Acad Sci U S A 101: 4746–4751. 53. Achtman M, Azuma T, Berg DE, Ito Y, Morelli G, et al. (1999) Recombination and clonal groupings within Helicobacter pylori from different geographical regions. Mol Microbiol 32: 459–470. 54. Schwarz S, Morelli G, Kusecek B, Manica A, Balloux F, et al. (2008) Horizontal versus familial transmission of Helicobacter pylori. PLoS Pathog 4: e1000180. doi:10.1371/journal.ppat.1000180. 55. Falush D, Wirth T, Linz B, Pritchard JK, Stephens M, et al. (2003) Traces of human migrations in Helicobacter pylori populations. Science 299: 1582–1585. 56. Moodley Y, Linz B, Yamaoka Y, Windsor HM, Breurec S, et al. (2009) The peopling of the Pacific from a bacterial perspective. Science 323: 527–530. 22. Hatakeyama M (2006) Helicobacter pylori CagA — A bacterial intruder conspiring gastric carcinogenesis. Int J Cancer 119: 1217–1223. 23. Ohnishi N, Yuasa H, Tanaka S, Sawa H, Miura M, et al. (2008) Transgenic expression of Helicobacter pylori CagA induces gastrointestinal and hematopoietic neoplasms in mouse. Proc Natl Acad Sci U S A 105: 1003–1008. 24. Viala J, Chaput C, Boneca IG, Cardona A, Girardin SE, et al. (2004) Nod1 responds to peptidoglycan delivered by the Helicobacter pylori cag pathogenicity island. Nat Immunol 5: 1166–1174. 25. Romanelli F, Smith KM, Murphy BS (2007) Does HIV infection alter the incidence or pathology of Helicobacter pylori infection? AIDS Patient Care STDS 21: 908–919. 26. Tomb JF, White O, Kerlavage AR, Clayton RA, Sutton GG, et al. (1997) The complete genome sequence of the gastric pathogen Helicobacter pylori [published erratum appears in Nature 1997 Sep 25;389(6649):412]. Nature 388: 539–547. 27. Alm RA, Ling LS, Moir DT, King BL, Brown ED, et al. (1999) Genomicsequence comparison of two unrelated isolates of the human gastric pathogen Helicobacter pylori. Nature 397: 176–180. 28. Amundsen SK, Fero J, Hansen LM, Cromie GA, Solnick JV, et al. (2008) Helicobacter pylori AddAB helicase-nuclease and RecA promote recombinationrelated DNA repair and survival during stomach colonization. Mol Microbiol 69: 994–1007. 29. Gressmann H, Linz B, Ghai R, Pleissner KP, Schlapbach R, et al. (2005) Gain and loss of multiple genes during the evolution of Helicobacter pylori. PLoS Genet 1: e43. doi:10.1371/journal.pgen.0010043. 30. Salama N, Guillemin K, McDaniel TK, Sherlock G, Tompkins L, et al. (2000) A whole-genome microarray reveals genetic diversity among Helicobacter pylori strains. Proc Natl Acad Sci U S A 97: 14668–14673. 31. Salama NR, Shepherd B, Falkow S (2004) Global transposon mutagenesis and essential gene analysis of Helicobacter pylori. J Bacteriol 186: 7926–7935. 32. Baldwin DN, Shepherd B, Kraemer P, Hall MK, Sycuro LK, et al. (2007) Identification of Helicobacter pylori genes that contribute to stomach colonization. Infect Immun 75: 1005–1016. 33. Kavermann H, Burns BP, Angermuller K, Odenbreit S, Fischer W, et al. (2003) Identification and characterization of Helicobacter pylori genes essential for gastric colonization. J Exp Med 197: 813–822. 34. Yamaguchi N, Kakizoe T (2001) Synergistic interaction between Helicobacter pylori gastritis and diet in gastric cancer. Lancet Oncol 2: 88–94. 35. Johnson WE, Desrosiers RC (2002) Viral persistance: HIV’s strategies of immune system evasion. Annu Rev Med 53: 499–518. 36. Israel DA, Salama N, Krishna U, Rieger UM, Atherton JC, et al. (2001) Helicobacter pylori genetic diversity within the gastric niche of a single human host. Proc Natl Acad Sci U S A 98: 14625–14630. 37. Salama NR, Gonzalez-Valencia G, Deatherage B, Aviles-Jimenez F, Atherton JC, et al. (2007) Genetic analysis of Helicobacter pylori strain populations colonizing the stomach at different times postinfection. J Bacteriol 189: 3834–3845. 38. Suerbaum S, Smith JM, Bapumia K, Morelli G, Smith NH, et al. (1998) Free recombination within Helicobacter pylori. Proc Natl Acad Sci U S A 95: 12619–12624. PLoS Pathogens | www.plospathogens.org 6 October 2009 | Volume 5 | Issue 10 | e1000544 Review Helminth Genomics: The Implications for Human Health Paul J. Brindley1*, Makedonka Mitreva2, Elodie Ghedin3, Sara Lustigman4 1 Department of Microbiology, Immunology, and Tropical Medicine, George Washington University Medical Center, Washington, D. C., United States of America, 2 Genome Centre and Department of Genetics, Washington University School of Medicine, St. Louis, Missouri, United States of America, 3 Division of Infectious Diseases, University of Pittsburgh School of Medicine, Pittsburgh, Pennsylvania, United States of America, 4 New York Blood Center, Laboratory of Molecular Parasitology, New York, New York, United States of America received nearly the same level of support. This is partly because helminthiases are diseases of the poorest people in the poorest regions, but also because these pathogens are difficult to study in the laboratory by comparison to most model eukaryotes and many other pathogens. Standard tools and approaches, including cell lines, culture in vitro, and animal models, are generally lacking. In addition, the genomes of helminths are generally much more complex than those of model organisms like yeast and fruit flies [2]. Whereas helminth diseases are ancient scourges of humanity, with some known from biblical times, most can also be considered as re-emerging diseases in the sense that new outbreaks are reported routinely in response to environmental and sociopolitical changes [3]. For example, schistosomiasis has reemerged many times in Africa in recent times in response to hydrological changes, e.g. construction of dams, irrigation canals, reservoirs, etc. that establish suitable new environments for the intermediate host snails that transmit the parasites. Schistosomiasis has also reemerged in mountainous and hilly regions in Sichuan, China, where it had been controlled previously by intensive interventions [4]. Furthermore, new strains of schistosomes are indeed emerging through natural hybridizations between human and cattle species of schistosomes [5]. Despite the difficulties with investigation of helminth parasites, new insights into fundamental helminth biology are accumulating through genome projects and the application of genome manipulation technologies including RNA interference and transgenesis (Figure 3). What’s more, research on immunology of helminth infections has contributed enormously to our understanding of Th2 immune responses, the function of regulatory T cells, generation of alternatively activated macrophages, and the transmission dynamics of infectious agents. It is hoped that this progress can be translated into new and robust drugs, diagnostics, and vaccines for the helminth diseases Abstract: More than two billion people (one-third of humanity) are infected with parasitic roundworms or flatworms, collectively known as helminth parasites. These infections cause diseases that are responsible for enormous levels of morbidity and mortality, delays in the physical development of children, loss of productivity among the workforce, and maintenance of poverty. Genomes of the major helminth species that affect humans, and many others of agricultural and veterinary significance, are now the subject of intensive genome sequencing and annotation. Draft genome sequences of the filarial worm Brugia malayi and two of the human schistosomes, Schistosoma japonicum and S. mansoni, are now available, among others. These genome data will provide the basis for a comprehensive understanding of the molecular mechanisms involved in helminth nutrition and metabolism, host-dependent development and maturation, immune evasion, and evolution. They are likely also to predict new potential vaccine candidates and drug targets. In this review, we present an overview of these efforts and emphasize the potential impact and importance of these new findings. Helminth Infections—The Great Neglected Tropical Diseases Helminth parasites are parasitic worms from the phyla Nematoda (roundworms) and Platyhelminthes (flatworms) (Figures 1 and 2); together, they comprise the most common infectious agents of humans in developing countries. The collective burden of the common helminth diseases—which range from the dramatic sequelae of elephantiasis and blindness to the more subtle but widespread effects on child development, pregnancy, and productivity—rivals that of the main high-mortality conditions such as HIV/AIDS or malaria [1]. For example, based on a recent analysis [2], 85% of the neglected tropical disease (NTD) burden for the poorest 500 million people living in sub-Saharan Africa (SSA) results from helminth infections. Hookworm infection occurs in almost half of the poorest people in SSA, including 40– 50 million school-aged children and 7 million pregnant women, in whom it is a leading cause of anemia. Schistosomiasis (192 million cases) is the second most prevalent NTD after hookworm, accounting for 93% of the world’s number of cases of schistosomiasis and possibly associated with increased horizontal transmission of HIV/AIDS. Lymphatic filariasis (46–51 million cases) and onchocerciasis (37 million cases) are also widespread in SSA, each disease representing a significant cause of disability and reduction in the region’s agricultural productivity. The disease burden estimate in disability-adjusted life years (DALYs) for total helminth infections in SSA is 5.4–18.3 million in comparison to 40.9 million DALYs for malaria and 9.3 million DALYs for tuberculosis. Yet, research into helminth infections has not www.plosntds.org Citation: Brindley PJ, Mitreva M, Ghedin E, Lustigman S (2009) Helminth Genomics: The Implications for Human Health. PLoS Negl Trop Dis 3(10): e538. doi:10.1371/journal.pntd.0000538 Editor: Matty Knight, Biomedical Research Institute, United States of America Published October 26, 2009 Copyright: ß 2009 Brindley et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Funding: Support from the NIH-NIAID, award numbers R01AI072773 (to PJB) and R01AI081803 (to MM) is gratefully acknowledged. The funder had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Competing Interests: The authors have declared that no competing interests exist. * E-mail: [email protected] This article is part of the ‘‘Genomics of Emerging Infectious Disease’’ PLoS Journal collection (http://ploscollections.org/emerginginfectiousdisease/). 1 October 2009 | Volume 3 | Issue 10 | e538 Figure 1. Montage of some of the major human helminth parasites, their developmental stages, and disease pathology. (A) Microfilaria of Brugia malayi in a thick blood smear, stained with Giemsa (http://www.dpd.cdc.gov/dpdx/html/frames/a-f/filariasis/body_ Filariasis_mic1.htm); the microfilaria is about 250 mm in length. (B) Patient with lymphedema of the left leg due to lymphatic filariasis (http:// www.cdc.gov/ncidod/dpd/parasites/lymphaticfilariasis/index.htm). (C) Hookworm egg passed in the stool of an infected person; the microscopic egg, barrel-shaped with a thin wall, is about 70640 mm in dimension. (D) longitudinal section through an adult hookworm attached to wall of small intestine, ingesting host blood and mucosal wall. The parasite is about 1 cm in length. (E) Eggs of Schistosoma mansoni. The egg is about 150650 mm in dimension; the lateral spine is diagnostic for S. mansoni in comparison to the other human schistosome species. Fibrotic responses to schistosome eggs trapped in the intestines, liver, and other organs of the infected person are the cause of the schistosomiasis pathology and morbidity. (F) A pair of adult worms of the blood fluke Schistosoma mansoni; the more slender female worm resides in the gynecophoral canal of the thicker male. The worms are about 1.5 cm in length, and live for many years (http://www.dpd.cdc.gov/dpdx/HTML/ImageLibrary/Schistosomiasis_il.htm ). doi:10.1371/journal.pntd.0000538.g001 protein sequence information for proteomics methods (e.g., [13]), to name but a few applications. Quantitative analysis of ESTs (transcriptomics), including serial expression of gene analysis, can identify transcripts that are either over- or under-represented by comparison to other transcripts in various helminth life cycle stages or tissues (e.g., [14,15]), and the subset of genes evaluated with gene ontology programs provide insights into cellular and metabolic pathway functioning in the parasite (e.g., [16]). Furthermore, one can identify potential targets for interventions by applying a hierarchy of considerations including a matrix of biological, expression, and phenotypic data [17] or by performing a pan-phylum analysis to identify conserved parasite-specific genes whose selective targeting will have low or no toxicity to the host [18,19] or genes that have diverged enough from the host counterpart, resulting in altered or absent functions [20]. The first multicellular genome sequenced was that of the freeliving roundworm C. elegans [21]; reported in 1998, it is still the only metazoan for which the sequence of every nucleotide is known with high confidence. Today, the genome sequences of 22 species of helminths that either infect humans or are closely related parasites are completed or underway (Table 1). A comprehensive genome analysis has been published for several of them, including the lymphatic filarial nematode Brugia malayi [22], the dog hookworm Ancylostoma caninum [23], and the blood flukes Schistosoma japonicum and S. mansoni [24,25] (Figure 1; Table 1). of humanity and those of our livestock and companion species [1,6–10]. Genomics Approaches to Investigating Helminths Over the past decade, increasing numbers of helminth-specific genome sequences have become available due to ever-improving techniques for obtaining biological material, extracting RNA and DNA, constructing complimentary DNA (cDNA)/whole genome shotgun libraries, and, especially, major advances in the chemistry and instrumentation for DNA sequencing and its concomitant decreased cost. Helminth genomics began with the generation and analysis of transcribed sequences (expressed sequence tags [ESTs] [11]), which has proved to be a rapid and cost-effective route to discover genes in other eukaryotes. In April 2009, there were ,550,000 nematode and 450,000 platyhelminth ESTs in the dbEST division of GenBank, excluding those from the model nematode Caenorhabditis rhabditis. Of these, 60% were from parasites of humans and closely related animal pathogens used to study human infections (Table 1). These ESTs have many applications. They can be used to annotate helminth genomes (see below) to determine alternative splicing, verify open reading frames, and confirm exon/intron and gene boundaries. They are valuable also, for example, in functional genomics to design probes for expression microarrays (e.g., [12]) and to provide putative www.plosntds.org 2 October 2009 | Volume 3 | Issue 10 | e538 Figure 2. Phylogeny of the major taxa of human helminths—nematodes and platyhelminths—as established by maximum likelihood (ML) analysis of 18S ribosomal RNA from 18 helminth species. Sequences were aligned using ClustalX [93]. The topology of the tree was derived from a consensus tree by neighbor-joining–based bootstrapping, its branch lengths were computed using a ML-based method, and it was rooted with the orthologue from the brewer’s yeast, Saccharomyces cerevisiae. The branch lengths of the phylogenetic tree were computed using DNAML (PHYLIP package [94]) by allowing rate variation among sites. The headings Chromadorea, Enoplea, Trematoda, and Cestoda are major classes of the phyla Nematoda and Platyhelminthes. The GenBank accession numbers of aligned sequences are DQ118536.1 (Trichuris trichiura), AY851265.1 (Trichuris suis), AF036637.1 (Trichuris muris), AY497012.1 (Trichinella spiralis), U94366.1 (Ascaris lumbricoides), AF036587.1 (Ascaris suum), AF036588.1 (Brugia malayi), AJ920348.1 (Necator americanus), AJ920347.2 (Ancylostoma caninum), AF036597.1 (Nippostrongylus brasiliensis), X03680.1 (Caenorhabditis elegans), AF036605.1 (Strongyloides ratti), U81581.1 (Strongyloides ratti), AB453329.1 (Strongyloides ratti), AF279916.2 (Strongyloides stercoralis), AB453315.1 (Strongyloides stercoralis), M84229.1 (Strongyloides stercoralis), EU011664.1 (Saccharomyces cerevisiae), , U27015.1 (Saccharomyces cerevisiae), DQ157224.1 (Taenia solium), AF229852.1 (Clonorchis sinensis), Z11590.1 (Schistosoma japonicum), Z11976.1 (Schistosoma haematobium), U65657.1 (Schistosoma mansoni). doi:10.1371/journal.pntd.0000538.g002 Some of the main obstacles to research on human parasites are their life cycle complexity, tissue complexity, and the paucity of genetic and transgenic methods for manipulating genes of interest. Comparative genome analyses have also provided insights into the adaptations of various parasites to niches in their human (and vector) hosts as well as insights into the molecular basis of the mutualistic relationship between the filarial nematode B. malayi and its endosymbiont Wolbachia (see below). The genomes of the schistosomes S. japonicum and S. mansoni are the first complete genomes reported for members of the Lophotrochozoa [24,25], a large taxon that includes about 50% of all metazoan phyla including the mollusks, annelids, brachiopods, nemerteans, bryozoans, playthelminths, and others [26]. These schistosome genome sequences revealed remarkable features of the host–parasite relationship. Among these, the schistosome genome has lost numerous protein-encoding domains. Whereas the total number (,6,000) of protein families is broadly similar among schistosomes, humans, C. elegans, and fruit fly, about 1,000 protein domains have been abandoned by S. japonicum, including some involved in basic metabolic pathways and defense, implying that loss of these domains could be a consequence of the adoption of a parasitic way of life. If so, the remaining molecular repertoire must have evolved in parallel with this extensive domain loss to permit the pathogen to locate and infect humans efficiently, nourish itself, and interact with the external environment as well as with the host. On the other hand, despite extensive gene and www.plosntds.org domain loss, a number of schistosome gene families have expanded and these provide insights into the requirements for a parasitic lifestyle. Among the expanded gene families, a metalloprotease called invadolysin (or leishmanolysin) has at least 12 putative family members in schistosomes compared to a single orthologue in the human, fruit fly, and C. elegans genomes and only three in the free-living flatworm S. mediterranea. This protease family may facilitate skin penetration and tissue invasion by the cercaria, the infective-stage larva of the schistosome [24,25]. Publication of genome sequences of filaria and schistosomes has underscored the pressing need to develop functional genomics approaches for these significant pathogens. Functional analyses— which use approaches such as RNA interference (RNAi) and translational studies—are essential to resolve uncertainties in the molecular physiology of helminths and to illuminate mechanisms of pathogenesis that may lead to development of new interventions to control and eliminate these parasites or the diseases. Progress in the functional genomics of helminths was reviewed recently [6,27,28]. In brief, RNAi has been used to inactivate the RNA products of several genes in schistosomes (e.g., [29–32]) and nematodes (e.g., [33]; reviewed in [8]). In addition, the recent genome sequences of S. mansoni and S. japonicum now make feasible genome-scale investigation of transgene integration into schistosome chromosomes. Gene therapy–like approaches to transform schistosomes include the use of the piggyBac transposon and pseudotyped murine leukemia retrovirus as transgene vectors 3 October 2009 | Volume 3 | Issue 10 | e538 Figure 3. Some recent approaches to expressing transgenes in human helminths. (A) Luciferase activity in Schistosoma mansoni larvae (schistosomules) after transduction with a pseudotyped retrovirus that expresses the luciferase reporter gene. Anti-luciferase antibody staining of schistosomules three days after exposure to pseudotyped lentivirus carrying the firefly luciferase transgene. Schistosomules examined by confocal laser microscopy; (i) bright field, (ii) fluorescence red channel, (iii) merged images. Control non-transformed worms showed only background levels of fluorescence (not shown; see [34–36] for relevant hypotheses and experimental methods). (B) Recent studies on transgenic Strongyloides stercoralis indicated that morphogenesis of the infective L3 stage larva requires the DAF-16 orthologue FKTF-1 [38]. L3s of this parasitic nematode were transfected with plasmids carrying the transgene fktf-1b::gfp::fktf-1b and examined by fluorescence microscopy. (i, ii) Transgenic first-stage larvae express green fluorescent protein (GFP) in the procorpus (arrow) of the pharynx. (iii, iv) A first-stage larva (L1) expresses the GFP::FKTF-1b(wt) transgene in the hypodermis. (v, vi) An infective L3 expresses the GFP::FKTF-1b(wt) fusion protein in the hypodermis and in a narrow band in the pharynx (arrow). Scale bars, 10 mm. Adapted from [38]. doi:10.1371/journal.pntd.0000538.g003 [34–36] (Figure 3A), both of which offer a means to establish transgenic lines of schistosomes, to elucidate schistosome gene function and expression, and to advance functional genomics approaches for these parasites. Notably, progress is also being made to express reporter transgenes in parasitic nematodes including Strongyloides stercoralis [37], in which transgene approaches developed for use in C. elegans have recently been used to demonstrate that morphogenesis of infective larvae requires the DAF-16 orthologue FKTF-1 (Figure 3B) [38]. Progress is also being made with systems for analysis of promoter sequences of genes of parasitic nematodes (e.g., [39]). Many future discoveries resulting from the parasitic helminths genome information can be expected to emanate from the broader scientific community rather than by the laboratory originating a genome sequence project. For the specialized genome sequence labs, dissemination of sequence information in a way that is useful, consistent, centralized, and lasting has been therefore a key goal. Efforts have gone well beyond depositing raw data in public databases. Currently, helminthologists have available a number of specialized sites for sequence analysis. C. elegans information is easily accessible at http://www.wormbase.org [40]. Useful information about the organism includes genome sequence, genetic and physical maps, transcript data (EST, mRNA, SAGE, TEC-RED, ORFeome, expression patterns from reporter gene fusions, and microarrays), the developmental lineage of all cells, connectivity of the nervous system, mutant phenotypes and genetic markers, gene expression described at the level of single cells, 3D protein structures, NCBI Clusters of Orthologous Groups, and apoptosis and aging information. It also contains extensive www.plosntds.org information from large-scale genomics analyses, including precomputed sequence similarity searches, protein motif analyses, protein–protein interactions, findings from systematic RNAi screens, single nucleotide polymorphisms (SNPs), orthologous and paralogous relationships, and the assignment of Gene Ontology (GO) terms to gene products. These resources greatly aid in the interpretation of much of the sequence data emerging from parasitic helminths. However, accumulating evidence suggest that C. elegans is not a good model for all parasitic helminths, especially for the ones that are phylogenetically very distant such as the basic nematode and zoonotic parasite Trichinella spiralis (e.g., [41]). The other specialized site is Nematode.net (http://www.nematode.net) [42]), developed with a primary aim to disseminate the diverse collection of information for parasitic nematodes to the broader scientific community in a way that is useful, consistent, centralized, and enduring. In addition to sequence data, the site hosts assembled NemaGene clusters in GBrowse views, characterizing composition and protein homology, functional Gene Ontology annotations presented via the AmiGO browser, KEGG-based graphical display of NemaGene clusters mapped to metabolic pathways, codon usage tables, NemFam protein families (which represent conserved nematode-restricted coding sequences not found in public protein databases), and a Web-based WU-BLAST search tool that allows complex querying and other assorted resources. Furthermore, Nematode.net, by connecting data across the entire phylum Nematoda, has made a substantial contribution toward integrating the historically separate fields of C. elegans, vertebrate parasitology, and plant parasitology research. Finally, 4 October 2009 | Volume 3 | Issue 10 | e538 www.plosntds.org 5 October 2009 | Volume 3 | Issue 10 | e538 Blood fluke/urinary schistosomiasis Blood fluke/intestinal schistosomiasis Liver fluke/clonorchiasis S. haematobium S. japonicum Clonorchis sinensis Pork tapeworm/taeniasis/cysticercosis Blood fluke/intestinal schistosomiasis Taenia solium Schistosoma mansoni Tapeworm/unilocular hydatidosis Model whipworm T. suis E. granulosus Model whipworm T. muris Tapeworm/alveolar hydatidosis Whipworm/trichuriasis Trichuris trichiura Echinococcus multilocularis Trichina worm/trichinosis Trichinella spiralis Model filaria Filaria/loaisis (cutaneous filariasis)/African eye worm Loa Loa Filaria/river blindness Filaria/lymphatic filariasis Brugia malayi Acanthocheilonema viteae Model large roundworm A. sum Onchocerca volvulus Large roundworm/ascariasis Model threadworm Ascaris lumbricoides Threadworm/strongyloidiasis Model hookworm Nippostrongylus brasiliensis S. ratti Model hookworm Strongyloides stercoralis Hookworm/necatoriasis Ancylostoma caninum Common Name / Disease Necator americanus Species Human Human Human Human Human Canids; larva infects humans Rodent; larva infects humans Pig Mouse Human Pig to human Rodent Human Human Human Pig Human Rat Human Rat Dog Human Primary host — 400 — 390 270 150 150 - 96 — 71 — 150 — 96 230 230 — — — 344 — Genome size, Mb 17975 29491 12616 12599 17815 12620 — — — — 12605 33239 — — 9549 — — — — 20445 12841 20369 GenBank Project ID 3 104 0 206 25 10 1 0 7 0 25.3 0 15 3.3 26.2 55.7 1.8 27.4 11.4 14.7 81 5 cDNAs (3730 ABI), 1,000 s b WUGC, Washington University’s Genome Center. Phylogeny based on Blaxter et al. [47]. BI, Broad Institute; CNHGC, Chinese National HGC; SI, Sanger Institute; SNUCM, Seoul National University College of Medicine; TIGR, The Institute for Genomic Research (now JCVI). doi:10.1371/journal.pntd.0000538.t001 a Trematoda (flukes) Cestoda (tapeworms) Clade I Clade III Clade IV Clade Vb Nematoda (roundworms) Phylum or Class Table 1. Human parasitic helminths (and their close relatives) with genome sequencing projects completed or underway. In progress Draft completed In progress Draft completed Draft completed In progress In progress In progress In progress In progress Draft completed In progress In progress In progress Improving draft Improving draft In progress In progress In progress In progress Improving draft In progress Genome Sequencing Status SNUCM CNHGC SI SI/TIGR Mexico City SI SI WUGC SI SI WUGC UMIGS SI BI TIGR/University of Pittsburgh WUGC/SI SI SI/WUGC SI SI WUGC WUGC Sequencing Institutea Nembase (http://www.nematodes.org [43]) also offers access to parasite sequence and tools such as visualization of clusters by stage of expression. While each of these databases has been challenged by the requirement to support the influx of new genomes and related data, they nonetheless provide user communities with innovative features and tools suited to their needs that are beyond the scope of the large sequence repositories. For flatworms (Figure 2), it is notable that public genome annotation and analysis tools are already in place, including SchistoDB (http://schistoDB.net/), a genomic database for S. mansoni that incorporates sequences and annotation [44] and SjTPdb, http://function.chgc.sh.cn/sjproteome/index.htm, an integrated transcriptome and proteome database and analysis platform for S. japonicum [45]. The genome database for the planarian Schmidtea mediterranea, a model free-living platyhelminth, can be expected to be advantageous to comparative genome projects and specific research problems for the growing number of parasitic flatworms that now are or will be subjects of genome sequence analysis. In addition, because of the phylogenetic position of planarians as early bilaterian metazoans, SmedGD (http://smedgd.neuro.utah.edu) will prove useful not only to planarian research, but also to investigations on developmental and evolutionary biology, comparative genomics (specifically with parasitic flatworms including flukes and tapeworms), stem cell research, and regeneration [46]). Platyhelminthes, particularly in comparison to the fresh-water planarian S. mediterranea, a non-parasitic flatworm for which a draft genome is available [53]. In addition to evolution of parasitism of humans and other vertebrate hosts, helminth parasite genome sequences will also facilitate evolutionary studies on the role of intermediate hosts/vectors such as the snail in schistosome infections and the mosquito in filarial infections in this evolution. Host–Parasite Relationships Investigations of regulatory networks involved in the embryonic development, organogenesis, development, and reproduction of helminths based on newly available genome sequences have revealed the presence of well-characterized signaling pathways, including those for Wnt, Notch, Hedgehog, and transforming growth factor b (TGF-b). These pathways can be recognized in the B. malayi and schistosome genomes [22,24,25]. These include endogenous hormones including epidermal growth factor (EGF)like and fibroblast growth factors (FGF)-like peptides. Predicted components of the Ras–Raf–MAPK and TGF-b–SMAD signaling pathways (including FGF and EGF receptors), for example, encoded by these genomes, have components sharing high sequence identity with their mammalian orthologs, implying that schistosomes or filarial worms, in addition to utilizing their own pathways, might exploit host growth factors as developmental signals. Immune regulation by helminth parasites includes suppression, diversion, and alteration of the host immune response, resulting in an anti-inflammatory environment that is favorable to parasite survival. For example, chronic infections induce key changes in host immune cell populations including dominance of the T-helper 2 (Th2) cells and selective loss of effector T cell activity, against a background of regulatory T cells, alternatively activated macrophages, and Th2-inducing dendritic cells [54,55]. With advances in genomics, numerous parasite-derived proteins, including cytokine homologs, protease inhibitors, and an intriguing set of novel products, as well as glycoconjugates and small lipid moieties, have been discovered with known or hypothesized roles in immune interference [56–61]. These studies suggest that secreted parasite products interfere with different arms of the immune system by influencing the cytokine network and signal transduction pathways or by inhibiting essential enzymes. Using bioinformatics to compare the predicted proteome of B. malayi to proteins implicated in the immune response (interleukins, chemokines, and other signaling molecules), potential immune modulators produced by the filarial have been identified, including genes encoding the macrophage migration inhibition family of signaling proteins [62]. Furthermore, the genome of the blood fluke S. mansoni encodes a large array of paralogues of fucosyl and xylosyltransferases [25] that are involved in generating novel glycans at the host–parasite interface and could have an important role in the subverting the host immune system. A recent comprehensive review summarizes our current understanding of the growing number of individual helminth mediators that target key receptors or pathways in the mammalian immune system [63]. Helminth infection can have a broad impact on the entire immune system. Infection with trematode and nematode parasites, for example, correlates with a reduced incidence of atopic, allergic-type disorders [64]. Thus, helminth infection might potentially be useful as a novel therapy for allergic or autoimmune diseases [65]. Recently, worms, eggs, or purified nematode parasite protein have been used in preclinical and clinical trials to protect humans from allergy and autoimmunity (reviewed in [66–70]), including Crohn’s disease and ulcerative colitis [71,72]. Evolution of Parasitism in Helminths Genomics research has helped our understanding of the evolution of helminths of humans and other hosts, certainly with regard to roundworms of the phylum Nematoda. The first comprehensive study of the molecular evolution of helminths was a phylogenetic analysis of the small subunit ribosomal DNA (ss rDNA) sequences from 53 roundworms [47]. This study included both major parasitic and free-living taxonomic groups. It identified five major clades within the Nematoda and suggested that parasitism of animal and plants arose independently multiple times. A more recent study included 339 nearly full-length ss rDNAs and proposed subdivision of the phylum into 12 clades [48]. This revealed that nematodes that feed on fungi occupy a basal position compared to their plant parasite relatives, confirming that the parasitic nematodes of plants arose from fungivorous ancestors. Phylogenetic methods are also being used to study evolution of parasitism-related protein-coding genes (such as the enzymes that degrade the plant cell wall in nematode parasites of plants [cellulases, pectate lyases, etc.]) to understand better the mechanisms underlying the evolution of parasitism (reviewed in [49]). Recent genome-wide analysis of two plant parasitic nematodes [50,51] provided a more complete picture of the acquisition of these cellulase genes, apparently by horizontal gene transfer (HGT) from prokaryotes. The subsequent expansion and diversification of HGT genes in these nematodes allow inferences about the evolutionary history of these parasites, and in addition present potential targets for anti-nematode drugs. When the genome of the necromenic nematode Pristionchus pacificus was reported recently, it became was clear that cellulases were not restricted to plant parasitic nematodes; their presence in this species indicated preadaptation for parasitism of animals [52], consistent with the intermediate evolutionary position of Pristionchus between the microbivorous C. elegans and the animal parasitic nematodes. In like fashion to evolution of parasitism among nematodes, we can predict that additional analyses of parasitic and free-living flatworm genomes will provide deeper insights into how and when parasitism evolved in the phylum www.plosntds.org 6 October 2009 | Volume 3 | Issue 10 | e538 Other studies have shown that substances produced by helminths, for example Ascaris suum, Nippostrongylus brasiliensis, and Acanthocheilonema viteae, can directly interfere with allergic responses or with development of allergen-specific Th2 responses [73–75]. ES-62, a molecule secreted by the filarial nematode A. viteae, directly inhibits the FceRI-induced release of mediators from mast cells, protects against mast-cell–dependent hypersensitivity in skin and lungs [76] and inhibits collagen-induced arthritis [77]. Research is underway to develop molecules that mimic the activity of ES-62 as drugs for allergic and autoimmune diseases [66]. Other helminth-derived products have the potential to reduce allergic responses. These products include schistosomal lysophosphatidylserine (lyso-PS) [61] and thioredoxin peroxidase from the liver fluke Fasciola hepatica [78]. These findings demonstrated that helminths produce products that can interfere with both the development of allergic responses and the workings of host effector mechanisms. Ankyrin domain–containing proteins are noteworthy because of their roles in protein–protein interactions in a variety of cellular processes. A number of other wBm molecules are of interest as potential drug targets. For example, glutathione biosynthesis genes may provide glutathione for the protection of the filaria from oxidative stress or human immunological effector molecules. Heme produced from wBm (all five synthesis genes are present) could be vital to worm embryogenesis, as there is evidence that molting and reproduction are controlled by ecdysteroid-like hormones [82], synthesis of which requires heme. Depletion of Wolbachia might therefore halt production of these hormones and block molting and/or embryogenesis in B. malayi. Most, if not all, nematodes, including B. malayi, appear to be unable to synthesize heme, but must obtain it from extraneous sources, such as the host, the food supply, or perhaps from endosymbionts. Challenges for the Future The ‘‘Dependent’’ Helminth The filarial and schistosome genome sequences now available provide the vanguard for assembly of a genome sequence catalog of the numerous other neglected helminth parasites (Table 1). Comparative genomics will likely be a dominant approach to organize, interpret, and utilize the vast amounts of genomic information anticipated from the genomes of these parasites (e.g. [83,84]). In terms of sequencing tools, the new generation of ‘‘massively parallel’’ sequencing platforms commercially available today, (such as the Roche/454 pyrosequencer [85], Illumina/ Solexa [86], and SOLiD [87]) offer of the order of 100- to 1,000fold increases in throughput over the Sanger sequencing technology [88] on capillary electrophoresis instruments. This rapid change to producing millions of DNA sequence reads in a short time will have a huge impact on research on NTDs. Each platform has a specific application: while the Roche/454 is optimal for in-depth analysis of whole transcriptomes and de novo sequencing of bacterial and small eukaryotic genomes, the Illumina and SOLiD systems are highly attractive for resequencing projects aimed at identifying genetic variants (mutations, insertions, deletions), profiling and discovering noncoding RNAs (ncRNAs), and studying epigenetic modifications of histones and DNA. With the increased read length and improved error rate of massively parallel pyrosequencing technology, de novo sequencing of helminth genomes has become possible at a fraction of earlier costs. In the next five years, projects at the Washington University’s Genome Center (http://www.genome.gov/ 10002154) and the Wellcome Trust Sanger Institute (http:// www.sanger.ac.uk/Projects/Helminths/) should increase the available sequence data on human helminths and their close relatives by an order of magnitude, adding more than 20 draft genomes to those listed in Table 1. Once these reference genomes become available, sequencing of clinical isolates is expected to accelerate. Sequencing of the clinical strains and strain-to-reference comparisons can be performed using platforms such as Illumina/Solexa and SOLiD to investigate genome-wide polymorphism and provide a comprehensive picture of natural helminth genome variation. These approaches should also be valuable for exploring genetic changes involved in resistance to anti-worm drugs and understanding the potential mechanisms of drug resistance in human parasites, and can be expected to facilitate development of genetic markers to monitor and manage any future appearance and spread of drug resistance. These phenomena are of tremendous importance, particularly since some major neglected helminth diseases are being targeted in mass drug treatment campaigns [89]. In addition, the new generation of sequencing technologies has also provided unprec- As a consequence of evolution of an obligatory parasitic existence, helminth parasites are dependent upon their intermediate and definitive hosts for many necessities including nutrients such as amino acids; filariae are dependent on insect vectors to transport them to the host. The newly available genome sequences for schistosomes and B. malayi have confirmed earlier biochemical studies that had revealed aspects of physiological/ biochemical dependence of these parasites on the host. For example, schistosomes cannot synthesize fatty acids de novo, or sterols, purines, and nine human essential amino acids plus arginine or tyrosine, and must catabolize complex precursors obtained from their hosts. Loss or degeneracy of fatty acid, sterol, and purine synthesis pathways in schistosomes likely relates to the adoption of a parasitic lifestyle; it is notable that genes encoding all the key enzymes for both the de novo fatty acid and purine syntheses are complete in the (free-living) planarian S. mediterranea. To obtain essential lipid nutrients, the schistosome genome encodes transporters, including apolipoproteins, low-density lipoprotein receptor, scavenger receptor, fatty-acid-binding protein, ATP-bindingcassette transporters and cholesterol esterase, to exploit fatty acids and cholesterol from host blood [25,79]. Many species of filarial nematodes are themselves infected by the endosymbiotic bacterium Wolbachia. The genome sequence of the Wolbachia species that infects the roundworm nematode B. malayi (wBm) [80] helped establish which metabolites the bacterium potentially provides to the nematode (riboflavin, flavine adenine dinucleotide, heme, and nucleotides, for example) and which are provided by the nematode to the endobacterium (notably, amino acids). This type of information has opened up the exciting possibility that drugs already registered for human use might inhibit key biochemical pathways in Wolbachia that could sterilize or kill the adult worms. Although the Wolbachia genome is even more degenerate than that of the related pathogen Rickettsia, it has retained more intact metabolic pathways than Rickettsia. This may be important in its biochemical contribution to host (i.e., filarial) viability and fecundity. The wBm genome encodes many more proteases and peptidases than Rickettsia, which likely degrade host proteins in the extracellular environment. Other proteins encoded by wBm include a common type IV secretion system, as used by some pathogenic gram-negative bacteria to transfer plasmids and proteins into surrounding host cells, and an abundance of ankyrin domain-containing proteins, which might regulate host gene expression, as suggested for Ehrlichia phagocytophilia AnkA [81], as well as several proteins predicted to localize on the cell surface. www.plosntds.org 7 October 2009 | Volume 3 | Issue 10 | e538 edented opportunities for high-throughput functional genomic research (reviewed in [90]) that awaits application to helminth research. Although some details of immunomodulation by helminth components have been characterized, we are just beginning to understand how these parasite products act on immune responses and to assemble fragmentary information on individual components into a comprehensive picture. Comparisons of helminth molecules with orthologues/paralogues from free-living relatives will strengthen efforts to decipher the strategies adopted by helminth parasites to evade and subvert their host immune responses. This information will be exploitable for development of drugs and vaccines against the parasites and potentially also novel therapeutic biologics for use in humans. Future studies might determine whether helminth proteins with unknown function might be the source for the intriguing regulatory effects helminth infections have on the host immune response. Treatment for helminthic infections, responsible for hundreds of thousands of deaths each year, depends almost exclusively on just two or three drugs: praziquantel, the benzimidazoles, and ivermectin. Vaccines and new drugs are needed, certainly because drug resistance in human helminth parasites such as schistosomes, whipworms, and filariae, to these compounds would present a major problem for current treatment and control strategies. Pharmacogenomics with the new helminth genomes represents a practicable route forward toward new drugs. For example, chemogenomics screening of the genome sequence of S. mansoni identified .20 parasite proteins for which potential drugs are available approved for other human ailments [25], and indeed for which, in the case of the schistosome thioredoxin glutathione reductase, auranofin (an anti-arthritis medication) was shown recently to exhibit potent anti-schistosomal activity [91]. Finally, the vast new sequence information will undoubtedly allow revision of our understanding of the host–parasite relationship, its evolution, vector–pathogen and helminth–symbiont interactions, unique aspects of cell biology and biochemistry, phylogenetic relationships, intervention targets, research approaches (e.g. [92]), and so forth. Acknowledgments We thank Victoria Mann, Geoffrey Gobert and Gabriel Rinaldi for access to their unpublished findings on schistosomes transduced with pseudotyped virions. References 1. Hotez PJ, Brindley PJ, Bethony JM, King CH, Pearce EJ, et al. (2008) Helminth infections: The great neglected tropical diseases. J Clin Invest 118: 1311–1321. 2. Hotez PJ, Kamath A (2009) Neglected tropical diseases in sub-Saharan Africa: Review of their prevalence, distribution, and disease burden. PLoS Negl Trop Dis 3: e412. 3. Patz JA, Graczyk TK, Geller N, Vittor AY (2000) Effects of environmental change on emerging parasitic diseases. Int J Parasitol 30: 1395–1405. 4. Liang S, Yang C, Zhong B, Qiu D (2006) Re-emerging schistosomiasis in hilly and mountainous areas of Sichuan, China. Bull WHO 84: 139–144. 5. Huyse T, Webster BL, Geldof S, Stothard JR, Diaw OT, et al. (2009) Bidirectional introgressive hybridization between a cattle and human schistosome species. PLoS Pathog 5: e1000571. doi:10.1371/journal.ppat.1000571. 6. Kalinna BH, Brindley PJ (2007) Manipulating the manipulators: Advances in parasitic helminth transgenesis. Trends Parasitol 23: 197–204. 7. Krasky A, Rohwer A, Schroeder J, Selzer PM (2007) A combined bioinformatics and chemoinformatics approach for the development of new antiparasitic drugs. Genomics 89: 36–43. 8. Mitreva M, Zarlenga DS, McCarter JP, Jasmer DP (2007) Parasitic nematodes From genomes to control. Vet Parasitol 148: 31–42. 9. Berriman M, Lustigman S, McCarter JP (2007) Helminth initiative for drug discovery – Report of the informal consultation, genomics and emerging drug discovery technologies. Expert Opin Drug Discovery 2: S83–S89. 10. Lustigman S, Ford S, Crawford MJ (2008) RNA Interference: from functional genomics to validation of drug targets in helminths. In: RNA interference research progress LylandRoger T, BrowningIrving B, eds. Nova Publishers. pp 135–162. 11. Franco GR, Adams MD, Soares MB, Simpson AJG, Venter JC, et al. (1995) Identification of new Schistosoma mansoni genes by the EST strategy using a directional cDNA library. Gene 152: 141–147. 12. Gobert GN, Moertel L, Brindley PJ, McManus DP (2009) Developmental gene expression profiles of the human pathogen Schistosoma japonicum. BMC Genomics 10: 128. 13. Robinson MW, Connolly B (2005) Proteomic analysis of the excretory-secretory proteins of the Trichinella spiralis L1 larva, a nematode parasite of skeletal muscle. Proteomics 5: 4525–4532. 14. Mitreva M, McCarter JP, Martin J, Dante M, Wylie T, et al. (2004) Comparative genomics of gene expression in the parasitic and free-living nematodes Strongyloides stercoralis and Caenorhabditis elegans. Genome Res 14: 209–220. 15. Taft AS, Vermeire JJ, Bernier J, Birkeland SR, Cipriano MJ, et al. (2009) Transcriptome analysis of Schistosoma mansoni larval development using serial analysis of gene expression (SAGE). Parasitology 136: 469–485. 16. Mitreva M, McCarter JP, Arasu P, Hawdon J, Martin J, et al. (2005) Investigating hookworm genomes by comparative analysis of two Ancylostoma species. BMC Genomics 6: 58. 17. McCarter JP (2004) Genomic filtering: An approach to discovering novel antiparasitics. Trends Parasitol 20: 462–468. 18. Wasmuth J, Schmid R, Hedley A, Blaxter M (2008) On the extent and origins of genic novelty in the phylum Nematoda. PloS Negl Trop Dis 2: e258. doi:10.1371/journal.pntd.0000258. 19. Yin Y, Martin J, Abubucker S, Wang Z, Wyrwicz L, et al. (2009) Molecular determinants archetypical to the phylum Nematoda. BMC Genomics 10: 114. www.plosntds.org 20. Wang Z, Martin J, Abubucker S, Yin Y, Gasser R, et al. (2009) Systematic analysis of insertions and deletions specific to nematode proteins and their proposed functional and evolutionary relevance. BMC Evol Biol 9: 23. 21. The C. elegans Sequencing Consortium (1998) Genome sequence of the nematode C. elegans: A platform for investigating biology. Science 282: 2012–2018. 22. Ghedin E, Wang S, Spiro D, Caler E, Zhao Q, et al. (2007) Draft genome of the filarial nematode parasite Brugia malayi. Science 317: 1756–1760. 23. Abubucker S, Martin J, Yin Y, Fulton L, Yang S-P, et al. (2008) The canine hookworm genome: Analysis and classification of Ancylostoma caninum survey sequences. Mol Biochem Parasitol 157: 187–192. 24. Schistosoma japonicum Genome Sequencing and Functional Analysis Consortium, Liu F, Zhou Y, Wang ZQ, Lu G, et al. (2009) The Schistosoma japonicum genome reveals features of host-parasite interplay. Nature 460: 345–351. 25. Berriman M, Haas BJ, LoVerde PT, Wilson RA, Dillon GP, et al. (2009) The genome of the blood fluke Schistosoma mansoni. Nature 460: 352–358. 26. Dunn CW, Hejnol A, Matus DQ, Pang K, Browne WE, et al. (2008) Broad phylogenomic sampling improves resolution of the animal tree of life. Nature 452: 745–749. 27. Krautz-Peterson G, Bhardwaj R, Faghiri Z, Tararam C, Skelly PJ (2009) RNA interference in schistosomes: machinery and methodology. Parasitology;E-pub ahead of print. doi:10.1017/S0031182009991168. 28. Mann VH, Morales ME, Kines KJ, Brindley PJ (2008) Transgenesis of schistosomes: approaches using mobile genetic elements. Parasitology 134: 1–13. 29. Freitas TC, Jung E, Pearce EJ (2007) TGF-beta signaling controls embryo development in the parasitic flatworm Schistosoma mansoni. PLoS Pathog 3: e52. doi:10.1371/journal.ppat.0030052. 30. Morales ME, Rinaldi G, Kines KJ, Gobert GN, Tort JF, et al. (2008) RNA interference targeting Schistosoma mansoni cathepsin D, the apical enzyme of the hemoglobin proteolysis cascade. Mol Biochem Parasitol 157: 160–168. 31. Rinaldi G, Morales ME, Alrefaei YN, Cancela M, Castillo E, et al. (2009) RNA interference targeting leucine aminopeptidases inhibits hatching of eggs of the human blood fluke, Schistosoma mansoni. Mol Biochem Parasitol 167: 118–126. 32. Faghiri Z, Skelly PJ (2009) The role of tegumental aquaporin from the human parasitic worm, Schistosoma mansoni, in osmoregulation and drug uptake. FASEB J 23: 2780–2789. 33. Ford L, Zhang J, Liu J, Hashmi S, Fuhrman JA, et al. (2009) Functional analysis of the cathepsin-like cysteine protease genes in adult Brugia malayi using RNA interference. PLoS Negl Trop Dis 3: e377. doi: 10.1371/journal.pntd.0000377. 34. Morales ME, Mann VH, Kines KJ, Gobert GN, Kalinna BH, et al. (2007) piggyBac transposon mediated transgenesis of the human blood fluke, Schistosoma mansoni. FASEB J 21: 3479–3489. 35. Kines KJ, Mann VH, Morales ME, Shelby BD, Kalinna BH, et al. (2006) Transduction of Schistosoma mansoni by vesicular stomatitis virus glycoproteinpseudotyped Moloney murine leukemia retrovirus. Exp Parasitol 112: 209–220. 36. Kines KJ, Morales ME, Mann VH, Gobert GN, Brindley PJ (2008) Integration of reporter transgenes into Schistosoma mansoni chromosomes mediated by pseudotyped murine leukemia virus. FASEB J 22: 2936–2948. 37. Li X, Massey HC, Jr., Nolan TJ, Schad GA, Kraus K, et al. (2006) Successful transgenesis of the parasitic nematode Strongyloides stercoralis requires endogenous non-coding control elements. Int J Parasitol 36: 671–679. 8 October 2009 | Volume 3 | Issue 10 | e538 65. Imai S, Fujita K (2004) Molecules of parasites as immunomodulatory drugs. Curr Top Med Chem 4: 539–552. 66. Harnett W, Harnett MM (2008) Therapeutic immunomodulators from nematode parasites. Expert Rev Mol Med 10: e18. 67. Harnett W, Harnett MM (2008) Parasitic nematode modulation of allergic disease. Curr Allergy Asthma Rep 8: 392–397. 68. Johnston MJ, MacDonald JA, McKay DM (2009) Parasitic helminths: A pharmacopeia of anti-inflammatory molecules. Parasitology 136: 125–147. 69. McKay DM (2009) The therapeutic helminth? Trends Parasitol 25: 109–114. 70. Erb KJ (2009) Can helminths or helminth-derived products be used in humans to prevent or treat allergic diseases? Trends Immunol 30: 75–82. 71. Summers RW, Elliott DE, Urban JF, Jr., Thompson R, Weinstock JV (2005) Trichuris suis therapy in Crohn’s disease. Gut 54: 87–90. 72. Summers RW, Elliott DE, Urban JF, Jr., Thompson RA, Weinstock JV (2005) Trichuris suis therapy for active ulcerative colitis: A randomized controlled trial. Gastroenterology 128: 825–832. 73. Lima C, Perini A, Garcia ML, Martins MA, Teixeira MM, et al. (2002) Eosinophilic inflammation and airway hyper-responsiveness are profoundly inhibited by a helminth (Ascaris suum) extract in a murine model of asthma. Clin Exp Allergy 32: 1659–1566. 74. Schnoeller C, Rausch S, Pillai S, Avagyan A, Wittig BM, et al. (2008) A helminth immunomodulator reduces allergic and inflammatory responses by induction of IL-10-producing macrophages. J Immunol 180: 4265–4272. 75. Melendez AJ, Harnett MM, Pushparaj PN, Wong WS, Tay HK, et al. (2007) Inhibition of Fc epsilon RI-mediated mast cell responses by ES-62, a product of parasitic filarial nematodes. Nat Med 13: 1375–1381. 76. McInnes IB, Leung BP, Harnett M, Gracie JA, Liew FY, et al. (2003) A novel therapeutic approach targeting articular inflammation using the filarial nematode-derived phosphorylcholine-containing glycoprotein ES-62. J Immunol 171: 2127–2133. 77. Donnelly S, O’Neill SM, Sekiya M, Mulcahy G, Dalton JP (2005) Thioredoxin peroxidase secreted by Fasciola hepatica induces the alternative activation of macrophages. Infect Immun 73: 166–173. 78. Holland MJ, Harcus YM, Riches PL, Maizels RM (2000) Proteins secreted by the parasitic nematode Nippostrongylus brasiliensis act as adjuvants for Th2 responses. Eur J Immunol 30: 1977–1987. 79. Han ZG, Brindley PJ, Wang S, Chen Z (2009) Schistosome genomics: New perspectives on schistosome biology and host parasite interaction. Annu Rev Genomics Hum Genet 10: 211–240. 80. Foster J, Ganatra M, Kamal I, Ware J, Makarova K, et al. (2005) The Wolbachia genome of Brugia malayi: endosymbiont evolution within a human pathogenic nematode. PLoS Biol 3: e121. doi:10.1371/journal.pbio.0030121. 81. Park J, Kim KJ, Choi K-S, Grab DJ, Dumler JS (1993) Anaplasma phagocytophilum AnkA binds to granulocyte DNA and nuclear proteins. Cell Microbiol 6: 743–751. 82. Warbrick EV, Barker GC, Rees HH, Howells RE (1993) The effect of invertebrate hormones and potential hormone inhibitors on the third larval moult of the filarial nematode, Dirofilaria immitis, in vitro. Parasitology 107: 459–463. 83. Nisbet AJ, Cottee PA, Gasser RB (2008) Genomics of reproduction in nematodes: prospects for parasite intervention? Trends Parasitol 24: 89–95. 84. Dieterich C, Sommer RJ (2009) How to become a parasite - Lessons from the genomes of nematodes. Trends Genet 25: 203–209. 85. Margulies M, Egholm M, Altman WE, Attiya S, Bader JS, et al. (2005) Genome sequencing in microfabricated high-density picolitre reactors. Nature 437: 376–380. 86. Bennett S (2004) Solexa Ltd. Pharmacogenomics 5: 433–438. 87. Shendure J, Porreca GJ, Reppas NB, Lin X, McCutcheon JP, et al. (2005) Accurate multiplex polony sequencing of an evolved bacterial genome. Science 309: 1728–1732. 88. Sanger F, Niklen S, Coulson A (1977) DNA sequencing with chain-terminating inhibitors. Proc Natl Acad Sci U S A 74: 5463–5467. 89. Fenwick A (2009) Host-parasite relations and implications for control. Adv Parasitol 68: 247–261. 90. Morozova O, Marra MA (2008) Applications of next-generation sequencing technologies in functional genomics. Genomics 92: 255–264. 91. Kuntz AN, Davioud-Charvet E, Sayed AA, Califf LL, Dessolin J, et al. (2007) Thioredoxin glutathione reductase from Schistosoma mansoni: An essential parasite enzyme and a key drug target. PLoS Med 4: e206. Erratum in: PLoS Med 2007, 4: e264. 92. Cosseau C, Azzi AH, Smith K, Freitag M, Mitta G, et al. (2009) Native chromatin immunoprecipitation (N-ChIP) and ChIP-Seq of Schistosoma mansoni: Critical experimental parameters. Mol Biochem Parasitol 166: 70–76. 93. Chenna R, Sugawara H, Koike T, Lopez R, Gibson TJ, et al. (2003) Multiple sequence alignment with the Clustal series of programs. Nucleic Acids Res 31: 3497–3500. 94. Felsenstein J (1988) Phylogenies from molecular sequences: Inference and reliability. Ann Rev Genet 22: 521–565. 38. Castelletto ML, Massey HC, Jr., Lok JB (2009) Morphogenesis of Strongyloides stercoralis infective larvae requires the DAF-16 ortholog FKTF-1. PLoS Pathog 5: e1000370. doi: 10.1371/journal.ppat.1000370. 39. de Oliveira A, Katholi CR, Unnasch TR (2008) Characterization of the promoter of the Brugia malayi 12 kDa small subunit ribosomal protein (RPS12) gene. Int J Parasitol 38: 1111–1119. 40. Rogers A, Antoshechkin I, Bieri T, Blasiar D, Bastiani C, et al. (2008) Wormbase 2007. Nucleic Acids Res 36(Database issue). pp D612–617. 41. Mitreva N, Appleton J, McCarter JP, Jasmer DP (2005) Expressed sequence tags from life cycle stages of Trichinella spiralis: Application to biology and parasite control. Vet Parasitol 132: 13–17. 42. Martin J, Abubucker S, Wylie T, Yin Y, Mitreva M (2009) Nematode.net update 2008: Improvements enabling more efficient data mining and comparative nematode genomics. Nucleic Acids Res 37(Database issue): D571–578. 43. Parkinson J, Whitton C, Schmid R, Thomson M, Blaxter M (2004) NEMBASE: A resource for parasitic nematode ESTs. Nucleic Acids Res 32: D427–430. 44. Zerlotini A, Heiges M, Wang H, Moraes RL, Dominitini AJ, et al. (2009) SchistoDB: A Schistosoma mansoni genome resource. Nucleic Acids Res 37(Database issue): D579–582. 45. Liu F, Chen P, Cui SJ, Wang ZQ, Han ZG (2008) SjTPdb: Integrated transcriptome and proteome database and analysis platform for Schistosoma japonicum. BMC Genomics 9: 304. 46. Robb SMC, Ross E, Sánchez Alvarado A (2008) SmedGD: The Schmidtea mediterranea genome database. Nucleic Acids Res 36(Database issue). pp D599–D606. 47. Blaxter ML, De Ley P, Garey JR, Liu LX, Scheldeman P, et al. (1998) A molecular evolutionary framework for the phylum Nematoda. Nature 392: 71–75. 48. Holterman M, van der Wurff A, van den Elsen S, van Megen H, Bongers T, et al. (2006) Phylum-wide analysis of SSU rDNA reveals deep phylogenetic relationships among nematodes and accelerated evolution toward crown clades. Mol Biol Evol 23: 1792–1800. 49. Mitreva M, Smant G, Helder J (2009) Role of horizontal gene transfer in the evolution of plant parasitism among nematodes. In: Horizontal Gene Transfer. Methods Mol Biol 532: 517–535. 50. Abad P, Gouzy J, Aury J-M, Castagnone-Sereno P, Danchin EG, et al. (2008) Genome sequence of the metazoan plant-parasitic nematode Meloidogyne incognita. Nat Biotech 26: 909–915. 51. Opperman CH, Bird DM, Williamson VM, Rokhsar DS, Burke M, et al. (2008) Sequence and genetic map of Meloidogyne hapla: A compact nematode genome for plant parasitism. Proc Natl Acad Sci U S A 105: 14802–14807. 52. Dieterich C, Clifton SW, Schuster LN, Chinwalla A, Delehaunty K, et al. (2008) The Pristionchus pacificus genome provides a unique perspective on nematode lifestyle and parasitism. Nat Genet 40: 1193–1198. 53. Robb SM, Ross E, Sánchez Alvarado A (2008) SmedGD: The Schmidtea mediterranea genome database. Nucleic Acids Res 6: D599–D606. 54. Maizels RM, Balic A, Gomez-Escobar N, Nair M, Taylor MD, et al. (2004) Helminth parasites–Masters of regulation. Immunol Rev 201: 89–116. 55. Ohnmacht C, Voehringer D (2009) Basophil effector function and homeostasis during helminth infection. Blood 113: 2816–2825. 56. Hartmann S, Kyewski B, Sonnenburg B, Lucius R (1997) A filarial cysteine protease inhibitor down-regulates T cell proliferation and enhances interleukin10 production. Eur J Immunol 27: 2253–2260. 57. Hartmann S, Lucius R (2003) Modulation of host immune responses by nematode cystatins. Int J Parasitol 33: 1291–1302. 58. Harnett W, McInnes IB, Harnett MM (2004) ES-62, a filarial nematode-derived immunomodulator with anti-inflammatory potential. Immunol Lett 94: 27–33. 59. Gomez-Escobar N, Lewis E, Maizels RM (1998) A novel member of the transforming growth factor-beta (TGF-beta) superfamily from the filarial nematodes Brugia malayi and B. pahangi. Exp Parasitol 88: 200–209. 60. Gomez-Escobar N, Gregory WF, Maizels RM (2000) Identification of tgh-2, a filarial nematode homolog of Caenorhabditis elegans daf-7 and human transforming growth factor beta, expressed in microfilarial and adult stages of Brugia malayi. Infect Immun 68: 6402–6410. 61. van der Kleij D, Latz E, Brouwers JF, Kruize JC, Schmitz M, et al. (2002) A novel host-parasite lipid cross-talk. Schistosomal lyso-phosphatidylserine activates toll-like receptor 2 and affects immune polarization. J Biol Chem 277: 48122–48129. 62. Pastrana DV, Raghavan N, FitzGerald P, Eisinger SW, Metz C, et al. (1998) Filarial nematode parasites secrete a homologue of the human cytokine macrophage migration inhibitory factor. Infect Immun 66: 5955–5963. 63. Hewitson JP, Grainger JR, Maizels RM (2009) Helminth immunoregulation: The role of parasite secreted proteins in modulating host immunity. Mol Biochem Parasitol 167: 1–11. 64. Yazdanbakhsh M, van den Biggelaar A, Maizels RM (2001) Th2 responses without atopy: Immunoregulation in chronic helminth infections and reduced allergic disease. Trends Immunol 22: 372–377. www.plosntds.org 9 October 2009 | Volume 3 | Issue 10 | e538