Download The Genomics of Emerging Infectious Disease

Document related concepts

Social history of viruses wikipedia , lookup

Human microbiota wikipedia , lookup

Globalization and disease wikipedia , lookup

Plant virus wikipedia , lookup

Molecular mimicry wikipedia , lookup

Horizontal gene transfer wikipedia , lookup

History of virology wikipedia , lookup

Infection wikipedia , lookup

Sociality and disease transmission wikipedia , lookup

Cross-species transmission wikipedia , lookup

Viral phylodynamics wikipedia , lookup

Transmission (medicine) wikipedia , lookup

Metagenomics wikipedia , lookup

Virology wikipedia , lookup

Influenza A virus wikipedia , lookup

Transcript
The Genomics of Emerging
Infectious Disease
www.plos.org
A collection of essays, perspectives,
and reviews from six PLoS Journals
about how genomics can revolutionize
our understanding of emerging
infectious disease.
Produced with support from Google.org.
The PLoS Journal editors have sole responsibility
for the content of this collection.
Image credits:
Brindley et al., PLoS Neglected Tropical Diseases 3(10) e538.
McHardy et al., PLoS Pathogens 5(10) e1000566.
Salama et al., PLoS Pathogens 5(10) e1000544.
Editorial
Genomics of Emerging Infectious Disease: A PLoS
Collection
Jonathan A. Eisen1*, Catriona J. MacCallum2*
1 University of California Davis, Davis, California, United States of America, 2 Public Library of Science, Cambridge, United Kingdom
Today, the Public Library of Science publishes a collection of
essays, perspectives, and reviews about how genomics, with all its
associated tools and techniques, can provide insights into our
understanding of emerging infectious disease (http://ploscollections.
org/emerginginfectiousdisease/) [1–13]. This collection, focused on
human disease, is particularly timely as pandemic H1N1 2009
influenza (commonly referred to as swine flu) spreads around the
globe, and government officials, the public, journalists, bloggers, and
tweeters strive to find out more. People want to know if this flu poses
more of a threat than other seasonal flu strains, how fast it’s
spreading (and where), and what can be done to contain it. As this
collection illustrates, the increasing speed at which complete genome
sequences and other genome-scale data can be generated for
individual isolates and strains of a pathogen provides tremendous
opportunities to identify the molecular changes in these disease
agents that will enable us to track their spread and evolution through
time (e.g., [3,7,8]) and generate the vaccines and drugs necessary to
combat them (e.g., [5–7]). The collection also shines a spotlight on
specific pathogens, some familiar and widespread, such as the
influenza A virus (e.g., [9]); some ‘‘reemerging,’’ such as the
Mycobacterium tuberculosis complex that causes tuberculosis [10]; and
some identified only recently, as with the bacterium Helicobacter pylori
(which causes peptic ulcers and gastric cancer [11]).
There is no simple definition of an emerging disease, but it can
be loosely described as a disease that is novel in some way—for
example, one that displays a change in geographic location,
genetics, or function. Emerging infectious diseases are caused by a
wide range of organisms, but they are perhaps best typified by
zoonotic viral diseases that cross from animal to human hosts and
can have a devastating impact on human health, causing a high
disease burden and mortality [8]. These zoonotic diseases include
monkeypox, Hendra virus, Nipah virus, and severe acute
respiratory syndrome coronavirus (SARS-CoV), in addition to
influenza A and the lentiviruses that cause AIDS. The apparently
increased transmission of pathogens from animals to humans over
the recent decades has been attributed to the unintended
consequences of globalization as well as environmental factors
and changes in agricultural practices [8]. Generally, the burden of
these diseases is most strongly felt by those in developing countries.
Brindley et al. [12] point to the debilitating effects of the most
common human infectious agent in such areas—helminths
(parasitic worms)—and the role that genomics plays in advancing
our understanding of molecular and medical helminthology.
Compounding the problem of emerging infectious diseases in
developing countries is the reality that researchers in developing
countries have often been unable to participate fully in genomics
research, because of their technological isolation and limited
resources. As Harris et al. emphasize [13], ‘‘collaborations—
starting with capacity building in genomics research—need to be
fostered so that countries that are currently excluded from the
genomics revolution find an entry point for participation.’’
This collection is a collaborative effort that combines financial
support from Google.org (which has also sponsored research on
PLoS Biology | www.plosbiology.org
emerging infectious disease through its Predict and Prevent
initiative [14]) with PLoS’s editorial independence and rigor.
Gupta et al. [1] provide Google.org’s perspective and vision for
how systematic application of genomics, proteomics, and bioinformatics to infectious diseases could predict and prevent the next
pandemic. To realize this vision, they urge the community to unite
under an ‘‘Infectious Disease Genomics Project,’’ analogous to the
Human Genome Project. This is, as the authors admit, a
potentially ‘‘grandiose’’ and difficult proposition. Some researchers
might justifiably argue that much is already being achieved—as
demonstrated by this collection—and that the vision is naı̈ve.
However, as every article in the collection also points out,
tremendous challenges remain if the potential of genomics in this
field is to be realized.
One problem is that, despite the fact that sequencing is now the
method of choice for characterizing new disease agents, and new
substantially faster and cheaper sequencing methods are continually being produced, we still lack the range of computational tools
necessary to analyze these sequences in sufficient detail [4]. It is
possible to sequence the entire assemblage of viruses in a particular
tissue type or host species [3] and to obtain complete or nearly
complete genome sequences for large samples of bacteria [7]. Yet
we remain in the early, albeit essential, stages of pathogen
discovery (Box 1). These sequences can be interpreted fully only
when integrated with relevant environmental, epidemiological,
and clinical data (e.g., [3,4,8]). And, despite the increased
sequencing, really comprehensive genome data are still only
available for a few key pathogens, which further limits our
understanding. For example, a full quantitative understanding of
the processes that shape the epidemiology and evolution—the
phylodynamics—of RNA viruses is currently possible only for HIV
and influenza A virus [3].
In this collection, you will find not only the views of leading
researchers from several different disciplines, and a provocative
vision from a funding agency, but also the contributions of six
different PLoS journals (PLoS Biology, PLoS Medicine, PLoS
Computational Biology, PLoS Genetics, PLoS Neglected Tropical Diseases,
and PLoS Pathogens). The PLoS open-access model of publishing
makes possible such a large multidisciplinary cross-journal
collection, in which all articles are simultaneously available online
Citation: Eisen J, MacCallum CJ (2009) Genomics of Emerging Infectious Disease:
A PLoS Collection. PLoS Biol 7(10): e1000224. doi:10.1371/journal.pbio.1000224
Published October 26, 2009
Copyright: ß 2009 Eisen, MacCallum. This is an open-access article distributed
under the terms of the Creative Commons Attribution License, which permits
unrestricted use, distribution, and reproduction in any medium, provided the
original author and source are credited.
Competing Interests: The authors have declared that no competing interests
exist.
* E-mail: [email protected] (JAE); [email protected] (CJM)
This article is part of the ‘‘Genomics of Emerging Infectious Disease’’ PLoS Journal
collection (http://ploscollections.org/emerginginfectiousdisease/).
1
October 2009 | Volume 7 | Issue 10 | e1000224
Box 1. A Field Guide to Microbes?
When an American robin (Turdus migratorius) showed up in London a few years ago, birders were rapidly all atwitter and many
came flocking to town [22]. Why had this one bird created such a stir? For one main reason—it was out of place. This species is
normally found in North America and only very rarely shows up on the other side of the ‘‘pond.’’ Amazingly, this rapid, collective
response is not that unusual in the world of birding. When a bird is out of place, people notice quickly.
This story of the errant robin gets to the heart of the subject of this collection because being out of place in a metaphorical way
is what defines an emerging infectious disease. Sometimes we have never seen anything quite like the organism or the disease
before (e.g., SARS, Legionella). Or perhaps, as with many opportunistic pathogens, we have seen the organism before but it was
not previously known to cause disease. In other cases, such as with as pandemic H1N1 2009 or E. coli O157:H7, we have seen
the organism cause disease before but a new form is causing far more trouble. And of course organisms can be literally out of
place, by showing up in a location not expected (e.g., consider the anthrax letters [2]).
Historically, despite the metaphorical similarities with the robin case, the response to emerging infectious disease is almost
always much slower. Clearly, there are many reasons for these differences, which we believe are instructive to consider. At least
four factors are required for birders’ rapid responses to the arrival of a vagrant bird: (1) knowledge of the natural ‘‘fauna’’ in a
particular place, (2) recognition that a specific bird may be out of place, (3) positive identification of the possibly out-of-place
bird, and (4) examination of the ‘‘normal’’ place for relatives of the identified bird.
How are these requirements achieved? Mostly through the existence of high-quality field guides that allow one to place an
organism such as a bird into the context of what is known about its relatives. This placement in turn is possible because of two
key components of field guides. First, such guides contain information about the biological diversity of a group of organisms.
This usually includes features such as a taxonomically organized list of species with details for each species on biogeography
(distribution patterns across space and time, niche preferences, relative abundance), biological properties (e.g., behavior, size,
shape, etc.), and genetic variation within the species (e.g., presence of subspecies). Second, a good field guide provides
information on how to identify particular types (e.g., species) of those organisms. With such information, and with a network of
interested observers, an out-of-place bird can be detected with relative ease.
In much the same way, a field guide to microbes would be valuable in the study of emerging infectious diseases. The articles in
this collection describe what can be considered the beginnings of species-specific field guides for the microbial agents of
emerging diseases. If we want to truly gain the benefits that can come from good field guides it will be necessary to expand
current efforts to include more organisms, more systematic biogeographical sampling, and more epidemiological and clinical
data. But the current efforts are a great start.
Figure: The American Robin (Turdus migratorius). (Photo Credit: NASA).
doi:10.1371/journal.pbio.1000224.g001
for unrestricted reuse, regardless of venue (see also the podcast that
accompanies the collection; http://ploscollections.org/podcast/
emerginginfectiousdisease.mp3).
Our aim is that this collection will add to other ‘‘open science’’
activities that have helped provide insights into infectious disease
more quickly than would have been thought feasible only a few
years ago. This accelerated availability of research findings is
exemplified by the recent response to the flu pandemic. Consider,
PLoS Biology | www.plosbiology.org
for example, data access. Traditionally, scientists have released
data after publishing a study. Fortunately, in part due to
experience from genome sequencing projects, prepublication flu
sequence data have been released in a relatively unrestricted
manner to the community [15]. This has in turn enabled
anyone—not just those who collected the data—to carry out
analyses while the epidemic is occurring (when in principle there is
still time to save lives) rather than being forced to provide a
2
October 2009 | Volume 7 | Issue 10 | e1000224
communication of research results and ideas about flu vetted by
expert moderators [21].
This is not to say there are no possible risks or drawbacks from
more openness. For example, some governments may avoid
releasing data because of fears about discrimination (as was seen in
many aspects of the flu in Mexico). Others worry that complete
openness might foster the spread of misinformation. However, as
Fricke et al. argue in their article on the relationship between
genomics and biopreparedness [2], open source genomic resources
are actually of immense benefit to those in charge of our public
health and biosecurity.
It is clear that ‘‘for all stages of combating emerging infections,
from the early identification of the pathogen to the development
and design of vaccines, application of sophisticated genomics tools
is fundamental to success’’ [8]. It is equally clear that open science
and open access to publications and data will be key to that
success. Whatever one’s position has been on the various open
science initiatives, there is no doubt that the ‘‘esoteric’’ label on
some open science initiatives has largely been eliminated by the
emergence of H1N1 flu epidemic.
The faster, cheaper, and more openly we can distribute the
discoveries of science, the better for scientific progress and public
health. As this collection emphasizes, managing the threat of
novel, re-emerging, and longstanding infectious diseases is
challenging enough even without barriers to scientific research.
We encourage you to make the most of this collection by sharing,
rating, and annotating the articles using our online commenting
tools. Better yet, join the discussion by providing your own vision
to prevent the emergence and spread of the next rogue pathogen.
posthumous account of the spread of infection. Such a response
highlights both the importance of early data access and the
removal of restrictions in the use of data (e.g., in many past cases
data might be released but use of the data in presentations and
publications would be limited).
The value of open access to sequence data is helping to put
pressure both on private organizations to release their sequence
data [16,17] and on all agencies to release other information (e.g.,
metadata about strains) more rapidly. This pressure is not being
brought to bear only on flu data—in this collection Van Voorhis
et al. [5] call on pharmaceutical companies to deposit the
structural coordinates of drug targets from all globally important
infectious disease organisms in public databases.
Of course, data about any infectious disease are not very useful
unless placed in the scientific context of past studies (i.e.,
publications) specifically about the disease or about methods to
analyze such data. It is also important to have access to
information about other diseases and other organisms that might
impact its spread or evolution. Perhaps the most intriguing aspect
of open science in response to flu has been the move toward prejournal publication release of findings. Many flu researchers took
the available data, analyzed it, and posted results on blogs [18,19],
wikis [20], and other sites. Although some view this ‘‘non peerreviewed’’ release as unseemly, it is clear that it has helped
accelerate the science in the study of pandemic H1N1 2009 and
led to some important journal papers [17]. Indeed, such advances
helped provide one of the stimuli for PLoS’s most recent initiative,
PLoS Currents: Influenza, a Google ‘‘Knol,’’ for the rapid
References
12. Brindley PJ, Mitreva M, Ghedin E, Lustigman S (2009) Helminth genomics:
The implications for human health. PLoS Negl Trop Dis 3: e538. doi:10.1371/
journal.ppat.1000538.
13. Coloma J, Harris E (2009) Molecular genomic approaches to infectious diseases
in resource-limited settings. PLoS Med 6: e1000142. doi:10.1371/journal.
pmed.1000142.
14. Google.org (2008) Predict and Prevent Initiative homepage. Available: http://
www.google.org/predict.html. Accessed 16 September 2009.
15. National Center for Biotechnology Information (2009) Influenza Virus
Resource. Available: http://www.ncbi.nlm.nih.gov/genomes/FLU/SwineFlu.
html. Accessed 11 September 2009.
16. Butler D (2005) Flu researchers slam US agency for hoarding data. Nature 437:
458–459.
17. Smith GJD, Vijaykrishna D, Bahl J, Lycett SJ, Worobey M, et al. (2009) Origins
and evolutionary genomics of the 2009 swine-origin H1N1 influenza A
epidemic. Nature 459: 1122–1125.
18. Porter S (2009) Did the California H1N1 swine flu come from Ohio?
Discovering Biology in a Digital World blog. Available: http://scienceblogs.
com/digitalbio/2009/04/did_the_california_h1n1_swine.php. Accessed 11
September 2009.
19. Koppstein D (2009) Swine flu phylogeny, part II. Koppology blog. Available:
http://koppology.blogspot.com/2009/04/swine-flu-phylogeny-part-ii.html. Accessed 11 September 2009.
20. Rambaut A (2009) Human/Swine A/H1N1 Influenza Origins and Evolution.
Available: http://tree.bio.ed.ac.uk/groups/influenza/. Accessed 11 September
2009.
21. Allen L (2009) Welcome to PLoS Currents: Influenza. PLoS Blog. Available:
http://www.plos.org/cms/node/481. Accessed 8 September 2009.
22. Evans I (29 March 2009) American Robin Spotted in South London.
Foxnews.com. Available at:http://www.foxnews.com/story/0,2933,189510,00.
html. Accessed 14 September 2009.
1. Gupta R, Michalski MH, Rijsberman FR (2009) Can an Infectious Diseases
Genomics Project predict and prevent the next pandemic? PLoS Biol 7:
e1000219. doi:10.1371/journal.pbio.1000219.
2. Fricke WF, Rasko DA, Ravel J (2009) The role of genomics in the identification,
prediction, and prevention of biological threats. PLoS Biol e1000217. doi:10.1371/
journal.pbio.1000217.
3. Holmes EC, Grenfell BT (2009) Discovering the phylodynamics of RNA viruses.
PLoS Comput Biol 5: e1000505. doi:10.1371/journal.pcbi.1000505.
4. Berglund EC, Nystedt B, Andersson SGE (2009) Computational resources in
infectious disease: Limitations and challenges. PLoS Comput Biol 5: e1000481.
doi:10.1371/journal.pcbi.1000481.
5. Van Voorhis WC, Hol WGJ, Myler PJ, Stewart LJ (2009) The role of medical
structural genomics in discovering new drugs for infectious diseases. PLoS Comp
Biol 5: e1000530. doi:10.1371/journal.pcbi.1000530.
6. Seib KL, Dougan G, Rappuoli R (2009) The key role of genomics in modern
vaccine and drug design for emerging infectious diseases. PLoS Genet 5:
e1000612. doi:10.1371/journal.pgen.1000612.
7. Falush D (2009) Toward the use of genomics to study microevolutionary change
in bacteria. PLoS Genet 5: e1000627. doi:10.1371/journal.pgen.1000627.
8. Haagmans BL, Andeweg AC, Osterhaus ADME (2009) The application of
genomics to emerging zoonotic viral diseases. PLoS Pathog 5: e1000557.
doi:10.1371/journal.ppat.1000557.
9. McHardy AC, Adams B (2009) The role of genomics in tracking the evolution of
influenza A virus. PLoS Pathog 5: e1000566. doi:10.1371/journal.ppat.1000566.
10. Comas I, Gagneux S (2009) The past and future of tuberculosis research. PLoS
Pathog 5: e1000600. doi:10.1371/journal.ppat.1000600.
11. Dorer MS, Talarico S, Salama NR (2009) Helicobacter pylori’s unconventional
role in health and disease. PLoS Pathog 5: e1000544. doi:10.1371/journal.ppat.
1000544.
PLoS Biology | www.plosbiology.org
3
October 2009 | Volume 7 | Issue 10 | e1000224
Essay
Molecular Genomic Approaches to Infectious Diseases in
Resource-Limited Settings
Josefina Coloma1,2, Eva Harris1,2*
1 Division of Infectious Diseases and Vaccinology, School of Public Health, University of California Berkeley, Berkeley, California, United States of America, 2 Sustainable
Sciences Institute, San Francisco, California, United States of America
Only half a century after the landmark
discovery of the double helix structure of
DNA, the human genome was sequenced
and a new era of biomedical research was
ushered in [1]. Parallel advances in
comparative genomics, genetics, highthroughput biochemical techniques, and
bioinformatics have provided researchers
in wealthy nations with a repertoire of
tools to analyze the sequence and functions of organisms at an unprecedented
pace and level of detail. Since the
beginning of the genomics era [2,3],
however, it has been evident that researchers in many developing countries will not
be participating fully in genomics research,
mainly because of their technological
isolation and their limited resources and
capacity for genomics research combined
with the urgency of many other health
priorities. To share the benefits of this
technology equitably worldwide, some
have advocated that developed and developing countries alike should participate in
genomics research to prevent widening of
the already large gap in global health
resources [4]. As most of the funding that
has fueled the rapid advance of the field
comes from developed country governments, private initiatives, and industry,
however, not much has been done to
enable poorer countries to participate as
equals in genomics research. Developing
countries that are not directly participating
in a genomics initiative can, nonetheless,
gain from the discoveries of this field in a
number of ways, as detailed below. It
remains to be seen, however, how the
developing world will specifically benefit
from the refined genetic information and
the drugs and vaccines produced as a result
of genomics initiatives. Information exchange and translation of knowledge must
be carried out continually through fora
accessible to researchers in developing
countries. ‘‘North–South’’ collaborations—
starting with capacity building in genomics
research—need to be fostered so that
countries that are currently excluded from
the genomics revolution find an entry point
for participation. ‘‘South–South’’ collaborations must be encouraged to allow
countries with limited resources to pool
their human and financial capital, learn
from each other’s experience, and share in
the benefits of genomics. Ensuring that the
benefits of genomics-based medicine are
shared by developing countries involves
their inclusion in the discussion of ethical,
legal, social, economic, and sovereignty
issues (Box 1).
Summary Points
N
N
N
N
N
N
Researchers in most developing
countries lack the technology,
resources, and capacity to participate fully in genomics research.
Information exchange and knowledge translation must be carried out
continually through ‘‘North–South’’
collaborations, starting with capacity building in genomics research;
‘‘South–South’’ collaborations must
be encouraged to allow countries
with limited resources to pool their
human and financial capital and
share in the benefits of genomics.
Several emerging countries have
made significant progress in the
past decade by sequencing the
genomes of organisms with little
economic value in the developed
world but of great local relevance.
Molecular diagnostics and molecular epidemiology are the first
frontier of genomics, with accessible tools that can be applied in
resource-limited settings.
Developing countries entering the
genomics era should start by establishing their priorities and enacting appropriate legislation before
embarking on large-scale projects.
Access to training and capacity
building of human resources in
bioinformatics and data mining are
crucial in the developing world.
Initiatives in the Developing
World
In the developing world, the link between
human genomics and infectious disease is
particularly important. The influence of
host genes on the differential susceptibility
of individuals or populations to infection
and the evolutionary influence of pathogens
on the genetic composition of populations
by selecting for resistant individuals through
coevolution can be now dissected in more
detail with genomics. An array of host–
pathogen interactions are associated with
particular human genes and loci, as best
illustrated by the relationship of the malaria
pathogen with host genetic evolution. As
genetic information about larger populations becomes increasingly available, it
is important to disseminate information
Citation: Coloma J, Harris E (2009) Molecular Genomic Approaches to Infectious Diseases in Resource-Limited
Settings. PLoS Med 6(10): e1000142. doi:10.1371/journal.pmed.1000142
Published October 26, 2009
Copyright: ß 2009 Coloma, Harris. This is an open-access article distributed under the terms of the Creative
Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium,
provided the original author and source are credited.
Funding: No specific funding was received for this study/essay.
Competing Interests: The authors have declared that no competing interests exist.
* E-mail: [email protected]
The Essay section contains opinion pieces on topics
of broad interest to a general medical audience.
PLoS Medicine | www.plosmedicine.org
Provenance: Commissioned; externally peer-reviewed.
This article is part of the ‘‘Genomics of Emerging Infectious Disease’’ PLoS Journal collection (http://
ploscollections.org/emerginginfectiousdisease/).
1
October 2009 | Volume 6 | Issue 10 | e1000142
Box 1. Societal and Ethical Issues in Genomics to Be Discussed
with Full Participation of All Nations
N
N
N
N
N
N
Issues of confidentiality, stigmatization, discrimination, and misuse of genetic
information
Dangers of a reductionist approach to health issues based only on genetic
information that ignores multifactorial determinants
Issues about intellectual property rights associated with the patentability of
DNA sequences, the applications derived from them, and the implications for
developing countries [45]
The potential exploitation of developing-country populations by creating
genetic databases for a price [46]
The potential risk of breeding human beings by design [47]
Issues about informed consent, standard of care, and availability and pricing of
new drugs and vaccines being tested in developing countries [48]
relating genomics to disease as well as to
devise intervention strategies for at-risk
populations worldwide [5].
Because science and technology are
increasingly recognized as vital components for national development, emerging
economies and some developing countries
are building their infrastructures to promote local innovation and to retain the
value of their human, plant, and microbial
genomic diversity and research. India,
Thailand, South Africa, Indonesia, Brazil,
and Mexico, for example, have devoted
considerable resources to large-scale population genotyping projects that explore
human genetic variation. The Institute for
Genomic Medicine (INMEGEN) initiative
in Mexico is the largest and most comprehensive, with a broad strategy for incorporating genomics into health care that
includes infrastructure, strategic public–
private partnerships, research and development in genomics relevant to local health
problems, capacity building, and bioethics
policy making [6,7]. Although it is unclear
how Mexico will make the transition from
early-phase investment to translation of
knowledge into products and services with
health and economic impacts, the country
is taking important steps to address the
challenges it and other emerging economies
face, such as the shortage of trained
professionals and the ability to retain local
talent. For example, the National Council
for Science and Technology (CONICYT)
is making efforts to engage the Mexican
scientific diaspora with expertise in genomics by offering repatriation packages tied
to jobs at universities and research institutes, an approach that is also being
adopted by Brazil.
Brazil’s Foundation for Research Support in Sao Paolo (FAPESP) genomics
initiative is also considered a political and
scientific achievement. Key to its success
has been early investment in training
PLoS Medicine | www.plosmedicine.org
young scientists by sponsoring scholarships
abroad in areas related to genomics in
which Brazil lacks expertise. To avoid
brain drain, beneficiaries are required to
return to Brazil for at least four years and
must have a committed teaching position
at a local university before they leave. One
important principle of Brazil’s genomics
initiative is that the projects are relevant to
Brazil and the rest of the developing world
but are low on the list of priorities of the
US and Europe, thus providing both an
important contribution to genomics and a
benefit to Brazil’s economy and scientific
endeavor [8]. FAPESP is in the process of
sequencing the genes of the parasite that
causes schistosomiasis, a disease that
afflicts millions in Brazil. Another example
in Brazil is the government-funded consortium Organization for Nucleotide Sequencing and Analysis (ONSA), formed to
sequence and analyze the genome of the
plant pathogen Xylella, which infects
orange trees and has great economic
impact [9]. This effort led to additional
genomics projects on vectors of pathogens
that cause major public health problems in
Brazil, such as the sandfly Lutzomyia longipalpis, which transmits Leishmania spp.,
and the Triatominae bug species, which
are vectors of Trypanosoma cruzi [10].
The impact of genomics on the developing world is also illustrated by multinational initiatives such as the one funded by
the US National Institutes of Health
(NIH), the UK’s Wellcome Trust, and
private and public institutes in the US and
Europe in collaboration with research
centers in Brazil, Argentina, Venezuela
and Singapore to sequence the genomes of
the parasites T. brucei, T. cruzi and
Leishmania major, which cause the deadly
insect-borne diseases African sleeping sickness, Chagas disease, and leishmaniasis,
respectively [11–13]. The potential new
drug targets identified by these initiatives
2
have great relevance in over 100 developing countries where the diseases take a
significant toll on the economy and the
quality of life of their citizens. Similar
initiatives have resulted in sequencing of
other pathogens important to medicine
and agriculture. The data from these
projects are usually freely available online
for data mining and for bioinformatics
analysis at remote locations, as most
researchers follow the recommendation
set by the Bermuda Accord to make
DNA sequences (especially human) freely
and openly available without delay [14].
Resource-limited countries can enter
the genomics era by creating partnerships
and regional centers for technology and
resources [15]. For example, DNA sequencing technology, still unaffordable for
many researchers and public laboratories
because of low-use volume and high costs
of equipment, reagents, and maintenance,
can be affordable if a regional center
provides services to a pool of laboratories
and researchers within a country or
geographical region. As an illustration,
using Brazilian infrastructure, Perú and
Chile joined the global potato sequencing
consortium, which will sequence different
varieties of this important agricultural
species [16]. Brazil has also generated
several open-source bioinformatics tools
for the annotation of bacterial and protozoan genomes that can be used by any
researcher worldwide [17]. In Africa, the
Center for Training in Functional Genomics of Insect Vectors of Human Disease
(AFRO VECTGEN) was initiated by
TDR (Special Programme in Research
and Training in Tropical Diseases) at the
World Health Organization (WHO) and
the Department of Medical Entomology
and Vector Ecology of the Malaria
Research and Training Center in Mali to
train young scientists in functional genomics who will ultimately use genome
sequence data for research on insect
vectors of human disease. The program
triggers collaborative research with neighboring nations and the vector biology
network in Mali, which was built around
research grants funded by the US NIH
and TDR/WHO [18]. The Malaria
Genomic Epidemiology Network (MalariaGEN) uses a consortial approach that
brings together researchers from 21 countries to overcome scientific, ethical, and
practical challenges to conducting largescale studies of genomic variation that
could assist efforts in the fight against
malaria [19]. Successful ‘‘North–South’’
partnerships that help scientists bridge
the genomic gap usually involve a project
of mutual interest. An example is the
October 2009 | Volume 6 | Issue 10 | e1000142
common effort of the International Livestock Research Institute (ILRI) in Nairobi
and The Institute for Genome Research
(TIGR; now the J. Craig Ventner Institute) to sequence and annotate the genome
of Theileria parva, a cattle parasite that
causes important economic losses to small
farmers in Africa and elsewhere [20]. This
effort has generated local human resources
in genomics and infrastructure for the
future.
Application of Molecular,
Genetic, and Genomic Tools
with Limited Resources
Although the genomics initiatives described above challenge the notion that
developing countries must wait to import
advances in science and technology that
emerge from the developed world, poorer
developing countries still do not have the
resources to develop their own genomic
projects on a large scale. However,
implementing simpler molecular genetic
approaches to solve health problems is
very feasible in resource-limited settings.
The decades preceding the human and
microbial genome initiatives were highlighted by important developments in
molecular and genetic methods applied
to infectious diseases. These developments
were enabled by increasingly available
genetic information about many pathogens and their vectors and by molecular
tools such as PCR and powerful sequencing technologies, which permitted rapid
advances that were successfully introduced
into the developing world with little delay.
Molecular tools for diagnosis have
gained a ready foothold because many
poor countries do not have the facilities for
traditional diagnosis and surveillance.
Thus, diagnosis often relies on clinical
observations or requires that a sample be
sent out to foreign agencies such as the US
Centers for Disease Control and Prevention (CDC) for confirmation. In addition,
even when available, classic techniques
based on serological, microscopic, and
culture-based methods are often lengthy,
of only moderate sensitivity, and not
highly discriminatory at the level of species
subtype or strain. By adapting DNA
technologies to the existing infrastructure,
using home-grown solutions to reduce
their cost, and applying them to solve
local health problems, molecular approaches to detect and type infectious
agents on-site offer real value [21]. Fostering appropriate technology transfer and
capacity-building in the ‘‘South’’ enables
public health laboratories and research
groups in less scientifically developed
PLoS Medicine | www.plosmedicine.org
countries to participate in global genomics
by contributing their findings and sharing
their expertise with their peers [22,23].
For example, we and others adapted PCRbased molecular diagnostic techniques for
infectious diseases such as leishmaniasis
and dengue for cost-effective application
in laboratories with minimum infrastructure and basic technical expertise, which
are now fully validated and used routinely
throughout Latin America [21,24–30].
This approach relies on understanding
the principles of the technologies, deconstructing them into their basic components, and rebuilding them on-site [21].
Another area where molecular tools have
demonstrated their utility in resource-poor
settings is in detecting drug resistance in a
variety of pathogens. This has been facilitated in large part by successful ‘‘North–
South’’ partnerships that have served to
train scientists in developing countries in the
use, implementation, and interpretation of
modern molecular methods applied to
emerging drug resistance (see [31]). This
approach has been particularly successful
with certain diseases, such as malaria, HIV/
AIDS, tuberculosis, and drug-resistant bacterial infections (both nosocomial and community-based). Unfortunately, most studies
of drug-resistant pathogens are performed
independently of one another, so data on the
prevalence of resistance markers is scattered
in disparate databases or in unpublished
studies without links to clinical, laboratory,
and pharmacokinetic data needed to relate
the genetic information to relevant phenotypes. To enable molecular markers of
malaria drug resistance to realize their
potential as public health tools, the Worldwide Malaria Resistance Network (WARN)
database is being created with the dual goals
of improving treatment of malaria by
informed drug selection and use and
providing a prompt warning when treatment protocols need to be changed [32,33].
By accelerating the identification and validation of markers for resistance to combination therapies, this global database should
help prolong the useful therapeutic lives of
important new drugs.
The ultimate power of genetic tools in
resource-limited settings is evident in the
field of molecular epidemiology, where
genetic information about the host or
infectious agent is analyzed together with
clinical and epidemiological data to derive
and implement appropriate interventions.
For example, molecular tools based on
limited sequence information, such as
molecular fingerprinting of a polymorphic
marker, have made important contributions
to strengthening control of tuberculosis in
both developed and developing countries by
3
enabling analysis of transmission patterns,
helping identify phenotypic variation
among strains, and facilitating evaluation
of the global distribution, relative transmissibility, virulence, and immunogenicity of
different lineages of M. tuberculosis [34–38].
Bacterial infections, food-borne outbreaks,
and viral infections in developing countries,
including the recent H1N1 influenza pandemic, are monitored using similar typing
methodologies [39–41]. Molecular tools
permit a refined case definition and thus
have tremendous potential for decisionmaking support and informing targeted
public health interventions in countries with
high burdens of disease and limited technological capabilities and resources.
The trend to move beyond genetic
marker analysis to full genome sequencing
is growing, as complete genome data can
provide a wealth of information about
etiologic agents of disease that was previously unknown. Full-genome approaches
are not always necessary, however. In
molecular epidemiology of infectious diseases, nucleic acid fingerprinting can
provide enough answers to important
epidemiological questions to allow critical
interventions to be designed (see above). In
fact, too much genetic information, in
some instances, can obscure the picture, as
several closely related pathogenic variants
might coexist in one individual or one
outbreak that differ by only a few
nucleotides but that nonetheless belong
to the same strain or subtype, complicating
the interpretation of results [42].
The relatively rapid transfer of DNA
technology from developed to developing
countries is an excellent example of what
can be done by forging strong relationships between universities and research
groups and public-health laboratories
across the world. The validity of adapting
these technologies relies on links with
epidemiological data and translation into
local public health interventions.
Setting Priorities
General international ethical and scientific guidelines for genomics have been
created and are being adapted by nations
participating in the field as it evolves.
Governments and regulatory agencies in
the ‘‘North’’ have prepared for the
eventual implementation of genomicsbased medicine in their respective countries. A critical problem faced by developing countries is the lack of national
guidelines for genomics research and its
ethical ramifications. Thus, a priority to be
set by countries in the early steps of
genomic applications is to draw up the
October 2009 | Volume 6 | Issue 10 | e1000142
necessary rules and legislation on genomics and to generate procedures for
implementation. Creating the necessary
communication channels between researchers, social scientists, policy makers,
and civil society organizations is also a
critical step. Other key challenges facing
emerging genomics researchers include
proper informed consent and privacy
protocols for research participants, protecting them against the potential discrimination that might emerge from genetic
information and ensuring that any benefit
that comes to fruition from the research
reaches them. In parallel, capacity building of scientists in clinical research and of
ethics committees in these issues is essential. Past experience with ‘‘safari research’’
in which biological samples are taken outof-country for research that does not
benefit local populations have prompted
countries such as Mexico, India, and
Brazil to draw up legislation governing
‘‘sovereignty’’ over genomics material and
data that restricts the export of biological
materials for studies abroad and prioritizes
national interests. Poorer countries currently lacking their own genomics initiatives could benefit from similar legislation
balancing the protection of ‘‘genomic
sovereignty’’ while fostering international
collaborations that bring much-needed
resources and increase local scientific
capacity. Beyond the improvement of their
basic genomics research capabilities, governments should engage their relevant
ministries to develop a plan to integrate
genetic and genomics products (including
diagnostics, vaccines, therapies, and others), within the health system and public
health programs with emphasis on accessibility and equity to improve health for
all. A good example of priority setting in
genomics is Mexico’s national genomics
program over the last 15 years (see Box 2).
Sharing Know-How
To strengthen genomics globally, the
tools necessary for analysis of genomics
data are urgently needed in developing
countries, where they are currently underutilized [43]. A problem with genomics is
that much of the advanced knowledge is
concentrated in individuals and a few
research centers and companies rather
than in textbooks or academia, restricting
dissemination even though massive
amounts of genomic data and software
are openly accessible through the Internet.
A conscious effort on the part of developed
nations to transfer their knowledge of the
use and analysis of genomic databases
needs to be encouraged to help developing
countries manage their own specific data
on indigenous biological species, local
epidemiology and infectious diseases, biodiversity, and other issues. Some successful
programs and initiatives include the Wellcome Trust Sanger Institute training
courses on bioinformatics and genomic
analysis, the Sustainable Sciences Institute–Broad Institute bioinformatics workshops (Figure 1), and the TDR/WHOSouth African Bioinformatics Institute
(SANBI) regional training center. Online
training like the S-star alliance bioinformatics courses and conferences such as the
African Bioinformatics Conference (Afbix’09) with remote participation are
becoming more widespread and are an
excellent option for countries with limited
resources. GARSA (Genomic Analysis
Resources for Sequence Annotation) is a
Box 2. Building a Road toward Genomics: The Mexican
Experience 1995–2009 [7]
N
N
N
N
N
N
N
N
Increases in investment in science and technology (S&T) from 0.35% to 0.43% of
the GNP and creation of national S&T legislation to increase regional funding
Four-fold increase in number of students registered for doctoral-level programs
Participation in international genomics efforts
Creation of sequencing initiatives of organisms with local agricultural and
health relevance
Creation of a Genomics Sciences degree and two scientific societies in
genomics
Creation of the National Institute of Genetic Medicine (2004-INMEGEN) with
seed funding for modern infrastructure; a strategy for development that
includes country-wide strategic alliances; high-level research and academic
programs; ethical, legal, and social implications of genomic medicine; and
translation of the scientific knowledge into public goods
Establishment of genomics research priorities based on most prevalent local
diseases
Plans for creation of public–private partnerships to guarantee sustainability
PLoS Medicine | www.plosmedicine.org
4
flexible Web-based system designed to
analyze genomic data in the context of a
data analysis pipeline. Hosted in Brazil,
this free system aims to facilitate the
analysis, integration, and presentation of
genomic information, concatenating several bioinformatics tools and sequence
databases with a simple user interface
[44]. An alternative to on-site sequencing
is to partner with colleagues in moredeveloped countries to have samples
processed abroad in sequencing centers.
This is possible only if local legislation
allows for export of biological samples,
and if true partnership and trust exist with
a colleague(s) in the developed country.
Challenges for the Future
As developing countries reevaluate their
role in the genomics era, they will continue
to explore the unique opportunities that
arise from the vast natural and genomic
diversity that they embody. As exemplified
by the successes in Brazil, Mexico, and
several African countries, it is possible to
turn challenges and problems such as
emerging and endemic infectious diseases
into opportunities for unique scientific and
economic growth. Access to sequencing
facilities, open-source databases, and harmonized methodologies for genomic analysis are essential for the future of genomics in
the developing world. However, unless a
more concerted effort is made to include
countries with limited scientific development
and resources, it is unlikely that they will
fully participate in genomics projects or use
the technologies available other than by
allowing their genetic material to be accessible to others. As emerging countries set
their own priorities for genomics research
and take ownership of its results, the main
challenge across developing nations remains
access to training and knowledge translation.
Human resources and local capacity in
genomics are thus central to development,
as countries with these skills could participate in the potential benefits of the field with
respect to health, food security, natural
resource management, and other critical
areas. ‘‘North–South’’ and ‘‘South–South’’
collaborations are a viable and extremely
rewarding way to increase the capacities of
developing countries to access genomic tools
to address unique problems considered of
little economic value outside these countries
but of tremendous importance to the
majority of the world’s population.
Author Contributions
ICMJE criteria for authorship read and met: JC
EH. Wrote the first draft of the paper: JC.
Contributed to the writing of the paper: JC EH.
October 2009 | Volume 6 | Issue 10 | e1000142
Figure 1. Participants in a Bioinformatics/Genomics Analysis workshop in Managua, Nicaragua, in June 2008 (conducted by the
Sustainable Sciences Institute and the Broad Institute). Photograph by Eva Harris.
doi:10.1371/journal.pmed.1000142.g001
References
1. Venter JC (2003) A part of the human genome
sequence. Science 299: 1183–1184.
2. Singer PA, Daar AS (2001) Harnessing genomics
and biotechnology to improve global health
equity. Science 294: 87–89.
3. Calva E, Cardosa MJ, Gavilondo JV (2002)
Avoiding the genomics divide. Trends Biotechnol
20: 368–370.
4. Acharya T, Daar AS, Thorsteinsdottir H,
Dowdeswell E, Singer PA (2004) Strengthening
the role of genomics in global health. PLoS Med
1: e40. doi:10.1371/journal.pmed.0010040.
5. Manolio TA, Rodriguez LL, Brooks L,
Abecasis G, Ballinger D, et al. (2007) New
models of collaboration in genome-wide association studies: The Genetic Association Information Network. Nat Genet 39: 1045–1051.
6. Seguin B, Hardy BJ, Singer PA, Daar AS (2008)
Genomics, public health and developing countries: The case of the Mexican National Institute
of Genomic Medicine (INMEGEN). Nat Rev
Genet 9 (Suppl 1): S5–9.
7. Jimenez-Sanchez G, Silva-Zolezzi I, Hidalgo A,
March S (2008) Genomic medicine in Mexico:
Initial steps and the road ahead. Genome Res 18:
1191–1198.
8. Castilla EE, Luquetti DV (2008) Brazil: Public
Health Genomics. Public Health Genomics.
PLoS Medicine | www.plosmedicine.org
9.
10.
11.
12.
13.
14.
E-pub ahead of print (3 Sept). doi:10.1159/
000153424.
Simpson AJ, Reinach FC, Arruda P, Abreu FA,
Acencio M, et al. (2000) The genome sequence of
the plant pathogen Xylella fastidiosa. The Xylella
fastidiosa Consortium of the Organization for
Nucleotide Sequencing and Analysis. Nature
406: 151–159.
Davila AM, Majiwa PA, Grisard EC, Aksoy S,
Melville SE (2003) Comparative genomics to
uncover the secrets of tsetse and livestock-infective
trypanosomes. Trends Parasitol 19: 436–439.
Berriman M, Ghedin E, Hertz-Fowler C,
Blandin G, Renauld H, et al. (2005) The genome
of the African trypanosome Trypanosoma brucei.
Science 309: 416–422.
El-Sayed NM, Myler PJ, Bartholomeu DC,
Nilsson D, Aggarwal G, et al. (2005) The genome
sequence of Trypanosoma cruzi, etiologic agent of
Chagas disease. Science 309: 409–415.
Ivens AC, Peacock CS, Worthey EA, Murphy L,
Aggarwal G, et al. (2005) The genome of the
kinetoplastid parasite, Leishmania major. Science
309: 436–442.
Bentley DR (1996) Genomic sequence information should be released immediately and freely in the public domain. Science 274: 533–
534.
5
15. Rabinowicz PD (2001) Genomics in Latin
America: Reaching the frontiers. Genome Res
11: 319–322.
16. Potato Genome Sequencing Consortium. Available: http://www.potatogenome.net. Accessed 19
July 2009.
17. Almeida LG, Paixao R, Souza RC, Costa GC,
Almeida DF, et al. (2004) A new set of bioinformatics tools for genome projects. Genet Mol Res
3: 26–52.
18. Doumbia S, Chouong H, Traore SF, Dolo G,
Toure AM, et al. (2007) Establishing an insect
disease vector functional genomics training center
in Africa. Afr J Med Med Sci 36 (Suppl): 31–33.
19. Malaria Genomic Epidemiology Network (2008)
A global network for investigating the genomic
epidemiology of malaria. Nature 456: 732–737.
20. Gardner MJ, Bishop R, Shah T, de Villiers EP,
Carlton JM, et al. (2005) Genome sequence of
Theileria parva, a bovine pathogen that transforms
lymphocytes. Science 309: 134–137.
21. Harris E (1998) A low-cost approach to PCR:
Appropriate transfer of biomolecular techniques.
New York: Oxford University Press.
22. Coloma MJ, Harris E (2004) Innovative low cost
technologies for biomedical research and diagnosis in developing countries. BMJ 329:
1160–1162.
October 2009 | Volume 6 | Issue 10 | e1000142
23. Harris E (2004) Scientific capacity building in
developing countries. EMBO Rep 5: 7–11.
24. Harris E, Tanner M (2000) Health technology
transfer. BMJ 321: 817–820.
25. Aviles H, Belli A, Armijos R, Monroy FP,
Harris E (1999) PCR detection and identification
of Leishmania parasites in clinical specimens in
Ecuador: A comparison with classical diagnostic
methods. J Parasitol 85: 181–187.
26. Harris E, Kropp G, Belli A, Rodriguez B,
Agabian N (1998) Single-step multiplex PCR
assay for characterization of New World Leishmania complexes. J Clin Microbiol 36:
1989–1995.
27. Belli A, Rodriguez B, Aviles H, Harris E (1998)
Simplified polymerase chain reaction detection of
new world Leishmania in clinical specimens of
cutaneous leishmaniasis. Am J Trop Med Hyg 58:
102–109.
28. Coloma J, Harris E (2008) Sustainable transfer of
biotechnology to developing countries: fighting
poverty by bringing scientific tools to developingcountry partners. Ann N Y Acad Sci 1136:
358–368.
29. Miagostovich MP, Sequeira PC, Dos Santos FB,
Maia A, Nogueira RM, et al. (2003) Molecular
typing of dengue virus type 2 in Brazil. Rev Inst
Med Trop Sao Paulo 45: 17–21.
30. Schriefer A, Schriefer AL, Goes-Neto A,
Guimaraes LH, Carvalho LP, et al. (2004)
Multiclonal Leishmania braziliensis population
structure and its clinical implication in a region
of endemicity for American tegumentary leishmaniasis. Infect Immun 72: 508–514.
31. Falush D (2009) Toward the use of genomics to
study microevolutionary change in bacteria. PLoS
PLoS Medicine | www.plosmedicine.org
32.
33.
34.
35.
36.
37.
38.
39.
Gen 5: e1000627. doi:10.1371/journal.
pgen.1000627.
Plowe CV, Roper C, Barnwell JW, Happi CT,
Joshi HH, et al. (2007) World Antimalarial
Resistance Network (WARN) III: Molecular
markers for drug resistant malaria. Malar J 6:
121.
Sibley CH, Barnes KI, Watkins WM, Plowe CV
(2008) A network to monitor antimalarial drug
resistance: a plan for moving forward. Trends
Parasitol 24: 43–48.
Bifani PJ, Mathema B, Kurepina NE,
Kreiswirth BN (2002) Global dissemination of
the Mycobacterium tuberculosis W-Beijing family
strains. Trends Microbiol 10: 45–52.
Filliol I, Driscoll JR, van Soolingen D,
Kreiswirth BN, Kremer K, et al. (2003) Snapshot
of moving and expanding clones of Mycobacterium
tuberculosis and their global distribution assessed
by spoligotyping in an international study. J Clin
Microbiol 41: 1963–1970.
Manca C, Reed MB, Freeman S, Mathema B,
Kreiswirth B, et al. (2004) Differential monocyte
activation underlies strain-specific Mycobacterium
tuberculosis pathogenesis. Infect Immun 72:
5511–5514.
Valway SE, Sanchez MP, Shinnick TF, Orme I,
Agerton T, et al. (1998) An outbreak involving
extensive transmission of a virulent strain of
Mycobacterium tuberculosis. N Engl J Med 338:
633–639.
Gagneux S, Comas I (2009) The past and future
of tuberculosis research. PLoS Path 5(10): e600.
doi:10.1371/journal.ppat.1000600.
Poon LL, Chan KH, Smith GJ, Leung CS,
Guan Y, et al. (2009) Molecular detection of a
6
40.
41.
42.
43.
44.
45.
46.
47.
48.
novel human influenza (H1N1) of pandemic
potential by conventional and real-time quantitative RT-PCR assays. Clin Chem 55: 1555–1558.
Reis JN, Palma T, Ribeiro GS, Pinheiro RM,
Ribeiro CT, et al. (2008) Transmission of
Streptococcus pneumoniae in an urban slum community. J Infect 57: 204–213.
Vieira N, Bates SJ, Solberg OD, Ponce K,
Howsmon R, et al. (2007) High prevalence of
enteroinvasive Escherichia coli isolated in a remote
region of northern coastal Ecuador. Am J Trop
Med Hyg 76: 528–533.
Riley LW (2004) Molecular epidemiology of
infectious diseases: Principles and practices.
Herndon (Virginia): ASM Press.
Teufel A, Krupp M, Weinmann A, Galle PR
(2006) Current bioinformatics tools in genomic
biomedical research. Int J Mol Med 17: 967–973.
Davila AM, Lorenzini DM, Mendes PN,
Satake TS, Sousa GR, et al. (2005) GARSA:
Genomic analysis resources for sequence annotation. Bioinformatics 21: 4302–4303.
Cook-Deegan RM, McCormack SJ (2001) Intellectual property. Patents, secrecy, and DNA.
Science 293: 217.
Burton B (2002) Proposed genetic database on
Tongans opposed. BMJ 324: 443.
Pang T (2002) The impact of genomics on global
health. Am J Public Health 92: 1077–1079.
Chokshi DA, Thera MA, Parker M, Diakite M,
Makani J, et al. (2007) Valid consent for genomic
epidemiology in developing countries. PLoS Med
4: e95. doi:10.1371/journal.pmed.0040095.
October 2009 | Volume 6 | Issue 10 | e1000142
Perspective
Can an Infectious Disease Genomics Project Predict and
Prevent the Next Pandemic?
Rajesh Gupta¤*, Mark H. Michalski¤, Frank R. Rijsberman
Google.org, Mountain View, California, United States of America
We believe that there is great potential
in the systematic application of genomics,
proteomics, and bioinformatics to infectious diseases, and that this potential has
yet to be fully realized. We suggest that the
international community unite under an
Infectious Disease Genomics Project, analogous to the Human Genome Project,
with a goal of a comprehensive, openaccess system of genomic information to
accelerate scientific understanding and
product development in the very settings
where diseases have the highest probability of emerging. If properly structured,
such an approach could shift fundamentally the global response to emerging
infectious diseases.
Genomics Is Systematically
Transforming Medicine
The ‘‘Genomic Revolution’’ has transformed our vision and understanding of
how living organisms and systems interact
with each other and with the environment
[1]. Increasingly, the science of genomics
serves as the foundation for translational
research for advancing the management of
many important diseases [2–7]. Decreasing costs and increasing throughput of new
technologies has made possible multinational collaboration on large-scale projects
such as the Human Microbiome Project
and the 1000 Genomes Project [8–10].
Infectious disease management is also
transforming thanks to molecular technologies as seen in HIV [11,12], tuberculosis
[13,14], malaria [15,16], and other neglected tropical diseases [17,18]. Discovering novel pathogens and elucidating the
implications of genetic variation among
existing pathogens [19,20] is critical for
rapidly mitigating pandemic threats, as
demonstrated recently with severe acute
respiratory syndrome (SARS) [21,22] and
avian (H5N1) and pandemic H1N1 2009
influenza (commonly referred to as ‘‘swine
flu’’) [23–26].
To fully harness the benefit of genomics
in infectious diseases, a chain of overarching activities must occur. First, understanding the dynamics of infectious diseases through the genomics lens requires a
tremendous amount of integrated comparative sequence, expression, epigenetic,
and proteomic data from a variety of
pathogens (bacteria, virus, protozoa, fungi), vectors (arthropod and avian sources),
reservoirs (non-human mammals, environment) and human hosts. Second, generating, collating, organizing, and curating
these data is an essential public health task.
Third, translating this information to tools
to improve surveillance and response
mechanisms is critical to effectively impact
disease management.
If this bench-to-beside chain of activities
were optimized, we envision that the
following could occur:
N
N
Fully annotated genomes of all known
pathogens, vectors, non-human hosts,
and reservoir species, as well as a large
number of candidate microbes in
families that have a high risk of
generating future pathogens, are held
in public open-access databases such as
GenBank.
A ‘‘Genomic search’’ of all available
contextual information, from sample
origins through to published analyses,
is as simple as a Google search.
N
N
N
Sequencing and other molecular technologies are everyday tools-of-thetrade in every district hospital and
laboratory in hotspots of emerging
infectious disease, such as southeast
Asia and sub-Saharan Africa.
Automated molecular diagnostic assays are low-cost, reduced at least to
the size of a smart mobile phone, and
can return definitive diagnoses of a
range of specialized known pathogen
panels at the point of care.
A range of products that use infectious
disease genomic information routinely—such as vector maps, early warning
systems, diagnostics, vaccines, and
drugs—contribute to the prediction
and prevention of epidemics.
While progress is occurring in each of
these areas, the outputs—which are needed today—are far from complete.
Creating an Infectious Disease
Genomics Project (IDGP)
We believe that accelerated advances in
the area of infectious diseases can occur
under a global collaborative framework
composed of discrete and delineated
activities between the public and private
sectors among resource-wealthy and resource-limited settings. The Human Genome Project (HGP) was a pioneering
international effort that helped unlock the
power of genomics for human health
Citation: Gupta R, Michalski MH, Rijsberman FR (2009) Can an Infectious Disease Genomics Project Predict and
Prevent the Next Pandemic? PLoS Biol 7(10): e1000219. doi:10.1371/journal.pbio.1000219
Published October 26, 2009
Copyright: ß 2009 Gupta et al. This is an open-access article distributed under the terms of the Creative
Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium,
provided the original author and source are credited.
Funding: Google.org is financially supported through its parent company, Google.com. At the time this
manuscript was developed, RG was an employee of Google.org and MM was a consultant to Google.org. The
funder had no role in the decision to publish or preparation of the manuscript.
Competing Interests: The authors have declared that no competing interests exist.
* E-mail: [email protected]
The Perspective section provides experts with a
forum to comment on topical or controversial issues
of broad interest.
PLoS Biology | www.plosbiology.org
¤ Current address: Stanford University, Stanford, California, United States of America
This article is part of the ‘‘Genomics of Emerging Infectious Disease’’ PLoS Journal collection (http://
ploscollections.org/emerginginfectiousdisease/).
1
October 2009 | Volume 7 | Issue 10 | e1000219
Author Summary
The world of genomics is transforming medicine, and is likely to influence the
future development of new drugs, diagnostics, and vaccines. To date, the greater
focus of genomics and medicine has been on conditions affecting resourcewealthy settings, primarily involving scientists and companies in those settings.
However, we believe that it is possible to expand genomics into a more global
technology that can also focus on diseases of resource-limited settings. This goal
can be achieved if genomics is made a global priority. We feel one way to move in
this direction is through a comprehensive approach to infectious diseases—i.e.,
an Infectious Disease Genomics Project—that would mirror the Human Genome
Project. Without an active, unified effort specifically focused on allowing actors at
any level to participate in the genomics revolution, infectious diseases that
primarily affect the poor will likely not achieve the same level of scientific
advancement as diseases affecting the wealthy.
[27,28]. This effort generated important
information in part by having clear,
targeted outcomes and by implementing
a standard methodology across all participants. The HGP was a great impetus for
progress seen thus far in genomics and
health. Moreover, the HGP recognized
that sequencing was just the first step in
a much bigger process [26]. A similar
effort for infectious diseases could, in our
view, help predict and prevent the next
pandemic.
To capitalize on existing successful
efforts in the area of genomics and
infectious diseases such as those by the
Broad Institute, Genomics Standards Consortium, J Craig Venter Institute, the
National Institute of Allergy and Infectious
Diseases, and the Wellcome Trust Sanger
Institute (to name a few), we urge the
international community to unite its numerous activities under an Infectious
Diseases Genomic Project (IDGP)—a
coordinated, large-scale, international effort focused on the genomes of pathogens,
vectors, hosts, and reservoirs and linked to
end-point surveillance and response systems. Such a project could coordinate
activities in four specific areas: generating
data, linking data, analyzing data, and
applying data (Figure 1).
Generating Data
At the outset, the IDGP would need to
determine what the world requires in
terms of genomic information. A standard
approach to generating depth and diversity in genomic data is essential; beyond
this, continuous real-time surveillance and
characterization of evolving pathogens can
help effectively forestall future epidemics/
pandemics. Frontline work by consortiums, genome research centers, and
individual laboratories has yielded baseline
approaches in this area and a wealth of
critical genomic information for many
important infectious agents [29–34].
While each actor in the genomics field
brings its own priority for targeting
particular pathogens or diseases, a clear
roadmap to generating a complete genomic picture of all infectious agents, emerging threats, hosts, and reservoirs, incorporating a broad range of investigators with
varied technological capacity, would enhance both data generation and application. Such a process allows for community-level priority setting, thereby enabling
smaller-scale laboratories to tailor projects
to fit the needs of local communities while
contributing to global efforts.
Linking Data
The data collected must be connected
to all relevant information and analytical
tools in a single, easy-to-use, open-source,
real-time interface. Such a system would
improve on current systems by: gathering
data across the public domain and working with companies/institutions to harness
information in the private domain; linking
accurate, annotated sequencing information to functional genomic and proteomic/functional proteomic information;
attaching scientific literature associated
with all levels of information; and including a self-sustaining financial mechanism
potentially based on royalties from commercial products generated from the use of
this system.
Analyzing Data
The data need to be linked via largescale, dynamic databases held in virtual
servers allowing for collaboration and
sharing while maintaining originating
information for data rights and sovereignty. Concurrently, these data should be
associated with a centralized collection of
open-source bioinformatics tools capable
of real-time operation in low- and highspeed computers and varying levels of
internet connectivity. A single interface
also would bring various sample collections together in formally structured biobanks that capture geospatial and context
data to allow efficient scientific collaboration to take place. Centralizing the entire
spectrum of information and analytic tools
also allows researchers in resource-limited
settings to participate in the genomics
revolution without prohibitively costly
machines, laboratories, and sample accessibility. Although we fully acknowledge
Figure 1. A coordinated Infectious Disease Genome Project (IDGP) could unify sequencing efforts, enhance data usability, and lead
to essential tools for infectious disease management.
doi:10.1371/journal.pbio.1000219.g001
PLoS Biology | www.plosbiology.org
2
October 2009 | Volume 7 | Issue 10 | e1000219
that internet connectivity is a requirement
that is not currently available to all, rapid
technical innovation and investment from
cheap netbook computers to new fiber
optic cables in Africa are changing that
equation. This system could be facilitated
by virtual community collaboration or
crowd-sourcing, taking full advantage of
networking tools such as Wikipedia, Facebook, Twitter, FusionTables, and PLoS.
Applying Data
Technological advances for basic scientific discovery (such as next-generation
sequencers, microarrays, mass spectrometers, cell-based assay methods, and other
tools for transcriptome, metabolome, and
proteome discovery), novel techniques to
increase throughput and/or decrease the
cost of analysis, and applied clinical
decision-making and surveillance tools
(point-of-care diagnostics, rapid multipathogen assays) are in progress and
should be supported actively. The IDGP
should be informed by and incorporate
emerging technology platforms to rapidly
develop more accurate field diagnostics
and to identify new opportunities for
vaccine and drug development.
An IDGP is attainable if others share
this vision, show leadership, and see the
added value resulting from a coordinated
effort. The HGP certainly was a more
targeted effort and we acknowledge that
an IDGP will have additional obstacles to
overcome. Scientific disagreement over
targets is bound to occur. Complications
resulting from the proposed level of data
sharing should not be underestimated, and
care must be taken to ensure proprietary
rights and acknowledgement when warranted. Adapting molecular genetic technologies to resource-limited settings is a
significant challenge, but is occurring with
some success. Bringing together a community of scientists and donors, each with
their own objectives and goals, to work
under a single framework, is a difficult
proposition. Finally, there will be many
who will find this perspective simply too
grandiose. Leaps of progress also require
big visions, however, and it may just be
possible that the 2009 H1N1 influenza
pandemic is a enough of a reminder of
what is at stake to provide a catalyst for
action.
Google.org has supported global public
health through its ‘‘Predict and Prevent’’
initiative with the aim of using the power
of information and technology to address
emerging infectious diseases by helping the
world to know where to look for these
diseases, find the threats earlier, and
respond to them faster [35]. Google.org
has focused its support on sequencing and
pathogen discovery activities, bringing
genomic technologies to resource-limited
settings in East Africa, improving surveillance networks and systems, and exploring
how our core competence in internet
search can assist the infectious diseases
community [36].
As firm supporters of the open access
model for scientific publication [37],
Google.org is pleased to support this series
of essays, The Genomics of Emerging
Infectious Disease, in partnership with the
Public Library of Science (PLoS) journals
(PLoS Biology, PLoS Computational Biology,
PLoS Genetics, PLoS Medicine, PLoS Neglected
Tropical Diseases, and PLoS Pathogens), not
only to help define the current state of the
art in pathogen genomics, but also, we
hope, to stimulate debate on priorities for
research and technology development.
11. Martinez-Cajas JL, Wainberg MA (2008) Antiretroviral therapy: Optimal sequencing of therapy
to avoid resistance. Drugs 68: 43–72.
12. Wilkinson KA, Gorelick RJ, Vasa SM, Guex N,
Rein A, et al. (2008) High-throughput SHAPE
analysis reveals structures in HIV-1 Genomic
RNA strongly conserved across distinct biological
states. PLoS Biol 6: e96. doi:10.1371/journal.
pbio.0060096.
13. Smith CV, Sacchettini JC (2003) Mycobacterium
tuberculosis: A model system for structural genomics. Curr Opin Struct Biol 13: 658–664.
14. Cockle PJ, Gordon SV, Lalvani A, Buddle BM,
Hewinson RG, et al. (2002) Identification of novel
Mycobacterium tuberculosis antigens with potential as
diagnostic reagents or subunit vaccine candidates
by comparative genomics. Infect Immun 70:
6996–7003.
15. Gonzales JM, Patel JJ, Ponmee N, Jiang L, Tan A,
et al. (2008) Regulatory hotspots in the malaria
parasite genome dictate transcriptional variation.
PLoS Biol 6: e238. doi:10.1371/journal.
pbio.0060238.
16. Ekland EH, Fidock DA (2007) Advances in
understanding the genetic basis of antimalarial
drug resistance. Curr Opin Microbiol 10:
363–370.
17. Beaty BJ, Prager DJ, James AA, Jacobs-Lorena M,
Miller LH, et al. (2009) From Tucson to genomics
and transgenics: The Vector Biology Network
and the emergence of modern vector biology.
PLoS Negl Trop Dis 3: e343. doi:10.1371/
journal.pntd.0000343.
18. Hertz-Fowler C, Figueiredo LM, Quail MA,
Becker M, Jackson A, et al. (2008) Telomeric
expression sites are highly conserved in Trypanosoma brucei. PLoS ONE 3: e3527. doi:10.1371/
journal.pone.0003527.
19. Wolfe N, Heneine W, Carr J, Garcia A,
Shanmugam V, et al. (2005) Emergence of
unique primate T-lymphotropic viruses among
central African bushmeat hunters. Proc Natl
Acad Sci U S A 102: 7994–7999.
Palacios G, Druce J, Du L, Tran T, Birch C, et al.
(2008) A new arenavirus in a cluster of fatal
transplant-associated diseases. N Engl J Med 358:
991–998.
Grant P, Garson J, Tedder R, Chan P, Tam J,
et al. (2003) Detection of SARS coronavirus in
plasma by real-time RT-PCR. N Engl J Med 349:
2468.
Marra M, Jones S, Astell C, Holt R, BrooksWilson A, et al. (2003) The genome sequence of
the SARS-associated coronavirus. Science 300:
1399–1404.
Gu J, Xie Z, Gao Z, Liu J, Korteweg C, et al.
(2007) H5N1 infection of the respiratory tract and
beyond: A molecular pathology study. Lancet
370: 1137–1145.
Zhao Z-M, Shortridge KF, Garcia M, Guan Y,
Wan X-F (2008) Genotypic diversity of H5N1
highly pathogenic avian influenza viruses. J Gen
Virol 89: 2182–2193.
Garten RJ, Davis CT, Russell CA, Shu B,
Lindstrom S, et al. (2009) Antigenic and genetic
characteristics of swine-origin 2009 A(H1N1)
influenza viruses circulating in humans. Science
325: 197–201.
Shinde V, Bridges CB, Uyeki TM, Shu B,
Balish A, et al. (2009) Triple-reassortant swine
influenza A (H1) in humans in the United States,
2005–2009. N Engl J Med 360: 2616–2625.
Consortium IHGS (2001) Initial sequencing and
analysis of the human genome. Nature 409:
860–921.
Collins FS, Morgan M, Patrinos A (2003) The
Human Genome Project: Lessons from largescale biology. Science 300: 286–290.
Wellcome Trust Sanger Institute (2009) Pathogen
genomics [Web site]. Available: http://www.
sanger.ac.uk/Projects/Pathogens/. Accessed 11
August 2009.
Moving beyond Discourse into
Action
References
1. Yudell M, DeSalle R (2002) The genomic
revolution: Unveiling the unity of life. Washington (D. C.): Joseph Henry Press. 272 p.
2. Langston AA, Malone KE, Thompson JD,
Daling JR, Ostrander EA (1996) BRCA1 mutations
in a population-based sample of young women with
breast cancer. N Engl J Med 334: 137–142.
3. Futreal P, Liu Q, Shattuck-Eidens D, Cochran C,
Harshman K, et al. (1994) BRCA1 mutations in
primary breast and ovarian carcinomas. Science
266: 120–122.
4. Helgadottir A, Manolescu A, Thorleifsson G,
Gretarsdottir S, Jonsdottir H, et al. (2004) The
gene encoding 5-lipoxygenase activating protein
confers risk of myocardial infarction and stroke.
Nature Genetics 36: 233–239.
5. Wellcome Trust C (2007) Genome-wide association study of 14,000 cases of seven common
diseases and 3,000 shared controls. Nature 447:
661–678.
6. Consortium G (2007) New models of collaboration in genome-wide association studies: The
Genetic Association Information Network. Nat
Genet 39: 1045–1051.
7. Vigneri P, Wang J (2001) Induction of apoptosis
in chronic myelogenous leukemia cells through
nuclear entrapment of BCR-ABL tyrosine kinase.
Nat Med 7: 228–234.
8. Gresham D, Kruglyak L (2008) Rise of the
machines. PLoS Genet 4: e1000134.
doi:10.1371/journal.pgen.1000134.
9. Spencer G (2008) Researchers establish international human microbiome consortium. NIH
News. Available: http://www.nih.gov/news/
health/oct2008/nhgri-16.htm. Accessed 19 September 2009.
10. Spencer G (2008) International consortium announces the 1000 Genomes Project. NIH News.
Available: http://www.nih.gov/news/health/
jan2008/nhgri-22.htm. Accessed 19 September
2009.
PLoS Biology | www.plosbiology.org
3
20.
21.
22.
23.
24.
25.
26.
27.
28.
29.
October 2009 | Volume 7 | Issue 10 | e1000219
30. National Institute of Allergy and Infectious
Disease (2009) Microbial Genome Sequencing
Centers: Completed NIAID-Supported Sequencing Projects. Available: http://www3.niaid.nih.
gov/research/resources/mscs/completed.htm.
Accessed 11 August 2009.
31. Cole ST, Brosch R, Parkhill J, Garnier T,
Churcher C, et al. (1998) Deciphering the biology
of Mycobacterium tuberculosis from the complete
genome sequence. Nature 393: 537–544.
PLoS Biology | www.plosbiology.org
32. Gardner MJ, Hall N, Fung E, White O,
Berriman M, et al. (2002) Genome sequence of
the human malaria parasite Plasmodium falciparum.
Nature 419: 498–511.
33. Greene JM, Collins F, Lefkowitz EJ, Roos D,
Scheuermann RH, et al. (2007) National Institute
of Allergy and Infectious Diseases bioinformatics
resource centers: New assets for pathogen informatics. Infect Immun 75: 3212–3219.
34. Field D, Garrity G, Gray T, Morrison N,
Selengut J (2008) The minimum information
4
about a genome sequence (MIGS) specification.
Nat Biotechnol 26: 541–547.
35. Google.org (2008) Predict and Prevent initiative.
Available: http://www.google.org/predict.html.
Accessed 19 September 2009.
36. Ginsberg J, Mohebbi MH, Patel RS, Brammer L,
Smolinski MS, et al. (2009) Detecting influenza
epidemics using search engine query data. Nature
457: 1012–1014.
37. Gass A (2004) Open access as public policy. PLoS
Biol 2: e353. doi:10.1371/journal.pbio.0020353.
October 2009 | Volume 7 | Issue 10 | e1000219
Perspective
The Role of Genomics in the Identification, Prediction,
and Prevention of Biological Threats
W. Florian Fricke, David A. Rasko, Jacques Ravel*
Institute for Genome Sciences (IGS), University of Maryland School of Medicine, Baltimore, Maryland, United States of America
Since the publication in 1995 of the first
complete genome sequence of a free-living
organism, the bacterium Haemophilus influenzae [1], more than 1,000 genomes of
species from all three domains of life—
Bacteria, Archaea, and Eukarya—have
been completed and a staggering 4,300
are in progress (not including an even
larger number of viral genome projects)
(GOLD, Genomes Online Database v.
2.0; http://www.genomesonline.org/gold.
cgi, as of August 2009). Whole-genome
shotgun sequencing remains the standard
in biomedical, biotechnological, environmental, agricultural, and evolutionary genomics (http://genomesonline.org/
gold_statistics.htm#aname). While nextgeneration sequencing technology is
changing the field, this approach will
continue to be used and lead to a
previously unimaginable number of genome sequences, providing opportunities
that could not have been thought of a few
years ago. These opportunities include
studying genomes in real-time to understand the evolution of known pathogens
and predict the emergence of new infectious agents (Box 1). With the introduction
of next-generation sequencing platforms,
cost has decreased dramatically, resulting
in genomics no longer being an independent discipline, but becoming a tool
routinely used in laboratories around the
world to address scientific questions. This
global sequencing effort has been focusing
primarily on pathogenic organisms, which
today are still the subject of the majority of
genome projects [2]. Sequencing two to
five strains of the same pathogen has, in
recent years, afforded us not only a better
understanding of evolution, virulence, and
biology in general [3], but, taken to the
next level (hundreds or thousands of
strains) it will enable even more accurate
diagnostics to support epidemiological
studies, food safety improvements, public
health protection, and forensics investigations, among others.
Biodefense Funding for
Genomic Research
Since the anthrax letter attacks of 2001,
when letters containing anthrax spores
were mailed to several news media offices
and two Democratic senators in the
United States, killing five people and
infecting 17 others, funding agencies in
the US and other countries have prioritized research projects on organisms that
might potentially challenge our security
and economy should they be used as
biological weapons. This has resulted in
large amounts of funding dedicated to socalled ‘‘biodefense’’ research, totaling close
to $50 billion between 2001 and 2009 [4].
Genomics has benefited greatly from this
influx of research dollars and as a result,
representatives of most major animal, plant,
and human pathogens have been sequenced
(http://www.pathogenportal.org/). Supported by federal funds from the National
Institutes of Health (NIH), the National
Institute of Allergy and Infectious Diseases
(NIAID), and the US Department of Defense, research programs, such as the Microbial Sequencing Centers and the Bioinformatics Resource Centers (http://www3.
niaid.nih.gov/topics/pathogenGenomics/
PDF/genomicsinitiatives.htm), have been
established that carry out genomics research on pathogenic organisms and have
spearheaded a new phase of the genomics
revolution. Similar programs were started
in Europe, such as those at the Wellcome
Trust Sanger Institute in the United
Kingdom, and the multinational European
effort, The Network of Excellence EuroPathoGenomics
(http://www.noe-epg.
uni-wuerzburg.de/epg_general.htm). As
an example of the success of these types
of programs, the genome sequences of over
90,000 influenza viruses were rapidly
generated and are now deposited in
GenBank (http://www.ncbi.nlm.nih.gov/
genomes/FLU/aboutdatabase.html). Because of the availability of large sequencing
capacity and the large amount of information, the response to the 2009 H1N1
influenza pandemic was rapid and efficient
(Box 2): Genomics information was generated within days and validated diagnostic
tools were approved within weeks [5,6]. A
global response was made possible through
tremendous research efforts enabled by
genomic research.
Access to and Documentation
of Sequence Data
Open access to genomics resources (i.e.,
raw sequence data and associated publications) is an essential component of the
nation preparedness to biological threats
(biopreparedness), whether intentionally
delivered or not. Although some consider
open-source genomic resources a threat to
security [7] because they make publicly
available information that could facilitate
the construction of dangerous infectious
agents, we strongly disagree with this point
of view. Rather, we and others [8] believe
that it is an enabling tool more useful to
those in charge of our public health and
biosecurity than to those with ill intentions. Genomic sequence data can provide
a starting point for the development of
new vaccines, drugs, and diagnostic tests
[9], hence improving public health capabilities and increasing our biopreparedness. Access to the organisms from which
the sequences are derived should be
restricted, not their genome sequences.
Citation: Fricke WF, Rasko DA, Ravel J (2009) The Role of Genomics in the Identification, Prediction, and
Prevention of Biological Threats. PLoS Biol 7(10): e1000217. doi:10.1371/journal.pbio.1000217
Published October 26, 2009
Copyright: ß 2009 Fricke et al. This is an open-access article distributed under the terms of the Creative
Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium,
provided the original author and source are credited.
Competing Interests: The authors have declared that no competing interests exist.
The Perspective section provides experts with a
forum to comment on topical or controversial issues
of broad interest.
PLoS Biology | www.plosbiology.org
* E-mail: [email protected]
This article is part of the ‘‘Genomics of Emerging Infectious Disease’’ PLoS Journal collection (http://
ploscollections.org/emerginginfectiousdisease/).
1
October 2009 | Volume 7 | Issue 10 | e1000217
Author Summary
In all likelihood, it is only a matter of time before our public health system will
face a major biological threat, whether intentionally dispersed or originating from
a known or newly emerging infectious disease. It is necessary not only to increase
our reactive ‘‘biodefense,’’ but also to be proactive and increase our
preparedness. To achieve this goal, it is essential that the scientific and public
health communities fully embrace the genomic revolution, and that novel
bioinformatic and computing tools necessary to make great strides in our
understanding of these novel and emerging threats be developed. Genomics has
graduated from a specialized field of science to a research tool that soon will be
routine in research laboratories and clinical settings. Because the technology is
becoming more affordable, genomics can and should be used proactively to
build our preparedness and responsiveness to biological threats. All pieces,
including major continued funding, advances in next-generation sequencing
technologies, bioinformatics infrastructures, and open access to data and
metadata, are being set in place for genomics to play a central role in our public
health system.
Now that genomics technologies are
broadly available, there is the potential
for commercial interests to hamper the
release of genomic data in the public
domain. Thus it is important that federally
funded large-scale genome sequencing
efforts have enforceable rapid release
policies. This accessibility could afford
further opportunities to capitalize on
investments in genome sequencing by
providing the necessary resources to biopreparedness.
Whereas genome projects aimed at
sequencing one, two, or three isolates of
a pathogen seemed adequate a few years
ago, it is now possible to sequence rapidly
hundreds of individual genomes for each
species. Access to relevant, well-curated
culture collections [10] and DNA preparations suitable for sequencing may become a bottleneck in the future when
sequencing resources are no longer limiting. More importantly, the impact of large
genomic sequence datasets from clinical
isolates will be limited without key clinical
metadata that characterize these isolates,
such as patients’ medical information,
date of isolation, and the number of
culture passages in the laboratory. Open
access to large numbers of sequences and
associated metadata allows for powerful
comparative genomic analyses and thus
provides major insights into the characteristics of a pathogen. Standardized
Box 1. Hot Spots for the Emergence of Infectious Disease
Can we define ‘‘hot spots’’ of microbial populations where new infectious
diseases are more likely to evolve? Human contact with new types of infectious
agents precedes the emergence of infectious diseases. Infectious agents can be
new in the sense of not having previously infected humans or new in the sense
that a combination of preexisting genetic factors (for example, mobile elements
or regulatory elements) have reassembled to give rise to an infectious agent with
a substantially altered genome. The Ebola virus, which first emerged by infecting
humans 1976 in Zaire [21], is an example of the former, whereas the acquisition of
antimicrobial resistance by Acinetobacter baumannii [22] is an example of the
latter. In both cases, a change in the selective pressure on an infectious agent
allows its emergence from a specific setting. This selective pressure may be, for
example, the new niche that the human host provides to the pathogen or the
antimicrobial selection on a pathogen. Since both events rely on preexisting
genetic resources and not on the de novo evolution of virulence factors, the
potential of a setting to serve as a hot spot or reservoir for an emerging infectious
disease is theoretically predictable from the examination of the total metagenome. In this scenario, traditional microbiological approaches that focus on single
isolates of bacteria or viruses are limited in their predictive power since they lack a
view of the complete genetic landscape. The potential infectious disease agent
could, however, arise from an environment that only contains pieces of a
‘‘virulence puzzle,’’ i.e., individual virulence factors encoded within the genomes
of different organisms (the metagenomic ‘‘gene soup’’). These pieces would have
to be assembled in one species for the new pathogen to emerge as an infectious
agent.
PLoS Biology | www.plosbiology.org
2
vocabulary should be developed to describe these isolates and the genes they
contain. Such efforts have already started,
for example through the open-access
journal Standards in Genome Sciences
(SIGS) (http://standardsingenomics.org/
index.php/sigen), but the dedicated resources are not adequate and highlight the
lack of understanding of the importance of
metadata in genomics. Initiatives such as
those of the Genomics Standards Consortium have made great strides [11,12], but
still need widespread implementation
from the ever-expanding genomic community. Open access to the genomic DNA
that has been sequenced or the culture
from which the DNA was extracted and to
the associated metadata is key to successful genome sequencing projects, whether
on single or several hundred genomes or
metagenomes. Well-documented genome
sequence data will form a key growing
resource for biodefense and other research fields.
Emerging New Bioinformatics
Resources
As we enter a new era of modern
genomics, the ever-expanding sequence
datasets are becoming more challenging to
analyze. Future analysts will require powerful
new bioinformatics tools in conjunction with
new computer systems engineered with
genomic analysis in mind. Open-source
new bioinformatics software tools are being
developed that exploit Web-based services
and the increasing computing power provided by academic and commercial ‘‘cloud
computing networks’’ (large computing resources provided as a service over the
Internet). For example, ‘‘Science Clouds’’
(http://workspace.globus.org/clouds/) allow
members of the scientific community to lease
cloud computing resources free of charge.
To leverage these capabilities, novel cloudoptimized bioinformatics tools are being
developed, such as the genome sequence
read mapper CloudBurst [13]. In addition,
novel resources are currently under development to increase the availability of opensource bioinformatics tools for cloud computing (http://www.nsf.gov/awardsearch/
showAward.do?AwardNumber=0949201;
http://www.nsf.gov/awardsearch/showAward.
do?AwardNumber=0844494). These emerging
tools make access to the Worldwide Web the
only requirement to join the genomic revolution
and achieve large scale bioinformatics analyses
that could not be possible on local servers. As a
consequence, it is conceivable that in the future
genomic research will increasingly move away
from the large sequencing centers toward a
more decentralized organization. Decentralized
October 2009 | Volume 7 | Issue 10 | e1000217
Box 2. Pandemic H1N1 2009 Influenza: A Recent Example of the
Impact of Genomics on Biopreparedness
Genomics can be readily applied to follow outbreaks of infectious diseases. This is
clearly illustrated during the severe acute respiratory syndrome (SARS) outbreak
in 2002–2003 and the emergence and worldwide spread of the pandemic H1N1
2009 influenza virus this year. In both cases, genomics played a key role in the
immediate response to the outbreak. Initially, very little was known about the
virus responsible for the SARS outbreak. Pangenomic virus microarrays identified
it as a coronavirus [23]; however, it was only through detailed sequencing that the
specific genotype of this virus could be determined [24]. Comparative sequence
analysis identified the SARS virus as distinct from other coronaviruses in terms of
its encoded proteins responsible for antigen presentation. This finding ultimately
lead to development of diagnostics [25] and potential therapeutics [26]. This
example of a sequencing approach as a rapid response to a virus outbreak
demonstrates that genomics can be a useful and important, if not essential,
epidemiological tool. In the ongoing H1N1 influenza outbreak, the National
Center for Biotechnology Information (NCBI) established the Influenza Virus
Resource (a database and tool for flu sequence analysis, annotation, and
submission to GenBank; http://www.ncbi.nlm.nih.gov/genomes/FLU/SwineFlu.
html), containing 462 complete viral genome sequences from worldwide viral
samples (as of September, 2009). Some of the genomic data was completed,
compared, and released to the public within two weeks of isolation of the DNA.
The rapid generation of genome sequence data is providing a paradigm shift in
the analysis of infectious disease outbreaks, from more classical methods of
isolation to the rapid molecular examination of the pathogen in question.
rapid genome sequencing and bioinformatic
analysis of infectious agents will enable near-realtime global surveillance, detection of new
pathogens, new virulence factors, antimicrobial resistance determinants, or engineered
organisms.
Population Genomics Applied
to Single Cultures
Because the resources for affordable
high-throughput sequencing, data processing, and analysis are available, the
time is right to think about microbial
population genomics and large-scale microbial metagenomics in the context of
biodefense research (Box 3). Traditionally, the concept of population genomics
has applied to variation within a species.
However, a bacterial culture, even if
derived from a single clone, is composed
of millions of cells that are not necessarily
identical at the genome sequence level,
hence forming a population of genomes.
Therefore we propose to apply the
concept of population genomics to microbial cultures. The assemblage of
genotypes defines what is called a ‘‘culture,’’ ‘‘culture stock,’’ or ‘‘reference
strain.’’ Population genomics addresses
the genomic diversity within these assemblages and has significant implications for
many fields of research but, most importantly, for pathogen evolution, diagnostics, epidemiology, and microbial forensics. For example, following the anthrax
mail attacks of 2001, microbiologists and
PLoS Biology | www.plosbiology.org
genomicists joined forces to characterize
the unique genetic traits of the Bacillus
anthracis spores recovered from the envelopes, which were quickly identified as
the B. anthracis Ames strain (DAAR et al.,
unpublished data). Sequencing the genome of several single colonies obtained
from the spores revealed that the entire
chromosome and its associated plasmids
were 100% identical to the genome
sequence of the ancestral B. anthracis
Ames strain that was stored for over 20
years in a military laboratory in Frederick, Maryland. The only genotypic differences were found in a small, phenotypically and genetically distinct portion
of cells grown from the spores used in the
attacks. Genomic characterization of
these phenotypic variants revealed a
number of unique genetic alterations that
together provided a characteristic DNA
fingerprint of the spore population that
could be unequivocally matched to the
spore sample used in the attacks. Using
this fingerprint, a genetic assay was
developed to screen a B. anthracis spore
repository, which identified the origin of
the spores as a single spore stock of B.
anthracis Ames. This stock was stored at
the US Army Medical Research Institute
for Infectious Diseases in Fort Detrick,
Maryland, narrowing the pool of suspects
to a manageable number (those who had
access to the spore stock) for the investigative team. The police investigation that
followed identified a potential suspect as
the custodian of the spore stock. This was
3
the first use of microbial genomics as an
essential tool in a forensic investigation.
In the course of the investigation, scientists had to establish culture repositories
from strains used in research in the US
and build databases of genome sequences
of all B. anthracis isolates. This work took
several years and delayed the investigation significantly. A lesson to be learned
from this investigation should therefore
be that there is a need for comprehensive
databases of unique DNA fingerprints of
stocks of potentially threatening pathogens. In the event that another bioterror
attack were to take place such genomic
databases would be key in quickly
establishing the source of the biological
material.
The concept of population genomics also
applies to epidemiological studies of outbreaks of infectious diseases such as those
caused by food-borne or zoonotic pathogens, such as Salmonella spp. Traditionally,
epidemiologists and pathologists have used
low-resolution methods such as pulsed-field
gel electrophoresis (PFGE), multi-locus
sequence typing (MLST), or multi-locus
variable number tandem repeats analysis
(MLVA) to trace an individual isolate from
a patient back to a potentially infected food
source or to isolates from other patients
[14–17]. In 2006, for example, during an
outbreak of pathogenic Escherichia coli
O157:H7 infections in 26 states of the
US, which was caused by contaminated
spinach, isolates of the pathogen were
recovered from cows and wild pigs (the
zoonotic reservoirs), bags of spinach (the
vehicle of transmission), and ill patients
(http://www.cdc.gov/mmwr/preview/
mmwrhtml/mm55d926a1.htm). One
of these isolates was designated as the
reference for the outbreak based on
conserved PFGE patterns. Genome
sequencing of several isolates from the
same outbreak performed in our laboratory, however, revealed genomic
variations that questioned a direct
evolutionary link between all outbreak-associated isolates (Eppinger
et al., unpublished data). Comparative
genomics followed by whole-genome
phylogenetic analyses based on single
nucleotide polymorphisms demonstrated that these isolates were indeed
closely related to one another and only
distantly related to other E. coli
O157:H7 isolates, hence linking all
isolates to the same outbreak, something that was not possible using PFGE
patterns. In this case, phylogenetic
analyses suggest that several highly
related genotypes were at the source
of the outbreak, thus challenging the
October 2009 | Volume 7 | Issue 10 | e1000217
Box 3. Simple Genomics, Population Genomics, and
Metagenomics
ciated microbial communities (e.g., Vibrio cholerae, the etiologic agent of cholera)
but potentially also by slight shifts in the
proportions of different populations within the community that give an otherwise harmless species or strain an undesirable advantage over others, a similar situation to what is observed in
bacterial vaginosis [20]. Probiotic dietary supplements of live microorganisms
deliver beneficial bacteria that promote
an healthy state of the targeted microbiota. In a completely hypothetical
possibility, the opposite would also be
plausible, where the healthy microbiota
(skin, gut, or upper respiratory tract,
among others) may be disturbed by
introducing large amounts of ‘‘contrabiotics,’’ i.e., living nonpathogenic bacteria that would shift the microbiota
away from a healthy state. A better
understanding of the ecological principles that shape the composition of our
microbiome might contribute to our
biopreparedness for such a threat to
public health.
The field of biodefense has thoroughly
embraced genomics and made it a
keystone for developing better identification technologies, diagnostic tools, and
vaccines and improving our understanding of pathogen virulence and evolution.
Enabling technologies and bioinformatics tools have shifted genomics from
a separate research discipline to a tool so
powerful that it can provide novel
insights that were not imaginable a few
years ago, including for example redefining the notion of strains or cultures in the
context of biopreparedness or microbial
forensics. Challenges remain, though,
mostly in the form of large amounts of
data that are being generated, and will
continue to be generated in the future,
and are becoming difficult to manage.
The need for better bioinformatic algorithms, access to faster computing capabilities, larger or novel and more efficient
data storage devices, and better training
in genomics are all in critical demand,
and will be required to fully embrace the
genomic revolution. Our nation’s preparedness for biological threats, whether
they are deliberate or not, and our public
health system would benefit greatly by
leveraging these capabilities into better
real-time diagnostics (in the environment
as well as at the bedside), vaccines, a
greater understanding of the evolutionary process that makes a friendly microbe
become a pathogen (Box 3) (hence to
better predict what microbial foes will be
facing us in the near future), and better
forensics and epidemiological tools. The
time is right to be bold and capitalize on
these enabling technological advances to
sequence microbial species or complex
microbial communities to the greatest
level possible—that is, hundreds of genomes per species or samples—but let us
not forget that informatics and computing resources are now becoming the
bottleneck to actually making major
progress in this field.
4. Franco C (2008) Billions for biodefense: Federal
agency biodefense funding, FY2008-FY2009.
Biosecur Bioterror 6: 131–146.
5. Rowe T, Abernathy RA, Hu-Primmer J,
Thompson WW, Lu X, et al. (1999) Detection
of antibody to avian influenza A (H5N1)
virus in human serum by using a combination of serologic assays. J Clin Microbiol 37:
937–943.
6. Maurer-Stroh S, Ma J, Lee RT, Sirota FL,
Eisenhaber F (2009) Mapping the sequence
mutations of the 2009 H1N1 influenza A virus
neuraminidase relative to drug and antibody
binding sites. Biol Direct 4: 18.
7. Aldhous P (2001) Biologists urged to address risk
of data aiding bioweapon design. Nature 414:
237–238.
8. Read TD, Parkhill J (2002) Restricting genome
data won’t stop bioterrorism. Nature 417: 379.
9. Bambini S, Rappuoli R (2009) The use of
genomics in microbial vaccine development.
Drug Discov Today 14: 252–260.
10. Tindall BJ, Garrity GM (2008) Proposals to clarify
how type strains are deposited and made available to
the scientific community for the purpose of systematic
research. Int J Syst Evol Microbiol 58: 1987–1990.
11. Garrity GM, Field D, Kyrpides N, Hirschman L,
Sansone SA, et al. (2008) Toward a standards-
It is now technically possible and scientifically desirable to combine sequencing
projects on single genomes, genome populations, and metagenomes to study
genome evolution. Single-genome projects provide the greatest resolution for
identifying genetic factors responsible for specific virulence phenotypes and
provide answers to many important questions, such as: What is the minimal gene
set in a pathogen required to cause a specific disease phenotype? What does the
genetic context of virulence or antibiotic resistance factors tell us about their
evolutionary origin or the mobility between different microbial species or even
genera? Population-level genome sequencing projects provide us with information about the pangenomic gene pool and the potential of a species to evolve
into a novel pathogen. Are certain bacterial species or strains more likely than
others to evolve pathogenic traits? What distinguishes a commensal from a
pathogenic isolate? What provides the trigger or ability to convert a commensal
or opportunistic strain into a pathogen? What role does horizontal gene transfer
play in species evolution? Is an infection always caused by an individual isolate or
might infection be caused by a combination of individuals in a population that all
have different attenuated infectious potentials? Metagenomics projects sample
the genetic reservoir (the set of genes carried by all members of a community)
within a specific environment or sample. This ‘‘gene soup’’ reflects the maximum
genetic potential accessible to individual isolates by horizontal gene transfer.
utility of assigning a single reference
strain to a specific outbreak. Instead,
collecting and sequencing tens or
hundreds of isolates from each source
or patient linked to an outbreak would
provide a better basis for understanding the genomic diversity within the
outbreak population and would aid in
defining the population dynamics of an
outbreak.
A New Concept: Contrabiotics
Insufficient attention has been paid to
the human microbiome (i.e., the consortium of microbes that inhabit the human
body) as it relates to our efforts to
increase biopreparedness. New analyses
of the diversity and composition of the
human microbiome are making it increasingly clear that human health
depends on a delicate equilibrium between the microbial inhabitants and the
human host [18,19]. Severe effects on
health could be caused not only by the
introduction of true pathogens in the
traditional sense into these human-asso-
Challenges for the Future
References
1. Fleischmann RD, Adams MD, White O,
Clayton RA, Kirkness EF, et al. (1995) Wholegenome random sequencing and assembly of
Haemophilus influenzae Rd. Science 269:
496–512.
2. Guzman E, Romeu A, Garcia-Vallve S (2008)
Completely sequenced genomes of pathogenic
bacteria: A review. Enferm Infecc Microbiol Clin
26: 88–98.
3. Binnewies TT, Motro Y, Hallin PF, Lund O,
Dunn D, et al. (2006) Ten years of bacterial
genome sequencing: Comparative-genomicsbased discoveries. Funct Integr Genomics 6:
165–185.
PLoS Biology | www.plosbiology.org
4
October 2009 | Volume 7 | Issue 10 | e1000217
12.
13.
14.
15.
16.
compliant genomic and metagenomic publication
record. OMICS 12: 157–160.
Field D, Garrity GM, Sansone SA, Sterk P,
Gray T, et al. (2008) Meeting report: The fifth
Genomic Standards Consortium (GSC) workshop. OMICS 12: 109–113.
Schatz MC (2009) CloudBurst: Highly sensitive
read mapping with MapReduce. Bioinformatics
25: 1363–1369.
Gerner-Smidt P, Hise K, Kincaid J, Hunter S,
Rolando S, et al. (2006) PulseNet USA: A fiveyear update. Foodborne Pathog Dis 3: 9–19.
Urwin R, Maiden MC (2003) Multi-locus sequence typing: A tool for global epidemiology.
Trends Microbiol 11: 479–487.
Keim P, Price LB, Klevytska AM, Smith KL,
Schupp JM, et al. (2000) Multiple-locus variablenumber tandem repeat analysis reveals genetic
relationships within Bacillus anthracis. J Bacteriol
182: 2928–2936.
PLoS Biology | www.plosbiology.org
17. Boxrud D, Pederson-Gulrud K, Wotton J,
Medus C, Lyszkowicz E, et al. (2007) Comparison of multiple-locus variable-number tandem
repeat analysis, pulsed-field gel electrophoresis,
and phage typing for subtype analysis of Salmonella
enterica serotype Enteritidis. J Clin Microbiol 45:
536–543.
18. Gao Z, Tseng CH, Strober BE, Pei Z, Blaser MJ
(2008) Substantial alterations of the cutaneous
bacterial biota in psoriatic lesions. PLoS One 3:
e2719.
19. Turnbaugh PJ, Ley RE, Mahowald MA,
Magrini V, Mardis ER, et al. (2006) An obesityassociated gut microbiome with increased capacity for energy harvest. Nature 444: 1027–1031.
20. Srinivasan S, Fredricks DN (2008) The human
vaginal bacterial biota and bacterial vaginosis.
Interdiscip Perspect Infect Dis 2008: 750479.
21. Pourrut X, Kumulungui B, Wittmann T,
Moussavou G, Delicat A, et al. (2005) The
5
22.
23.
24.
25.
26.
natural history of Ebola virus in Africa. Microbes
Infect 7: 1005–1014.
Peleg AY, Seifert H, Paterson DL (2008)
Acinetobacter baumannii: Emergence of a successful
pathogen. Clin Microbiol Rev 21: 538–582.
Wang D, Urisman A, Liu YT, Springer M,
Ksiazek TG, et al. (2003) Viral discovery and
sequence recovery using DNA microarrays. PLoS
Biol 1: e2. doi:10.1371/journal.pbio.0000002.
Marra MA, Jones SJ, Astell CR, Holt RA,
Brooks-Wilson A, et al. (2003) The genome
sequence of the SARS-associated coronavirus.
Science 300: 1399–1404.
Zhu M (2004) SARS immunity and vaccination.
Cell Mol Immunol 1: 193–198.
Haagmans BL, Osterhaus AD (2006) Coronaviruses and their therapy. Antiviral Res 71:
397–403.
October 2009 | Volume 7 | Issue 10 | e1000217
Perspective
Discovering the Phylodynamics of RNA Viruses
Edward C. Holmes1,2*, Bryan T. Grenfell2,3
1 Center for Infectious Disease Dynamics, Department of Biology, The Pennsylvania State University, Mueller Laboratory, University Park, Pennsylvania, United States of
America, 2 Fogarty International Center, National Institutes of Health, Bethesda, Maryland, United States of America, 3 Department of Ecology and Evolutionary Biology
and Woodrow Wilson School, Princeton University, Princeton, New Jersey, United States of America
Phylodynamics: The Discovery
Phase
The advent of extremely high throughput DNA sequencing ensures that genomic
data from microbial organisms can be
acquired in unprecedented quantities and
with remarkable rapidity. Although this
genomic revolution will affect all microbes
alike, our focus here is on RNA viruses, as
the rapidity of their evolution, which is
observable over the time scale of human
observation, allows phylodynamic inferences to be made with great precision. In
the foreseeable future it is likely that
complete genome sequencing will become
the standard method of viral characterization, providing the highest possible resolution for phylogenetic studies. The rapidity with which genome sequence data were
generated from the ongoing epidemic of
swine-origin H1N1 influenza A virus [1] is
testament to the power of this technology.
Understandably, pathogen discovery is
a major focus of this new-scale genome
sequencing [2]. It is now possible to
sequence the entire assemblage of viruses
in a particular tissue type or host species
[3–5], as well as all those viruses that are
associated with specific disease syndromes
[6,7]. In essence, this new era of metagenomics constitutes a crucial taxonomic
discovery phase in virology and epidemiology that allows the genetic characterization of new viruses within hours of their
isolation.
Assembling an inventory of viruses that
may emerge in human populations is of
major importance to public health and to
students of biodiversity. However, it is only
the first step in developing a full quantitative understanding of the processes that
shape the epidemiology and evolution—
the phylodynamics—of RNA virus infections [8]. To achieve this goal, we argue
here that the field of viral phylodynamics
requires its own discovery phase; that is, a
comprehensive and quantitative analysis
of the interaction between the ecological
and evolutionary dynamics of all circulating RNA viruses from the molecular to the
global scale. Such a marriage of phylogenetic and epidemiological dynamics is
currently only potentially possible for the
select few human viruses for which large
genome sequence datasets have been
acquired, such as HIV and influenza A
virus, and even here fundamental gaps in
our knowledge remain (see below). Indeed,
it is striking that so few complete genome
sequences are currently available for
viruses whose epidemiological dynamics
are known in exquisite detail, such as
measles [9,10]; these sequences have been
so sparsely sampled in both time and space
that a full phylodynamic perspective has
not yet been achieved. We contend that a
better understanding of RNA virus phylodynamics will allow more directed attempts at pathogen surveillance, facilitate
more accurate predictions of the epidemiological impact of newly emerged viruses,
and assist in the control of those viruses
that exhibit complex patterns of antigenic
variation such as dengue and influenza.
Just as PCR and first-generation DNA
sequencing ushered in the science of
molecular epidemiology, so next-generation sequencing may herald the age of
phylodynamics. Box 1 lists a number of
key questions that can be addressed within
this phylodynamics research program.
A number of important advances are
needed to meet our goal of a comprehensive catalog of the diversity of phylodynamic patterns in RNA viruses. Because
answers to many of the most interesting
research questions depend on sufficiently
large sample sizes, we require large
numbers of sequences that have been
rigorously sampled according to strict
temporal, spatial, and clinical criteria,
and that as much of these data are publicly
accessible as possible. A phylodynamic
analysis has little value unless viral genomes are sampled on the same scale as
the epidemiological processes under investigation.
The only acute virus for which a suitably
expansive genome dataset currently exists is
influenza. In this case, the .4,000 complete genomes generated under the Influenza Genome Sequencing Project [11]
have provided important new insights into
the evolution and epidemiology of this
major human pathogen [12]. To highlight
one key insight here, these genome sequence data have revealed that multiple
lineages of influenza virus are imported and
circulate within specific geographic localities (even within relatively isolated populations), generating both frequent mixed
infections [13] and reassortment events
[14]. Even so, the sampling of these
genome sequences (and associated epidemiological covariates) may not be dense
enough to fully capture spatial dynamics
[15]. There is also a marked absence of
samples from asymptomatically infected
patients (or those with mild disease), so it
is impossible to link genetic variation to
clinical syndrome. Such a bias against
viruses sampled from individuals with
asymptomatic infections is a common
problem in molecular epidemiology.
Epidemiological Factors
It is also clear that for many RNA
viruses we need to better understand a
Citation: Holmes EC, Grenfell BT (2009) Discovering the Phylodynamics of RNA Viruses. PLoS Comput
Biol 5(10): e1000505. doi:10.1371/journal.pcbi.1000505
Editor: Ernest Fraenkel, Massachusetts Institute of Technology, United States of America
Published October 26, 2009
Copyright: ß 2009 Holmes, Grenfell. This is an open-access article distributed under the terms of the
Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any
medium, provided the original author and source are credited.
Funding: BTG was supported by the RAPIDD program of the Science & Technology Directorate of the
Department of Homeland Security and the National Institutes of Health (NIH), and National Science Foundation
grant 0742373. ECH was supported by the NIH (grant GM080533). The funders had no role in study design, data
collection and analysis, decision to publish, or preparation of the manuscript.
Competing Interests: The authors have declared that no competing interests exist.
* E-mail: [email protected]
This article is part of the ‘‘Genomics of Emerging Infectious Disease’’ PLoS Journal collection (http://
ploscollections.org/emerginginfectiousdisease/).
PLoS Computational Biology | www.ploscompbiol.org
1
October 2009 | Volume 5 | Issue 10 | e1000505
Box 1. Key Research Questions in RNA Virus Phylodynamics
(1) What is the range of phylodynamic patterns observed in RNA viruses? Can they
be categorized into specific groups? How do these patterns relate to other ‘‘life
history’’ variables exhibited by RNA viruses?
(2) What epidemiological and evolutionary processes give rise to these phylodynamic
patterns? What generalities can be drawn?
(3) How commonly does natural selection (compared to neutral evolutionary
processes) determine the population dynamics of pathogens? On what scale does
natural selection act? How does viral immune escape reduce herd immunity at the
population level and allow the persistence of viral lineages in epidemic troughs?
(4) What is the range of spatial patterns exhibited by RNA viruses? What
epidemiological factors are responsible for these patterns?
(5) How do different viral species (various respiratory viruses, for example) interact
in host immunity?
number of key epidemiological factors,
such as the interaction between local
persistence, epidemic dynamics in both
time and space, the impact of measures to
control the spread of infection, and the
consequences of adaptive evolution in
those viral genes that interact most
intimately with the host immune response.
It is instructive to imagine the ideal
database for addressing these issues. In
the case of acute infections, the goal would
be to collect four parallel datasets on the
appropriate scale of interest during outbreaks (Figure 1). This database would
comprise, first, epidemic dynamics in time and
space, ideally at a comparable or higher
frequency than the generation time of
individual infections. Second, and in
parallel, our ideal study would collect viral
genome sequence data at these time points,
sampling both within and among infected
hosts. Both disease incidence data (bolstered by contact tracing) and viral
sequence data furnish information on the
transmission network traced by an outbreak. Third, we would need to know the
underlying contact network of susceptible
individuals, which serves as fuel for the
epidemic. This is a difficult structure to
measure directly, although novel measure-
ments of human interactions are increasingly shedding light on the problem [16].
Finally, measurements of the immunity
structure of our contact network [17]—
reflecting the past history of the virus in
the population—are key for understanding
both the dynamics of epidemic spread and
the evolutionary pressures that shape virus
diversity.
The outbreak of foot-and-mouth disease
(FMD, an RNA virus infection of cattle) in
the UK in 2001 resulted in a database that
is arguably closest to our ideal on the
epidemiological scale [18,19]. Notwithstanding a variety of gaps in data from
the epidemic [20], it is one of the most
well-documented large outbreaks in terms
of the availability of spatiotemporal incidence data in parallel with contact tracing
and the underlying spatial pattern of the
susceptible farms as a measure of the
contact network. In addition, analyses of
viral sequences from relatively small samples of farms have drawn important
conclusions about epidemic spread and
allowed the testing of new methods to
recover the spatiotemporal patterns written into sequence data [18,20]. Importantly, samples exist from over half the
,2,000 confirmed infected premises in
2001: sequencing whole FMD virus genomes from these samples would provide a
vast resource for basic and applied devel-
Figure 1. Sampling scales for acute RNA viruses and the associated phylodynamic processes that viral genome sequence data and
host sampling can elucidate.
doi:10.1371/journal.pcbi.1000505.g001
PLoS Computational Biology | www.ploscompbiol.org
2
October 2009 | Volume 5 | Issue 10 | e1000505
opments in integrating epidemiological
and phylogenetic information to dissect
spatiotemporal spread. We suggest that
achieving this task would be a huge
contribution to understanding the phylodynamics of acute viruses. Another virtue
of animal infections like FMD is that the
relationship between the determinants of
viral variability within and between hosts
can also be dissected by experimental
infections (see [21] for another example).
A parallel limitation of many phylogenetic approaches to viral epidemiology is
that they have often proceeded in the
absence of the necessary metadata, such as
the precise time and place of sampling or
those that relate to clinical syndrome [22].
A perhaps more challenging goal for
phylodynamics is therefore to integrate
phylogenetic patterns with other biological
variables, such as the nature of antigenic
variation, the capacity for drug resistance,
or the clinical syndrome of the host, as well
as the spatial host network data outlined
above. Cohort studies may be the most
productive way to link genomics with
epidemiological variables.
The lack of a synthesis of phylogenetic
and phenotypic/epidemiological data is
reflected in the current debate over the
mode of antigenic evolution in human
influenza A virus. Although it has long
been known that the hemagglutinin (HA)
and neuraminidase (NA) proteins of hu-
man influenza A virus evolve by strong
natural selection to evade the host immune
response—a process commonly called
antigenic drift [23,24]—the precise mechanisms by which such drift occurs are
uncertain. From a phylodynamics perspective, the key observation is that over long
time periods a single lineage of HA
sequences from subtype A/H3N2 influenza viruses links epidemic to epidemic [23],
although intensive sampling has revealed
that single populations may harbor far
higher levels of genetic diversity [25].
Rather different phylodynamic patterns
are seen in other influenza viruses, including those sampled from birds (Figure 2).
Three models have been proposed to
explain the distinctive phylodynamic pattern observed in human A/H3N2 viruses:
(i) that there is short-lived cross-immunity
among viral strains [26], (ii) that the HA
evolves in a punctuated manner among
antigenic types that are linked by a
network of neutrally evolving sites [27],
and (iii) that the virus continually reuses a
limited number of antigenic combinations
[28].
To determine which combination of
these models best explains influenza phylodynamics will require more expansive
genome sequence data, as well as focused
sampling and epidemiological surveillance
in Southeast Asia, which is likely the global
source population for the virus [29]. More
importantly, it is also crucial that these
phylogenetic data are combined with
detailed, spatiotemporally disaggregated
antigenic information. Indeed, it is remarkable that despite the abundance of
information on the antigenic characteristics of individual influenza viruses, most
notably through the use of the hemagglutinin inhibition (HI) assay [17], these data
have not been routinely linked to phylogenetic information. It is clear that both
antigenic and phylogenetic analyses would
greatly benefit from each other.
New-Generation Computational
Tools
Another important challenge for phylodynamics is to match the remarkable
ongoing developments in genome sequencing technology to the increase in
the power of the computational tools
available to analyze these sequence data.
Crucially, in phylogenetics, the size of the
space of possible trees increases faster than
exponentially with the number of sequences, such that the availability of datasets
comprising thousands of complete genomes [30] presents a major combinatorial problem. This problem creates a
growing discrepancy between our ability
to generate genome sequence data and our
capacity to analyze them using the most
sophisticated methods. Redressing this
Figure 2. Phylodynamic patterns of human and avian influenza viruses. The left diagram shows the phylogeny of the hemagglutinin (HA)
gene of human H3N2 influenza A viruses sampled between 1985 and 2005, revealing the ‘‘ladder-like’’ branching structure indicative of antigenic
drift. By comparison, the phylogeny of the HA gene of human influenza B virus sampled over the same interval (center diagram) shows the cocirculation of the antigenically distinct ‘‘Victoria 1987’’ and ‘‘Yamagata 1988’’ lineages, as well a shorter length from root to tip, reflecting a lower rate
of evolutionary change. Finally, the phylogeny for the HA gene of H4 avian influenza virus (right diagram) reveals the deep geographic division
between the Eurasian and Australian versus North American lineages of this virus.
doi:10.1371/journal.pcbi.1000505.g002
PLoS Computational Biology | www.ploscompbiol.org
3
October 2009 | Volume 5 | Issue 10 | e1000505
balance should be the major goal of
bioinformatics in the future; and in fact
some progress has been made recently
[31].
It is also clear that improvements need
to be made to the methods that are
available to analyze genome sequence
data. A powerful set of research tools in
this area comprises those based on coalescent theory, as this provides a natural link
between the analysis of epidemiological
and phylogenetic patterns [8,32]. In particular, the coalescent allows the demographic characteristics of viral populations
(particularly population size and growth
rate) to be inferred directly from gene
sequence data. Coalescent analyses are
especially powerful in the case of RNA
viruses, because their rapid evolution
means that temporal and spatial dynamics
are discernable over the period of human
observation [33] and can in theory be
combined with time series epidemiological
data. However, currently available coalescent methods are restricted by the limited
scope of demographic models and their
inability to fully incorporate spatial information. In particular, most acute RNA
viruses have complex population dynamics
that combine distinct periods of growth
and decline. The most commonly used
phylodynamic tool available in such cases
is the Bayesian skyline plot (and the related
Bayesian ‘‘skyride’’ [34]), which represents
a piecewise graphical depiction of changes
in genetic diversity through time [32]. In
the case of neutral evolution, such changes
in genetic diversity also reflect underlying
changes in the number of infected hosts.
Although the Bayesian skyline plot can
reveal unique features of epidemic dynamics (Figure 3) [30], precise estimates of
parameters such as population growth rate
are not yet possible.
The coalescent methods commonly
used to study RNA virus evolution focus
largely on temporal dynamics (a natural
function of the rapidity of viral evolution),
with little consideration of patterns of
spatial diffusion. Although these phylogeographic patterns are becoming increasingly well described for RNA viruses [35], few
methods effectively recover the spatial
component in genome sequence data.
For example, commonly used parsimonybased approaches consider a single phylogenetic tree without an explicit spatial
model (see, for example, [36]). In addition,
these methods usually describe the place of
origin and direction of spread of viral
lineages without formal tests of competing
spatial hypotheses. As a specific case in
point, although gravity models (in which
patterns of viral transmission reflect the
size of and distance between population
centers) have been applied successfully to
morbidity and mortality data from human
influenza A virus to describe its spread
across the United States [37], they have
yet to be interpreted within a phylogenetic
setting. A clear push for the future should
therefore be the development of coalescent
tools that integrate the analysis of spatial
and temporal dynamics within a single
framework, with a focus on those that
combine phylogenetic data and information on the dynamics of the host contact
network of susceptible, infected, and
immune individuals.
Looking beyond the Consensus
Sequence
The vast majority of studies of RNA
virus evolution undertaken to date, particularly of those viruses that cause acute
infections, rely on the analysis of consensus
sequences in which the nucleotide shown
for any given site is the most common
among all the genomes within a patient.
Although the use of consensus sequences is
adequate for many aspects of molecular
epidemiology, in which complete genomes
may suffice to determine even tight
transmission chains [20], there is growing
evidence that key evolutionary processes
occur beyond the consensus. In particular,
extensive intra-host gene sequencing has
revealed the existence of minor viral
subpopulations within individual hosts that
are not detected by consensus sequencing
and that are sometimes of great phenotypic importance [38,39]. Given the intrinsically high mutation rates of RNA
viruses, as well as the immense size of
intra-host populations, such extensive genetic and phenotypic diversity is only to be
expected.
Figure 3. Fluctuating genetic diversity of influenza A virus. The figure shows a Bayesian skyline plot of changing levels of genetic diversity
through time for the HA gene (165 sequences) of A/H3N2 virus sampled from the state of New York, US, during the period 2001–2003. The y-axes
depict relative genetic diversity (Net, where Ne is the effective population size, and t the generation time from infected host to infected host), which
can be considered a measure of effective population size under strictly neutral evolution. Peaks of genetic diversity, reflecting the seasonal
occurrence of influenza, are clearly visible. See [30] for a more detailed analysis.
doi:10.1371/journal.pcbi.1000505.g003
PLoS Computational Biology | www.ploscompbiol.org
4
October 2009 | Volume 5 | Issue 10 | e1000505
A full description of the extent and
structure of intra-host viral genetic variation is critical for understanding evolutionary dynamics, informing on such issues
as the frequency of mixed infection, and
hence the degree and extent of crossimmunity; the frequency with which
antigenic variants are produced and
whether antigenic evolution can occur on
the time scale of individual infections; and
the size of the population bottleneck that
might accompany inter-host transmission.
As a case in point, it is commonly assumed
that viruses experience a severe population
bottleneck as they are transmitted to new
hosts, a phenomenon that greatly restricts
the power of natural selection to fix
advantageous mutations. Although this
assumption appears to be true in some
cases [40], whether this is a general
property of RNA viruses is unclear; the
evidence that multiple viral lineages can
be transmitted among hosts argues against
a narrow bottleneck in all cases [41]. To
more accurately determine the size of the
transmission bottleneck, analyses of intrahost genetic diversity along known transmission chains will be essential. On a
larger scale, it is unclear whether phylodynamic patterns differ within and among
hosts, and whether any differences among
these scales of analysis are qualitative or
quantitative.
Intra-host sequence data are also essential for understanding the process of crossspecies virus transmission and emergence.
Key parameters in determining whether a
virus will adapt successfully to a new host
species include the extent of intra-host
genetic diversity, the fitness distribution of
the mutations produced, and how many of
these mutations will assist adaptation to
new host species [41–43]. No such data
are available for any acute RNA virus, so
testing models for viral emergence is
difficult. We believe, however, that understanding the mechanics of this adaptive
process is at least as important as surveying
for new emerging viruses.
Challenges for the Future
Our discussion has highlighted a number of key challenges for a successful
phylodynamic research agenda. These
challenges comprise data, theory, and
methodological issues, and are briefly
summarized as follows. First, with respect
to data, it is clear that more genome
sequences must be acquired and with
increased temporal and spatial precision.
For example, wherever possible, GenBank
records should contain the exact day and
precise latitude and longitude of sampling.
In addition, it is essential that these
sequence data be linked with the relevant
metadata, such as the associated clinical
syndrome and (if applicable) measure of
antigenicity. Similarly, it is essential that
equivalent genome sequence data be
acquired from multiple time points within
individual hosts. Second, in terms of
theory, it is crucial that we fully integrate
patterns of viral evolution across multiple
epidemiological scales, from within hosts,
to local outbreaks, and on to global
pandemics. Although the coalescent is
hugely useful in this respect, it is essential
that its theoretical framework be extended
to incorporate models of population
growth and decline that most accurately
reflect the population dynamics of acute
RNA viruses, in particular the dynamics of
the susceptible ‘‘denominator’’ that fuels
epidemics. Sequencing of all available
samples from the UK 2001 FMD epidemic would yield great scientific dividends
here. Third and finally, with respect to
methodology, new computational tools are
needed to rapidly make phylodynamic
inferences from genomic datasets that
may contain thousands of sequences and
that efficiently integrate genomic with
other forms of biological data. We hope
this review will stimulate research in all
these areas.
References
1. Novel Swine-Origin Influenza A (H1N1) Virus
Investigation Team, Dawood FS, Jain S, Finelli L,
Shaw MW, et al. (2009) Emergence of a novel
swine-origin influenza A (H1N1) virus in humans.
N Engl J Med 360: 2605–2615.
2. Lipkin WI (2009) Microbe hunting in the 21st
century. Proc Natl Acad Sci U S A 106: 6–7.
3. Cox-Foster DL, Conlan S, Holmes EC,
Palacios G, Evans JD, et al. (2007) A metagenomic survey of microbes in honey bee colony
collapse disorder. Science 318: 283–287.
4. Finkbeiner SR, Allred AF, Tarr PI, Klein EJ,
Kirkwood CD, et al. (2008) Metagenomic analysis
of human diarrhea: viral detection and discovery.
PLoS Pathog 4(2): e1000011. doi:10.1371/journal.
ppat.1000011.
5. Zhang T, Breitbart M, Lee WH, Run JQ, Wei CL,
et al. (2005) RNA viral community in human feces:
Prevalence of plant pathogenic viruses. PLoS Biol
4(1): e3. doi:10.1371/journal.pbio.0040003.
6. Palacios G, Druce J, Du L, Tran T, Birch C, et al.
(2008) A new arenavirus in a cluster of fatal
transplant-associated diseases. N Engl J Med 358:
991–998.
7. Palmenberg AC, Spiro D, Kuzmickas R,
Wang S, Djikeng A, et al. (2009) Sequencing
and analyses of all known human rhinovirus
genomes reveals structure and evolution. Science 324: 55–59.
8. Grenfell BT, Pybus OG, Gog JR, Wood JLN,
Daly JM, et al. (2004) Unifying the epidemiological and evolutionary dynamics of pathogens.
Science 303: 327–332.
9. Bjørnstad ON, Finkenstädt B, Grenfell BT (2002)
Dynamics of measles epidemics. I. estimating
scaling of transmission rates using a time series
SIR model. Ecol Monogr 72: 169–184.
10. Grenfell BT, Bjornstad ON, Finkenstädt BF
(2002) Dynamics of measles epidemics. II. Scaling
noise, determinism and predictability with the
time series SIR model. Ecol Monogr 72:
185–202.
11. Ghedin E, Sengamalay NA, Shumway M,
Zaborsky J, Feldblyum T, et al. (2005) Largescale sequencing of human influenza reveals the
dynamic nature of viral genome evolution.
Nature 437: 1162–1166.
12. Nelson MI, Holmes EC (2007) The evolution of
epidemic influenza. Nat Rev Genet 8: 196–205.
13. Ghedin E, Fitch A, Boyne A, DePasse J, Bera J,
et al. (2009) Mixed infection and the genesis of
influenza diversity. J Virol 83: 8832–8841.
14. Nelson MI, Simonsen L, Viboud C, Miller MA,
Taylor J, et al. (2006) Stochastic processes are key
determinants of the short-term evolution of
influenza A virus. PLoS Pathog 2: e125.
doi:10.1371/journal.ppat.0020125.
15. Nelson MI, Edelman L, Spiro DJ, Boyne AR,
Bera J, et al. (2008) Molecular epidemiology of
A/H3N2 and A/H1N1 influenza virus during a
single epidemic season in the United States. PLoS
Pathog 4(8): e1000133. doi:10.1371/journal.
ppat.1000133.
16. Gonzalez MC, Hidalgo CA, Barabasi AL (2008)
Understanding individual human mobility patterns. Nature 453: 779–782.
17. Smith DJ, Lapedes AS, de Jong JC,
Bestebroer TM, Rimmelzwaan GF, et al.
(2004) Mapping the antigenic and genetic
evolution of influenza virus. Science 305:
371–376.
18. Cottam EM, Haydon DT, Paton DJ, Gloster J,
Wilesmith JW, et al. (2006) Molecular epidemiology of the foot-and-mouth disease virus out-
PLoS Computational Biology | www.ploscompbiol.org
5
19.
20.
21.
22.
23.
24.
25.
26.
27.
break in the United Kingdom in 2001. J Virol 80:
11274–11282.
Keeling MJ, Woolhouse MEJ, Shaw DJ,
Matthews L, Chase-Topping M, et al. (2001)
Dynamics of the 2001 UK foot and mouth
epidemic: stochastic dispersal in a heterogeneous
landscape. Science 294: 813–817.
Cottam EM, Wadsworth J, Shaw AE,
Rowlands RJ, Goatley L, et al. (2008) Transmission pathways of foot-and-mouth disease virus in
the United Kingdom in 2007. PLoS Pathog 4(4):
e1000050. doi:10.1371/journal.ppat.1000050.
Hoelzer K, Shackelton LA, Holmes EC,
Parrish CR (2008) Within-host genetic diversity
of endemic and emerging parvoviruses of cats
and dogs. J Virol 82: 11096–11105.
Holmes EC (2007) Viral evolution in the genomic
age. PLoS Biol 5(10): e278. doi:10.1371/journal.
pbio.0050278.
Fitch WM, Leiter JME, Li X, Palese P (1991)
Positive Darwinian evolution in human influenza
A viruses. Proc Natl Acad Sci U S A 88:
4270–4274.
Webster RG, Laver WG, Air GM, Schild GC
(1982) Molecular mechanisms of variation in
influenza viruses. Nature 296: 115–121.
Holmes EC, Ghedin E, Miller N, Taylor J, Bao Y,
et al. (2005) Whole genome analysis of human
influenza A virus reveals multiple persistent
lineages and reassortment among recent H3N2
viruses. PLoS Biol 3(9): e300. doi:10.1371/
journal.pbio.0030300.
Ferguson NM, Galvani AP, Bush RM (2003)
Ecological and immunological determinants of
influenza evolution. Nature 422: 428–433.
Koelle K, Cobey S, Grenfell B, Pascual M (2006)
Epochal evolution shapes the phylodynamics of
October 2009 | Volume 5 | Issue 10 | e1000505
28.
29.
30.
31.
32.
interpandemic influenza A (H3N2) in humans.
Science 314: 1898–1903.
Recker M, Pybus OG, Nee S, Gupta S (2007)
The generation of influenza outbreaks by a
network of host immune responses against a
limited set of antigenic types. Proc Natl Acad
Sci U S A 104: 7711–7716.
Russell CA, Jones TC, Barr IG, Cox NJ,
Garten RJ, et al. (2008) The global circulation
of seasonal influenza A (H3N2) viruses. Science
320: 340–346.
Rambaut A, Pybus OG, Nelson MI, Viboud C,
Taubenberger JK, et al. (2008) The genomic and
epidemiological dynamics of human influenza A
virus. Nature 453: 615–619.
Suchard MA, Rambaut A (2009) Many-core
algorithms for statistical phylogenetics. Bioinformatics 25: 1370–1376.
Drummond AJ, Rambaut A, Shapiro B,
Pybus OG (2005) Bayesian coalescent inference
of past population dynamics from molecular
sequences. Mol Biol Evol 22: 1185–1192.
33. Drummond AJ, Pybus OG, Rambaut A,
Forsberg R, Rodrigo AG (2003) Measurably
evolving populations. Trends Ecol Evol 18:
481–488.
34. Minin VN, Bloomquist EW, Suchard MA (2008)
Smooth skyride through a rough skyline: Bayesian
coalescent-based inference of population dynamics. Mol Biol Evol 25: 1459–1471.
35. Holmes EC (2008) The evolutionary history and
phylogeography of human viruses. Annu Rev
Microbiol 62: 307–328.
36. Wallace RG, Hodac H, Lathrop RH, Fitch WM
(2007) A statistical phylogeography of influenza A
H5N1. Proc Natl Acad Sci U S A 104:
4473–4478.
37. Viboud C, Bjornstad ON, Smith DL, Simonsen L,
Miller MA, et al. (2006) Synchrony, waves, and
spatial hierarchies in the spread of influenza.
Science 312: 447–451.
38. Aaskov J, Buzacott K, Thu HM, Lowry K,
Holmes EC (2006) Long-term transmission of
defective RNA viruses in humans and Aedes
mosquitoes. Science 311: 236–238.
PLoS Computational Biology | www.ploscompbiol.org
6
39. Jerzak G, Bernard KA, Kramer LD, Ebel GD
(2005) Genetic variation in West Nile virus from
naturally infected mosquitoes and birds suggests
quasispecies structure and strong purifying selection. J Gen Virol 86: 2175–2183.
40. Keele BF, Giorgi EE, Salazar-Gonzalez JF,
Decker JM, Pham KT, et al. (2008) Identification
and characterization of transmitted and early
founder virus envelopes in primary HIV-1
infection. Proc Natl Acad Sci U S A 105:
7552–7557.
41. Holmes EC (2009) The evolution and emergence
of RNA viruses. Oxford Series in Ecology and
Evolution. Harvey PH, May RM, eds. Oxford:
Oxford University Press.
4 2. Kuiken T, Hol mes EC, McCa uley J ,
Rimmelzwaan GF, Williams CS, et al. (2006)
Host species barriers to influenza virus infections.
Science 312: 394–397.
43. Parrish CR, Holmes EC, Morens DM, Park EC,
Burke DS, et al. (2008) Cross-species viral
transmission and the emergence of new epidemic
diseases. Microbiol Mol Biol Rev 72: 457–470.
October 2009 | Volume 5 | Issue 10 | e1000505
Perspective
Computational Resources in Infectious Disease:
Limitations and Challenges
Eva C. Berglund, Björn Nystedt, Siv G. E. Andersson*
Department of Molecular Evolution, Uppsala University, Uppsala, Sweden
Infectious diseases continue to be a
major cause of death in the human
population, with tuberculosis and malaria
affecting 500 million people and causing
1–2 million deaths annually [1]. The
situation is aggravated by the increasing
prevalence of antibiotic-resistant bacteria
and the risk that terrorists might use
infectious organisms to aggress target
populations. During the past decade, we
have also witnessed the emergence of
many new pathogens not previously detected in humans, such as the avian
influenza virus, severe acute respiratory
syndrome (SARS), and Ebola. The appearance of these novel agents and the
reemergence of previously eradicated
pathogens may be associated with the
growing human population, flooding, and
other environmental perturbations; global
travel and migration; and animal trade
and domestic animal husbandry practices.
Simultaneously, we have seen an explosion
of genome sequence data. Sequencing is
now the method of choice for characterization of new disease agents, as exemplified by the rapid sequencing of the
genome of the SARS virus, which was
made available within a month of identification of the virus [2,3]. Like SARS,
most newly emerging disease agents originate in animals and have been transmitted to humans recently at food markets, by
insect bites, or through hunting [1].
The new sequencing technologies enable
small academic research groups to create
huge genome datasets at low cost. As a
result, scientists with expertise in other
fields of research, such as clinical microbiology and ecology, are just beginning to
face the challenge of handling, comparing,
and extracting useful information from
millions of sequences. Here, we discuss
the limitations of publicly available resources in the field of genomics of emerging
bacterial pathogens, emphasizing areas
where increased efforts in computational
biology are urgently needed.
Genome Evolution in Emerging
Bacterial Pathogens
A natural ecosystem of a bacterial
population that incidentally infects hu-
mans provides a high-risk microenvironment for the establishment of this pathogen in the human population (Box 1;
Figure 1). Comparative studies of the
genomes of well-recognized human pathogens, incidental pathogens, and their
closely related nonpathogenic species [4–
11] are valuable for efforts to predict the
propensity for host shifts and their consequences for human health.
A successful infectious bacterium,
whether it causes disease or not, must
possess mechanisms for interacting with
the host and evading the host immune
system. The key players in these processes
are often proteins on the surface of the
bacterium, including secretion systems
that release effector proteins into the
surrounding medium or directly into the
host cells. These host-interaction factors
are often members of large protein families
with many paralogs and often encoded by
long genes with internal repeats. Fluctuations in gene length and copy number
occur through homologous recombination
over these repeats [12–15].
Adding to the variability of the hostinteraction genes is that they are often
located on mobile elements such as
plasmids or bacteriophages, which are
easily gained and lost. Rapid sequence
evolution of these genes may be driven by
selection, because it often increases bacterial fitness by escaping the host immune
system, creating a diverse set of binding
structures or tuning effector proteins to a
new host. As a consequence, host-interaction genes typically show extreme plasticity in both sequence and copy number,
partly because they are under strong
evolutionary pressure and partly because
they are mechanistically prone to drastic
mutational changes. Understanding these
complex dynamics poses major challenges
in many areas of computational biology,
ranging from sequence assembly to epidemic risk assessment.
Complete Genome Assembly
Remains Difficult
Despite the ease with which shotgun
sequence data can be generated, assembling these data into a single genomic
contig remains labor-intensive and timeconsuming. This obstacle is primarily due
to the difficulty of assembling repeated
sequences. Hence, resequencing approaches—where short sequence reads
are directly mapped to an already completed reference genome—have become
increasingly popular. Resequencing readily detects SNPs (single nucleotide polymorphisms) in single-copy genes, but
performs very poorly in repeated and
highly divergent regions of the genome.
Genes involved in infection processes, with
their complex repeat structures, high
duplication frequency, and rapid evolution, are thus often left unresolved.
The perhaps most imminent need is not
for improved assembly algorithms but for
Citation: Berglund EC, Nystedt B, Andersson SGE (2009) Computational Resources in Infectious Disease:
Limitations and Challenges. PLoS Comput Biol 5(10): e1000481. doi:10.1371/journal.pcbi.1000481
Editor: Ernest Fraenkel, Massachusetts Institute of Technology, United States of America
Published October 26, 2009
Copyright: ß 2009 Berglund et al. This is an open-access article distributed under the terms of the Creative
Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium,
provided the original author and source are credited.
Funding: The authors are supported by grants to SGEA from the European Union (QLK3-CT2000-01079,
EUWOL and EuroPathogenomics), the Swedish Research Council (http://www.vr.se/), the Göran Gustafsson
Foundation (http://www.gustafssonsstiftelse.se/), the Swedish Foundation for Strategic Research (http://www.
stratresearch.se/) and the Knut and Alice Wallenberg Foundation (http://www.wallenberg.com/kaw/). The
funders had no role in study design, data collection and analysis, decision to publish, or preparation of the
manuscript.
Competing Interests: The authors have declared that no competing interests exist.
* E-mail: [email protected]
This article is part of the ‘‘Genomics of Emerging Infectious Disease’’ PLoS Journal collection (http://
ploscollections.org/emerginginfectiousdisease/).
PLoS Computational Biology | www.ploscompbiol.org
1
October 2009 | Volume 5 | Issue 10 | e1000481
Box 1. Genomic Changes Associated with Host Shifts
The movement of a bacterial species from abundant animal hosts such as
rodents, which are a major reservoir of infectious disease agents, to the relatively
small human population is typically associated with decreased genome size and
loss/alteration of the mobile gene pool [4,32–34]. One illustrative example can be
found in the genus Mycobacterium, which contains several severe human
pathogens, including the agents of tuberculosis (M. tuberculosis) and leprosy (M.
leprae) and also the recently emerged M. ulcerans. M. ulcerans causes severe skin
lesions; this disease, known as Buruli ulcer, is becoming a serious public health
problem in West and Central Africa as well as in other parts of the tropics.
Like many other recently emerged human pathogens [4,34–36], M. ulcerans
appears to have switched from a generalist to a specialist lifestyle: starting with a
progenitor very similar to the aquatic M. marinum. While M. marinum has been
found both free-living and as an intracellular pathogen of fish and other species,
M. ulcerans is thought to have a restricted host range and to be transmitted by
insects (Figure 1A). The host switch was likely initiated by the uptake of a
virulence plasmid, and preceded through a series of ‘‘bottleneck events’’ or
(severe reductions in population size due to environmental circumstances). This
process resulted in loss of about 1 Mb of the genome, major genomic
rearrangements, extensive proliferation of insertion sequences, and a massive
increase in number of pseudogenes [37–39]. In particular, there was a massive
reduction in the size of the two major surface protein gene families (a decrease of
more than 250 genes compared to M. marinum). This gene loss is thought to have
been crucial for the organism to evade the human immune system, by limiting
the number of antigens on the bacterial surface [40].
The uptake of a new virulence plasmid producing an immunosuppressive
substance called mycolactone is also thought to have played a key role in the
evolution and host switch of M. ulcerans. This plasmid consists mainly of three
unusually large and internally repeated genes (over 100 kb in total), and thus
illustrates the concept of long and repeated virulence genes (Figure 1B) [41].
These genes appear to evolve rapidly by recombination and gene conversion,
and new variants can be directly connected to variations in the chemical structure
of mycolactone [42], which might be important for host specificity, immunosuppressive potency, and drug design.
Figure 1. Evolution of a new infectious disease agent. (A) Recent evolution of the specialist
human pathogen M. ulcerans from the aquatic generalist pathogen M. marinum. (B) Arrangement
of the three M. ulcerans plasmid–encoded repeated virulence genes (arrows from left to right:
mlsA1 [51 kb], mlsA2 [7.6 kb], mlsB [43 kb]) coding for three polyketide synthases. The loading
modules (labeled LM) and the 16 repeated modules depicted in purple (labeled 1–9 for mlsA1 and
mlsA2, and 1–7 for mlsB) enable the serial buildup of the backbone carbon chain of the complex
immunosuppressive substance mycolactone.
doi:10.1371/journal.pcbi.1000481.g001
PLoS Computational Biology | www.ploscompbiol.org
2
better ways to integrate data from diverse
sources, including shotgun sequencing,
paired-end sequencing, PCR experiments,
fosmid and BAC (bacterial artificial chromosome) clone sequencing, physical mapping, and restriction fragment data. A
program integrating these different data
should not only accurately assemble as
much of the genome as possible, but also
assist the researcher in designing additional experiments to resolve the remaining
regions. Given the rapidly increasing
number of incomplete genome sequences
available, it would also be valuable with a
quality-scoring standard that not only
provides quality scores at individual sites
under the assumption that the assembly is
correct, but also reflects the uncertainty of
the actual assembly over specific regions.
While assembly software development is
struggling to keep up, the sequencing
revolution shows no signs of slowing down.
Perhaps the most important new development is real-time single molecule detection
platforms with ultra-long sequencing reads
[16]. Within the next few years, we can
expect to see read lengths of 20 kb, which
will help resolve many of the complex
genomic features underlying host adaptation and pathogenicity.
Functional Annotation of
Virulence and Host-Interaction
Genes
Annotation is the process of assigning
meaningful information, such as the location or function of genes, to raw sequence
data. Reliable and consistent annotations
are thus fundamental for analysis and
interpretation of genome data. Since
annotation of new genomes is usually
based on homology searches (e.g., BLAST
hits), errors and inconsistencies tend to
propagate. One way to reduce error
propagation is to functionally annotate a
set of reference genomes based on experimentally determined information. Annotation of new genomes could then start
with searches in this database, which
would allow high-quality annotation of
all well-conserved genes. The Gene Ontology’s Reference Genome Project [17]
and BioCyc [18] represent developments
in this direction. However, the number of
species included is still limited, and a
broader taxonomic breadth of bacteria,
with one reference species per genus,
would be desirable.
Functional annotation of pathogen genomes is particularly important, because
genes involved in host-interaction processes are among the most difficult to
annotate. One problem is that different
October 2009 | Volume 5 | Issue 10 | e1000481
research groups often have studied homologous genes in various species, and given
them different names that are not always
logical or reflective of similarities in
sequence and function. A manually curated database of protein families involved in
host interactions that incorporates currently used gene names, sequence motifs,
gene functions, and experimental results
would substantially improve the situation.
Much improved guidelines for how to
annotate genes in large families with
different combinations of sequence motifs
would also be valuable.
Comparative studies of very closely
related genomes can help to distinguish
functional genes from spurious ORFs
(open reading frames) and pseudogenes,
and thereby improve gene prediction. To
this end, a tool to visualize all the fine
details in comparisons of multiple closely
related genomes is crucial. Such a tool was
developed recently for genomes with a
conserved order of genes, and it has been
applied to analyze sequence deterioration
in the typhus pathogen Rickettsia prowazekii
and its closest relatives [10]. Future
studies, however, will require software that
can also handle multiple genome comparisons from highly rearranged genomes.
Another limitation of currently available
visualization tools is that, although multiple genomes can be included, only serial
pairwise comparisons can be made. This
limitation can be overcome by visualization of genome comparisons in ‘‘three
dimensions’’ (3D visualization), enabling
all-against-all comparisons to be viewed
simultaneously (Figure 2). Just as 3D
visualizations revolutionized the field of
structural biology over the past decades,
such developments might well revolutionize the field of comparative genomics in
the years to come.
Molecular Diagnostics and
Vaccine Development
Classification of infectious disease
agents is typically based on multilocus
sequence-typing (MLST) systems, by
which new bacterial isolates are analyzed
by sequencing five to seven predefined
core genes [19]. With the increasing
number of complete genome sequences
of pathogenic and nonpathogenic strains,
it will be possible to concatenate a much
larger number of conserved genes and use
this dataset to infer a tree to represent the
underlying population structure [20].
However, while genotyping systems based
on conserved genes can be useful for
monitoring the spread of strains, they do
not necessarily correlate with genomotypes
edge the importance of mutation by
recombination (Figure 3) and multiple-base
insertion/deletion events as well as point
mutations. With the expected huge increase
of complete and draft genomes for many
strains of a species, there is a need for
programs capable of screening a large set of
alignments for recombination signals, with
novel statistical and visualization tools to
analyze the full set of results.
Figure 2. New visualization tools for
genome comparisons. Comparison of the
genes in multiple genomes can be represented visually by using a 3D program. Each arrow
represents one gene, and the grey shading
between genes indicates homology. Red
indicates genes that are unique to one
genome. The difference between this approach and existing programs is that all
genomes can be compared to each other
simultaneously, rather than by pairwise comparisons. With multiple genomes, and with
zooming, flipping, and selecting options, even
this rudimentary 3D program would be of
great help in genome analysis.
doi:10.1371/journal.pcbi.1000481.g002
defined by virulence properties [21]. This
is because genes contributing to virulence
are prone to horizontal gene transfer, gene
duplications, and gene loss. Further complicating the development of molecular
diagnostic methods is that homologs of
virulence genes are often present also in
nonpathogenic species, making it difficult
to recognize pathogens solely from the
gene content. Hence, classification and
risk assessments for the emergence of
novel infectious strains ultimately should
be based on a combination of strain
typing, gene content, and identification
of virulence genes.
Understanding the evolutionary dynamics of host-interaction genes in terms of
both mechanisms and selective forces is also
important in order to design drugs that will
be effective in the long term. What good
would be the development of a new
antibiotic or vaccine if the intended target
protein evolves beyond recognition before
the drug reaches the market? One solution
to this problem is to characterize the
selective pressures on candidate vaccine
targets, and then exclude genes or parts of
genes based on their evolutionary dynamics
[22]. However, current tools for measuring
positive or diversifying selection are severely limited in that they assume that singlebase mutations are the only underlying
mechanism of sequence change. For reliable analyses of genes with a complex
evolution, a new generation of evolutionary
tests needs to be developed that acknowl-
PLoS Computational Biology | www.ploscompbiol.org
3
Predicting Risk for Disease
Outbreaks
The next challenge is to place the
genomic data within its ecological context,
which has led to a new research field
called molecular ecosystems biology [23].
This field focuses on dissecting the many
complex molecular interactions between
the bacterial population and its environment. This environment can be highly
specialized, as in the case of bacteria
adapted to a single host species, or very
complex as for soil-, water-, or airborne
bacteria. The behavior of a pathogen thus
depends on many ecological factors, such
as seasonal fluctuations in temperature
and nutritional availability, species richness and host population density.
To be able to integrate and evaluate
these data, new software is needed. Imagine
a program that can read sequence data
from hundreds of bacterial isolates, infer
the underlying population structure, and
combine it with gene expression data,
Figure 3. New methods for analyzing
evolution by recombination. Improved
models and visualization tools are needed to
analyze recombination. Virulence genes, here
exemplified by the acfD gene in the Vibrio
cholerae pathogenicity island [43], often
display complex recombination patterns. The
aligned acfD genes (arrows) from three V.
cholerae strains (M2140, M1567, and M1118)
are plotted separately; a line connects each
site where the nucleotides in two strains differ
from the third strain. Noninformative sites
were removed before plotting.
doi:10.1371/journal.pcbi.1000481.g003
October 2009 | Volume 5 | Issue 10 | e1000481
ecological factors, and clinical data such as
the number of disease cases reported in
various geographic areas. It should be
possible to visualize global patterns in the
data, such as abundance of particular
strains and sequence variants and migration of infected hosts and vectors over
geographic areas and seasons. Changes in
taxonomic profiles, virulence genes, and
metabolic pathways should be visualized in
real time. This program could also be
linked to a Web site where researchers can
post daily updates of clinical cases, spread
of virulence genes, appearance of new
strains and new mutations, migration
patterns, and news about genome and
functional data. This site would be useful
for estimating the risk for new epidemics to
emerge in the human population.
Analyzing Microbial
Communities
Analyzing the behavior of complete
pathogen ecosystems is an immediate
priority. Random shotgun sequencing
projects of bacterial DNA from diverse
environments count in the hundreds, and
the amount of metagenomic sequence
data already exceeds the available genomic sequences in public databases [24,25]
(http://www.genomesonline.org). Several
multinational projects on the human
microbiome have been launched, which,
together with studies of 16S rRNA amplicons, have provided new insights into the
human intestinal [26–28], oral [29], and
vaginal flora [30]. Comparison of the
microbial flora in healthy and diseased
people can be a powerful diagnostic tool
and enable the discovery of both emerging
pathogens and novel virulence factors,
such as antibiotic resistance plasmids. An
important technical development that
holds great promise for associating the
functional adaptation of the community as
a whole with the metabolic pathways
present in the individual strains is single-
cell isolation followed by whole-genome
amplification. Community sequencing also
provides an excellent tool for epidemic
surveillance of pathogenic strains and
virulence genes in environments from
which they may further spread to humans.
The massive amount of data created by
microbial community sequencing poses
new challenges and will require extensive
bioinformatics development [24]. Although the advent of longer sequence
reads will have a large impact on the
assembly of community data, the presence
of many closely related species or strains in
the same sample, along with horizontal
gene transfer, will remain a daunting
challenge. A whole new field of comparative algorithms needs to be developed, for
example to provide meaningful comparisons between taxonomic profiles. New
sequence databases will be essential for
rapid access to both raw and processed
data. Also, for fair comparisons between
datasets, a certain level of standardization
of sampling, experimental work, and
statistics will be crucial [31]. Bioinformatics skills combined with a deep biological understanding of the system under
study are needed to use these complex
sequence datasets to answer such questions
as: Who is there? What are they doing?
How are they communicating? And what
is the risk for disease?
Challenges for the Future
The priority goals for the next decade
within the area of emerging infectious
diseases should be the study of complete
pathogen ecosystems and the dissection of
host–pathogen interaction communication
pathways directly in the natural environment. To achieve these goals, investments
in user-friendly software and improved
visualization tools, along with excellent
expertise in computational biology, will be
of utmost importance. Unfortunately, too
few undergraduate students in clinical
microbiology and microbial ecology are
trained in computational skills, and national governments and universities need to
take action to address this deficiency to
meet the demands of the near future. Often
neglected by public and private funding is
the monumental need for stable and
standardized infrastructure at all levels,
from the individual research group to the
intergovernmental organization. Only with
proper investments in everything from
hardware and personnel for data handling,
to the development of sensible and standardized file formats, can we ensure that
the current developments can be fully
exploited to more efficiently battle emerging infectious diseases.
Currently, the slow transition from a
scientific in-house program to the distribution of a stable and efficient software
package is a major bottleneck in scientific
knowledge sharing, preventing efficient
progress in all areas of computational
biology. Efforts to design, share, and
improve software must receive increased
funding, practical support, and, not the
least, scientific impact. Since microorganisms do not follow national borders, such
initiatives are probably best started from
intergovernmental organizations with close
links to national centers with established
communication networks to distribute
know-how and advances further within
the country, and vice versa, to facilitate
the spread of new concepts and software to
all members of the organization. Eventually, many of these initiatives may become
community-driven. The example of Wikipedia, with more than 10 million entries
written since the launch in 2001 and a
current growth rate of thousands of articles
daily (http://www.wikipedia.org), demonstrates the power of user-contributed initiatives.
Acknowledgments
We thank Eddie Persson for graphical work.
References
1. Rappuoli R (2004) From Pasteur to genomics:
Progress and challenges in infectious diseases. Nat
Med 10: 1177–1185.
2. Marra MA, Jones SJ, Astell CR, Holt RA,
Brooks-Wilson A, et al. (2003) The genome
sequence of the SARS-associated coronavirus.
Science 300: 1399–1404.
3. Rota PA, Oberste MS, Monroe SS, Nix WA,
Campagnoli R, et al. (2003) Characterization of a
novel coronavirus associated with severe acute
respiratory syndrome. Science 300: 1394–1399.
4. Parkhill J, Wren BW, Thomson NR, Titball RW,
Holden MT, et al. (2001) Genome sequence of
Yersinia pestis, the causative agent of plague.
Nature 413: 523–527.
5. Welch RA, Burland V, Plunkett G 3rd,
Redford P, Roesch P, et al. (2002) Extensive
mosaic structure revealed by the complete
genome sequence of uropathogenic Escherichia
coli. Proc Natl Acad Sci U S A 99: 17020–17024.
6. Dziejman M, Balon E, Boyd D, Fraser CM,
Heidelberg JF, et al. (2002) Comparative genomic
analysis of Vibrio cholerae: genes that correlate with
cholera endemic and pandemic disease. Proc Natl
Acad Sci U S A 99: 1556–1561.
7. Wolfgang MC, Kulasekara BR, Liang X, Boyd D,
Wu K, et al. (2003) Conservation of genome
content and virulence determinants among clinical and environmental isolates of Pseudomonas
aeruginosa. Proc Natl Acad Sci U S A 100:
8484–8489.
8. Seshadri R, Myers GS, Tettelin H, Eisen JA,
Heidelberg JF, et al. (2004) Comparison of the
genome of the oral pathogen Treponema denticola
with other spirochete genomes. Proc Natl Acad
Sci U S A 101: 5646–5651.
PLoS Computational Biology | www.ploscompbiol.org
4
9. Gill SR, Fouts DE, Archer GL, Mongodin EF,
Deboy RT, et al. (2005) Insights on evolution of
virulence and resistance from the complete
genome analysis of an early methicillin-resistant
Staphylococcus aureus strain and a biofilm-producing
methicillin-resistant Staphylococcus epidermidis strain.
J Bacteriol 187: 2426–2438.
10. Fuxelius HH, Darby AC, Cho NH, Andersson SG
(2008) Visualization of pseudogenes in intracellular bacteria reveals the different tracks to gene
destruction. Genome Biol 9: R42.
11. Berglund EC, Frank AC, Calteau A, Vinnere
Pettersson O, Granberg F, et al. (2009) Runoff replication of host-adaptability genes is
associated with gene transfer agents in the
genome of mouse-infecting Bartonella grahamii.
PLoS Genet 5: e1000546. doi:10.1371/journal.
pgen.1000546.
October 2009 | Volume 5 | Issue 10 | e1000481
12. Deitsch KW, Moxon ER, Wellems TE (1997)
Shared themes of antigenic variation and virulence in bacterial, protozoal, and fungal infections. Microbiol Mol Biol Rev 61: 281–293.
13. Brayton KA, Knowles DP, McGuire TC,
Palmer GH (2001) Efficient use of a small
genome to generate antigenic diversity in tickborne ehrlichial pathogens. Proc Natl Acad
Sci U S A 98: 4130–4135.
14. Nystedt B, Frank AC, Thollesson M,
Andersson SG (2008) Diversifying selection and
concerted evolution of a type IV secretion system
in Bartonella. Mol Biol Evol 25: 287–300.
15. Bilek N, Ison CA, Spratt BG (2009) Relative
contributions of recombination and mutation to
the diversification of the opa gene repertoire of
Neisseria gonorrhoeae. J Bacteriol 191: 1878–1890.
16. Gupta PK (2008) Single-molecule DNA sequencing technologies for future genomics research.
Trends Biotechnol 26: 602–611.
17. The Gene Ontology’s Reference Genome Project: A unified framework for functional annotation across species. PLoS Comput Biol 5:
e1000431.
18. Karp PD, Ouzounis CA, Moore-Kochlacs C,
Goldovsky L, Kaipa P, et al. (2005) Expansion of
the BioCyc collection of pathway/genome databases to 160 genomes. Nucleic Acids Res 33:
6083–6089.
19. Maiden MC, Bygraves JA, Feil E, Morelli G,
Russell JE, et al. (1998) Multilocus sequence
typing: A portable approach to the identification
of clones within populations of pathogenic
microorganisms. Proc Natl Acad Sci U S A 95:
3140–3145.
20. Ciccarelli FD, Doerks T, von Mering C,
Creevey CJ, Snel B, et al. (2006) Toward
automatic reconstruction of a highly resolved
tree of life. Science 311: 1283–1287.
21. Turner KM, Feil EJ (2007) The secret life of the
multilocus sequence type. Int J Antimicrob
Agents 29: 129–135.
22. Bambini S, Rappuoli R (2009) The use of
genomics in microbial vaccine development.
Drug Discov Today 14: 252–260.
23. Raes J, Bork P (2008) Molecular eco-systems
biology: Towards an understanding of community function. Nat Rev Microbiol 6: 693–699.
24. Kunin V, Copeland A, Lapidus A, Mavromatis K,
Hugenholtz P (2008) A bioinformatician’s guide
to metagenomics. Microbiol Mol Biol Rev 72:
557–578.
25. Liolios K, Mavromatis K, Tavernarakis N,
Kyrpides NC (2008) The Genomes On Line
Database (GOLD) in 2007: Status of genomic
and metagenomic projects and their associated
metadata. Nucleic Acids Res 36: D475–479.
26. Dethlefsen L, Huse S, Sogin ML, Relman DA
(2008) The pervasive effects of an antibiotic on
the human gut microbiota, as revealed by deep
16S rRNA sequencing. PLoS Biol 6: e280.
doi:10.1371/journal.pbio.0060280.
27. Turnbaugh PJ, Hamady M, Yatsunenko T,
Cantarel BL, Duncan A, et al. (2009) A core
gut microbiome in obese and lean twins. Nature
457: 480–484.
2 8 . M a h o w a l d M A , R e y F E, S e e d o r f H ,
Turnbaugh PJ, Fulton RS, et al. (2009) Characterizing a model human gut microbiota composed of members of its two dominant bacterial
phyla. Proc Natl Acad Sci U S A 106:
5859–5864.
29. Keijser BJ, Zaura E, Huse SM, van der
Vossen JM, Schuren FH, et al. (2008) Pyrosequencing analysis of the oral microflora of
healthy adults. J Dent Res 87: 1016–1020.
30. Spear GT, Sikaroodi M, Zariffard MR,
Landay AL, French AL, et al. (2008) Comparison
of the diversity of the vaginal microbiota in HIVinfected and HIV-uninfected women with or
without bacterial vaginosis. J Infect Dis 198:
1131–1140.
31. Raes J, Foerstner KU, Bork P (2007) Get the most
out of your metagenome: Computational analysis
of environmental sequence data. Curr Opin
Microbiol 10: 490–498.
32. Andersson SG, Kurland CG (1998) Reductive
evolution of resident genomes. Trends Microbiol
6: 263–268.
33. Cole ST, Eiglmeier K, Parkhill J, James KD,
Thomson NR, et al. (2001) Massive gene decay in
the leprosy bacillus. Nature 409: 1007–1011.
PLoS Computational Biology | www.ploscompbiol.org
5
34. Alsmark CM, Frank AC, Karlberg EO,
Legault BA, Ardell DH, et al. (2004) The louseborne human pathogen Bartonella quintana is a
genomic derivative of the zoonotic agent Bartonella henselae. Proc Natl Acad Sci U S A 101:
9716–9721.
35. Cole ST, Brosch R, Parkhill J, Garnier T,
Churcher C, et al. (1998) Deciphering the biology
of Mycobacterium tuberculosis from the complete
genome sequence. Nature 393: 537–544.
36. Parkhill J, Sebaihia M, Preston A, Murphy LD,
Thomson N, et al. (2003) Comparative analysis of
the genome sequences of Bordetella pertussis,
Bordetella parapertussis and Bordetella bronchiseptica.
Nat Genet 35: 32–40.
37. Yip MJ, Porter JL, Fyfe JA, Lavender CJ,
Portaels F, et al. (2007) Evolution of Mycobacterium
ulcerans and other mycolactone-producing mycobacteria from a common Mycobacterium marinum
progenitor. J Bacteriol 189: 2021–2029.
38. Rondini S, Kaser M, Stinear T, Tessier M,
Mangold C, et al. (2007) Ongoing genome
reduction in Mycobacterium ulcerans. Emerg Infect
Dis 13: 1008–1015.
39. Stinear TP, Seemann T, Pidot S, Frigui W,
Reysset G, et al. (2007) Reductive evolution and
niche adaptation inferred from the genome of
Mycobacterium ulcerans, the causative agent of Buruli
ulcer. Genome Res 17: 192–200.
40. Huber CA, Ruf MT, Pluschke G, Kaser M (2008)
Independent loss of immunogenic proteins in
Mycobacterium ulcerans suggests immune evasion.
Clin Vaccine Immunol 15: 598–606.
41. Stinear TP, Mve-Obiang A, Small PL, Frigui W,
Pryor MJ, et al. (2004) Giant plasmid-encoded
polyketide synthases produce the macrolide toxin
of Mycobacterium ulcerans. Proc Natl Acad Sci U S A
101: 1345–1349.
42. Pidot SJ, Hong H, Seemann T, Porter JL, Yip MJ,
et al. (2008) Deciphering the genetic basis for
polyketide variation among mycobacteria producing mycolactones. BMC Genomics 9: 462.
43. Tay CY, Reeves PR, Lan R (2008) Importation of
the major pilin TcpA gene and frequent recombination drive the divergence of the Vibrio
pathogenicity island in Vibrio cholerae. FEMS
Microbiol Lett 289: 210–218.
October 2009 | Volume 5 | Issue 10 | e1000481
Perspective
The Role of Medical Structural Genomics in Discovering
New Drugs for Infectious Diseases
Wesley C. Van Voorhis1, Wim G. J. Hol2, Peter J. Myler3,4,5*, Lance J. Stewart6*
1 Department of Medicine, University of Washington, Seattle, Washington, United States of America, 2 Department of Biochemistry, University of Washington, Seattle,
Washington, United States of America, 3 Seattle Biomedical Research Institute, Seattle, Washington, United States of America, 4 Department of Global Health, University of
Washington, Seattle, Washington, United States of America, 5 Department of Medical Education and Biomedical Informatics, University of Washington, Seattle,
Washington, United States of America, 6 deCODE biostructures, Bainbridge Island, Washington, United States of America
Introduction
Whether we think of Alzheimer’s disease, microbial infection, or any other
modern-day disease, new medicines are
urgently needed. The number of new
drugs registered since the advent of
genomics, however, has not lived up to
expectations. One recent review revealed
that over 70 high-throughput biochemical
screens against genetically validated drug
targets in bacteria failed to yield a single
candidate that could be tested in the clinic
[1]. The reasons for the failure of highthroughput biochemical screens are not
completely clear, but it could reflect the
limited diversity of chemical libraries used
and/or the absence of structural information for many of the targets. Indeed,
structure-based drug design is playing a
growing role in modern drug discovery,
with numerous approved drugs tracing
their origins, at least in part, to the use of
structural information from X-ray crystallography or nuclear magnetic resonance
(NMR) analysis of protein targets and their
ligand-bound complexes. Although it is
beyond the scope of this brief overview to
present a comprehensive list of structures
that have led to useful drugs, Table 1 lists
some examples in which protein structure
information has provided insights to the
design and development of new therapeutic entities. These cases include both novel
drug design based on native and ligandbound structures and optimization of
inhibitors based on the binding mode
revealed by the structures of inhibitor–
target complexes. These approaches have
allowed increased affinity for the target
and/or improvement of pharmacological
properties while maintaining target
affinity.
With the increasing availability of
complete human and pathogen genome
sequences and the substantial progress in
structure determination methods, it is no
surprise that the field of ‘‘structural
genomics’’ has emerged recently. Its aim
is to solve as many useful protein structures as possible from the entire genome of
a single organism or group of related
organisms. Over the past ten years, over
20 structural genomics initiatives have
begun around the world (Table 2). The
impact of these efforts on structural
biology has been substantial, both in the
sheer number of new structures and,
perhaps even more importantly, in the
development of new methodologies, especially the use of robotics and informatics to
generate and capture data in a systematic
way [2]. Over the next five years,
thousands of new protein structures, many
bound to their ligands, will be elucidated;
laying the groundwork for structurebased design and development of new
and improved chemotherapeutic agents
against pathogen proteins. Here, we will
focus on the intersection of structural
biology with chemistry and biology—a
field called ‘‘medical structural genomics’’—particularly on how the structures
of medically relevant drug targets in
pathogens can serve as a starting point
for inhibitor design and drug development. We argue that the pharmaceutical
industry should be persuaded to complement the publicly funded structural genomics initiatives by making public the
structural coordinates of their drug targets
for important infectious disease organisms
in a timely fashion and by developing
public–private partnerships to provide the
maximal synergy between target validation, structure determination, and hit-tolead development.
Target Selection
A prerequisite of medical structural
genomics is that the proteins whose
structures are determined must be wellvalidated as good drug targets. The term
‘‘drugability’’ is often used to loosely
describe how tractable any given target is
for the development of a drug candidate.
For infectious organisms, one key factor in
defining drugability is that the target
protein be essential for survival of the
microbe. While essentiality has traditionally been defined using techniques such as
‘‘gene knockout’’ and RNA interference,
these are not always feasible and should be
complemented by chemical biology approaches (see below). Furthermore, the
meaningfulness of these experiments can
often be difficult to assess, since the
interplay of host and pathogen is complex
and full of surprises. For example, tremendous effort has been devoted recently
to the development of antagonists for
targets in the fatty acid biosynthesis
Citation: Van Voorhis WC, Hol WGJ, Myler PJ, Stewart LJ (2009) The Role of Medical Structural Genomics in
Discovering New Drugs for Infectious Diseases. PLoS Comput Biol 5(10): e1000530. doi:10.1371/journal.
pcbi.1000530
Editor: Ernest Fraenkel, Massachusetts Institute of Technology, United States of America
Published October 26, 2009
Copyright: ß 2009 Van Voorhis et al. This is an open-access article distributed under the terms of the
Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any
medium, provided the original author and source are credited.
Funding: This work was supported by the NIAID funding to the Seattle Structural Genomics Center for
Infectious Disease (SSGCID) contract HHSN266200700057C, the Medical Structural Genomics of Protozoan
Pathogens (MSGPP) contract P01 AI067921 and to WCVV, grant 1R01AI080625. The funders had no role in
preparation of the article.
Competing Interests: Co-author Lance Stewart is an employee of deCODE biostructures, which developed
the Fragments-of-Life library presented in Figure 1 and discussed in sections titled ‘Fragment-based drug
discovery’ and ‘Targeting oligomeric enzymes’. Fragments-of-Life TM is a technology trademarked by deCODE
biostructures and chemistry (http://www.decodechembio.com/Capabilities/StructuralBiology/FragmentsofLife.
aspx).
* E-mail: [email protected] (PJM); [email protected] (LJS)
This article is part of the ‘‘Genomics of Emerging Infectious Disease’’ PLoS Journal collection (http://
ploscollections.org/emerginginfectiousdisease/).
PLoS Computational Biology | www.ploscompbiol.org
1
October 2009 | Volume 5 | Issue 10 | e1000530
Table 1. Examples of how target protein structure can assist drug discovery and development.
Source
Target Protein
Approach
Reference(s)
HIV
gp41
Structure led to strategies that target viral entry.
[43–45]
HIV
Protease
Protease–inhibitor complexes allowed lead optimization.
[46–52]
HIV
Reverse transcriptase
Non-nucleoside inhibitor complexes led to drug design that targets
pockets outside the enzyme’s active site.
[53–55]
Influenza virus
Neuraminidase
Complex with a transition state analog led to inhalable and orally active
neuraminidase inhibitors.
[56–59]
Rhinovirus
Coat protein
Small fatty acid molecules bound in hydrophobic pocket led to new
strategies of antiviral drug design.
[60]
Vibrio
Cholera toxin
Five receptor-binding sites provided inspiration for design of novel
multivalent inhibitors.
[61]
Bacteria
Peptide deformylase
Protein–inhibitor complexes led to macrocyclic compounds with
improved potency, selectivity and metabolic stability.
[62]
Trypanosoma
GAPDH
Novel adenosine analogs showed enhanced selectivity towards the
parasite target versus human protein.
[63,64]
Human
Cyclophilin and calcineurin
A ternary complex with cyclosporine A led to insights into its
immunosuppressive activity.
[65]
Human
Renin
The ligand-bound structure allowed design and improvement of orally
active non-peptide inhibitors to regulate blood pressure.
[66]
Human
Coagulation factor Xa
Structure-based design led to improved pharmacological anticoagulant
properties in a primate model.
[67]
Human
Adenosine deaminase
Optimization of a non-nucleoside inhibitor led to an orally active
anti-inflammatory compound in a rat model.
[68]
Human
Kinases
Structures of kinases provided a basis to improve and design new
therapeutics for various human diseases including cancer.
[69]
doi:10.1371/journal.pcbi.1000530.t001
pathway of bacteria [3]. Potent drug-like
molecules with high bioavailability have
been developed that can effectively shut
down bacterial replication in vitro. These
compounds were found to be ineffective in
subsequent animal testing, however, because fatty acids are quite abundant in
vertebrates, so bacteria can secure these
host molecules for their survival and
growth even if their own fatty acid
biosynthesis pathways are blocked [4].
Thus, to improve target selection for
medical structural genomics, it will be
important to collaborate with chemical
biology groups to undertake screening
campaigns to identify compounds that
cause the death of a pathogen under the
appropriate assay conditions [5].
If the target protein of a drug is known,
medical structural genomics offers a rapid
and efficient way to obtain ligand-bound
structures by using high-throughput X-ray
crystallography and/or NMR. Conversely, when the target of a cell-active
compound is unknown, medical structural
genomics efforts provide purified protein
for many potential drug targets that can be
screened for interaction with the active
compound by a number of biophysical
methods (such as thermal stability [6]).
The Medicinal Structural Genomics of
Protozoan Pathogens (MSGPP, http://
www.msgpp.org/) initiative has already
begun such an effort by screening thousands of anti-malaria compounds against
67 potential Plasmodium falciparum targets
expressed in bacteria (WC Van Voorhis,
unpublished data). These approaches aim
to generate knowledge about the biological
effect of a small molecule on a target
protein. Follow-up experiments are then
needed to test the activity of this compound in live organisms in order to
validate the target; this valuable ‘‘chemical
validation’’ makes the target much more
likely to be drugable, and thus worthy of
more intensive effort. The future will likely
see more medical structural genomics
centers working with chemical biology
groups that have collections of ‘‘phenotype-defined’’ compounds (i.e., those with
known anti-pathogen activity). The result
will be synergistic target validation and
hit-to-lead development using structurebased drug design.
Fragment-Based Drug
Discovery
Fragment-based drug discovery has rapidly gained interest within the pharmaceutical
industry (reviewed in [7] with roots of 128compound cocktails in [8]), as an alternative
to expensive and sometimes inefficient high-
PLoS Computational Biology | www.ploscompbiol.org
2
throughput screening methods for hit identification and optimization [9]. The general
concept of fragment-based drug discovery
involves screening libraries of ‘‘rule-of-three’’
compounds [10] against target macromolecules by using a variety of methods including
X-ray crystallography, NMR, surface plasmon resonance, differential thermal denaturation, fluorescence polarization, and other
techniques [7,11–14]. The rule of three
consists of molecular weight ,300 daltons,
#3 rotatable bonds, #3 hydrogen bond
donors/acceptors, and Clog P (calculated log
of octanol/water partition coefficient) ,3.
These compounds generally include fragments or ‘‘building blocks’’ of available drugs,
on the assumption that these fragments are
more likely to be ‘‘drug-like.’’ Fragmentbased drug discovery has been used by
commercial and academic groups, including
our own, and has led to a number of leads for
further drug development [15]. At deCODE
biostructures, a partner in the Seattle Structural Genomics Center for Infectious Disease
(SSGCID, http://www.ssgcid.org/) consortium, the approach to assembling a fragment
library has been somewhat different. The
Fragments of Life (FOL) library (Figure 1) is a
collection of approximately 1,400 structurally
diverse small molecules found in the cellular
environment, metabolites, natural products,
and their derivatives or isosteres (molecules of
October 2009 | Volume 5 | Issue 10 | e1000530
Table 2. Structural genomics projects worldwide submitting to the Protein Data Bank.
Name
URL
Target Focus
Berkeley Structural Genomics Center (BSGC)
http://www.strgen.org/
Near complete coverage of Mycoplasma genome
Center for Eukaryotic Structural Genomics (CESG)
http://www.uwstructuralgenomics.org/
PSI Center—Eukaryotic bottlenecks, specifically solubility
Center for Structural Genomics of Infectious Disease
(CSGID)
http://csgid.org/csgid/
Medically relevant infectious disease targets
Center for Structure of Membrane Proteins (CSMP)
http://csmp.ucsf.edu/index.htm
PSI Center—Bacterial and human membrane proteins
Integrated Center for Structure and Function
Innovation (ISFI)
htp://techcenter.mbi.ucla.edu/
PSI Center—Protein solubility and crystallization
improvement
Israel Structural Proteomics Center
http://www.weizmann.ac.il/ISPC/
Member of Structural Proteomics in Europe (see
below)
Joint Center for Structural Genomics (JCSG)
http://www.jcsg.org/
PSI Center—High-throughput pipeline development
and operation
Marseilles Structural Genomics Program
http://www.afmb.univ-mrs.fr/rubrique93.html
Human health
Medical Structural Genomics of Pathogenic
Protozoa (MSGPP)
http://www.msgpp.org/
Structural and functional genomics of ten species of
pathogenic protozoa
Montreal-Kingston Bacterial Structural Genomics
Initiative (BSGI)
http://euler.bri.nrc.ca/brimsg/bsgi.html
ORFs from pathogenic and nonpathogenic bacterial
strains
Mycobacterium Tuberculosis Structural Genomics
Consortium (TBsgc)
http://www.doe-mbi.ucla.edu/TB/
Mycobacterium tuberculosis—To understand
pathogenesis and for structure-based drug design
Mycobacterium Tuberculosis Structural Proteomics
Project (X-MTB)
http://webclu.bio.wzw.tum.de/binfo/proj/mtb/
35 Mycobacterium tuberculosis targets to identify five
for drug development
New York SGX Research Center for Structural
Genomics (NYSGXRC)
http://www.nysgrc.org/nysgrc/
PSI Center—High-throughput pipeline development
and operation
Ontario Center for Structural Proteomics (OCSP)
http://www.uhnres.utoronto.ca/centres/proteomics/
Enzymatic activity characterization
Oxford Protein Production Facility
http://www.oppf.ox.ac.uk/OPPF/
Human and pathogen targets of biomedical
relevance
RIKEN Structural Genomics/Proteomics Initiative
http://www.rsgi.riken.jp/rsgi_e/
Protein functional networks
Seattle Structural Genomics Center for Infectious
Disease (SSGCID)
http://www.ssgcid.org/
Medically relevant infectious disease targets
Southeast Collaboratory for Structural Genomics
http://www.secsg.org/
High-throughput eukaryotic genome-scan methods
development
Structural Genomics of Pathogenic Protozoa
http://www.sgpp.org/
PSI Center - Three-dimensional structures of proteins
from four major pathogenic protozoa
Structural Proteomics in Europe (SPINE)
http://www.spineurope.org/
Structures of medically relevant proteins and protein
complexes
Structural Proteomics in Europe 2-Complexes
(SPINE2 - Complexes)
http://www.spine2.eu/SPINE2/
Structures of protein complexes from medically
relevant signaling pathways
Structural Genomics Consortium
http://www.thesgc.org/
Medically relevant human and pathogen proteins
Structure 2 Function Project
http://s2f.umbi.umd.edu/
Poorly characterized and hypothetical protein targets
The Accelerated Technologies Center for Gene
to 3D Structure
http://atcg3d.org/default.aspx
PSI Center—Technologies development of X-ray
source, synthetic gene design, and microfluidic
crystallization
The Midwest Center for Structural Genomics
(MCSG)
http://www.mcsg.anl.gov/
PSI Center—High-throughput methods development
and operation
The Northeast Structural Genomics Consortium
(NESG)
http://www.nesg.org/
PSI Center—Protein domains, network families,
biomedical relevance
Note: Some centers with fewer than ten released structures in the PDB (www.rcsb.org/pdb/) are not shown.
PSI, Protein Structure Initiative.
doi:10.1371/journal.pcbi.1000530.t002
similar size containing the same number and
types of atoms). Also included in the FOL
library are a series of biaryl small molecules
(which contain two tethered five- or sixmembered ring structures) that mimic protein
secondary structure elements (e.g., a-helical
turns). Thus, this fragment set is useful for
targeting both the active sites of enzymes and
more complex protein surfaces including
allosteric small molecule binding sites and
protein–protein interfaces [16].
Targeting Oligomeric Enzymes
Protein–protein interaction and assemblies, ranging from simple dimers to
extremely complex arrangements as seen
in the ribosome or the nuclear pore
PLoS Computational Biology | www.ploscompbiol.org
3
complex, form the basis of most biological
processes, and there are usually numerous
points of contact between the macromolecules involved. Yet the protein–protein
interfaces formed by oligomerization are
not necessarily accompanied by a large
gain in free energy, and small molecules
have been shown to prevent critical
protein–protein interactions [17]. These
October 2009 | Volume 5 | Issue 10 | e1000530
Figure 1. Conceptual organization of the deCODE biostructures Fragments of Life library. The current ,1,400-compound library contains
chemically tractable natural small molecule metabolites (FOL-Nat), metabolite-like compounds and their bioisosteres (FOL-NatD), and biaryl mimetics
of protein architecture (FOL-Biaryl). The FOL-Nat members include any natural molecule of molecular weight ,350 daltons that exists as a substrate,
natural product, or allosteric regulator of any metabolic pathway in any cell type, such as the biosynthetic pathways for the neurotransmitter
serotonin (1) and the plant hormone auxin (2). The FOL-Nat members also include secondary metabolites such as bestatin (3), a secondary
metabolite of Streptomyces olivoreticuli [38]. FOL-NatD fragments are defined as heteroatom-containing derivatives, isosteres, or analogs of any FOLNat molecule. For example, fragments 4–7 contain the indole scaffold, which is known to be a privileged building block for drug molecules [39]. To
emulate protein architecture, the FOL-Biaryl fragments were selected from a variety of biaryl compounds that are potential mimics of protein a, b, or
c turns [40–42]. These include a compound (8) whose structure in an energy-minimized state can be seen to mimic the architecture on an a-turn of a
protein structure (here, residues Ser65-Ile66-Leu67-Lys68 of PDB ID:1RTP) and, similarly, a compound (9) whose structure mimics the b-turn of a
protein structure (residues Ala20-Ala21-Asp22-Ser23).
doi:10.1371/journal.pcbi.1000530.g001
findings have prompted recent discussion
of a structure-based approach aimed at
developing novel small-molecule antibiotics that modulate protein activity by
binding to an interface between subunits
within multi-protein complexes [18]. The
bacterial enzyme inorganic pyrophosphatase may serve as an example for this
approach, since it exists in a hexameric
state that requires conformational flexibility for its essential role in converting
inorganic pyrophosphate into phosphate
[19–21]. Moreover, whereas all bacterial
inorganic pyrophosphatases function as a
homohexamer, the eukaryotic cytosolic
and mitochondrial inorganic pyrophosphatases function as homodimers [21].
Hence eukaryotic inorganic pyrophosphatases have different oligomeric interfaces
than those of bacterial enzymes. This
suggests that it may be possible to inhibit
the bacterial inorganic pyrophosphatase
safely by targeting its oligomeric state
rather than its highly conserved active
site. A similar approach has recently been
used to identify species-specific modulators
of porphobilinogen synthase (PBGS) activity [22]. SSGCID has solved the highresolution X-ray crystal structure of inorganic pyrophosphatase from the pathogenic bacterium Burkholderia pseudomallei,
and a subsequent FOL screen of this target
identified several fragments that specifically bind at multiple oligomerization pockets
in a molecular interface between the two
trimers of the homohexamer (Figure 2).
While these fragments remain to be
validated in terms of their species-specific
inhibition of inorganic pyrophosphatase
activity, they represent potential starting
points for the development of novel
antibiotics.
Industry-Generated Structures
and the Protein Data Bank
As we have seen above, protein structure information is the bread and butter of
structure-based drug discovery. Structural
genomics projects (Table 2) have substantially increased the number of protein
structures solved and have made this
information freely and openly available
(i.e., at no cost and without restriction by
copyright or other constraints) by depositing it in the Protein Data Bank (PDB)
[23]. Most publishers have policies that
require authors to deposit structural data
in the PDB at the time of publication, so
structures determined by academic researchers worldwide are, for the most
part, well disseminated. By contrast, the
pharmaceutical industry is sitting on a
mountain of structural data for protein–
PLoS Computational Biology | www.ploscompbiol.org
4
ligand complexes from globally important
pathogens, which is not available to the
wider scientific community. The secrecy
engendered by the current economic
incentives driving drug discovery in the
commercial sector has led to a substantial
waste of precious resources through duplication of effort and inability to learn from
others’ successes and failures. The situation is unlikely to change without a
concerted effort to find ways to overcome
the financial and intellectual property
barriers that prevent dissemination of this
information. A recent publication suggested that open access industry–academia
partnerships may provide one possible
model [24]. We propose that the United
States National Institutes of Health, along
with other national and international
research-funding agencies, issue calls for
proposals that will fund the transfer of the
highly valuable structural information
from corporate databases into the PDB.
Such an effort would obviously require
discussion with industrial parties to negotiate mutually acceptable policies and
mechanisms for the deposition of these
structures in the public databases. These
might include relaxation of release standards for industrial entities, such that
structural information could be safely
deposited in PDB at the time of structure
October 2009 | Volume 5 | Issue 10 | e1000530
Figure 2. B. pseudomallei inorganic pyrophosphatase with bound ligand at an oligomeric interface. Homo-hexameric bacterial inorganic
pyrophosphatase is a dimer of trimers (blue and green). The illustration shows the hexamer structure in a complex with three ligand fragment
molecules (red spheres and stick structures represent fragment FOL 110), each of which is located at one of three ‘‘dimer of trimer’’ interfaces (1.5
ligands per monomer) (PDBID:3EJ0). The location of one pyrophosphate substrate (cyan spheres) at the active site of one of the monomers is
indicated here based on the superimposed structure of the hexamer with pyrophosphate bound in the active site (PDBID:3EIY). The binding sites of
the ligands (red) are clearly seen in a pocket formed by the homo-oligomeric assemblage, which is distant from the active site where pyrophosphate
(cyan) binds.
doi:10.1371/journal.pcbi.1000530.g002
determination and released only at a later
date more appropriate for protection of
intellectual property.
Challenges for the Future
We are currently witnessing an explosion in technological and computational
advances in structural genomics, with
protein structures of hundreds or thousands of medically relevant targets from
infectious disease organisms likely to be
available over the next few years. This new
information provides both academic and
for-profit scientists with an unprecedented
opportunity to accelerate the development
of new and improved chemotherapeutic
agents against these pathogens. One major
challenge will be the adaptation of existing
fragment-based drug design methods to
match the scale of the structural genomics
era. New high-throughput methods need
to be developed for fragment-screening to
enhance the success rate for protein–
ligand structure determination.
Major attention is also needed to the
development of fully automated, very high
throughput crystal growth screening meth-
ods to elucidate the binding of wellselected compounds to medically relevant
targets. These screens need to cover many
(up to 100) protein variants [25,26],
1,000–10,000 different small molecule
compounds, and approximately 1,000
different crystal growth conditions [27],
resulting in 108 to 109 conditions to be
tested for a single drug target. Obviously,
this will require development of even
smaller volume assays than those currently
in use [28–31]—down to the low picoliters—and automated detection of crystals
in the millions of crystallization chambers
[32–34]. Further development of automated capillary crystallization methods [35]
might provide another way to achieve the
very high throughput crystal screening
required for reaching the full power of
medical structural genomics in the future.
Cryoprotection of the crystals is a specific
hurdle, although it might be possible to
routinely collect and merge partial datasets
from multiple crystals under non-cryo
conditions. Alternatively, the use of micromeshes [36,37] and further miniaturization of trays and other crystal screening
PLoS Computational Biology | www.ploscompbiol.org
5
tools may allow cryoprotection of many
crystals simultaneously.
In addition, existing databases will need
to be modified to allow easy dissemination
of the results from these fragment screens,
and a serious effort should be made to
persuade small and big pharma to release
coordinates of drug targets from globally
important infectious disease organisms. It
will also be critical (but challenging) for
structural biologists to collaborate with
medicinal chemists and molecular biologists to turn these fragment from promising leads to effective drugs. Together,
these steps should begin to release a flood
of structures that provide a tremendous
resource for improving health in rich and
poor countries alike.
Acknowledgments
The authors wish to thank all the individuals
who have dedicated themselves to the SSGCID
and MSGPP projects. In particular, we thank
Robin Stacy, Bart Staker, Alberto Napuli,
Frank E. Zucker, Erkang Fan, Christophe
Verlinde, Ethan Merritt, and Frederick Buckner, to name but a few.
October 2009 | Volume 5 | Issue 10 | e1000530
References
1. Payne DJ, Gwynn MN, Holmes DJ,
Pompliano DL (2007) Drugs for bad bugs:
Confronting the challenges of antibacterial discovery. Nat Rev Drug Discov 6: 29–40.
2. Haquin S, Oeuillet E, Pajon A, Harris M,
Jones AT, et al. (2008) Data management in
structural genomics: An overview. Methods Mol
Biol 426: 49–79.
3. Wright HT, Reynolds KA (2007) Antibacterial
targets in fatty acid biosynthesis. Curr Opin
Microbiol 10: 447–453.
4. Brinster S, Lamberet G, Staels B, Trieu-Cuot P,
Gruss A, et al. (2009) Type II fatty acid synthesis
is not a suitable antibiotic target for gram-positive
pathogens. Nature 458: 83–86.
5. Hoon S, Smith AM, Wallace IM, Suresh S,
Miranda M, et al. (2008) An integrated platform
of genomic assays reveals small-molecule bioactivities. Nat Chem Biol 4: 498–506.
6. Ericsson UB, Hallberg BM, Detitta GT,
Dekker N, Nordlund P (2006) Thermofluor-based
high-throughput stability optimization of proteins
for structural studies. Anal Biochem 357:
289–298.
7. Congreve M, Chessari G, Tisi D, Woodhead AJ
(2008) Recent developments in fragment-based
drug discovery. J Med Chem 51: 3661–3689.
8. Verlinde CLMJ, Kim H, Bernstein BE,
Mande SC, Hol WG (1997) Antitrypanosomiasis
drug development based on structures of glycolytic enzymes. In: Veerapandian P, ed. Structurebased drug design. New York: Marcel Dekker. pp
365–394.
9. Rees DC, Congreve M, Murray CW, Carr R
(2004) Fragment-based lead discovery. Nat Rev
Drug Discov 3: 660–672.
10. Congreve M, Carr R, Murray C, Jhoti H (2003)
A ‘‘rule of three’’ for fragment-based lead
discovery? Drug Discov Today 8: 876–877.
11. Nienaber VL, Greer J (2000) Discovering novel
ligands for macromolecules using X-ray crystallographic screening. Nature Biotechnol 18:
1105–1108.
12. Neumann T, Junker HD, Schmidt K, Sekul R
(2007) SPR-based fragment screening: Advantages and applications. Curr Top Med Chem 7:
1630–1642.
13. Jhoti H, Cleasby A, Verdonk M, Williams G
(2007) Fragment-based screening using X-ray
crystallography and NMR spectroscopy. Curr
Opin Chem Biol 11: 485–493.
14. Erlanson DA (2006) Fragment-based lead discovery: A chemical update. Curr Opin Biotechnol
17: 643–652.
15. Bosch J, Robien MA, Mehlin C, Boni E,
Riechers A, et al. (2006) Using fragment cocktail
crystallography to assist inhibitor design of
Trypanosoma brucei nucleoside 2-deoxyribosyltransferase. J Med Chem 49: 5939–5946.
16. Davies DR, Mamat B, Magnusson OT,
Christensen J, Haraldsson MH, et al. (2009)
Discovery of leukotriene A4 hydrolase inhibitors
using metabolomics biased fragment crystallography. J Med Chem 52: 4694–4715.
17. Liuzzi M, Deziel R, Moss N, Beaulieu P,
Bonneau AM, et al. (1994) A potent peptidomimetic inhibitor of HSV ribonucleotide reductase
with antiviral activity in vivo. Nature 372:
695–698.
18. Wells JA, McClendon CL (2007) Reaching for
high-hanging fruit in drug discovery at proteinprotein interfaces. Nature 450: 1001–1009.
19. Kankare J, Salminen T, Lahti R, Cooperman BS,
Baykov AA, et al. (1996) Structure of Escherichia
coli inorganic pyrophosphatase at 2.2 Å resolution. Acta Crystallogr D Biol Crystallogr 52:
551–563.
20. Oksanen E, Ahonen AK, Tuominen H,
Tuominen V, Lahti R, et al. (2007) A complete
structural description of the catalytic cycle of
21.
22.
23.
24.
25.
26.
27.
28.
29.
30.
31.
32.
33.
34.
35.
36.
37.
38.
yeast pyrophosphatase. Biochemistry 46:
1228–1239.
Sivula T, Salminen A, Parfenyev AN,
Pohjanjoki P, Goldman A, et al. (1999) Evolutionary aspects of inorganic pyrophosphatase.
FEBS Lett 454: 75–80.
Lawrence SH, Ramirez UD, Tang L, Fazliyez F,
Kundrat L, et al. (2008) Shape shifting leads to
small-molecule allosteric drug discovery. Chem
Biol 15: 586–596.
Berman H, Henrick K, Nakamura H, Markley JL
(2007) The worldwide Protein Data Bank
(wwPDB): Ensuring a single, uniform archive of
PDB data. Nucleic Acids Res 35: D301–303.
Edwards AM, Bountra C, Kerr DJ, Willson TM
(2009) Open access chemical and clinical probes
to support drug discovery. Nat Chem Biol 5:
436–440.
Choi KH, Groarke JM, Young DC,
Rossmann MG, Pevear DC, et al. (2004) Design,
expression, and purification of a Flaviviridae
polymerase using a high-throughput approach to
facilitate crystal structure determination. Protein
Sci 13: 2685–2692.
Graslund S, Sagemark J, Berglund H,
Dahlgren LG, Flores A, et al. (2008) The use of
systematic N- and C-terminal deletions to
promote production and structural studies of
recombinant proteins. Protein Expr Purif 58:
210–221.
Luft JR, Collins RJ, Fehrman NA, Lauricella AM,
Veatch CK, et al. (2003) A deliberate approach to
screening for initial crystallization conditions of
biological macromolecules. J Struct Biol 142:
170–179.
Santarsiero BDYD, Lee CC, Spraggon G, Gu J,
Scheibe D, Uber EC, Cornell EW, Nordmeyer RA,
Kolbe WF, Jin J, Jones AL, Jaklevic JM,
Schultz PG, Stevens RC (2002) An approach to
rapid protein crystallization using nanodroplets.
J Appl Crystallogr 35: 278–281.
Hansen CL, Skordalakes E, Berger JM, Quake SR
(2002) A robust and scalable microfluidic metering method that allows protein crystal growth by
free interface diffusion. Proc Natl Acad Sci U S A
99: 16531–16536.
Zheng B, Roach LS, Ismagilov RF (2003)
Screening of protein crystallization conditions
on a microfluidic chip using nanoliter-size
droplets. J Am Chem Soc 125: 11170–11171.
Gerdts CJ, Elliott M, Lovell S, Mixon MB,
Napuli AJ, et al. (2008) The plug-based nanovolume Microcapillary Protein Crystallization System (MPCS). Acta Crystallogr D Biol Crystallogr
64: 1116–1122.
Wilson J (2002) Towards the automated evaluation of crystallization trials. Acta Crystallogr D Biol
Crystallogr 58: 1907–1914.
Pan S, Shavit G, Penas-Centeno M, Xu DH,
Shapiro L, et al. (2006) Automated classification
of protein crystallization images using support
vector machines with scale-invariant texture and
Gabor features. Acta Crystallogr D Biol Crystallogr 62: 271–279.
Liu R, Freund Y, Spraggon G (2008) Imagebased crystal detection: A machine-learning
approach. Acta Crystallogr D Biol Crystallogr
64: 1187–1195.
Fan E, Baker D, Fields S, Gelb MH, Buckner FS,
et al. (2008) Structural genomics of pathogenic
protozoa: An overview. Methods Mol Biol 426:
497–513.
Wagner A, Diez J, Schulze-Briese C, Schluckebier G
(2009) Crystal structure of ultralente—A microcrystalline insulin suspension. Proteins 74: 1018–1027.
Thorne RESZ, Kmetko J, O’Niell J, Gillilan R
(2003) Microfabricated mounts for high-throughput macromolecular cryocrystallography.
J Applied Crystallography 36: 1455–1460.
Schorlemmer HU, Bosslet K, Dickneite G,
Luben G, Sedlacek HH (1984) Studies on the
PLoS Computational Biology | www.ploscompbiol.org
6
39.
40.
41.
42.
43.
44.
45.
46.
47.
48.
4 9.
50.
51.
52.
53.
54.
55.
56.
mechanisms of action of the immunomodulator
Bestatin in various screening test systems. Behring
Inst Mitt: 157–173.
Costantino L, Barlocco D (2006) Privileged
structures as leads in medicinal chemistry. Curr
Med Chem 13: 65–85.
Biros SM, Moisan L, Mann E, Carella A, Zhai D,
et al. (2007) Heterocyclic alpha-helix mimetics for
targeting protein-protein interactions. Bioorg
Med Chem Lett 17: 4641–4645.
Robinson JA (2008) Beta-hairpin peptidomimetics: design, structures and biological activities.
Acc Chem Res 41: 1278–1288.
Saraogi I, Hamilton AD (2008) alpha-Helix
mimetics as inhibitors of protein-protein interactions. Biochem Soc Trans 36: 1414–1417.
Root MJ, Steger HK (2004) HIV-1 gp41 as a
target for viral entry inhibition. Curr Pharm Des
10: 1805–1825.
Weissenhorn W, Dessen A, Harrison SC,
Skehel JJ, Wiley DC (1997) Atomic structure of
the ectodomain from HIV-1 gp41. Nature 387:
426–430.
Ferrer M, Kapoor TM, Strassmaier T,
Weissenhorn W, Skehel JJ, et al. (1999) Selection
of gp41-mediated HIV-1 cell entry inhibitors
from biased combinatorial libraries of nonnatural binding elements. Nat Struct Biol 6:
953–960.
Lapatto R, Blundell T, Hemmings A,
Overington J, Wilderspin A, et al. (1989) X-ray
analysis of HIV-1 proteinase at 2.7 Å resolution
confirms structural homology among retroviral
enzymes. Nature 342: 299–302.
Miller M, Schneider J, Sathyanarayana BK,
Toth MV, Marshall GR, et al. (1989) Structure
of complex of synthetic HIV-1 protease with a
substrate-based inhibitor at 2.3 Å resolution.
Science 246: 1149–1152.
Navia MA, Fitzgerald PM, McKeever BM,
Leu CT, Heimbach JC, et al. (1989) Threedimensional structure of aspartyl protease from
human immunodeficiency virus HIV-1. Nature
337: 615–620.
Wl oda wer A, Mil ler M, Jaskolski M,
Sathyanarayana BK, Baldwin E, et al. (1989)
Conserved folding in retroviral proteases: Crystal
structure of a synthetic HIV-1 protease. Science
245: 616–621.
Wlodawer A, Vondrasek J (1998) Inhibitors of
HIV-1 protease: A major success of structureassisted drug design. Annu Rev Biophys Biomol
Struct 27: 249–284.
Abdel-Rahman HM, Al-karamany GS, ElKoussi NA, Youssef AF, Kiso Y (2002) HIV
protease inhibitors: Peptidomimetic drugs and
future perspectives. Curr Med Chem 9:
1905–1922.
Chrusciel RA, Strohbach JW (2004) Non-peptidic
HIV protease inhibitors. Curr Top Med Chem 4:
1097–1114.
Das K, Lewi PJ, Hughes SH, Arnold E (2005)
Crystallography and the design of anti-AIDS
drugs: Conformational flexibility and positional
adaptability are important in the design of nonnucleoside HIV-1 reverse transcriptase inhibitors.
Prog Biophys Mol Biol 88: 209–231.
Kohlstaedt LA, Wang J, Friedman JM, Rice PA,
Steitz TA (1992) Crystal structure at 3.5 Å
resolution of HIV-1 reverse transcriptase complexed with an inhibitor. Science 256: 1783–1790.
Smerdon SJ, Jager J, Wang J, Kohlstaedt LA,
Chirino AJ, et al. (1994) Structure of the binding
site for nonnucleoside inhibitors of the reverse
transcriptase of human immunodeficiency virus
type 1. Proc Natl Acad Sci U S A 91: 3911–3915.
Babu YS, Chand P, Bantia S, Kotian P,
Dehghani A, et al. (2000) BCX-1812 (RWJ270201): Discovery of a novel, highly potent,
orally active, and selective influenza neuramini-
October 2009 | Volume 5 | Issue 10 | e1000530
57.
58.
59.
60.
dase inhibitor through structure-based drug
design. J Med Chem 43: 3482–3486.
Bossart-Whitaker P, Carson M, Babu YS,
Smith CD, Laver WG, et al. (1993) Threedimensional structure of influenza A N9 neuraminidase and its complex with the inhibitor 2deoxy 2,3-dehydro-N-acetyl neuraminic acid.
J Mol Biol 232: 1069–1083.
Kim CU, Lew W, Williams MA, Liu H, Zhang L,
et al. (1997) Influenza neuraminidase inhibitors
possessing a novel hydrophobic interaction in the
enzyme active site: Design, synthesis, and structural analysis of carbocyclic sialic acid analogues
with potent anti-influenza activity. J Am Chem
Soc 119: 681–690.
von Itzstein M, Wu WY, Kok GB, Pegg MS,
Dyason JC, et al. (1993) Rational design of potent
sialidase-based inhibitors of influenza virus replication. Nature 363: 418–423.
Hadfield AT, Lee W, Zhao R, Oliveira MA,
Minor I, et al. (1997) The refined structure of
human rhinovirus 16 at 2.15 Å resolution:
Implications for the viral life cycle. Structure 5:
427–441.
61. Merritt EA, Zhang Z, Pickens JC, Ahn M,
Hol WG, et al. (2002) Characterization and
crystal structure of a high-affinity pentavalent
receptor-binding inhibitor for cholera toxin and
E. coli heat-labile enterotoxin. J Am Chem Soc
124: 8818–8824.
62. Hu X, Nguyen KT, Jiang VC, Lofland D,
Moser HE, et al. (2004) Macrocyclic inhibitors
for peptide deformylase: A structure-activity
relationship study of the ring size. J Med Chem
47: 4941–4949.
63. Aronov AM, Verlinde CL, Hol WG, Gelb MH
(1998) Selective tight binding inhibitors of trypanosomal glyceraldehyde-3-phosphate dehydrogenase via structure-based drug design. J Med
Chem 41: 4790–4799.
64. Bressi JC, Choe J, Hough MT, Buckner FS, Van
Voorhis WC, et al. (2000) Adenosine analogues as
inhibitors of Trypanosoma brucei phosphoglycerate
kinase: Elucidation of a novel binding mode for a
2-amino-N(6)-substituted adenosine. J Med Chem
43: 4135–4150.
65. Jin L, Harrison SC (2002) Crystal structure of
human calcineurin complexed with cyclosporin A
PLoS Computational Biology | www.ploscompbiol.org
7
66.
67.
68.
69.
and human cyclophilin. Proc Natl Acad Sci U S A
99: 13522–13526.
Rahuel J, Rasetti V, Maibaum J, Rueger H,
Goschke R, et al. (2000) Structure-based drug
design: The discovery of novel nonpeptide orally
active inhibitors of human renin. Chem Biol 7:
493–504.
Lam PY, Clark CG, Li R, Pinto DJ, Orwat MJ,
et al. (2003) Structure-based design of novel
guanidine/benzamidine mimics: Potent and orally bioavailable factor Xa inhibitors as novel
anticoagulants. J Med Chem 46: 4405–4418.
Terasaka T, Kinoshita T, Kuno M, Seki N,
Tanaka K, et al. (2004) Structure-based design,
synthesis, and structure-activity relationship studies of novel non-nucleoside adenosine deaminase
inhibitors. J Med Chem 47: 3730–3743.
Noble ME, Endicott JA, Johnson LN (2004)
Protein kinase inhibitors: Insights into drug design
from structure. Science 303: 1800–1805.
October 2009 | Volume 5 | Issue 10 | e1000530
Review
The Key Role of Genomics in Modern Vaccine and Drug
Design for Emerging Infectious Diseases
Kate L. Seib1, Gordon Dougan2, Rino Rappuoli1*
1 Novartis Vaccines and Diagnostics, Siena, Italy, 2 The Wellcome Trust Sanger Institute, The Wellcome Trust Genome Campus, Hinxton, Cambridge, United Kingdom
spp., M. tuberculosis) [3]. For many EIDs, the wealth of information
emerging in the genome era has already had a significant impact
on the way we approach vaccine and therapeutic development.
For EIDs that appear in the near future, genomics will be in the
first line of defense in terms of antigen identification, diagnostic
development, and functional characterization.
Since the completion of the genome sequence of Haemophilus
influenzae—the first finished bacterial genome sequence—in 1995 [4],
advances in sequencing technology and bioinformatics have
produced an exponential growth of genome sequence information.
At least one genome sequence is now available for each major
human pathogen. As of October 2009, over 1,000 bacterial genomes
were ‘‘completed’’ (i.e., closed genomes and whole genome shotgun
sequences) and more than 1,000 were ongoing; over 3,000 viral
genomes were completed (http://www.genomesonline.org/gold.cgi,
http://www.ncbi.nlm.nih.gov/genomes/MICROBES/microbial_
taxtree.html, http://cmr.jcvi.org/tigr-scripts/CMR/shared/
Genomes.cgi). For a bacterial pathogen, which may have more
than 4,000 genes, the genome sequence provides the complete
genetic repertoire of antigens or drug targets from which novel
candidates can be identified. For viral pathogens that may possess
fewer than 10 genes, genomics can be used to define the variability
that may exist between isolates. Host genetic factors also play a role
in infectious disease [5,6], however, and the availability of
‘‘complete’’ human genome sequences, as well as large-scale human
genome projects (see http://www.1000genomes.org/), are valuable
resources. Hence, the sequences of both pathogen and host genomes
can facilitate identification of a growing number of potential vaccine
and drug targets (Figure 1). It is estimated that 10–100 times more
candidates can be identified in one to two years using genomicsbased approaches than can be identified by conventional methods
in the same time frame. Furthermore, genomics-based vaccine
projects have substantially increased our understanding of microbial
physiology, epidemiology, pathogenesis, and protein functions (see
Box 1).
Abstract: It can be argued that the arrival of the
‘‘genomics era’’ has significantly shifted the paradigm of
vaccine and therapeutics development from microbiological to sequence-based approaches. Genome sequences
provide a previously unattainable route to investigate the
mechanisms that underpin pathogenesis. Genomics,
transcriptomics, metabolomics, structural genomics, proteomics, and immunomics are being exploited to perfect
the identification of targets, to design new vaccines and
drugs, and to predict their effects in patients. Furthermore, human genomics and related studies are providing
insights into aspects of host biology that are important in
infectious disease. This ever-growing body of genomic
data and new genome-based approaches will play a
critical role in the future to enable timely development of
vaccines and therapeutics to control emerging infectious
diseases.
By controlling debilitating and often-lethal infectious diseases,
vaccines and antibiotics have had an enormous impact on world
health. Now, with the arrival of the ‘‘genomics era,’’ a paradigm
shift is occurring in the development of vaccines—and potentially
also in the development of antibiotics—that is providing fresh
impetus to this field. The world is still faced with a huge burden of
infection, however, by classic pathogens (e.g., typhoid, measles),
recently discovered causes of disease (e.g., Helicobacter pylori and
hepatitis C virus [HCV]), and emerging infectious diseases (EIDs,
e.g., H1N1 swine flu and severe acute respiratory syndrome
coronavirus [SARS-CoV]). In addition, variant forms of previously identified infectious diseases are reemerging (e.g., Streptococcus
pyogenes, also known as group A streptococcus [GAS], and dengue
fever), along with antibiotic-resistant forms of microbes (e.g.,
methicillin-resistant Staphylococcus aureus [MRSA] and Mycobacterium
tuberculosis) [1,2] (for a list of EIDs see http://www3.niaid.nih.gov/
topics/emerging/list.htm). The World Health Organization
(WHO) estimates that we can expect at least one such new
pathogen to appear every year.
The fact that an infectious disease has emerged or reemerged
indicates immune naı̈vety in the infected population, or altered
virulence potential or an increase in antibiotic/antiviral resistance
in the pathogen population. The rapid development of vaccines
and therapeutics that target these pathogens is therefore essential
to limit their spread. Traditional empirical approaches that screen
for vaccines or drugs a few candidates at a time are timeconsuming and have often proven insufficient to control many
EIDs, particularly when the causative pathogens are antigenically
diverse (e.g., HIV), cannot be cultivated in the laboratory (e.g.,
HCV), lack suitable animal models of infection (e.g., Neisseria spp.),
have complex mechanisms of pathogenesis (e.g., retroviruses),
and/or are controlled by mucosal or T cell–dependent immune
responses rather than humoral immune responses (e.g., Shigella
PLoS Genetics | www.plosgenetics.org
Citation: Seib KL, Dougan G, Rappuoli R (2009) The Key Role of Genomics in
Modern Vaccine and Drug Design for Emerging Infectious Diseases. PLoS
Genet 5(10): e1000612. doi:10.1371/journal.pgen.1000612
Editor: Nicholas J. Schork, University of California San Diego and The Scripps
Research Institute, United States of America
Published October 26, 2009
Copyright: ß 2009 Seib et al. This is an open-access article distributed under
the terms of the Creative Commons Attribution License, which permits
unrestricted use, distribution, and reproduction in any medium, provided the
original author and source are credited.
Funding: KLS is the recipient of an Australian NHMRC CJ Martin Fellowship. GD is
supported by The Wellcome Trust. KLS and RR are employed by Novartis Vaccines.
The funders had no role in the preparation of the article.
Competing Interests: KLS and RR are employed by Novartis Vaccines.
* E-mail: [email protected]
This article is part of the ‘‘Genomics of Emerging Infectious Disease’’ PLoS Journal
collection (http://ploscollections.org/emerginginfectiousdisease/).
1
October 2009 | Volume 5 | Issue 10 | e1000612
Figure 1. Genomics-based approaches used in the control of EIDs from the outbreak of a disease to the development of a vaccine
or drug. (A) The causative agent of a disease may first be identified from patient samples by using metagenomics. (B) Vaccine and therapeutic
targets can be identified from the pathogen genome using a variety of screening approaches that focus on the genome, transcriptome, proteome,
immunome or structural genome. (C) The human genome can be screened to avoid homologies or similarities with pathogen vaccine and
therapeutic targets, or to identify new targets. (D) Once candidate vaccine and therapeutic targets have been identified they must be shown to
provide protection against disease and to be safe for use in patients. (E) The clinically tested vaccine or therapeutic can then be licensed for use. The
clinical responses of a vaccine and/or therapeutic can be analyzed using human genome based studies (dotted arrows). The pathogen genome can
also be used to analyze mutants that are able to evade the immune system in vaccinated subjects or organisms that develop antibiotic resistance.
Examples of the approaches indicated are given in Table 1.
doi:10.1371/journal.pgen.1000612.g001
PLoS Genetics | www.plosgenetics.org
2
October 2009 | Volume 5 | Issue 10 | e1000612
Box 1: Reverse Vaccinology Drives the Discovery of New Protein Functions
play a role in adherence to lung epithelial cells and
colonization in a murine model of infection, where they
elicit host inflammatory responses [77,78]. In addition, the
pilus subunits confer protection in passive and active
immunization models [79]. The presence of pili that contain
protective antigens in all three principal streptococcal
pathogens indicates that these structures play an important
role in virulence.
Reverse vaccinology involves the in silico screening of the
entire genome of a pathogen to find genes that encode
proteins with the attributes of good vaccine targets, using
either the genome of a single pathogenic isolate or the pangenome (the genomic information from several isolates) of a
pathogenic species.
Pili in pathogenic streptococci play a key role in
virulence and are promising vaccine candidates The
identification of pili (long filamentous structures that extend
from the bacterial surface) in the main pathogenic strains of
streptococci is a good example of how genomics can lead to
the discovery of protein functions and increased
understanding of host–pathogen interactions. The pili of
gram-negative bacteria are well-described virulence factors.
Little was known, however, about pili in gram-positive
bacteria before the sequencing and analysis of the genomes
of S. pyogenes, S. agalactiae, and S. pneumoniae (reviewed in
[72]).
During analysis of eight S. agalactiae genome sequences,
three protective antigens identified by pan-genomic reverse
vaccinology [20] were found to contain LPXTG motifs typical
of cell wall-anchored proteins and seen to assemble into pili
[73]. Further bioinformatics analysis revealed three independent loci that encode structurally distinct pilus types, each of
which contains two surface-exposed antigens capable of
eliciting protective immunity in mice [75]. Because of the
limited variability of S. agalactiae pili, it has been suggested
that a combination of only three pilin subunits could lead to
broad protective immunity [74].
Following the identification of S. agalactiae pili, typical pilus
regions were identified in the available S. pyogenes genomes
based on the presence of genes encoding LPXTG-containing
proteins. In addition, a combination of recombinant pilus
proteins was shown to confer protection in mice against
mucosal challenge with virulent S. pyogenes isolates [75].
Falugi and colleagues have since found that S. pyogenes pili
are encoded by nine different gene clusters, and they
estimate that a vaccine comprising a combination of 12
backbone variants could provide protection against over
90% of circulating S. pyogenes strains [76].
The availability of multiple complete genome sequences for
S. pneumoniae, and the increased understanding of pilus
proteins in other pathogenic streptococci, led to the
discovery of two pilus ‘‘islands’’ that encode proteins that
Reverse vaccinology leads to identification of the
fHBP and its role in meningococcal species specificity
Serogroup B N. meningitidis (MenB) strains are responsible
for the majority of meningococcal disease in the developed
world, yet there is no comprehensive MenB vaccine
available. Screening of the MenB genome for vaccine
candidates by using reverse vaccinology led to the
discovery of the meningococcal factor H-binding protein
(fHBP) [15], which was recently suggested to play an
important role in the species specificity of N. meningitidis
[80]. fHBP is a component of the Novartis multivalent MenB
vaccine that entered Phase III clinical testing in 2008 [16,17]
and is also under investigation by Wyeth Vaccines
(designated LP2086) [81] and other groups [82]. Initially
identified as the genome-derived Neisseria antigen 1870
(GNA1870), a Neisseria-specific putative surface lipoprotein
of unknown function, fHBP was renamed because of its
ability to bind complement factor H (fH), a molecule that
down-regulates activation of the complement alternative
pathway. Hence, binding of fH to the surface of Neisseria
allows the pathogen to evade complement-mediated killing
by the innate immune system [83]. fHBP is expressed by all N.
meningitidis strains studied [84]. It induces high levels of
bactericidal antibodies in mice [16] and is important for
survival of bacteria in human serum and blood [83,85,86].
The discovery that binding of fH to N. meningitidis is specific
for human fH, and that human fH alone is able to downregulate complement activation and bactericidal activity
leading to increased bacterial survival has significant
implications for the study of this organism [80]. The
administration of human fH to infant rats challenged with
MenB led to a greater than 10-fold increase in survival of
bacteria [80], providing an important insight into host–
pathogen interactions that may lead to the development of
new animal models of infection.
From the outbreak of a disease, metagenomics (the study of all the
genetic material recovered directly from a sample) can be applied to
diseased human samples to aid the rapid identification of the
causative agent [7,8]. Once the complete genome sequence of the
organism is available, high-throughput approaches can be used to
screen for target molecules, as outlined below and in Table 1 [9,10].
Screening approaches vary depending on the nature of the pathogen
but are based on several accepted principles and key requirements of
vaccines and therapeutics, including the need for targets to be (i)
expressed and accessible to the host immune system, or to a
therapeutic agent, during human disease; (ii) genetically conserved;
(iii) important for survival or pathogenesis; and (iv) free of measurable
homology or similarity to host factors. Although many of the
approaches described here focus on vaccine development, which
involves screening of candidates for immunogenicity, they are largely
applicable to drug development by altering the selection criteria used
and screening candidates against compound libraries [11–13].
PLoS Genetics | www.plosgenetics.org
Reverse Vaccinology, Pan-genomics, and
Comparative Genomics
The idea behind reverse vaccinology is to screen an entire
pathogen genome to find genes that encode proteins with the
attributes of good vaccine targets, such as, for example, bacterial
surface associated proteins [14]. These proteins can then undergo
normal laboratory evaluation for immunogenicity. The Neisseria
meningitidis serogroup B (MenB) reverse vaccinology project provides
the ‘‘proof of concept’’ for this type of approach. This project
identified more novel vaccine candidates in 18 months than had
been discovered in 40 years of conventional vaccinology [15].
Analysis of the genome sequence of the virulent MenB strain MC58
found 2,158 predicted open reading frames (ORFs); these were
screened using bioinformatics tools to identify 570 ORFs that were
predicted to encode surface-exposed or secreted proteins that might
be accessible to the immune system [15]. Antigen screening
3
October 2009 | Volume 5 | Issue 10 | e1000612
Table 1. Approaches to identify vaccine and/or drug targets against EIDs in the genomic era.
Approach
Methods Used
Limitations of Method
Example
Genomics/reverse vaccinology:
Analysis of the genetic material of
an organism in order to identify the
repertoire of protein antigens/drug
targets the organism has the potential
to express.
Bioinformatics screening of the genome
sequence to identify ORFs predicted to
be exposed on the surface of the
pathogen or secreted, expression of
recombinant proteins, generation of
antibodies in mice to confirm surface
exposure, and bactericidal activity [14].
Prediction algorithms need to be
validated.
Non-protein antigens including
polysaccharides or glycolipids, and
post-translational modifications
cannot be identified.
High-throughput cloning and protein
expression is required.
Serogroup B N.
Major cause of
meningitidis [15,16] septicemia and
meningitis in the
developed world.
Pan-genomics: Analysis of the genetic
material of several organisms of a single
species to identify conserved antigens/
targets and ensure the chosen target
covers the diversity of the organism.
Similar to above, but ORFs are chosen
by screening of multiple genomes with
either direct sequencing or comparative
genome hybridization [18].
Sequences of multiple isolates
of a species are required.
Similar limitations as described
above.
S. agalactiae [20]
Leading cause of
neonatal bacterial
sepsis, pneumonia,
and meningitis in
the US and Europe.
Comparative genomics: Analysis of
the genetic material of several individuals
of a single species, to identify antigens/
targets that are present in pathogenic
strains but absent in commensal strains,
and thus important for disease.
Similar to pangenomics, but ORFs are
chosen by screening of genomes from
multiple strains of pathogenic and
commensal strains of a species [18,21].
Similar limitations as for the above
two approaches.
E. coli [22]
Major cause of mild
to severe diarrhea,
hemolytic-uremic
syndrome, and
urinary tract infections.
Transcriptomics: Analysis of the set
of RNA transcripts expressed by an
organism under a specified condition.
Gene expression is evaluated in vitro or
in vivo using DNA microarrays or cDNA
sequencing [24].
There is no direct correlation
between the levels of mRNA
and protein.
In vivo studies require relatively
large amounts of mRNA.
V. cholerae [26]
Causes diseases
ranging from selflimiting to severe,
life-threatening
diarrhea, wound
infections, and sepsis.
Functional genomics: Analysis of the
role of genes and proteins in order to
identify genes required for survival
under specific conditions.
Genes that are functionally essential in
specific conditions in vitro or in vivo are
determined by gene inhibition followed
by screening of mutants in animal models
or cell culture to identify attenuated
clones [87].
Genetic tools, acceptance of
transposons, and natural
competence of the pathogen
are required.
H. pylori [32]
Major cause of
duodenal and gastric
ulcers and stomach
cancer as a result
of chronic low-level
inflammation of the
stomach lining.
Proteomics: Analysis of the set of
proteins expressed by an organism
under a specified condition and/or in
specific cellular locations (e.g., on the
cell surface).
2D-PAGE, MS, and chromatographic
techniques to identify proteins from
whole cells, fractionated samples, or
the cell surface [34].
Proteins with low abundance
and/or solubility and proteins
that are only expressed in vivo
may not be identified.
S. pyogenes [36]
Cause of a range of
diseases from mild
pharyngitis to severe
toxic shock syndrome,
necrotizing fasciitis,
and rheumatic fever.
Immunomics: Analysis of the subset
of proteins/epitopes that interact with
the host immune system.
Analysis of seroreactive proteins, using
2D-PAGE, phage display libraries, or
protein microarrays, probed with host
sera [38].
Bioinformatics prediction of B cell and
T cell epitopes [37].
Potential bias against sequences
that cannot be displayed.
Large conformational epitopes
made up of noncontiguous amino
acids may not be detected.
Prediction of B cell epitopes is
difficult due to the need to
identify conformational epitopes.
S. aureus [39]
Cause of wound
infections. Has
emerged as a
significant
opportunistic
pathogen due to
antibiotic resistance.
Structural genomics: Analysis of the
three-dimensional structure of an
organism’s proteins and how they
interact with antibodies or therapeutics.
NMR or crystallography to determine
the structure of proteins in the
presence/absence of antibodies or
therapeutics [51].
Poor understanding of
determinants of immunogenicity,
immunodominance, and structurefunction relationships.
HIV [53]
Causative agent of
AIDS.
Vaccinomics/immunogenetics
pharmacogenetics: Analysis of how
the human immune system responds
to a vaccine or drug.
Investigation of genetic heterogeneity/
polymorphisms in the host, at the
individual or population level, that may
alter immune responses to vaccines [68]
or metabolism of therapeutics [71].
Ethical issues of ‘‘personalized’’
medicine.
Immense diversity of the human
genome and, in particular, in the
human immune response.
Mumps virus [69]
Cause of disease
ranging from selflimiting parotid
inflammation to
epididymo-orchitis,
meningitis, and
encephalitis.
Organism
Disease
doi:10.1371/journal.pgen.1000612.t001
continued on the basis of several criteria: the ability of antigens to be
expressed in Escherichia coli as recombinant proteins (350 candidates);
confirmation by ELISA and flow cytometry that the antigen is
exposed on the cell surface (91 candidates); the ability of induced
antibodies to elicit killing, as measured by serum bactericidal assay
and/or passive protection in infant rat assays (28 candidates); and
PLoS Genetics | www.plosgenetics.org
screening of a panel of diverse meningococcal isolates to determine
whether the antigens are conserved. This approach resulted in the
development of a multi-component recombinant MenB vaccine
that entered Phase III clinical trials in 2008 [16,17].
As multiple genome sequences become available for a single
species, the concept of pan-genomic reverse vaccinology is
4
October 2009 | Volume 5 | Issue 10 | e1000612
emerging as a powerful tool to identify vaccine candidates in
antigenically diverse species [18]. Pan-genomics aims to identify
the full complement of genes in a species, based on the superset of
genes in several strains of the same species. Analysis of the genome
sequences of eight Streptococcus agalactiae (also known as group B
streptococcus) strains revealed substantial genetic heterogeneity
and the extended gene repertoire of the species [19]. Screening
found a total of 589 genes predicted to encode surface-exposed or
secreted proteins in the S. agalactiae pan-genome (396 from the
‘‘core genome’’—genes conserved in all strains—and 193 from the
‘‘dispensable genome’’—genes that are present in two or more
strains and are hence considered dispensable for survival). Based
on further screening of this pool of candidates, including the ability
of recombinant proteins to provide protection when used to
immunize animals, a combination of four antigens—only one of
which is in the core genome—was selected and shown to confer
protection against a panel of S. agalactiae strains [20].
Whereas genome sequencing projects have typically focused on
pathogenic organisms, comparison of the genomes of pathogenic and
nonpathogenic strains allows vaccine and drug targets to be identified
on the basis of proteins that are specifically involved in pathogenesis
[21]. Comparative studies of up to 17 commensal and pathogenic E.
coli genomes identified genes unique to certain pathogenic strains that
are largely absent in commensal strains. This filter decreases the pool of
targets to be screened and potentially limits any detrimental effects of
therapeutics on the composition of the commensal flora [22].
New sequencing technologies will also open up opportunities for
monitoring pathogen vaccine escape by screening for evidence of
immune selection in the genomes of pathogen populations before
and after vaccine selection. By deep-sequencing of bacterial and
viral populations it will be possible to identify antigens under
immune selection by monitoring the clustering of single nucleotide
polymorphisms (SNPs) and other mutations that affect protein
sequence. This approach has already been used to search for
evidence of antigenic variation/selection in populations of
Salmonella enterica serovar Typhi [23], where variation is extremely
limited. Similar sequencing strategies could be applied to
populations of bacteria taken before or after a vaccine trial in a
particular geographical region.
pathogens to identify genes essential to survival or virulence that
may be valid vaccine candidates. DNA microarrays can be used to
screen comprehensive libraries of pathogen mutants, by comparing bacterial isolates from before and after passage through animal
models or exposure to compound libraries to identify attenuated
clones [28–30]. For example, these methods have been used to
identify 65 novel MenB genes that are required for the pathogen to
cause septicemia in infant rats [31], 47 genes essential for H. pylori
gastric colonization of the gerbil [32], and genes contributing to
M. tuberculosis persistence in the host [33].
Analysis of a pathogen’s proteome (the near complete set of
proteins expressed under a specified condition) to reveal potential
vaccine and drug candidates can add significant value to in silico
approaches [34]. High-throughput proteomic analyses can be
performed by using mass spectrometry (MS), chromatographic
techniques, and protein microarrays [35]. A novel proteome-based
approach has been applied to identify the surface proteins of GAS
by making use of proteolytic enzymes to ‘‘shave’’ the bacterial
surface, releasing exposed proteins and partially exposed peptides.
Seventeen surface proteins of a virulent GAS strain were identified
in this way by using MS and genome sequence analysis. Their
location on the pathogen surface was confirmed by flow
cytometry, and one of them provided protective immunity in a
mouse model of the disease [36].
The proteome of a pathogen can also be screened to identify the
immunome (the near complete set of pathogen proteins or
epitopes that interact with the host immune system) using in vitro
or in silico techniques [37,38]. In vitro identification and screening
of the immunome are based on the idea that antibodies present in
serum from a host, which has been exposed to a pathogen,
represent a molecular ‘‘imprint’’ of the pathogen’s immunogenic
proteins and can be used to identify vaccine candidates. As such,
several techniques have been developed to allow the highthroughput display of pathogen proteins, and the subsequent
screening for proteins that interact with antibodies in sera.
Immunogenic surface proteins of several organisms have been
identified, including S. aureus using 2D-PAGE, membrane blotting,
and MS [39]; S. agalactiae, S. pyogenes, and Streptococcus pneumoniae
using phage- or E. coli-based comprehensive genomic peptide
expression libraries [38,40]; and Francisella tularensis (the causative
agent of tularemia or rabbit fever) [41] and V. cholerae using protein
microarray chips [42]. Protein microarrays, in which proteins
from the pathogen are spotted onto a microarray chip, can also
be used to characterize protein–drug interactions, as well as
other protein–protein, protein–nucleic acid, ligand–receptor, and
enzyme–substrate interactions [43].
The ability to predict in silico which pathogen epitopes will be
recognized by B cells or T cells has greatly improved in recent
years [44]. Large-scale screening of pathogens including HIV,
Bacillus anthracis, M. tuberculosis, F. tularensis, Yersinia pestis (the
causative agent of bubonic plague), flaviviruses, and influenza for
B cell and T cell epitopes is currently underway [45,46]. Although
epitope prediction is not foolproof, it can serve as a guide for
further biological evaluation. T cell epitopes are presented by
MHC/HLA proteins on the surface of antigen-presenting cells,
which vary considerably between hosts, complicating the task of
functional epitope prediction. Additionally, B cell epitopes can be
both linear and conformational. The ultimate aim of researchers
in this field of study would be to engineer a single peptide that
represents defined epitope combinations from a protein or
organism, enabling the genetic variability of both pathogen and
host to be overcome [44].
Structural genomics—the study of the three-dimensional
structures of the proteins produced by a species—is increasingly
Beyond Genomics: Other -Omics Approaches to
Study Pathogens
Pathogen genes that are up-regulated during infection and/or
essential for microorganism survival or pathogenesis can be
identified by using transcriptomics, i.e., the analysis of a near
complete set of RNA transcripts expressed by the pathogen under
a specified condition. Comprehensive DNA-based microarray
chips (probed with cDNA generated from RNA by reverse
transcription) [24] and ultra-high-throughput sequencing technologies that allow rapid sequencing and direct quantification of
cDNA [25] enable the transcriptome of a pathogen to be
characterized and particular types of gene product to be identified.
For example, genes involved in the hyperinfectious state of Vibrio
cholerae, which appears after passage through the human
gastrointestinal tract, were identified through a comparison of
the transcriptome of bacteria isolated directly from stool samples
of cholera patients with that of V. cholerae grown in vitro [26].
Similarly, analysis of the transcription profile of M. tuberculosis
during early infection in immune-competent (BALB/c) and severe
combined immunodeficient (SCID) mice revealed a set of 67 genes
activated exclusively in response to the host immune system [27].
Functional genomics—linking genotype, through transcriptomics and proteomics, to phenotype—has been applied to many
PLoS Genetics | www.plosgenetics.org
5
October 2009 | Volume 5 | Issue 10 | e1000612
being applied to vaccine and drug development as a result of the
explosion of genome and proteome data, and continuing
improvements in the fields of protein expression, purification,
and structural determination [47]. The structure-based design of
antiviral therapeutics has led to the development of drugs directed
at the active sites of the HIV-1 protease [48] and influenza
neuraminidase [49]. More than 45,000 high-resolution protein
structures are available in public databases (see http://www.
wwpdb.org/stats.html), and several initiatives have been established to pursue high-throughput characterization of protein
structures on a genome-wide scale [50], focusing on determining
and understanding the structural basis of immune-dominant and
immune-recessive antigens as well as protein active sites and
potential drug-binding sites [51,52]. For example, structural
characterization of the HIV envelope proteins gp120 and gp41
has revealed mechanisms used by the virus to evade host antibody
responses, many of which involve hypervariability in immunodominant epitopes [53,54]. Based on this information, immune
refocusing (e.g., by retargeted glycosylation, deletion, and/or
substitution of amino acids) has been used to dampen the response
to variable immunodominant epitopes of the envelope glycoprotein gp160, enabling the host to respond to previously subdominant epitopes [55]. High-throughput modification of proteins and
their screening for immunogenicity and interaction with antimicrobials is predicted to become more common as techniques
evolve [51].
result, the OspA-based Lyme disease vaccine (LYMErix) was taken
off the market in 2002, but a recombinant OspA lacking the
potentially autoreactive T cell epitope has been proposed as a
replacement vaccine [62].
Rather than targeting drugs to pathogen enzymes, an
alternative approach has focused on targeting the host-cell proteins
that are exploited by pathogens for replication and survival. The
use of techniques including microarray-based analysis of virusinduced host gene expression has revealed several possible targets
[63,64]. The cholesterol-lowering drugs statins, for example, have
an anti-HIV effect that is believed to be mediated by preventing
activation of the host protein Rho, which is activated by the HIV
envelope protein and required for virus entry to the cell [65].
Furthermore, such studies can improve our understanding of the
host immune responses that protect against a pathogen (i.e.,
innate, antibody, Th1, or Th2 responses), which will aid the
selection of appropriate vaccine adjuvants. For example, induction
of interferon signaling early in infection may be critical to confer
protection against SARS-CoV, as determined from functional
genomic studies of early host responses to SARS-CoV infection in
the lungs of macaques [66].
Many of the genes of the human immune system are highly
polymorphic, which enables the population as a whole to generate
sufficient immunological diversity to combat EIDs. This variation
also impacts on the outcome of vaccination and treatment. The
International HapMap Project has identified over 3.1 million
SNPs in 270 individuals [67] and the 1000 Genomes Project aims
to identify even more genetic variants. The field of vaccinomics
(also called immunogenetics) investigates heterogeneity in host
genetic markers that results in variations in vaccine-induced
immune responses, with the aim of predicting and minimizing
vaccine failures or adverse events [68]. For example, polymorphisms of HLA and immunoregulatory cytokine receptor genes
are associated with variable outcomes of vaccination against
mumps [69]. Similarly, pharmacogenetics, which investigates
genetic differences in the way individuals metabolize therapeutics,
has found that human variability in the speed of metabolism of the
common first-line tuberculosis drug isoniazid is associated with
genetic variants, including SNPs, in the gene encoding arylamine
N-acetyltransferase (NAT2) [70,71]. The ability to predict an
individual’s response to a vaccine or drug, may eventually allow
physicians to determine whether a patient is genetically susceptible
to a disease, the possible adverse effects of a vaccine or drug, and
the appropriate schedule or dose to use.
The Contribution of Human Genomics
When designing new vaccines, one important consideration is
the risk that the vaccine might generate ‘‘self’’ immune reactions
against host epitopes; immune responses against a pathogen
antigen can cross-react with host antigens if homologies exist in the
primary amino acid sequence or structure, potentially leading to
damage to the host tissue [56]. Drugs aimed at pathogen targets
could also theoretically target similar host molecules. The
availability of the human genome sequence combined with
methods for predicting B cell and T cell epitopes will facilitate
screening for the presence of homologies between candidate
microbial vaccine antigens and proteins in humans, enabling issues
of autoimmunity and cross-reactivity to be tackled [57]. As such,
vaccine or drug targets identified using methods based on
pathogen genomics should be screened for homology or similarity
to human proteins in silico, using programs such as BLAST (Basic
Local Alignment Search Tool; http://blast.ncbi.nlm.nih.gov/
Blast.cgi) to query human genome databases. Interestingly,
analysis of 30 viral genomes revealed that around 90% of viral
pentapeptides, which could be components of epitopes, are
identical to human peptides [58]. There is little homology,
however, between validated immunogenic disease-associated
peptides/epitopes and host peptides [57,59], suggesting that
screening approaches that include prediction of immunogenicity
could improve the pool of target candidates.
It is important to keep in mind that we do not fully understand
how self-tolerance is broken, so we currently have no perfect way
of predicting all potential autoimmune triggers that could be
associated with vaccination. While many links have been made
between autoimmune disease and vaccination, they have been
confirmed in only a small number of cases (reviewed in [60]). For
example, treatment-resistant Lyme arthritis is associated in certain
patients with immune reactivity to the outer surface protein A
(OspA) of the causative agent of Lyme disease, Borrelia burgdorferi,
and an OspA epitope (OspA165–173) has homology to the human
lymphocyte function-associated antigen (hLFA)-1aL [61]. As a
PLoS Genetics | www.plosgenetics.org
Challenges for the Future
We predict that genomics will greatly aid the control of EIDs
because of the increased efficiency with which vaccine and
therapeutic targets can be identified using the genome-based
approaches described above. Furthermore, we anticipate the
continual refinement and development of novel genome-based
approaches as sequencing becomes faster and more affordable.
Several challenges remain, however, in the identification of these
targets and in the processes needed to bring a new vaccine or drug
to the market. Understanding the molecular nature of epitopes,
the mechanisms of action of adjuvants, and T cell and mucosal
immunity are key priorities to be tackled in the coming years [3].
These issues can be addressed by improved structural studies of
antigen epitopes and the compilation of databases containing
information on structure, immunogenicity, and in silico B cell and
T cell epitope predictions. Genome-based development of effective
vaccines and therapeutics is still largely dependent on the
availability of valid models to measure efficacy and protection
6
October 2009 | Volume 5 | Issue 10 | e1000612
the stepwise series of prelicensure clinical trials (Phase I, II, and III)
that are required to document the safety, immunogenicity, and
efficacy of a vaccine are still highly time-consuming and costly. We
can only hope that the increasingly ‘‘smart’’ identification and
design of targets, and the fresh impetuous given to the fields of
vaccine and drug development by the arrival of genomics, will
enable increased success of those vaccines and drugs that do make
it into clinical development.
against disease; however, the increased understanding of microbial
pathogenesis that is emerging from genomics should greatly aid in
this respect. Likewise, the continued development of animal
models with knockout and allele-specific mutations in key
components of the immune response will greatly increase
understanding of the type of immune response needed to control
disease and the ways in which the immune system can be
programmed to protect the host against disease. Unfortunately,
References
1. Dong J, Olano JP, McBride JW, Walker DH (2008) Emerging pathogens:
Challenges and successes of molecular diagnostics. J Mol Diagn 10: 185–197.
2. Yang X, Yang H, Zhou G, Zhao GP (2008) Infectious disease in the genomic
era. Annu Rev Genomics Hum Genet 9: 21–48.
3. Rappuoli R (2007) Bridging the knowledge gaps in vaccine design. Nat
Biotechnol 25: 1361–1366.
4. Fleischmann RD, Adams MD, White O, Clayton RA, Kirkness EF, et al. (1995)
Whole-genome random sequencing and assembly of Haemophilus influenzae Rd.
Science 269: 496–512.
5. Casanova JL, Abel L (2007) Human genetics of infectious diseases: A unified
theory. EMBO J 26: 915–922.
6. Burgner D, Jamieson SE, Blackwell JM (2006) Genetic susceptibility to infectious
diseases: Big is beautiful, but will bigger be even better? Lancet Infect Dis 6:
653–663.
7. Nakamura S, Yang CS, Sakon N, Ueda M, Tougan T, et al. (2009) Direct
metagenomic detection of viral pathogens in nasal and fecal specimens using an
unbiased high-throughput sequencing approach. PLoS ONE 4: e4219.
doi:10.1371/journal.pone.0004219.
8. Bittar F, Richet H, Dubus JC, Reynaud-Gaubert M, Stremler N, et al. (2008)
Molecular detection of multiple emerging pathogens in sputa from cystic fibrosis
patients. PLoS ONE 3: e2908. doi:10.1371/journal.pone.0002908.
9. Rinaudo CD, Telford JL, Rappuoli R, Seib KL (2009) Vaccinology in the
genome era. J Clin Invest 119: 2515–2525.
10. Kaushik DK, Sehgal D (2008) Developing antibacterial vaccines in genomics
and proteomics era. Scand J Immunol 67: 544–552.
11. Pucci MJ (2007) Novel genetic techniques and approaches in the microbial
genomics era: identification and/or validation of targets for the discovery of new
antibacterial agents. Drugs R D 8: 201–212.
12. Mills SD (2006) When will the genomics investment pay off for antibacterial
discovery? Biochem Pharmacol 71: 1096–1102.
13. Van Voorhis WC, Hol WGJ, Myler PJ, Stewart LJ (2009) The role of medical
structural genomics in discovering new drugs for infectious diseases. PLoS
Comput Biol 5(10): e530. 10.1371/journal.pcbi.1000530.
14. Masignani V, Rappuoli R, Pizza M (2002) Reverse vaccinology: A genomebased approach for vaccine development. Expert Opin Biol Ther 2: 895–905.
15. Pizza M, Scarlato V, Masignani V, Giuliani MM, Arico B, et al. (2000)
Identification of vaccine candidates against serogroup B meningococcus by
whole-genome sequencing. Science 287: 1816–1820.
16. Giuliani MM, Adu-Bobie J, Comanducci M, Arico B, Savino S, et al. (2006) A
universal vaccine for serogroup B meningococcus. Proc Natl Acad Sci U S A
103: 10834–10839.
17. Rappuoli R (2008) The application of reverse vaccinology, Novartis MenB
vaccine developed by design. 16th International Pathogenic Neisseria Conference, Rotterdam, The Netherlands: http://www.IPNC2008.org. Abstr. 81 p.
18. Muzzi A, Masignani V, Rappuoli R (2007) The pan-genome: Towards a
knowledge-based discovery of novel targets for vaccines and antibacterials. Drug
Discov Today 12: 429–439.
19. Tettelin H, Masignani V, Cieslewicz MJ, Donati C, Medini D, et al. (2005) Genome
analysis of multiple pathogenic isolates of Streptococcus agalactiae: implications for the
microbial ‘‘pan-genome.’’ Proc Natl Acad Sci U S A 102: 13950–13955.
20. Maione D, Margarit I, Rinaudo CD, Masignani V, Mora M, et al. (2005)
Identification of a universal Group B streptococcus vaccine by multiple genome
screen. Science 309: 148–150.
21. Bhagwat AA, Bhagwat M (2008) Methods and tools for comparative genomics of
foodborne pathogens. Foodborne Pathog Dis 5: 487–497.
22. Rasko DA, Rosovitz MJ, Myers GS, Mongodin EF, Fricke WF, et al. (2008) The
pangenome structure of Escherichia coli: Comparative genomic analysis of E. coli
commensal and pathogenic isolates. J Bacteriol 190: 6881–6893.
23. Holt KE, Parkhill J, Mazzoni CJ, Roumagnac P, Weill FX, et al. (2008) Highthroughput sequencing provides insights into genome variation and evolution in
Salmonella typhi. Nat Genet 40: 987–993.
24. Dhiman N, Bonilla R, O’Kane DJ, Poland GA (2001) Gene expression
microarrays: A 21st century tool for directed vaccine design. Vaccine 20: 22–30.
25. Morozova O, Marra MA (2008) Applications of next-generation sequencing
technologies in functional genomics. Genomics 92: 255–264.
26. Merrell DS, Butler SM, Qadri F, Dolganov NA, Alam A, et al. (2002) Hostinduced epidemic spread of the cholera bacterium. Nature 417: 642–645.
27. Talaat AM, Lyons R, Howard ST, Johnston SA (2004) The temporal expression
profile of Mycobacterium tuberculosis infection in mice. Proc Natl Acad Sci U S A
101: 4602–4607.
PLoS Genetics | www.plosgenetics.org
28. Scarselli M, Giuliani MM, Adu-Bobie J, Pizza M, Rappuoli R (2005) The
impact of genomics on vaccine design. Trends Biotechnol 23: 84–91.
29. Saenz HL, Dehio C (2005) Signature-tagged mutagenesis: technical advances in
a negative selection method for virulence gene identification. Curr Opin
Microbiol 8: 612–619.
30. Sakata T, Winzeler EA (2007) Genomics, systems biology and drug development
for infectious diseases. Mol Biosyst 3: 841–848.
31. Sun YH, Bakshi S, Chalmers R, Tang CM (2000) Functional genomics of
Neisseria meningitidis pathogenesis. Nat Med 6: 1269–1273.
32. Kavermann H, Burns BP, Angermuller K, Odenbreit S, Fischer W, et al. (2003)
Identification and characterization of Helicobacter pylori genes essential for gastric
colonization. J Exp Med 197: 813–822.
33. Sassetti CM, Boyd DH, Rubin EJ (2003) Genes required for mycobacterial
growth defined by high density mutagenesis. Mol Microbiol 48: 77–84.
34. Zhu H, Bilgin M, Snyder M (2003) Proteomics. Annu Rev Biochem 72:
783–812.
35. Grandi G (2006) Genomics and proteomics in reverse vaccines. Methods
Biochem Anal 49: 379–393.
36. Rodriguez-Ortega MJ, Norais N, Bensi G, Liberatori S, Capo S, et al. (2006)
Characterization and identification of vaccine candidate proteins through
analysis of the group A Streptococcus surface proteome. Nat Biotechnol 24:
191–197.
37. De Groot AS, McMurry J, Moise L (2008) Prediction of immunogenicity: in
silico paradigms, ex vivo and in vivo correlates. Curr Opin Pharmacol 8:
620–626.
38. Meinke A, Henics T, Hanner M, Minh DB, Nagy E (2005) Antigenome
technology: A novel approach for the selection of bacterial vaccine candidate
antigens. Vaccine 23: 2035–2041.
39. Vytvytska O, Nagy E, Bluggel M, Meyer HE, Kurzbauer R, et al. (2002)
Identification of vaccine candidate antigens of Staphylococcus aureus by serological
proteome analysis. Proteomics 2: 580–590.
40. Giefing C, Meinke AL, Hanner M, Henics T, Bui MD, et al. (2008) Discovery of
a novel class of highly conserved vaccine antigens using genomic scale antigenic
fingerprinting of pneumococcus with human antibodies. J Exp Med 205:
117–131.
41. Eyles JE, Unal B, Hartley MG, Newstead SL, Flick-Smith H, et al. (2007)
Immunodominant Francisella tularensis antigens identified using proteome
microarray. Proteomics 7: 2172–2183.
42. Rolfs A, Montor WR, Yoon SS, Hu Y, Bhullar B, et al. (2008) Production and
sequence validation of a complete full length ORF collection for the pathogenic
bacterium Vibrio cholerae. Proc Natl Acad Sci U S A 105: 4364–4369.
43. Stoevesandt O, Taussig MJ, He M (2009) Protein microarrays: high-throughput
tools for proteomics. Expert Rev Proteomics 6: 145–157.
44. De Groot AS, Moise L, McMurry JA, Martin W (2008) Epitope-based immunonederived vaccines: a strategy for improved design and safety. In: Falus A, ed. Clinical
Applications of Immunomics. New York: Springer. pp 39–69.
45. Sette A, Fleri W, Peters B, Sathiamurthy M, Bui HH, et al. (2005) A roadmap
for the immunomics of category A-C pathogens. Immunity 22: 155–161.
46. De Groot AS, Rivera DS, McMurry JA, Buus S, Martin W (2008) Identification
of immunogenic HLA-B7 ‘‘Achilles’ heel’’ epitopes within highly conserved
regions of HIV. Vaccine 26: 3059–3071.
47. Lundstrom K (2007) Structural genomics and drug discovery. J Cell Mol Med
11: 224–238.
48. Kaldor SW, Kalish VJ, Davies JF, 2nd, Shetty BV, Fritz JE, et al. (1997)
Viracept (nelfinavir mesylate, AG1343): A potent, orally bioavailable inhibitor of
HIV-1 protease. J Med Chem 40: 3979–3985.
49. Kim CU, Lew W, Williams MA, Liu H, Zhang L, et al. (1997) Influenza
neuraminidase inhibitors possessing a novel hydrophobic interaction in the
enzyme active site: Design, synthesis, and structural analysis of carbocyclic sialic
acid analogues with potent anti-influenza activity. J Am Chem Soc 119: 681–690.
50. Todd AE, Marsden RL, Thornton JM, Orengo CA (2005) Progress of structural
genomics initiatives: An analysis of solved target structures. J Mol Biol 348:
1235–1260.
51. Dormitzer PR, Ulmer JB, Rappuoli R (2008) Structure-based antigen design: A
strategy for next generation vaccines. Trends Biotechnol 26: 659–667.
52. Nicola G, Abagyan R (2009) Structure-based approaches to antibiotic drug
discovery. Curr Protoc Microbiol Chapter 17: Unit 17.2.
53. Zhou T, Xu L, Dey B, Hessell AJ, Van Ryk D, et al. (2007) Structural definition
of a conserved neutralization epitope on HIV-1 gp120. Nature 445: 732–737.
7
October 2009 | Volume 5 | Issue 10 | e1000612
54. Prabakaran P, Dimitrov AS, Fouts TR, Dimitrov DS, KuanTeh J (2007)
Structure and function of the HIV envelope glycoprotein as entry mediator,
vaccine immunogen, and target for inhibitors. In: Advances in Pharmacology.
Academic Press. pp 33–97.
55. Tobin GJ, Trujillo JD, Bushnell RV, Lin G, Chaudhuri AR, et al. (2008)
Deceptive imprinting and immune refocusing in vaccine design. Vaccine 26:
6189–6199.
56. Ercolini AM, Miller SD (2009) The role of infections in autoimmune disease.
Clin Exp Immunol 155: 1–15.
57. Amela I, Cedano J, Querol E (2007) Pathogen proteins eliciting antibodies do
not share epitopes with host proteins: A bioinformatics approach. PLoS ONE 2:
e512. doi:10.1371/journal.pone.0000512.
58. Kanduc D, Stufano A, Lucchese G, Kusalik A (2008) Massive peptide sharing
between viral and human proteomes. Peptides 29: 1755–1766.
59. Kanduc D, Lucchese A, Mittelman A (2007) Non-redundant peptidomes from
DAPs: Towards ‘‘the vaccine’’? Autoimmun Rev 6: 290–294.
60. Wraith DC, Goldman M, Lambert PH (2003) Vaccination and autoimmune
disease: What is the evidence? Lancet 362: 1659–1666.
61. Gross DM, Forsthuber T, Tary-Lehmann M, Etling C, Ito K, et al. (1998)
Identification of LFA-1 as a candidate autoantigen in treatment-resistant Lyme
arthritis. Science 281: 703–706.
62. Willett TA, Meyer AL, Brown EL, Huber BT (2004) An effective secondgeneration outer surface protein A-derived Lyme vaccine that eliminates a
potentially autoreactive T cell epitope. Proc Natl Acad Sci U S A 101: 1303–1308.
63. Kellam P (2006) Attacking pathogens through their hosts. Genome Biol 7: 201.
64. Andeweg AC, Haagmans BL, Osterhaus AD (2008) Virogenomics: the virushost interaction revisited. Curr Opin Microbiol 11: 461–466.
65. del Real G, Jimenez-Baranda S, Mira E, Lacalle RA, Lucas P, et al. (2004) Statins
inhibit HIV-1 infection by down-regulating Rho activity. J Exp Med 200: 541–547.
66. de Lang A, Baas T, Teal T, Leijten LM, Rain B, et al. (2007) Functional
genomics highlights differential induction of antiviral pathways in the lungs of
SARS-CoV-infected macaques. PLoS Pathog 3: e112. doi:10.1371/journal.
ppat.0030112.
67. International HapMap Consortium (2007) A second generation human
haplotype map of over 3.1 million SNPs. Nature 449: 851–861.
68. Poland GA, Ovsyannikova IG, Jacobson RM (2009) Application of pharmacogenomics to vaccines. Pharmacogenomics 10: 837–852.
69. Ovsyannikova IG, Jacobson RM, Dhiman N, Vierkant RA, Pankratz VS, et al.
(2008) Human leukocyte antigen and cytokine receptor gene polymorphisms
associated with heterogeneous immune responses to mumps viral vaccine.
Pediatrics 121: e1091–1099.
70. Sim E, Lack N, Wang CJ, Long H, Westwood I, et al. (2008) Arylamine Nacetyltransferases: Structural and functional implications of polymorphisms.
Toxicology 254: 170–183.
71. Baudhuin LM, Langman LJ, O’Kane DJ (2007) Translation of pharmacogenetics into clinically relevant testing modalities. Clin Pharmacol Ther 82:
373–376.
72. Telford JL, Barocchi MA, Margarit I, Rappuoli R, Grandi G (2006) Pili in
gram-positive pathogens. Nat Rev Microbiol 4: 509–519.
PLoS Genetics | www.plosgenetics.org
73. Lauer P, Rinaudo CD, Soriani M, Margarit I, Maione D, et al. (2005) Genome
analysis reveals pili in Group B Streptococcus. Science 309: 105.
74. Margarit I, Rinaudo CD, Galeotti CL, Maione D, Ghezzo C, et al. (2009)
Preventing bacterial infections with pilus-based vaccines: The group B
streptococcus paradigm. J Infect Dis 199: 108–115.
75. Mora M, Bensi G, Capo S, Falugi F, Zingaretti C, et al. (2005) Group A
Streptococcus produce pilus-like structures containing protective antigens and
Lancefield T antigens. Proc Natl Acad Sci U S A 102: 15641–15646.
76. Falugi F, Zingaretti C, Pinto V, Mariani M, Amodeo L, et al. (2008) Sequence
variation in Group A Streptococcus pili and association of pilus backbone types
with Lancefield T serotypes. J Infect Dis 198: 1834–1841.
77. Barocchi MA, Ries J, Zogaj X, Hemsley C, Albiger B, et al. (2006) A
pneumococcal pilus influences virulence and host inflammatory responses. Proc
Natl Acad Sci U S A 103: 2857–2862.
78. Bagnoli F, Moschioni M, Donati C, Dimitrovska V, Ferlenghi I, et al. (2008) A
second pilus type in Streptococcus pneumoniae is prevalent in emerging serotypes and
mediates adhesion to host cells. J Bacteriol 190: 5480–5492.
79. Gianfaldoni C, Censini S, Hilleringmann M, Moschioni M, Facciotti C, et al.
(2007) Streptococcus pneumoniae pilus subunits protect mice against lethal challenge.
Infect Immun 75: 1059–1062.
80. Granoff DM, Welsch JA, Ram S (2009) Binding of complement factor H (fH) to
Neisseria meningitidis is specific for human fH and inhibits complement activation
by rat and rabbit sera. Infect Immun 77: 764–769.
81. McNeil LK, Murphy E, Zhao XJ, Guttmann S, Harris S, et al. (2009) Detection
of LP2086 on the cell surface of Neisseria meningitidis and its accessibility in the
presence of serogroup B capsular polysaccharide. Vaccine 27: 3417–3421.
82. Koeberling O, Seubert A, Granoff DM (2008) Bactericidal antibody responses
elicited by a meningococcal outer membrane vesicle vaccine with overexpressed
factor H-binding protein and genetically attenuated endotoxin. J Infect Dis 198:
262–270.
83. Madico G, Welsch JA, Lewis LA, McNaughton A, Perlman DH, et al. (2006)
The meningococcal vaccine candidate GNA1870 binds the complement
regulatory protein factor H and enhances serum resistance. J Immunol 177:
501–510.
84. Masignani V, Comanducci M, Giuliani MM, Bambini S, Adu-Bobie J, et al.
(2003) Vaccination against Neisseria meningitidis using three variants of the
lipoprotein GNA1870. J Exp Med 197: 789–799.
85. Welsch JA, Ram S, Koeberling O, Granoff DM (2008) Complement-dependent
synergistic bactericidal activity of antibodies against factor H-binding protein, a
sparsely distributed meningococcal vaccine antigen. J Infect Dis 197:
1053–1061.
86. Seib KL, Serruto D, Oriente F, Delany I, Adu-Bobie J, et al. (2009) Factor Hbinding protein is important for meningococcal survival in human whole blood
and serum and in the presence of the antimicrobial peptide LL-37. Infect
Immun 77: 292–299.
87. Mazurkiewicz P, Tang CM, Boone C, Holden DW (2006) Signature-tagged
mutagenesis: Barcoding mutants for genome-wide screens. Nat Rev Genet 7:
929–939.
8
October 2009 | Volume 5 | Issue 10 | e1000612
Review
Toward the Use of Genomics to Study Microevolutionary
Change in Bacteria
Daniel Falush*
Department of Microbiology, University College Cork, Environmental Research Institute, Lee Road, Cork, Ireland
extremes, with their genomes showing signs of both clonal descent
and DNA import from other strains.
In this essay, I will argue that the clonal mode of reproduction
shared by all bacteria and Archaea, in which replication occurs by
binary fission, in fact provides an extremely powerful context for
association studies. These studies will require both appropriate
technologies for genotyping and evolutionary analysis and
judiciously chosen strain collections. I will here concentrate on
two examples in which placing evolutionary changes in their
clonal context provides the power to relate phenotype to genotype.
Population-scale genome sequencing promises to allow a full and
unbiased catalogue of variation within the same clonal context.
This reconstruction will facilitate identification of loci that show
correlations with phenotype or anomalous patterns that indicate
natural selection, with minimal assumptions about the mechanisms by which phenotypes change.
Abstract: Bacteria evolve rapidly in response to the
environment they encounter. Some environmental changes are experienced numerous times by bacteria from the
same population, providing an opportunity to dissect the
genetic basis of adaptive evolution. Here I discuss two
examples in which the patterns of rapid change provide
insight into medically important bacterial phenotypes,
namely immune escape by Neisseria meningitidis and host
specificity of Campylobacter jejuni. Genomic analysis of
populations of bacteria from these species holds great
promise but requires appropriate concepts and statistical
tools.
Bacteria lack a natural reproductive system, comparable to
meiosis in eukaryotes, that segregates genes randomly. Instead,
they evolve progressively through mostly small genetic changes, a
proportion of which have noteworthy phenotypic effects. Some
phenotypes are intrinsically difficult to study in the laboratory:
virulence in humans or adaptation to particular ecological niches,
for example. For these traits in particular, a promising avenue for
scientific investigation is to identify the genetic changes that have
provided the basis for their evolution in natural populations.
Most human phenotypes are hard to study in vitro and,
consequently, methods for relating differences amongst humans to
natural genetic variation are well developed. Association studies
were proposed as an effective way of identifying genes with small
phenotypic effects more than a decade ago [1] and, although
initially controversial [2], the recent development of arrays for
genotyping hundreds of thousands of single nucleotide polymorphisms (SNPs) scattered across the whole genome has allowed the
approach to be successfully applied to many different human
diseases and other phenotypes [3]. This success should inspire the
development of equivalent protocols within bacteriology.
One challenge in developing generally applicable protocols for
mapping phenotypic traits in bacteria is that processes by which
microevolution occurs vary tremendously between species. For
example, the human pathogen Mycobacterium tuberculosis, the causal
agent of tuberculosis (TB), diverged recently from an obscure
organism occasionally isolated from humans in Africa called
Mycobacterium canetti [4]. M. tuberculosis shows very little variation
and there is no evidence of strains acquiring DNA by import from
other M. tuberculosis strains or indeed from any other organism, so
that individuals are clones of each other, distinguished only by rare
mutations or other small changes. By contrast, individual
Helicobacter pylori, a cause of gastric cancer, acquire DNA from
other members of the species at an extremely high rate.
Consequently, as well as varying in gene content [5], strains
isolated from different host individuals in the same ethnic group
typically differ from each other at approximately 3% of
nucleotides in core genes, and this diversity segregates nearly
randomly [6]. The majority of bacterial species fall between these
PLoS Genetics | www.plosgenetics.org
Example 1: Immune Escape during Clonal Spread
of Neisseria meningitidis
Neisseria meningitidis lives in the human nasopharynx and is best
known for its role in meningitis and other forms of meningococcal
disease. N. meningitidis is a major cause of morbidity and mortality
in childhood in industrialised countries and is responsible for
epidemics, principally in Africa and Asia. Many lineages persist
stably within human populations, causing little disease. There are
a handful of ‘‘hyperinvasive’’ lineages, however, that have a
distinct epidemiology, spreading rapidly from location to location
and causing clusters of disease cases but not persisting in any one
place.
Mark Achtman and colleagues examined variation within a
single hyperinvasive lineage of N. meningitidis, designated subgroup
III, over a period of three decades [7]. The strains within subgroup
III showed little diversity in most of their housekeeping and other
genes surveyed. A few loci were identified that did show variation,
however, allowing clonal relationships to be partially reconstructCitation: Falush D (2009) Toward the Use of Genomics to Study Microevolutionary Change in Bacteria. PLoS Genet 5(10): e1000627. doi:10.1371/journal.
pgen.1000627
Editor: David S. Guttman, University of Toronto, Canada
Published October 26, 2009
Copyright: ß 2009 Daniel Falush. This is an open-access article distributed
under the terms of the Creative Commons Attribution License, which permits
unrestricted use, distribution, and reproduction in any medium, provided the
original author and source are credited.
Funding: The author is funded by Science Foundation of Ireland grant number
05/FE1/B882. The funders had no role in the preparation of the article.
Competing Interests: The author has declared that no competing interests
exist.
* E-mail: [email protected]
This article is part of the ‘‘Genomics of Emerging Infectious Disease’’ PLoS Journal
collection (http://ploscollections.org/emerginginfectiousdisease/).
1
October 2009 | Volume 5 | Issue 10 | e1000627
ed. This reconstruction demonstrated that there were strong
bottlenecks during geographical spread, with a single ancestor for
each major wave of infection. It also showed that, notwithstanding
the low overall level of variation, certain genes encoding specific
antigens changed repeatedly in different countries and pandemic
waves.
The most remarkable variation was found in the transferrinbinding protein B gene (tbpB), which encodes a protein responsible
for iron uptake that is expressed on the surface of the bacterium.
This gene had evolved on three occasions by nonsynonymous
point mutations that altered the structure of the protein and on 21
occasions by import of different versions of the protein from a
variety of sources, including from N. lactamica, a closely related and
entirely noninvasive species that also colonizes humans (Figure 1).
The import events vary: analysis of similar tbpB changes in a
closely related lineage showed that between 2 kb and 10 kb of
sequence was transferred, which often altered the sequence of the
flanking genes as well as tbpB [8]. In each case, however, an effect
of the imported DNA was to change the externally exposed part of
the protein from the usual version (called the family 4 version) to
one of two antigenically highly distinct versions (family 1 and
family 3).
The fact that functionally equivalent changes to tbpB are
achieved by heterogeneous genetic events shows that the large
number of imports is not caused by a recombination mechanism
that is specific to the locus. Instead it reflects the amplifying effect
of natural selection within the large number of bacteria that
circulate during epidemics. Imports happen at a low rate
throughout the genome, but those that cause an antigenic change
at the tbpB locus have a selective advantage, meaning that they are
observed at a much higher rate than imports elsewhere in the
genome.
High diversity at a particular antigen locus is usually explained
by invoking a mechanism called negative frequency-dependent
selection [9]. Hosts who have been exposed to a particular variant
develop immune responses against this variant. Bacteria with
antigenically distinct variants escape this response, giving them an
advantage in colonizing that host. At the population level, this
selection should lead to the persistence of multiple variants. Yet,
despite this selection for rare variants within individual epidemics,
the antigenic diversity of subgroup III did not increase
progressively over time but was instead reset at the beginning of
each new epidemic, which was started by a strain with a family 4
allele.
The continuous generation of subgroup III strains with family 1
and 3 tbpB alleles is better explained by a mechanism called
source–sink dynamics [10]. The source consists of an environment
within which transmission of the bacterium is self-sustaining. Sinks
consist of environments that bacteria can colonize effectively
(perhaps by undergoing genetic modification) but from which
onward transmission is ineffective. Here, the sink environment
consists of individuals with acquired immunity to subgroup III
strains that carry family 4 alleles, while the source is the remainder
of the human population. The fact that the variant genotypes
capable of colonizing the sink do not spread geographically but
instead are repeatedly regenerated locally suggests that that these
strains have reduced overall transmission fitness in naı̈ve hosts,
which comprise the majority of individuals in populations where
an epidemic has not occurred recently.
Two other examples of sink environments are the lungs of
immunocompromised patients for Pseudomonas aeruginosa, and the
human urinary tract for Escherichia coli [10]; as for the N. meningitidis
example, specific genetic changes have been identified that adapt
strains of these bacteria to those environments but at the expense
of overall transmission fitness, with the result that infections occur
generally sporadically.
Example 2: Host Specificity in Campylobacter jejuni
Campylobacter jejuni is a gram-negative bacterium commonly
found in animal feces. It is often associated with poultry and
naturally colonises the GI tract of many bird species. C. jejuni is one
of the most common causes of human gastroenteritis in the world.
Infection caused by Campylobacter species can be severely
debilitating but is rarely life-threatening. Human infection is
sporadic and, although poorly prepared food is often thought to be
implicated, it is generally difficult to track the source. There has
therefore been a substantial effort to isolate bacteria from a wide
variety of reservoirs and to genotype them using multilocus
sequence typing (MLST), which involves obtaining the DNA
Figure 1. Acquisition of new tbpB genes by subgroup III Neisseria meningitidis during epidemic spread. Colours indicate the family of
each tbpB allele, with red corresponding to family 4, green corresponding to family 1, and blue corresponding to family 3. The bars highlight the time
frame, most common tbpB type, and geographical extent of each epidemic (in 1987, pilgrims from the Hajj pilgrimage briefly distributed the lineage
worldwide). The circles correspond to variant genotypes. Small circles indicating that the variant allele was found in only one strain; large circles
indicate it was found in between two and four strains.
doi:10.1371/journal.pgen.1000627.g001
PLoS Genetics | www.plosgenetics.org
2
October 2009 | Volume 5 | Issue 10 | e1000627
sequence for each isolate at a standardized panel of genes (seven
for Campylobacter) that are chosen because they have an essential
function and are present in the vast majority of isolates in the
species [11].
The C. jejuni strains acquired by chickens are distinct from those
of the wild birds around them, even when the poultry are kept
outdoors [12]. Within farm animals, certain lineages are found
with very different frequencies in chickens and cattle, whereas
several genotypes are found at high frequency in both (strains with
the MLST type ST-21, for example) [13]. Strains from different
farm animals are more similar to each other than they are to
strains found, for example, in starlings (a native European bird
that is also common in may other countries, including the US)
[14].
The digestive system of chickens differs from that of cattle in
multiple aspects, and their body temperature is several degrees
higher than that of cattle. This raises the question of how some
lineages are able to compete successfully in both hosts.
Mechanisms facilitating rapid phenotypic adaptation include: (1)
inbuilt regulatory mechanisms that allow individual bacteria to
alter gene expression in response to new environments [15], (2)
‘‘contingency loci’’ that mutate rapidly, creating phenotypic
variation amongst bacteria that are otherwise genetically identical
[16], and (3) import of DNA from other strains that are already
adapted to the current environment.
A first step toward understanding the evolution of host
specificity is to establish whether it is possible to predict the host
origin of strains based on their genome sequence. One approach
to doing this uses phylogenetic relationships. For example, the
program AdaptML (http://almlab.mit.edu/ALME/Software/
Software.html) attempts to assign branches of the phylogenetic
tree to preferred habitats based on where the strains on that
branch were isolated [17]. For C. jejuni, habitat can, for example,
be equated to host species. The observation of a group of
phylogenetically related strains in a single host species might reflect
the common ancestor of those strains acquiring the traits required
to survive in that species.
Since C. jejuni recombines frequently, the genome composition
of each strain is determined by the sources from which it has
imported DNA, as well by which strains it is phylogenetically
related to. For example, ST-21, together with its variants, is a
lineage analogous to subgroup III of N. meningitidis. Like subgroup
III, the lineage has imported DNA from other strains on numerous
occasions during its spread, with the result that many isolates have
variant genotypes that differ from ST-21 at one or two of the seven
MLST fragments. By convention, these strains are grouped with
ST-21 into the ST-21 clonal complex.
ST-21 itself has been found at high frequency in several
agricultural species and elsewhere. Therefore, if a new strain is
found to be ST-21, this provides little information on where it
might have originated. However, for the variants of ST-21, Noel
McCarthy and colleagues obtained significantly better than
random assignment by predicting hosts based on the frequency
with which the variant allele was found in chicken or cattle [13]. A
useful signal of host-of-origin is thus provided by the DNA that
each isolate has acquired (Figure 2). Furthermore, the high rate of
recombination within particular hosts represents a mechanism by
which complex adaptations to a particular host species can be
acquired quickly subsequent to a host switch.
Figure 2. A schematic illustration of the evolution of the C.
jejuni ST-21 clonal complex in cattle and chickens. The common
ancestor of the complex occurred in chickens (red). During evolution,
the lineage occasionally switched to a cattle host (indicated by a blue
branch) and sometimes back to chicken. The bacteria acquired DNA by
homologous recombination from other C. jejuni in the same host. Since
recombination is assumed to occur from donors within the same host,
the gene pool is determined by the genomic composition of the strains
that colonize each host. The gene pools are illustrated for two separate
loci (right and left facing arrows) in chickens and cattle. The gene pools
contain alleles whose frequencies occur at much higher frequency in
one host than another (shown in colour) and others that did not (shown
in black). The former are informative about the host in which the
recombination event occurred, while the latter are not. The recombination event labelled a introduces the left facing black arrow gene from
the cattle gene pool and is phylogenetically informative because it
defines a lineage that is largely restricted to cattle. The five
recombination events labelled b are not phylogenetically informative,
since they only affect a single strain in the sample. These events are
nevertheless informative because they introduce alleles that are
characteristic of the host species. The event labelled c is both
phylogenetically informative and characteristic of host. The event
labelled d is noninformative.
doi:10.1371/journal.pgen.1000627.g002
genotype based on natural variation. The first is the magnifying
effect of natural selection in enormous bacterial populations. This
selection acts to rapidly increase the frequency of genotypes that
give small fitness advantages in a particular environment, even if
these genotypes are generated only rarely. Adaptation in bacteria
is likely to be more frequent and to leave more distinctive genetic
signatures than in species such as humans where signals of
adaptation to local environments have proved to be remarkably
subtle [18]. The second is the fact that evolution occurs in the
context of progressively changing clonal backgrounds. This
property can make it possible to identify strains that have
extremely similar genomes but nevertheless differ phenotypically
[19]. These strains represent the natural equivalent of an isogenic
line and can allow precise inferences about the effects of natural
variation and how different changes interact with each other.
In order to fully exploit the advantages of bacteria for detecting
phenotypic associations, it is necessary to develop a conceptual
and analytical framework within which rapid evolutionary change
can be interpreted. One such framework is source–sink dynamics
[10]. The Neisseria example illustrates the power of microevolutionary analysis in a source–sink ecological context to identify first
the sink (hosts with immune responses to tbpB family 4 alleles) and
second the loci under an immediate selective pressure to change
within that sink (the tbpB gene).
Source–sink dynamics cannot be applied to investigate host
specificity within Campylobacter, because individual host species,
e.g., chicken, cattle, and individual species of wild birds, each
harbour large, viable populations of bacteria with high rates of
within-species transmission and do not represent sinks. Nevertheless, there is a key similarity between the Neisseria and Campylobacter
The Power of Bacterial Genomics
Studies in bacteria have two major advantages over those in
humans or other mammals when it comes to relating phenotype to
PLoS Genetics | www.plosgenetics.org
3
October 2009 | Volume 5 | Issue 10 | e1000627
examples, namely that the strains are repeatedly challenged by an
environment that is novel in the recent history of the strain. In the
Neisseria example, this challenge is repeatedly met by genetic
changes at particular antigenic loci, which consequently have
extremely atypical patterns of variation. In Campylobacter this
challenge is met in the context of a high rate of import of DNA
across the genome from other Campylobacter strains that already
colonize the new host.
The availability of full genome sequences promises to enhance
our understanding of the bacterial responses to new environments
in a number of ways. First, phylogenetic relationships will be better
resolved. In the Neisseria example, a well-resolved tree will
elucidate patterns of transmission within epidemics and, for
example, whether tbpB imports take place at the later stages of
each wave and if strains with such imports ever reacquire family 4
alleles and seed later epidemics. In the Campylobacter example this
will allow estimates of the number of occasions that the ST-21
lineage has jumped between host-species and establish whether
there are sublineages that are becoming progressively more
adapted to single-host transmission.
Second, genomics will provide a complete catalogue of loci
whose pattern of descent is atypical of the genome as a whole and
therefore either associated with a particular phenotype or
putatively affected by selection. In the Neisseria example, an
elevated rate of change at particular loci and consistency in the
nature of those changes would provide signs of selection. In the
Campylobacter example, loci that are imported at very high
frequency and/or that are highly differentiated between host
species may be involved in adaptation to a new host. An isolate-byisolate analysis of the patterns of import should establish whether
the multi-host lifestyle of ST-21 and, by extension, of C. jejuni as a
whole is facilitated by import of DNA from locally adapted strains.
Third, genomics will allow detection of epistasis between loci.
Epistasis occurs when the fitness effects of alleles at one gene are
modified by the genotype at one or more additional genes. In
outbreeding diploids, such as mammals, each allele has its fitness
tested on a new genetic background in every generation, with the
result that epistasis does not leave a distinctive signature in the
frequency of particular combinations of alleles unless the loci are
closely linked on the same chromosome or selection is very strong.
In bacteria, combinations of alleles remain together for many
generations wherever they occur in the genome, providing ample
opportunity for epistasis to bring particular combinations of alleles
to high frequency. For example, subgroup III strains that have
imported variant tbpB alleles can potentially enhance their fitness
by importing other parts of the genome that adapt other strains in
the Neisseria population to having high fitness when carrying family
1 or family 3 alleles. These parts of the genome could be detected
by identifying parallel changes that have occurred on the 21
occasions that a variant tbpB allele was imported during the spread
of subgroup III strains. Fitness interactions establish functional
relationships between loci and represent a central part of the
evolutionary landscape, for example triggering the origin of species
[20]. Genome sequencing of bacteria should provide key insights
on the nature of these interactions in natural populations.
In C. jejuni and other zoonoses, genomic analyses will facilitate a
qualitative advance in our understanding of the epidemiology,
ecology, and molecular biology of host switches. These developments will allow accurate delineation of the sources of human
infection and an understanding of the factors promoting successful
and pathogenic colonization of humans. In N. meningitidis and
similar bacteria, we will gain a much better understanding of the
genetic differences between invasive and noninvasive strains and
the particular adaptive strategies that cause lineages to become
invasive. These advances will together allow the design of targeted
interventions that reduce the burden of human disease.
Challenges for the Future
Advances in sequencing technology mean that it is becoming
economically feasible to obtain complete or nearly complete
genome sequences for large samples of bacteria. To better exploit
this technology to understand bacterial phenotypes, the field
should emulate the research program of human genetics and (1)
develop statistical tools that use sequence variation to infer
mechanisms of evolution [21] and patterns of genetic relationship
[22]; (2) collect and sequence samples of isolates in which bacteria
that differ in phenotypes of interest are matched as far as possible
in time and space [23]; and (3) design statistical tools for detecting
phenotypic associations [24] and natural selection [25] by
identifying patterns of relationship at particular loci that are
atypical of the genome as a whole.
Acknowledgments
Mark Achtman, Jim Bull, Jana Haase, Riikka Haukkanen, and Daniel
Stoebel provided insightful discussions and comments on the manuscript.
References
9. Brisson D, Dykhuizen DE (2004) ospC diversity in Borrelia burgdorferi: Different
hosts are different niches. Genetics 168: 713–722.
10. Sokurenko EV, Gomulkiewicz R, Dykhuizen DE (2006) Source-sink dynamics of
virulence evolution. Nat Rev Microbiol 4: 548–555.
11. Maiden MCJ, Bygraves JA, Feil E, Morelli G, Russell JE, et al. (1998) Multilocus
sequence typing: A portable approach to the identification of clones within
populations of pathogenic microorganisms. Proc Natl Acad Sci U S A 95:
3140–3145.
12. Colles FM, Jones TA, McCarthy ND, Sheppard SK, Cody AJ, et al. (2008)
Campylobacter infection of broiler chickens in a free-range environment.
Environ Microbiol 10: 2042–2050.
13. McCarthy ND, Colles FM, Dingle KE, Bagnall MC, Manning G, et al. (2007) Hostassociated genetic import in Campylobacter jejuni. Emerg Infect Dis 13: 267–272.
14. Colles FM, McCarthy ND, Howe JC, Devereux CL, Gosler AG, et al. (2009)
Dynamics of Campylobacter colonization of a natural host, Sturnus vulgaris
(European starling). Environ Microbiol 11: 258–267.
15. Coulson RM, Ouzounis CA (2003) The phylogenetic diversity of eukaryotic
transcription. Nucleic Acids Res 31: 653–660.
16. Moxon R, Bayliss C, Hood D (2006) Bacterial contingency loci: The role of simple
sequence DNA repeats in bacterial adaptation. Annu Rev Genet 40: 307–333.
17. Hunt DE, David LA, Gevers D, Preheim SP, Alm EJ, et al. (2008) Resource
partitioning and sympatric differentiation among closely related bacterioplankton. Science 320: 1081–1085.
1. Risch N, Merikangas K (1996) The future of genetic studies of complex human
diseases. Science 273: 1516–1517.
2. Weiss KM, Terwilliger JD (2000) How many diseases does it take to map a gene
with SNPs? Nat Genet 26: 151–157.
3. Hardy J, Singleton A (2009) Genomewide association studies and human
disease. N Engl J Med 360: 1759–1768.
4. Fabre M, Koeck JL, Le Fleche P, Simon F, Herve V, V, et al. (2004) High
genetic diversity revealed by variable-number tandem repeat genotyping and
analysis of hsp65 gene polymorphism in a large collection of ‘‘Mycobacterium
canettii’’ strains indicates that the M. tuberculosis complex is a recently emerged
clone of ‘‘M. canettii’’. J Clin Microbiol 42: 3248–3255.
5. Gressmann H, Linz B, Ghai R, Pleissner KP, Schlapbach R, et al. (2005) Gain
and loss of multiple genes during the evolution of Helicobacter pylori. PLoS Genet
1: e43. doi:10.1371/journal.pgen.0010043.
6. Suerbaum S, Maynard Smith J, Bapumia K, Morelli G, Smith NH, et al. (1998)
Free recombination within Helicobacter pylori. Proc Natl Acad Sci U S A 95:
12619–12624.
7. Zhu P, van Der EA, Falush D, Brieske N, Morelli G, et al. (2001) Fit genotypes
and escape variants of subgroup III Neisseria meningitidis during three pandemics
of epidemic meningitis. Proc Natl Acad Sci U S A 98: 5234–5239.
8. Linz B, Schenker M, Zhu P, Achtman M (2000) Frequent interspecific genetic
exchange between commensal Neisseriae and Neisseria meningitidis. Mol Microbiol
36: 1049–1058.
PLoS Genetics | www.plosgenetics.org
4
October 2009 | Volume 5 | Issue 10 | e1000627
22. Rosenberg NA, Pritchard JK, Weber JL, Cann HM, Kidd KK, et al. (2002)
Genetic structure of human populations. Science 298: 2381–2385.
23. The Wellcome Trust Case Control Consortium (2007) Genome-wide association
study of 14,000 cases of seven common diseases and 3,000 shared controls.
Nature 447: 661–678.
24. Marchini J, Howie B, Myers S, McVean G, Donnelly P (2007) A new multipoint
method for genome-wide association studies by imputation of genotypes. Nat
Genet 39: 906–913.
25. Sabeti PC, Varilly P, Fry B, Lohmueller J, Hostetter E, et al. (2007) Genomewide detection and characterization of positive selection in human populations.
Nature 449: 913–918.
18. Coop G, Pickrell JK, Novembre J, Kudaravalli S, Li J, et al. (2009) The role of
geography in human adaptation. PLoS Genet 5: e1000500. doi:10.1371/
journal.pgen.1000500.
19. Beres SB, Richter EW, Nagiec MJ, Sumby P, Porcella SF, et al. (2006)
Molecular genetic anatomy of inter- and intraserotype variation in the human
bacterial pathogen group A Streptococcus. Proc Natl Acad Sci U S A 103:
7059–7064.
20. Coyne JA, Orr HA (2004) Speciation. Sunderland (MA): Sinauer Associates.
21. McVean GA, Myers SR, Hunt S, Deloukas P, Bentley DR, et al. (2004) The
fine-scale structure of recombination rate variation in the human genome.
Science 304: 581–584.
PLoS Genetics | www.plosgenetics.org
5
October 2009 | Volume 5 | Issue 10 | e1000627
Review
The Application of Genomics to Emerging Zoonotic Viral
Diseases
Bart L. Haagmans, Arno C. Andeweg, Albert D. M. E. Osterhaus*
Department of Virology, Erasmus Medical Center, Rotterdam, The Netherlands
influenza A viruses and severe acute respiratory syndrome
coronavirus (SARS-CoV), may need multiple genetic changes to
adapt successfully to humans as a new host species; these changes
might include differential receptor usage, enhanced replication,
evasion of innate and adaptive host immune defenses, and/or
increased efficiency of transmission. Understanding the complex
interactions between the invading pathogen on the one hand and
the new host on the other as they progress toward a new host–
pathogen equilibrium is a major challenge that differs substantially
for each successful interspecies transmission and subsequent
spread of the virus.
Abstract: Interspecies transmission of pathogens may
result in the emergence of new infectious diseases in
humans as well as in domestic and wild animals.
Genomics tools such as high-throughput sequencing,
mRNA expression profiling, and microarray-based analysis
of single nucleotide polymorphisms are providing unprecedented ways to analyze the diversity of the genomes
of emerging pathogens as well as the molecular basis of
the host response to them. By comparing and contrasting
the outcomes of an emerging infection with those of
closely related pathogens in different but related host
species, we can further delineate the various host
pathways determining the outcome of zoonotic transmission and adaptation to the newly invaded species. The
ultimate challenge is to link pathogen and host genomics
data with biological outcomes of zoonotic transmission
and to translate the integrated data into novel intervention strategies that eventually will allow the effective
control of newly emerging infectious diseases.
Genomics of Zoonotic Viruses and Their Hosts
New molecular techniques such as high-throughput sequencing,
mRNA expression profiling, and array-based single nucleotide
polymorphism (SNP) analysis provide ways to rapidly identify
emerging pathogens (Nipah virus and SARS-CoV, for example)
and to analyze the diversity of their genomes as well as the host
responses against them. Essential to the process of identification
and characterization of genome sequences is the exploitation of
extensive databases that allow the alignment of viral genome
sequences and the linkage of these genomics data to those obtained
by classical viral culture and serological techniques, and
epidemiological, clinical, and pathological studies [4]. Extensive
genetic analysis of HIV-1, for example, has provided clues to the
geography and time scale of the early diversification of HIV-1
strains when the virus emerged in humans. HIV-1 strains are
divided into multiple clades, each of which has independently
evolved from a simian immunodeficiency virus (SIV) that naturally
infects chimpanzees in West and Central Africa. Current estimates
date the common ancestor of HIV-1 to the beginning of the
twentieth century [5].
Emerging Zoonotic Viruses
Most of the well-known human viruses persist in the population
for a relatively long time, and coevolution of the virus and its
human host has resulted in an equilibrium characterized by
coexistence, often in the absence of a measurable disease burden.
When pathogens cross a species barrier, however, the infection
can be devastating, causing a high disease burden and mortality.
In recent years, several outbreaks of infectious diseases in humans
linked to such an initial zoonotic transmission (from animal to
human host) have highlighted this problem. Factors related to our
increasingly globalized society have contributed to the apparently
increased transmission of pathogens from animals to humans over
the past decades; these include changes in human factors such as
increased mobility, demographic changes, and exploitation of the
environment (for a review see Osterhaus [1] and Kuiken et al. [2]).
Environmental factors also play a direct role, and many examples
exist. The recently increased distribution of the arthropod
(mosquito) vector Aedes aegypti, for example, has led to massive
outbreaks of dengue fever in South America and Southeast Asia.
Intense pig farming in areas where frugivorous bats are common is
probably the direct cause of the introduction of Nipah virus into
pig populations in Malaysia, with subsequent transmission to
humans. Bats are an important reservoir for a plethora of zoonotic
pathogens: two closely related paramyxoviruses—Hendra virus
and Nipah virus—cause persistent infections in frugivorous bats
and have spread to horses and pigs, respectively [3].
The similarity between human and nonhuman primates permits
many viruses to cross the species barrier between different primate
species. The introduction into humans of HIV-1 and HIV-2 (the
lentiviruses that cause AIDS), as well as other primate viruses, such
as monkeypox virus and Herpesvirus simiae, provide dramatic
examples of this type of transmission. Other viruses, such as
PLoS Pathogens | www.plospathogens.org
Citation: Haagmans BL, Andeweg AC, Osterhaus ADME (2009) The Application of
Genomics to Emerging Zoonotic Viral Diseases. PLoS Pathog 5(10): e1000557.
doi:10.1371/journal.ppat.1000557
Editor: Marianne Manchester, The Scripps Research Institute, United States of
America
Published October 26, 2009
Copyright: ß 2009 Haagmans et al. This is an open-access article distributed
under the terms of the Creative Commons Attribution License, which permits
unrestricted use, distribution, and reproduction in any medium, provided the
original author and source are credited.
Funding: Supported by the VIRGO consortium, an Innovative Cluster approved
by the Netherlands Genomics Initiative and partially funded by the Dutch
Government (BSIK 03012), The Netherlands and the US National Institutes of
Health, RO1 grant HL080621-O1A1. The funders had no role in study design, data
collection and analysis, decision to publish, or preparation of the manuscript.
Competing Interests: The authors have declared that no competing interests
exist.
* E-mail: [email protected]
This article is part of the ‘‘Genomics of Emerging Infectious Disease’’ PLoS Journal
collection (http://ploscollections.org/emerginginfectiousdisease/).
1
October 2009 | Volume 5 | Issue 10 | e1000557
Because zoonotic pathogens typically may cause variable
clinical outcomes in human hosts that differ in age, nutritional
status, genetic background, and immunological condition, deciphering the complex interactions between evolving pathogens and
their hosts is a great challenge. The genome sequences of many
host species have become available the last decade, and with them
a range of novel tools are available to study virus–host interactions
at the molecular level. This progress, together with advances in
high-throughput sequencing technology and, not least, in
(bio)informatics and statistics, allows us to analyze the ‘‘genomewide’’ networks of gene interactions that control the host response
to pathogens. By comparing and contrasting the outcomes of
infection with closely related pathogens in different but related
host species, we can further delineate the various host pathways
involved in the different outcomes. The power of this approach
was nicely demonstrated for SIV infection of various primate host
species. Natural reservoir hosts of SIV do not develop AIDS upon
infection, whereas non-natural hosts, such as rhesus macaques and
pig-tailed macaques, when infected experimentally with SIV,
develop AIDS in a similar manner to HIV-infected humans.
Transcriptional profiling indicates that SIV infection of these
species produces a distinctive host response [6]. SIV-infected
primates with symptoms of AIDS have a high viral load, immune
activation, and loss of certain types of T cells, whereas SIVinfected sooty mangabeys (the species from which HIV-2 is
thought to have originated) have substantially lower levels of
innate immune activation than the symptomatic primates, partly
due to the production of less interferon-a by plasmacytoid
dendritic cells in response to SIV and other Toll-like receptor
ligands [7]. Identification of host factors that restrict HIV infection
may aid the development of effective intervention strategies.
Below, we elaborate on two other examples of recent important
zoonotic events that led to sustained virus transmission in the
human host, and the role that genomics has played in the
elucidation of their pathogenesis thus far.
are essential to identify critical mutations that enable the
circulating virus to spread efficiently, interact with different
receptors, and cause disease in the new host. For example, the
importance of residue 627 of the PB2 protein of the viral
polymerase in determining species restriction has been demonstrated through these kinds of approaches [10]. Furthermore,
changes in the hemagglutinin molecules may allow influenza A
viruses to switch receptor specificity. The hemagglutinin of avian
H5N1 influenza viruses preferentially binds to oligosaccharides
that terminate with a sialic acid–a-2,3-Gal disaccharide, whereas
the hemagglutinins of mammalian influenza A viruses prefer
oligosaccharides that terminate with sialic acid–a-2,6-Gal
(Figure 1). Fatal viral pneumonia in humans infected with avian
H5N1 viruses is partly due to the ability of these viruses to attach
to and replicate in the cells of the lower respiratory tract, which
have oligosaccharides that terminate in sialic acid–a-2,3-Gal
disaccharide [11,12]. The sequence of the hemagglutinin protein
may also affect its binding affinity for neutralizing antibodies.
Understanding the relationship between genetic diversity and
antigenic properties of these viruses [13] may help to predict the
emergence of influenza viruses and to develop effective vaccines.
Microarray-assisted mRNA expression profiling of emerging
zoonotic viral infections, including influenza A virus, is used to
phenotype the host response in great detail. By comparing mRNA
expression in individuals infected with an emerging virus to
expression in individuals infected with a related established virus,
researchers can generate a ‘‘molecular fingerprint’’ of the host
response genes or pathways specifically involved in the oftenexuberant host responses to the emerging virus. By using
genetically engineered influenza A viruses, a role for the
nonstructural NS1 viral protein in evasion of the innate host
response has been demonstrated [14]. Interestingly, the NS1
protein derived from the 1918 Spanish H1N1 pandemic influenza
virus blocked expression of interferon-regulated genes more
efficiently than did the NS1 protein from established seasonal
influenza viruses [14]. Other genomics studies of genetically
engineered influenza A viruses containing some or all of the gene
segments from either the 1918 H1N1 virus or the highly
pathogenic avian influenza A virus (H5N1), suggest that these
highly pathogenic influenza viruses induce severe disease in mice
and macaques through aberrant and persistent activation of
proinflammatory cytokine and chemokine responses [15–18].
Application of genomics tools not only supports the elucidation
of mechanisms underlying pathogenesis but may also help to
identify leads for therapeutic intervention. In ferrets, H5N1
infection induced severe disease that was associated with strong
expression of interferon response genes including the interferon-cinduced cytokine CXCL10. Treatment of H5N1-infected ferrets
with an antagonist of the CXCL10 receptor (CXCR3) reduced the
severity of the flu symptoms and the viral titers compared to the
controls [19], clearly demonstrating the potential of biological
response modifiers for the clinical management of viral infections.
The host evasion and evolution of influenza virus is further
discussed in [20].
Influenza Virus
Influenza is caused by RNA viruses of the Orthomyxoviridae
family. Whereas fever and coughs are the most frequent
symptoms, in more serious cases a fatal pneumonia can develop,
particularly in the young and the elderly. Typically, influenza is
transmitted through the air by coughs or sneezes, creating aerosols
containing the virus; but influenza can also be transmitted by bird
droppings, saliva, feces, and blood. Birds and pigs play an
important role in the emergence of new influenza viruses in
humans. Fecal sampling of migratory birds has revealed that they
harbor a large range of different subtypes of influenza A viruses
[8]. Some wild duck species, particularly mallards, are potential
long-distance vectors of highly pathogenic avian influenza virus
(H5N1), whereas others, including diving ducks, are more likely to
act as ‘‘sentinel’’ species that die upon infection [9]. Following the
introduction of a new pandemic influenza A virus subtype from an
avian reservoir, either directly or via another mammalian species
such as the pig, the virus may continue to circulate in humans in
subsequent years as a seasonal influenza virus. In the past century,
three major influenza epidemics resulted in the loss of many
millions of lives. Spanish flu alone caused the deaths of more than
50 million people by the end of World War I in 1918. The 2009
outbreak of a new H1N1 virus (causing ‘‘swine flu’’) that started in
Mexico further illustrates the pandemic potential of influenza A
viruses.
After introduction of a new influenza A virus from an avian or
porcine reservoir into the human species, viral genomics studies
PLoS Pathogens | www.plospathogens.org
SARS-CoV
Coronaviruses (CoVs) primarily infect the upper respiratory and
gastrointestinal tract of mammals and birds. Five different
currently known CoVs infect humans and are believed to cause
a significant percentage of all common colds in human adults.
Surprisingly, recent studies revealed that approximately 6% of bats
sampled in China were positive for CoVs [21]. Subsequent
phylogenetic studies revealed that bat CoVs that resembled
2
October 2009 | Volume 5 | Issue 10 | e1000557
Figure 1. Zoonotic transmission of influenza A virus. The hemagglutinin of avian influenza A viruses (blue) preferentially bind to
oligosaccharides that terminate in sialic acid–a-2,3-Gal (red), whereas the hemagglutinin on human influenza A viruses (green) prefer
oligosaccharides that terminate in sialic acid–a-2,6-Gal (orange). Fatal viral pneumonia in humans infected with the H5N1 subtype of avian
influenza A viruses is likely due to the ability of these viruses to attach to and replicate in the lower respiratory tract cells, which have sialic acid-a-2,3Gal terminated saccharides. The horizontal arrows indicate interspecies transmission, including the transmission from an avian or porcine reservoir
into the human species. Image credit: Bart Haagmans, Erasmus MC. Original images (left to right, from top to bottom) by Roman Köhler, Alvesgaspar,
Anton Holmquist, Joshua Lutz, and CDC.
doi:10.1371/journal.ppat.1000557.g001
affects the efficiency by which the virus can enter cells [23]. By a
combination of phylogenetic and bioinformatics analyses, chimeric
gene design, and reverse genetics–aided generation of viruses that
encode spike proteins of diverse isolates, researchers have
reconstructed the events that led to the emergence of a virus able
to spread efficiently in humans [24]. Structural modeling predicted
that the SARS-CoV that caused the epidemic had an increased
affinity for both civet and human ACE2 receptors due to
adaptation (Figure 2). Subsequent functional genomics studies of
these viruses in diverse species provided further insight into the
role of specific host genes involved in the pathogenic response
[25,26]. The pathological changes observed in the lungs are
initiated by a disproportionate innate immune response, illustrated
by elevated levels of inflammatory cytokines and chemokines, such
human SARS-CoV clustered in a putative group comprising one
subgroup of bat CoVs and another of SARS-CoVs from humans
and other mammalian hosts. According to the current hypothesis
SARS-CoV has arisen by recombination between two bat viruses.
Phylogenetic analysis of SARS-CoV isolates from animals indicate
that the resulting bat virus was transmitted first to palm civets
(Paguma larvata), a wild cat-like animal hunted for its meat, and
subsequently to humans at live animal markets in southern China
[22].
Genome analyses have provided evidence that genetic variation
in the spike gene of these viruses from civets is associated with
increased transmission of the virus [21]. In addition, species-tospecies variation in the sequence of the gene angiotensin-converting
enzyme 2 (ACE2), which encodes the SARS-CoV receptor, also
PLoS Pathogens | www.plospathogens.org
3
October 2009 | Volume 5 | Issue 10 | e1000557
Figure 2. Zoonotic transmission of SARS-CoV. Genomic analyses provided evidence that genetic changes in the spike gene of SARS-CoV from
bats (left) and civet cats (center) are essential for the animal-to-human transmission (horizontal arrows). Species-to-species genetic variation in the
(thus far unidentified) viral receptor in bats and in the angiotensin converting enzyme 2 (ACE2) gene, encoding the SARS-CoV receptor in civet cats and
humans also affects the efficiency with which the virus can enter cells (vertical arrows). The SARS-CoV that caused the epidemic evolved a high
affinity for both civet (center) and human (right) ACE2 receptors (indicated by the single diagonal and the right side vertical arrow). Image credit: Bart
Haagmans, Erasmus MC. Original images (left to right) by Dodoni, Paul Hilton, and Hoang Dinh Nam.
doi:10.1371/journal.ppat.1000557.g002
as CXCL10 (IP-10), CCL2 (MCP-1), interleukin (IL)-6, IL-8, IL12, IL-1b, and interferon-c [27]. These clinical data were
confirmed experimentally by demonstrating that SARS-CoV
infection of diverse cell types induces a range of cytokines and
chemokines, thus providing a conceptual framework for SARSCoV pathogenesis. Host genome expression analyses of various
animal hosts and humans with different outcomes of infection
indicated differential activation of innate immune genes in, for
example, aged subjects compared to young subjects. Importantly,
treatment of aged macaques with pegylated interferon-a (i.e.
interferon-a covalently modified with polyethylene glycol polymer
chains, to enhance its bioavailability) reduced SARS-CoV
replication and pathogenic responses [28]. Thus, host genomics
analysis may provide markers of pathogenesis and leads for
PLoS Pathogens | www.plospathogens.org
therapeutic intervention, as in this example of SARS-CoV
infection.
Challenges for the Future
Rapid identification of newly emerging viruses through the use
of genomics tools is one of the major challenges for the near future.
In addition, the identification of critical mutations that enable
viruses to spread efficiently, interact with different receptors, and
cause disease in diverse hosts through, for instance, enhanced viral
replication or circumvention of the innate and adaptive immune
responses, needs to be further expanded. Although microarrayassisted transcriptional profiling can provide us with a wealth of
information regarding host genes and gene-interacting networks in
4
October 2009 | Volume 5 | Issue 10 | e1000557
virus–host interactions, future research should focus on combining
data obtained in different experimental settings. Therefore, the
careful design of complementary sets of experiments using
different formats of virus–host interactions is absolutely needed
for successful genomics studies [29]. Special attention should be
addressed to the comparative analysis of the host response in
diverse animal species. Thus far a limited number of laboratory
animal species has been studied, but the recent elucidation of the
genome of several other animal species will provide tools to
decipher the virus–host interactions in the more relevant natural
host. Recent developments in the sequencing of the RNA
transcriptome may aid this development. Ultimately, microarray
technology may also extend to genotyping of the human host by
SNP analysis, to identify markers of host susceptibility and severity
of disease, that can be used in tailor-made clinical management of
disease caused by emerging infections. Comparative analysis of
host responses to emerging viruses may also point toward a similar
dysregulated host response to a range of emerging virus infections,
enabling the rational design of multipotent biological response
modifiers to combat a variety of emerging viral infections. By
focusing on broad-acting intervention strategies rather than on the
discovery of a newly emerging pathogen that is not characterized
yet, we may be able to protect ourselves from several unexpectedly
emerging infections with the same clinical manifestations. This
approach may readily reduce the burden of disease and time will
be gained to design preventive pathogen specific intervention
strategies such as antiviral therapy or vaccination. Clearly, for all
stages of combating emerging infections, from the early identification of the pathogen to the development and design of vaccines,
application of sophisticated genomics tools is fundamental to
success.
References
1. Osterhaus A (2001) Catastrophes after crossing species barriers. Philos Trans Soc
Lond B Biol Sci 356: 791–793.
2. Kuiken T, Leighton FA, Fouchier RA, LeDuc JW, Peiris JS, et al. (2005) Public
health. Pathogen surveillance in animals. Science 309: 1680–1681.
3. Field HE, Mackenzie JS, Daszak P (2007) Henipaviruses: Emerging paramyxoviruses associated with fruit bats. Curr Top Microbiol Immunol 315: 133–159.
4. Rivers TM (1937) Viruses and Koch’s postulates. J Bacteriol 33: 1–12.
5. Worobey M, Gemmel M, Teuwen DE, Haselkorn T, Kunstman K, et al. (2008)
Direct evidence of extensive diversity of HIV-1 in Kinshasa by 1960. Nature
455: 661–664.
6. Lederer S, Favre D, Walters KA, Proll S, Kanwar B, et al. (2009)
Transcriptional profiling in pathogenic and non-pathogenic SIV infections
reveals significant distinctions in kinetics and tissue compartmentalization. PLoS
Pathog 5: e1000296. doi:10.1371/journal.ppat.1000296.
7. Mandl JN, Barry AP, Vanderford TH, Kozyr N, Chavan R, et al. (2008)
Divergent TLR7 and TLR9 signaling and type I interferon production
distinguish pathogenic and nonpathogenic AIDS virus infections. Nat Med 14:
1077–1087.
8. Munster VJ, Baas C, Lexmond P, Waldenström J, Wallensten A, et al. (2007)
Spatial, temporal, and species variation in prevalence of influenza A viruses in
wild migratory birds. PLoS Pathog 3: e61. doi:10.1371/journal.ppat.0030061.
9. Keawcharoen J, van Riel D, van Amerongen G, Bestebroer T, Beyer WE, et al.
(2008) Wild ducks as long-distance vectors of highly pathogenic avian influenza
virus (H5N1). Emerg Infect Dis 4: 600–607.
10. Hatta M, Gao P, Halfmann P, Kawaoka Y (2001) Molecular basis for high
virulence of Hong Kong H5N1 influenza A viruses. Science 293: 1840–1842.
11. van Riel D, Munster VJ, de Wit E, Rimmelzwaan GF, Fouchier RA, et al. (2006)
H5N1 virus attachment to lower respiratory tract. Science 312: 399.
12. Yamada S, Suzuki Y, Suzuki T, Le MQ, Nidom CA, et al. (2006)
Haemagglutinin mutations responsible for the binding of H5N1 influenza A
viruses to human-type receptors. Nature 444: 378–382.
13. Smith DJ, Lapedes AS, de Jong JC, Bestebroer TM, Rimmelzwaan GF, et al.
(2004) Mapping the antigenic and genetic evolution of influenza virus. Science
305: 371–376.
14. Geiss GK, Salvatore M, Tumpey TM, Carter VS, Wang X, et al. (2002) Cellular
transcriptional profiling in influenza A virus-infected lung epithelial cells: The
role of the nonstructural NS1 protein in the evasion of the host innate defense
and its potential contribution to pandemic influenza. Proc Natl Acad Sci U S A
99: 10736–10741.
15. Kobasa D, Jones SM, Shinya K, Kash JC, Copps J, et al. (2007) Aberrant innate
immune response in lethal infection of macaques with the 1918 influenza virus.
Nature 445: 319–323.
16. Baskin CR, Bielefeldt-Ohmann H, Tumpey TM, Sabourin PJ, Long JP, et al.
(2009) Early and sustained innate immune response defines pathology and death
PLoS Pathogens | www.plospathogens.org
17.
18.
19.
20.
21.
22.
23.
24.
25.
26.
27.
28.
29.
5
in nonhuman primates infected by highly pathogenic influenza virus. Proc Natl
Acad Sci U S A 106: 3455–3460.
Kash JC, Tumpey TM, Proll SC, Carter V, Perwitasari O, et al. (2006) Genomic
analysis of increased host immune and cell death responses induced by 1918
influenza virus. Nature 443: 578–581.
Kash JC, Basler CF, Garcı́a-Sastre A, Carter V, Billharz R, et al. (2004) Global
host immune response: Pathogenesis and transcriptional profiling of type A
influenza viruses expressing the hemagglutinin and neuraminidase genes from
the 1918 pandemic virus. J Virol 78: 9499–9511.
Cameron CM, Cameron MJ, Bermejo-Martin JF, Ran L, Xu L, et al. (2008)
Gene expression analysis of host innate immune responses during lethal H5N1
infection in ferrets. J Virol 82: 11308–11317.
McHardy AC, Adams, B (2009) The role of genomics in tracking the evolution
of influenza A virus. PLoS Pathog e1000566: doi:10.1371/journal.
ppat.1000566.
Tang XC, Zhang JX, Zhang SY, Wang P, Fan XH, et al. (2006) Prevalence and
genetic diversity of coronaviruses in bats from China. J Virol 80: 7481–7490.
Song HD, Tu CC, Zhang GW, Wang SY, Zheng K, et al. (2005) Cross-host
evolution of severe acute respiratory syndrome coronavirus in palm civet and
human. Proc Natl Acad Sci U S A 102: 2430–2435.
Li W, Zhang C, Sui J, Kuhn JH, Moore MJ, et al. (2005) Receptor and viral
determinants of SARS-coronavirus adaptation to human ACE2. EMBO J 24:
1634–1643.
Sheahan T, Rockx B, Donaldson E, Sims A, Pickles R, et al. (2008) Mechanisms
of zoonotic severe acute respiratory syndrome coronavirus host range expansion
in human airway epithelium. J Virol 82: 2274–2285.
Rockx B, Baas T, Zornetzer GA, Haagmans B, Sheahan T, et al. (2009) Early
upregulation of acute respiratory distress syndrome-associated cytokines
promotes lethal disease in an aged-mouse model of severe acute respiratory
syndrome coronavirus infection. J Virol 83: 7062–7074.
de Lang A, Baas T, Teal T, Leijten LM, Rain B, et al. (2007) Functional
genomics highlights differential induction of antiviral pathways in the lungs of
SARS-CoV-infected macaques. PLoS Pathog 3: e112. doi:10.1371/journal.
ppat.0030112.
Baas T, Roberts A, Teal TH, Vogel L, Chen J, et al. (2008) Genomic analysis
reveals age-dependent innate immune responses to severe acute respiratory
syndrome coronavirus. J Virol 82: 9465–9476.
Haagmans BL, Kuiken T, Martina BE, Fouchier RA, Rimmelzwaan GF, et al.
(2004) Pegylated interferon-alpha protects type 1 pneumocytes against SARS
coronavirus infection in macaques. Nat Med 10: 290–293.
Andeweg AC, Haagmans BL, Osterhaus ADME (2008) Virogenomics: The
virus –host interaction revisited. Curr Opin Microbiol 11: 1–6.
October 2009 | Volume 5 | Issue 10 | e1000557
Review
The Role of Genomics in Tracking the Evolution of
Influenza A Virus
Alice Carolyn McHardy1*, Ben Adams2
1 Computational Genomics and Epidemiology, Max Planck Institute for Informatics, Saarbruecken, Germany, 2 Department of Mathematical Sciences, University of Bath,
United Kingdom
fusing the virus membrane envelope with the host cell membrane,
thus delivering the viral genome into the cell (Figure 1). Segment 6
encodes another surface glycoprotein called neuraminidase (N),
which cleaves terminal sialic acid residues from glycoproteins and
glycolipids on the host cell surface, thus releasing budding viral
particles from an infected cell [10]. Influenza A viruses are further
classified into distinct subtypes based on the genetic and antigenic
characteristics of these two surface glycoproteins. Sixteen hemagglutinin (H1–16) and nine neuraminidase subtypes (N1–9) are
known to exist, and they occur in various combinations in influenza
viruses endemic in aquatic birds [10,11]. Viruses with the subtype
composition H1N1 and H3N2 have been circulating in the human
population for several decades. Of these two subtypes, H3N2
evolves more rapidly, and has until recently caused the majority of
infections [1,12,13]. In the spring of 2009, however, a new H1N1
virus originating from swine influenza A viruses, and only distantly
related to the H1N1 already circulating, gained hold in the human
population. The emergence of this virus has initiated the first
influenza pandemic of the twenty-first century [7,14,15].
Hemagglutinin is about five times more abundant than
neuraminidase in the viral membrane and is the major target of
the host immune response [16–18]. Following exposure to the
virus, whether by infection or vaccination, the host immune system
acquires the capacity to produce neutralizing antibodies against
the viral surface glycoproteins. These antibodies participate in
clearing an infection and may protect an individual from future
infections for many decades [19]. Five exposed regions on the
surface of hemagglutinin, called epitope sites, are predominantly
recognized by such antibodies [16,17]. However, the human
subtypes of influenza A continuously evolve and acquire genetic
mutations that result in amino acid changes in the epitopes. These
changes reduce the protective effect of antibodies raised against
previously circulating viral variants. This ‘‘antigenic drift’’
necessitates frequent modification and readministration of the
influenza vaccine to ensure efficient protection (Box 1).
Abstract: Influenza A virus causes annual epidemics and
occasional pandemics of short-term respiratory infections
associated with considerable morbidity and mortality. The
pandemics occur when new human-transmissible viruses
that have the major surface protein of influenza A viruses
from other host species are introduced into the human
population. Between such rare events, the evolution of
influenza is shaped by antigenic drift: the accumulation of
mutations that result in changes in exposed regions of the
viral surface proteins. Antigenic drift makes the virus less
susceptible to immediate neutralization by the immune
system in individuals who have had a previous influenza
infection or vaccination. A biannual reevaluation of the
vaccine composition is essential to maintain its effectiveness due to this immune escape. The study of influenza
genomes is key to this endeavor, increasing our understanding of antigenic drift and enhancing the accuracy of
vaccine strain selection. Recent large-scale genome
sequencing and antigenic typing has considerably improved our understanding of influenza evolution: epidemics around the globe are seeded from a reservoir in
East-Southeast Asia with year-round prevalence of influenza viruses; antigenically similar strains predominate in
epidemics worldwide for several years before being
replaced by a new antigenic cluster of strains. Future indepth studies of the influenza reservoir, along with largescale data mining of genomic resources and the
integration of epidemiological, genomic, and antigenic
data, should enhance our understanding of antigenic drift
and improve the detection and control of antigenically
novel emerging strains.
Influenza is a single-stranded, negative-sense RNA virus that
causes acute respiratory illness in humans. In temperate regions,
winter influenza epidemics result in 250,000–500,000 deaths per
year; in tropical regions, the burden is similar [1,2]. Influenza
viruses of three genera or types (A, B, and C) circulate in the
human population. Influenza viruses of the types B and C evolve
slowly and circulate at low levels. Type A evolves rapidly and can
evade neutralization by antibodies in individuals who have been
previously infected with, or vaccinated against, the virus. As a
result it regularly causes large epidemics. Furthermore, distinct
reservoirs of influenza A exist in other mammals and in birds. Four
times in the last hundred years these reservoirs have provided
genetic material for novel viruses that have caused global
pandemics [3–8].
The genome of influenza A viruses comprises eight RNA
segments of 0.9–2.3 kb that together span approximately 13.5 kb
and encode 11 proteins [9]. Segment 4 encodes the major surface
glycoprotein called hemagglutinin (H), which is responsible for
attaching the virus to sialic acid residues on the host cell surface and
PLoS Pathogens | www.plospathogens.org
Citation: McHardy AC, Adams B (2009) The Role of Genomics in Tracking the
Evolution of Influenza A Virus. PLoS Pathog 5(10): e1000566. doi:10.1371/
journal.ppat.1000566
Editor: Marianne Manchester, The Scripps Research Institute, United States of
America
Published October 26, 2009
Copyright: ß 2009 McHardy et al. This is an open-access article distributed
under the terms of the Creative Commons Attribution License, which permits
unrestricted use, distribution, and reproduction in any medium, provided the
original author and source are credited.
Funding: The authors received no specific funding for this work.
Competing Interests: The authors have declared that no competing interests
exist.
* E-mail: [email protected]
This article is part of the ‘‘Genomics of Emerging Infectious Disease’’ PLoS Journal
collection (http://ploscollections.org/emerginginfectiousdisease/).
1
October 2009 | Volume 5 | Issue 10 | e1000566
Figure 1. Schematic representation of an influenza A virion. Three proteins, hemagglutinin (HA, a trimer of three identical subunits),
neuraminidase (NA, a tetramer of four identical subunits), and the M2 transmembrane proton channel (a tetramer of four identical subunits), are
anchored in the viral membrane, which is composed of a lipid bilayer. The large, external domains of hemagglutinin and neuraminidase are the major
targets for neutralizing antibodies of the host immune response. The M1 matrix protein is located below the membrane. The genome of the influenza
A virus is composed of eight individual RNA segments (conventionally ordered by decreasing length, bottom row), which each encode one or two
proteins. Inside the virion, the eight RNA segments are packaged in a complex with nucleoprotein (NP) and the viral polymerase complex, consisting
of the PA, PB1, and PB2 proteins.
doi:10.1371/journal.ppat.1000566.g001
To monitor for novel emerging strains, the World Health
Organization (WHO) maintains a global surveillance program. A
panel of experts meets twice a year to review antigenic, genetic, and
epidemiological data and decides on the vaccine composition for the
next winter season in the northern or southern hemisphere [20]. If
an emerging antigenic variant is detected and judged likely to
become predominant, an update of the vaccine strain is recommended. This ‘‘predict and produce’’ approach mostly results in
efficient vaccines that substantially limit the morbidity and mortality
of seasonal epidemics [21]. The recommendation has to be made
almost a year before the season in which the vaccine is used,
however, because of the time required to produce and distribute a
new vaccine. Problems arise when an emerging variant is not
identified early enough for an update of the vaccine composition
[22–24]. Thus, gaining a detailed understanding of the evolution
and epidemiology of the virus is of the utmost importance, as it may
lead to earlier identification of novel emerging variants [20].
PLoS Pathogens | www.plospathogens.org
The development of high-throughput sequencing has recently
provided large datasets of high-quality, complete genome
sequences for viral isolates collected in a relatively unbiased
manner, regardless of virulence or other unusual characteristics
[9,25]. Analyses of the genome sequence data combined with
large-scale antigenic typing [26,27] have given insights into the
pattern of global spread, the genetic diversity during seasonal
epidemics, and the dynamics of subtype evolution. Influenza data
repositories such as the NCBI Influenza Virus Resource (http://
www.ncbi.nlm.nih.gov/genomes/FLU/FLU.html) [28] and the
Global Initiative on Sharing All Influenza Data (GISAID; http://
platform.gisaid.org/) database [29] make the genomic information
publicly available, together with epidemiological data for the
sequenced isolates. The GISAID model for data sharing requires
users to agree to collaborate with, and appropriately credit, all
data contributors. A notable success of this initiative has been the
contribution of countries, such as Indonesia and China, which
2
October 2009 | Volume 5 | Issue 10 | e1000566
antigenic drift. Furthermore, the antigenic drift of H3N2 is not
continuous but punctuated: antigenically homogenous clusters of
strains predominate for an average of 3 years before being
replaced by a new cluster. In accordance with the punctuated
nature of antigenic drift, periods of predominantly neutral
evolution alternate with periods of strong selection for antigenic
change [13,36]. Phylogenetic trees illustrating the evolution of the
hemagglutinin gene of H3N2 have a cactus-like shape with a
strong temporal structure in which the trunk represents the
succession of surviving viral lineages over time. Short side
branches indicate that most strains are driven to extinction and
that viral diversity at any given time is limited [31,34]. The
underlying causes of this punctuated antigenic drift and limited
viral diversity at a given point in time have been investigated in
phylodynamic modeling studies (Box 2).
Major changes in antigenicity (antigenic shift) are associated
with the introduction of novel viruses into the human population
that have a hemagglutinin segment of an influenza A virus from
another host species and can be transmitted efficiently among
humans [5]. Such viruses may arise by segment reassortment
between a human influenza A virus and an influenza A virus from
another host species. Alternatively, an entire virus from another
host species may cross into the human population. The
appearance of such viruses is rare, as it requires the viral genes
encoded by the different segments to be compatible with each
other and the virus to be capable of replication and transmission in
the human population, which is also thought to be a polygenic trait
[6,7,10,37,38]. Antigenic shift can have grave consequences
because neutralizing antibodies against the viral surface proteins
offer limited or no cross-protection across subtypes. Crossprotection can also be very limited between viruses of the same
subtype that have evolved independently in different hosts for long
periods of time [14]. Thus, a larger part of the population is
susceptible to infection with such viruses than to infection with
endemic viruses [10,14]. Antigenic shift caused three global
pandemics in the twentieth century, the 1918 H1N1 pandemic,
the 1957 H2N2 pandemic, and 1968 H3N2 pandemic (reviewed
in [3–5,8]): The 1918 pandemic had the most devastating impact,
with an estimated 20–50 million deaths worldwide [39]. There is
some uncertainty concerning the origin of the 1918 virus due to
the lack of data from this time [6,40–43]. A recent phylogenetic
study suggests that this virus may have been generated by
reassortment of avian viruses with already circulating viruses in a
mammalian host such as human or swine [44]. The H2N2 virus
that caused the 1957 pandemic was a reassortant of five human
H1N1 segments and avian segments encoding the viral surface
proteins and the PB1 protein. Similarly, the reassortant H3N2
virus of the 1968 pandemic featured avian segments encoding
hemagglutinin and PB1. H3N2 still circulates today, together with
an H1N1 lineage introduced in 1977, which is similar to the H1N1
viruses circulating in the 1950s [4].
The first pandemic virus of the twenty-first century probably
entered the human population in January or February of 2009
[15]. Phylogenetic analyses of the viral genome determined that
the virus has a complex reassortment history with segments of
‘‘avian-like’’ Eurasian swine influenza A viruses (NA and M) that
were first observed in Eurasian swine in 1979, and of a triple
reassortant virus identified in North American swine after 1998.
The segments derived from the triple reassortant stem themselves
from human H3N2 (PB1), an avian influenza A virus (PA, PB2),
and classical North American swine influenza A viruses (HA, NP,
NS), which have a common ancestry with the 1918 H1N1 virus
[14,45]. Experiments have shown that the new H1N1 virus
replicates efficiently in mammalian model organisms such as
Box 1. Broadly Protective Vaccines
Current influenza vaccines are based on detergentinactivated viruses. They elicit antibodies with a narrow
range of protection that target predominantly the variable
regions of the hemagglutinin protein. Accordingly, the
seasonal influenza vaccine includes one strain with
segments of the surface proteins for each of the A/H1N1,
A/H3N2 and B viruses, and it is updated every 1–3 years to
match the predominant variants of influenza. Research
into vaccines that offer broader protection across diverse
subtypes and antigenic drift variants is ongoing [21,59–61].
This research is particularly important with respect to the
emergence of novel viruses with pandemic potential, such
as the 2009 H1N1 virus. In such an event, the time period
between the detection of the virus and the onset of a
pandemic is too short to produce a specific vaccine for
immediate vaccination of the population. Work in this area
is focused on developing vaccines that elicit antibodies
against conserved viral components, such as certain
regions of hemagglutinin, neuraminidase, and the M2
proton channel in the viral membrane [60]. Other types of
vaccines based on live attenuated viruses or plasmid DNA
expression vectors, or supplemented with adjuvants, show
promise in inducing a more broadly protective immune
response [61].
have previously been reticent about placing data in the public
domain. The WHO also supports the endeavor of rapid
publication of all available sequences for influenza viruses and
there is hope that comprehensive submission to public databases
will soon become a reality [24,30]. In the future, mining these
resources and establishing a statistical framework based on
epidemiological, antigenic, and genetic information could provide
further insights into the rules that govern the emergence and
establishment of antigenically novel variants and improve the
potential for influenza prevention and control.
Host Immune Evasion by Antigenic Drift and Shift
Influenza viruses can rapidly acquire genetic diversity because
of high replication rates in infected hosts, an error-prone RNA
polymerase (which introduces mutations during genome replication), and segment reassortment (Figure 2). Mutations that change
amino acid residues appear significantly more often than silent
mutations in the evolution of the hemagglutinin gene of human
influenza A, particularly in the protein epitopes [31–34]. This
observation indicates that selection for antigenic change of the
virus is the driving force in the evolutionary ‘‘arms race’’ between
the virus and the immunity of the human population [35].
Reassortment of the eight genome segments between two distinct
viruses present simultaneously in a host cell can result in hybrid
viruses with genome segments from two different progenitors.
Antigenic mapping allows researchers to generate a quantitative, two-dimensional representation of antigenic distances between genetically divergent strains [26]. This technique has
revealed that the relationship between antigenic change and
genetic change is nonlinear for the hemagglutinin of influenza A/
H3N2. The rate of genetic change of the virus is almost constant
over time, but some mutations exert a disproportionately large
effect on the antigenic type, whereas others are ‘‘hitchhikers’’ with
no phenotypic effect. Elucidating the effects of different mutations
at individual sites on the antigenic type will improve our
understanding of the overall genotype-to-phenotype mapping for
PLoS Pathogens | www.plospathogens.org
3
October 2009 | Volume 5 | Issue 10 | e1000566
Figure 2. Generation of genetic diversity and antigenic drift in the evolution of human influenza A viruses. Blue and yellow viruses
depict two antigenically similar strains of the same subtype circulating in the human population. The genetic diversity of the circulating viral
population increases through mutation and reasssortment. Single white arrows indicate relationships between ancestral and descendant viruses.
White marks on the segments indicate neutral mutations and red marks indicate mutations that affect the antigenic regions of the surface proteins.
Incoming pairs of orange arrows indicate the generation of reassortants with segments from two different ancestral viruses. As these viruses continue
to circulate, immunity against them builds up in the host population, represented here by the narrowing of the bottleneck. In parallel, viruses with
mutations affecting the antigenic regions of the surface proteins accumulate in the viral population. At some point a novel antigenic drift variant,
indicated by a red colored virus, which is less affected by immunity in the human population, is generated. This variant is able to cause widespread
infection and founds a new cluster of antigenically similar strains.
doi:10.1371/journal.ppat.1000566.g002
selective sweeps caused by a novel antigenic drift variant rising to
predominance reduce the genomic diversity of the circulating viral
population, either genome-wide or for the hemagglutinin segment
only [12]. Reassortment results in substantial differences in the
evolutionary histories of individual segments. However, similarities
in the histories of some segments indicate that besides the antigenic
characteristics of hemagglutinin, the genomic context and compatibility of certain segment combinations might be an important
contributor to viral fitness [12,51]. A case in point is the
antigenically novel ‘‘Fujian’’ strain which became predominant in
the 2003–2004 season, following a reassortment event that placed a
hemagglutinin segment from a lineage that had been circulating at
low levels for several years into a new genomic context [49]. The
importance of other segments in the adaptive evolution of the virus
is further supported by the observation that a number of other
segments, including the one encoding neuraminidase, evolve at
similar rates to the segment encoding hemagglutinin [12].
ferrets, mice, and cynomolgus macaques and is likely to be capable
of long-term circulation in the human population, particularly in
the event of further adaptive changes through mutation or
reassortment [46–48]. The novel H1N1 appears, so far, to cause
relatively mild human infections in comparison to other viruses
such as the highly pathogenic H5N1 avian influenza A viruses
that, since 1997, have repeatedly been transmitted to humans and
caused severe disease but so far have not been capable of sustained
transmission between humans. The emergence of a novel
pandemic virus, which may have been circulating undetected in
swine for a decade [14,45], has highlighted the need for increased
genomic surveillance of the viral populations in mammalian hosts
such as swine. These hosts could be a vessel for mammalian
adaptation of avian viruses, either by reassortment with human or
swine viruses or through adaptive changes [8], but have been
monitored less intensely than avian populations. The latest
emergence of a pandemic H1N1 virus has also underscored the
vital importance of further research into the molecular factors that
determine the host range and capacity for sustained human-tohuman transmission of influenza A viruses.
Geographic Spread
Genomic analysis has led to profound insights into the global
patterns of circulation and evolution of influenza A. Over the
course of seasonal epidemics in temperate regions, little evidence
has been found for selection for amino acid change and adaptive
evolution in the antigenic regions of the surface proteins [36].
There is, however, substantial genetic diversity due to multiple
introductions of distinct strains, wide spatial spread, and frequent
Reassortment in Subtype Evolution
Whole-genome studies have revealed that segment reassortment
between different viruses of the same subtype is an important
mechanism in the evolution of human-adapted subtypes and
generates extensive genome-wide diversity [34,36,49–51]. Periodic
PLoS Pathogens | www.plospathogens.org
4
October 2009 | Volume 5 | Issue 10 | e1000566
months before they emerge in Oceania, Europe, and North
America and 12–18 months before they reach South America.
Box 2. Modeling Antigenic Evolution
There is a long history of the use of mathematical models
to study epidemiological and evolutionary s ystems [63].
For rapidly evolving RNA viruses such as influenza the
dynamics of these systems are densely interwoven, and
recent work has sought to develop unified ‘‘phylodynamic’’ models to examine the processes underlying the
observed epidemiological and evolutionary patterns (reviewed in [35]). A better understanding of the mechanisms
driving viral evolution will enhance our capacity to
accurately identify novel emerging strains. For influenza,
phylodynamic models have been developed to probe the
complex processes relating to viral persistence in the
human population, antigenic turnover, and the limited
genetic diversity at any given point in time. The first
models predicted that diversity increases exponentially
unless long-term, partial cross-immunity between strains is
supplemented by temporary broad immunity that lasts for
several months and protects against all infections,
regardless of the genetic or antigenic similarity of strains
[64,65]. Subsequently, it has been proposed that a
genotype-to-phenotype mapping defined by neutral
networks underlies influenza evolution [66]. A neutral
network is a set of genotypes linked by single mutations
that all map to the same phenotype, in this case the
antigenic characteristics of a virus. Hence, genetic divergence is not accompanied by antigenic divergence as long
as the genotype remains in the same network. In certain
genetic contexts, however, mutations can move a
genotype onto an adjacent network, resulting in a
significant change in the antigenic phenotype. Incorporating this evolutionary framework into an epidemiological
model leads to both epidemiological and evolutionary
patterns characteristic of human influenza A/H3N2.
Challenges for the Future
A key objective for research into the antigenic drift of influenza A is
to improve the accuracy of vaccine strain choice, in particular for
seasons preceding the establishment of novel antigenic drift variants.
More intensive surveillance and sampling, particularly in EastSoutheast Asia, could facilitate the early detection of novel emerging
drift variants and alleviate problems related to the time required for
vaccine production. A better understanding of the evolutionary and
epidemiological rules governing antigenic drift, viral fitness, the role
of the source region, and establishment of predominance would be
particularly helpful for the selection of vaccine strains when
considerable variation among antigenically novel strains is observed
and it is unclear which, if any, will become predominant. Such
insights are likely to come both from phylodynamic modeling studies
and by mining genomic resources for genome-wide properties
associated with viral fitness and predominance. Some molecular
properties of hemagglutinin with predictive value for this task have
already been identified [53–56], such as the number of changes at
sites under positive selection or in the most extensively altered
epitope, although the sites under selection might change over time
[26]. It is notable that the lack of antigenic information for sequenced
viral isolates in public repositories currently restricts the direct analysis
of genetic determinants in antigenic drift [24]. If the World Health
Organization were to establish similar policies for the deposition of
antigenic information into public databases as exist for sequence data,
this could create a valuable resource for research in this area. As
existing databases grow, new statistical and computational techniques
are being developed for interpretation of these large-scale, population-level genomic datasets in combination with epidemiological and
phenotypic information [57]. Ultimately, the expert analysis of the
WHO in the detection and control of antigenically novel emerging
strains could be extensively supported by the development of a
suitable predictive framework based on statistical learning that takes
into consideration the population-level phylodynamics of antigenic
change [57,58]. Such a framework could utilize epidemiological,
genomic, and antigenic information and detailed knowledge of the
genetic and epidemiological characteristics of antigenic drift to assess
the likelihood of strains rising to predominance.
segment reassortment in seasonal epidemics [9,12,36,49,50]. The
viral population circulating in one season does not directly seed the
epidemic in the following one. Instead, gene flow and viral spread
are global, with similar strains appearing in northern and southern
hemisphere epidemics across several seasons. There is a global
reservoir of viral diversity from which seasonal epidemics in
temperate regions are seeded [12,27,52]. This reservoir is located
in East-Southeast Asia, where a region-wide network of temporally
overlapping epidemics maintains infection incidence throughout
the year [27]. Novel strains appear in this region on average 6–9
Acknowledgments
We thank Linus Roune for his help creating the figures.
References
1. WHO (2003) Fact sheet number 211. Available: http://www.who.int/
mediacentre/factsheets/fs211/en/. Accessed 13 August 2009.
2. Viboud C, Alonso WJ, Simonsen L (2006) Influenza in tropical regions. PLoS
Med 3: e89. doi:10.1371/journal.pmed.0030089.
3. Palese P (2004) Influenza: old and new threats. Nat Med 10: S82–87.
4. Kilbourne ED (2006) Influenza pandemics of the 20th century. Emerg Infect Dis
12: 9–14.
5. Cox NJ, Subbarao K (2000) Global epidemiology of influenza: past and present.
Annu Rev Med 51: 407–421.
6. Morens DM, Taubenberger JK, Fauci AS (2009) The persistent legacy of the
1918 influenza virus. N Engl J Med 361: 225–229.
7. Neumann G, Noda T, Kawaoka Y (2009) Emergence and pandemic potential of
swine-origin H1N1 influenza virus. Nature 459: 931–939.
8. Horimoto T, Kawaoka Y (2005) Influenza: Lessons from past pandemics,
warnings from current incidents. Nat Rev Microbiol 3: 591–600.
9. Ghedin E, Sengamalay NA, Shumway M, Zaborsky J, Feldblyum T, et al. (2005)
Large-scale sequencing of human influenza reveals the dynamic nature of viral
genome evolution. Nature 437: 1162–1166.
10. Webster RG, Bean WJ, Gorman OT, Chambers TM, Kawaoka Y (1992)
Evolution and ecology of influenza A viruses. Microbiol Rev 56: 152–179.
PLoS Pathogens | www.plospathogens.org
11. Fouchier RA, Munster V, Wallensten A, Bestebroer TM, Herfst S, et al. (2005)
Characterization of a novel influenza A virus hemagglutinin subtype (H16)
obtained from black-headed gulls. J Virol 79: 2814–2822.
12. Rambaut A, Pybus OG, Nelson MI, Viboud C, Taubenberger JK, et al. (2008)
The genomic and epidemiological dynamics of human influenza A virus. Nature
453: 615–619.
13. Wolf YI, Viboud C, Holmes EC, Koonin EV, Lipman DJ (2006) Long intervals
of stasis punctuated by bursts of positive selection in the seasonal evolution of
influenza A virus. Biol Direct 1: 34.
14. Garten RJ, Davis CT, Russell CA, Shu B, Lindstrom S, et al. (2009) Antigenic
and genetic characteristics of swine-origin 2009 A(H1N1) influenza viruses
circulating in humans. Science 325: 197–201.
15. Fraser C, Donnelly CA, Cauchemez S, Hanage WP, Van Kerkhove MD, et al.
(2009) Pandemic potential of a strain of influenza A (H1N1): Early findings.
Science 324: 1557–1561.
16. Wiley DC, Wilson IA, Skehel JJ (1981) Structural identification of the antibodybinding sites of Hong Kong influenza haemagglutinin and their involvement in
antigenic variation. Nature 289: 373–378.
17. Wilson IA, Cox NJ (1990) Structural basis of immune recognition of influenza
virus hemagglutinin. Annu Rev Immunol 8: 737–771.
5
October 2009 | Volume 5 | Issue 10 | e1000566
18. Wilson IA, Skehel JJ, Wiley DC (1981) Structure of the haemagglutinin
membrane glycoprotein of influenza virus at 3 Å resolution. Nature 289:
366–373.
19. Yu X, Tsibane T, McGraw PA, House FS, Keefer CJ, et al. (2008) Neutralizing
antibodies derived from the B cells of 1918 influenza pandemic survivors. Nature
455: 532–536.
20. Russell CA, Jones TC, Barr IG, Cox NJ, Garten RJ, et al. (2008) Influenza
vaccine strain selection and recent studies on the global migration of seasonal
influenza viruses. Vaccine 26(Suppl 4): D31–34.
21. Karlsson Hedestam GB, Fouchier RA, Phogat S, Burton DR, Sodroski J, et al.
(2008) The challenges of eliciting neutralizing antibodies to HIV-1 and to
influenza virus. Nat Rev Microbiol 6: 143–155.
22. de Jong JC, Beyer WE, Palache AM, Rimmelzwaan GF, Osterhaus AD (2000)
Mismatch between the 1997/1998 influenza vaccine and the major epidemic
A(H3N2) virus strain as the cause of an inadequate vaccine-induced antibody
response to this strain in the elderly. J Med Virol 61: 94–99.
23. CDC (2004) Preliminary assessment of the effectiveness of the 2003–04
inactivated influenza vaccine—Colorado, December 2003. MMWR Morb
Mortal Wkly Rep 53: 8–11.
24. Salzberg S (2008) The contents of the syringe. Nature 454: 160–161.
25. Obenauer JC, Denson J, Mehta PK, Su X, Mukatira S, et al. (2006) Large-scale
sequence analysis of avian influenza isolates. Science 311: 1576–1580.
26. Smith DJ, Lapedes AS, de Jong JC, Bestebroer TM, Rimmelzwaan GF, et al.
(2004) Mapping the antigenic and genetic evolution of influenza virus. Science
305: 371–376.
27. Russell CA, Jones TC, Barr IG, Cox NJ, Garten RJ, et al. (2008) The global
circulation of seasonal influenza A (H3N2) viruses. Science 320: 340–346.
28. Bao Y, Bolotov P, Dernovoy D, Kiryutin B, Zaslavsky L, et al. (2008) The
influenza virus resource at the National Center for Biotechnology Information.
J Virol 82: 596–601.
29. Enserink M (2007) Data sharing. New Swiss influenza database to test promises
of access. Science 315: 923.
30. Bogner P, Capua I, Lipman DJ, Cox NJ, et al. (2006) A global initiative on
sharing avian flu data. Nature 442: 981.
31. Fitch WM, Leiter JM, Li XQ, Palese P (1991) Positive Darwinian evolution in
human influenza A viruses. Proc Natl Acad Sci U S A 88: 4270–4274.
32. Fitch WM, Bush RM, Bender CA, Cox NJ (1997) Long term trends in the
evolution of H(3) HA1 human influenza type A. Proc Natl Acad Sci U S A 94:
7712–7718.
33. Bush RM, Fitch WM, Bender CA, Cox NJ (1999) Positive selection on the H3
hemagglutinin gene of human influenza virus A. Mol Biol Evol 16: 1457–1465.
34. Nelson MI, Holmes EC (2007) The evolution of epidemic influenza. Nat Rev
Genet 8: 196–205.
35. Grenfell BT, Pybus OG, Gog JR, Wood JL, Daly JM, et al. (2004) Unifying the
epidemiological and evolutionary dynamics of pathogens. Science 303: 327–332.
36. Nelson MI, Simonsen L, Viboud C, Miller MA, Taylor J, et al. (2006) Stochastic
processes are key determinants of short-term evolution in influenza A virus.
PLoS Pathog 2: e125. doi:10.1371/journal.ppat.0020125.
37. Lowen AC, Palese P (2007) Influenza virus transmission: Basic science and
implications for the use of antiviral drugs during a pandemic. Infect Disord Drug
Targets 7: 318–328.
38. Kuiken T, Holmes EC, McCauley J, Rimmelzwaan GF, Williams CS, et al.
(2006) Host species barriers to influenza virus infections. Science 312: 394–397.
39. Johnson NP, Mueller J (2002) Updating the accounts: Global mortality of the
1918–1920 ‘‘Spanish’’ influenza pandemic. Bull Hist Med 76: 105–115.
40. Taubenberger JK, Reid AH, Lourens RM, Wang R, Jin G, et al. (2005)
Characterization of the 1918 influenza virus polymerase genes. Nature 437:
889–893.
41. Reid AH, Taubenberger JK, Fanning TG (2004) Evidence of an absence: The
genetic origins of the 1918 pandemic influenza virus. Nat Rev Microbiol 2:
909–914.
42. Antonovics J, Hood ME, Baker CH (2006) Molecular virology: Was the 1918 flu
avian in origin? Nature 440: E9; discussion E9–10.
PLoS Pathogens | www.plospathogens.org
43. Taubenberger JK (2006) The origin and virulence of the 1918 ‘‘Spanish’’
influenza virus. Proc Am Philos Soc 150: 86–112.
44. Smith GJ, Bahl J, Vijaykrishna D, Zhang J, Poon LL, et al. (2009) Dating the
emergence of pandemic influenza viruses. Proc Natl Acad Sci U S A 106:
11709–11712.
45. Smith GJ, Vijaykrishna D, Bahl J, Lycett SJ, Worobey M, et al. (2009) Origins
and evolutionary genomics of the 2009 swine-origin H1N1 influenza A
epidemic. Nature 459: 1122–1125.
46. Maines TR, Jayaraman A, Belser JA, Wadford DA, Pappas C, et al. (2009)
Transmission and pathogenesis of swine-origin 2009 A(H1N1) influenza viruses
in ferrets and mice. Science 325: 484–487.
47. Munster VJ, de Wit E, van den Brand JM, Herfst S, Schrauwen EJ, et al. (2009)
Pathogenesis and transmission of swine-origin 2009 A(H1N1) influenza virus in
ferrets. Science 325: 481–483.
48. Itoh Y, Shinya K, Kiso M, Watanabe T, Sakoda Y, et al. (2009) In vitro and in
vivo characterization of new swine-origin H1N1 influenza viruses. Nature;E-pub
ahead of print. doi:10.1038/nature08260.
49. Holmes EC, Ghedin E, Miller N, Taylor J, Bao Y, et al. (2005) Whole-genome
analysis of human influenza A virus reveals multiple persistent lineages and
reassortment among recent H3N2 viruses. PLoS Biol 3: e300. doi:10.1371/
journal.pbio.0030300.
50. Nelson MI, Edelman L, Spiro DJ, Boyne AR, Bera J, et al. (2008) Molecular
epidemiology of A/H3N2 and A/H1N1 influenza virus during a single epidemic
season in the United States. PLoS Pathog 4: e1000133. doi:10.1371/journal.ppat.1000133.
51. Nelson MI, Viboud C, Simonsen L, Bennett RT, Griesemer SB, et al. (2008)
Multiple reassortment events in the evolutionary history of H1N1 influenza A
virus since 1918. PLoS Pathog 4: e1000012. doi:10.1371/journal.ppat.1000012.
52. Nelson MI, Simonsen L, Viboud C, Miller MA, Holmes EC (2007) Phylogenetic
analysis reveals the global migration of seasonal influenza A viruses. PLoS
Pathog 3: e131. doi:10.1371/journal.ppat.0030131.
53. Fitch WM, Bush RM, Bender CA, Subbarao K, Cox NJ (2000) The Wilhelmine
E. Key 1999 Invitational lecture. Predicting the evolution of human influenza A.
J Hered 91: 183–185.
54. Gupta V, Earl DJ, Deem MW (2006) Quantifying influenza vaccine efficacy and
antigenic distance. Vaccine 24: 3881–3888.
55. Blackburne BP, Hay AJ, Goldstein RA (2008) Changing selective pressure
during antigenic changes in human influenza H3. PLoS Pathog 4: e1000058.
doi:10.1371/journal.ppat.1000058.
56. Kryazhimskiy S, Bazykin GA, Plotkin J, Dushoff J (2008) Directionality in the
evolution of influenza A haemagglutinin. Proc Biol Sci 275: 2455–2464.
57. Pybus OG, Rambaut A (2009) Modelling: Evolutionary analysis of the dynamics
of viral infectious disease. Nat Rev Genet 10: 540–550.
58. Bishop CM (2006) Pattern recognition and machine learning. In: Jordan M,
Kleinberg J, Schoellkopf B, eds. , Singapore: Springer.
59. Sui J, Hwang WC, Perez S, Wei G, Aird D, et al. (2009) Structural and
functional bases for broad-spectrum neutralization of avian and human
influenza A viruses. Nat Struct Mol Biol 16: 265–273.
60. Gerhard W, Mozdzanowska K, Zharikova D (2006) Prospects for universal
influenza virus vaccine. Emerg Infect Dis 12: 569–574.
61. Carrat F, Flahault A (2007) Influenza vaccine: The challenge of antigenic drift.
Vaccine 25: 6852–6862.
62. Fisher RA (1999) The genetical theory of natural selection. Oxford (UK):
Oxford University Press. pp 318.
63. Ross R (1910) The prevention of malaria. New York: E.P. Dutton. pp 669.
64. Ferguson NM, Galvani AP, Bush RM (2003) Ecological and immunological
determinants of influenza evolution. Nature 422: 428–433.
65. Tria F, Lässig M, Peliti L, Franz S (2005) A minimal stochastic model for
influenza evolution. J Stat Mech;doi:10.1088/1742-5468/2005/07/P07008.
66. Koelle K, Cobey S, Grenfell B, Pascual M (2006) Epochal evolution shapes the
phylodynamics of interpandemic influenza A (H3N2) in humans. Science 314:
1898–1903.
6
October 2009 | Volume 5 | Issue 10 | e1000566
Review
The Past and Future of Tuberculosis Research
Iñaki Comas, Sebastien Gagneux*
Division of Mycobacterial Research, MRC National Institute for Medical Research, London, United Kingdom
largely ineffective vaccine (Bacille Calmette-Guérin [BCG]), and
just a few drugs that were decades old (streptomycin, rifampicin,
isoniazid, ethambutol, pyrozinamide) [3]. Tragically, these are the
tools still in use today in most parts of the world where TB remains
one of the most important public health problems (Figure 1).
In addition to the lack of appropriate tools to control TB
globally, much about the disease was unknown in the early 1990s
and many dogmas were guiding the field at the time. These
included the view that differences in the clinical manifestation of
TB were primarily driven by host variables and the environment
as opposed to bacterial factors, a notion reinforced by early DNA
sequencing studies that reported very limited genetic diversity in
MTBC compared with other bacterial pathogens [6]. According
to other dogmas, TB was mainly a consequence of reactivation of
latent infections rather than ongoing disease transmission, and that
mixed infections and exogenous reinfections with different strains
were very unlikely.
The development of molecular techniques to differentiate
between strains of MTBC made it possible to readdress some of
these points. One of these methods, a DNA fingerprinting protocol
based on the Mycobacterium insertion sequence IS6110, quickly
evolved into the first international gold standard for genotyping of
MTBC [7]. It also became a key component of pragmatic public
health efforts, such as detecting disease outbreaks and ongoing TB
transmission [8], and allowed differentiation between patients who
relapsed due to treatment failure and those reinfected with a
different strain [9]. This latter finding demonstrated for the first
time that previous exposure to MTBC does not protect against
subsequent exogenous reinfection and TB disease, which is a
phenomenon with implications for vaccine design. Many other
new insights were gained through these molecular epidemiological
studies [10], which, for the most part, were performed in wealthy
countries; corresponding data from most high-burden areas
remained limited because of poor infrastructure and lack of
funding.
Routine genotyping of MTBC for public health purposes also
revived discussions about the role of pathogen variation in
Abstract: Renewed efforts in tuberculosis (TB) research
have led to important new insights into the biology and
epidemiology of this devastating disease. Yet, in the face
of the modern epidemics of HIV/AIDS, diabetes, and
multidrug resistance—all of which contribute to susceptibility to TB—global control of the disease will remain a
formidable challenge for years to come. New highthroughput genomics technologies are already contributing to studies of TB’s epidemiology, comparative genomics, evolution, and host–pathogen interaction. We argue
here, however, that new multidisciplinary approaches—
especially the integration of epidemiology with systems
biology in what we call ‘‘systems epidemiology’’—will be
required to eliminate TB.
Introduction
Tuberculosis (TB) remains an important public health problem
[1]. With close to 10 million new cases per year, and a pool of two
billion latently infected individuals, control efforts are struggling in
many parts of the world (Figure 1). Nevertheless, the renewed
interest in research and improved funding for TB give reasons for
optimism. Recently, the Stop TB Partnership, a network of
concerned governments, organizations, and donors lead by the
WHO (http://www.stoptb.org/stop_tb_initiative/), outlined a
global plan to halve TB prevalence and mortality by 2015 and
eliminate the disease as a public health problem by 2050 [2].
Attaining these goals will depend on both strong government
commitment and increased interdisciplinary research and development. As existing diagnostics, drugs, and vaccines will be
insufficient to achieve these objectives, a substantial effort in both
basic science and epidemiology will be necessary to develop better
tools and strategies to control TB [3]. Here we review the recent
history of TB research and some of the latest insights into the
evolutionary history of the disease. We then discuss ways in which
we could benefit from a more comprehensive systems approach to
control TB in the future.
Recent History of the Field
Citation: Comas I, Gagneux S (2009) The Past and Future of Tuberculosis
Research. PLoS Pathog 5(10): e1000600. doi:10.1371/journal.ppat.1000600
TB is caused by several species of gram-positive bacteria known
as tubercle bacilli or Mycobacterium tuberculosis complex (MTBC).
MTBC includes obligate human pathogens such as Mycobacterium
tuberculosis and Mycobacterium africanum, as well as organisms adapted
to various other species of mammal. In the developed world, TB
incidence declined steadily during the second half of the 20th
century and so funds available for research and control of TB
decreased substantially during that time [4]. When TB started to
reemerge in the early 1990s, fuelled by the growing pandemic of
HIV/AIDS (Box 1), scientists and public health officials were
caught off-guard; billions of dollars of emergency funds were
necessary to control TB outbreaks [5]. Moreover, long-term
neglect of basic TB research and product development meant that
global TB control relied on a 100-year-old diagnostic method (i.e.
sputum smear microscopy) of poor sensitivity, an 80-year-old and
Editor: Marianne Manchester, The Scripps Research Institute, United States of
America
PLoS Pathogens | www.plospathogens.org
Published October 26, 2009
Copyright: ß 2009 Comas, Gagneux. This is an open-access article distributed
under the terms of the Creative Commons Attribution License, which permits
unrestricted use, distribution, and reproduction in any medium, provided the
original author and source are credited.
Funding: Work in our laboratory is supported by the Medical Research Council,
UK, and the US National Institutes of Health grants HHSN266200700022C and
AI034238. The funders had no role in study design, data collection and analysis,
decision to publish, or preparation of the manuscript.
Competing Interests: The authors have declared that no competing interests
exist.
* E-mail: [email protected]
This article is part of the ‘‘Genomics of Emerging Infectious Disease’’ PLoS Journal
collection (http://ploscollections.org/emerginginfectiousdisease/).
1
October 2009 | Volume 5 | Issue 10 | e1000600
Figure 1. The global incidence of TB. The number of new TB cases per 100,000 population for the year 2007 according to WHO estimates
(adapted from [1]).
doi:10.1371/journal.ppat.1000600.g001
standardize, however, and whether MTBC genotype plays a
meaningful role in TB severity remains controversial [14].
Comparative genomics of MTBC also yielded interesting insights
into the evolution and geographic distribution of the organism.
Because MTBC has essentially no detectible horizontal gene transfer
[15,16], LSPs can be used as phylogenetic markers to trace the
evolutionary relationships of different strain families. Following such
an approach, studies have shown that humans did not, as previously
believed, acquire MTBC from animals during the initiation of animal
domestication, rather the human- and animal-adapted members of
MTBC share a common ancestor, which might have infected humans
even before the Neolithic transition [17,18]. LSPs also allowed
researchers to define several discrete strain lineages within the humanadapted members of MTBC, which are associated with different
human populations and geographical regions (Figures 2 and 3)
[15,19,20]. Because of the lack of horizontal gene exchange in
MTBC, phylogenetic trees derived using various molecular markers
define the same phylogenetic groupings [21], and several studies based
on single nucleotide polymorphisms (SNPs) and other molecular
makers have gathered additional support for the highly phylogeographical population structure of MTBC [22–25].
outcome of infection and disease. Some strains of MTBC
appeared over-represented in particular patient populations,
which suggested that strain diversity may have epidemiological
implications. The completion of the first whole genome sequence
of M. tuberculosis in 1998 [11] and the development of DNA
microarrays offered a new opportunity to address this question by
interrogating the entire genome of multiple clinical strains of
MTBC. These comparative genomics studies revealed that
genomic deletions, also known as large sequence polymorphisms
(LSPs), are an important source of genome plasticity in MTBC
[12]. Furthermore, statistical analyses of patient data suggested
possible associations between strain genomic content and disease
severity in humans [13]. Clinical phenotypes in TB are difficult to
Box 1. The Influence of Modern Epidemics on
TB Incidence
HIV/AIDS and diabetes are important comorbidities that
dramatically increase the susceptibility to TB. The synergy
between TB and HIV/AIDS is a particular problem in subSaharan Africa, while the impact of diabetes on TB is
increasing in many rapidly growing world economies; it
may already be a more important risk factor for TB than
HIV/AIDS in places like India and Mexico. The emergence
of multidrug-resistant strains represents an additional
threat to global TB control. The strong association
between HIV/AIDS and drug-resistant TB has been well
established, but whether similar interactions exist between
drug-resistant TB and diabetes needs to be explored
further.
PLoS Pathogens | www.plospathogens.org
Ancient History of the Pathogen
Although LSPs have proven very useful for defining different
lineages within MTBC, these markers do not reflect actual genetic
distances, and the mode of molecular evolution in MTBC cannot
be easily inferred from them [21]. By contrast, DNA sequencebased methods can provide important clues about the evolutionary
forces shaping bacterial populations. Multilocus sequence typing
(MLST), in which fragments of seven structural genes are
2
October 2009 | Volume 5 | Issue 10 | e1000600
Figure 2. Global distribution of the six main lineages of human MTBC. Each dot represents the most frequent lineage(s) circulating in a
country. Colours correspond to the lineages defined in Figure 3 (adapted from [20]).
doi:10.1371/journal.ppat.1000600.g002
sequenced for each strain [26], has been used very successfully to
define the genetic population structure of many bacterial species
[27]. Because of the low degree of sequence polymorphisms in
MTBC, however, standard MLST is uninformative [28]. A recent
study of MTBC extended the traditional MLST scheme by
sequencing 89 complete genes in 108 strains, covering 1.5% of the
genome of each strain [29]. Phylogenetic analysis of this extended
multilocus sequence dataset resulted in a tree that was highly
congruent with that generated previously using LSPs (Figure 3).
The new sequence-based data also revealed that the MTBC
strains that are adapted to various animal species represent just a
subset of the global genetic diversity of MTBC that affects different
human populations [29]. Furthermore, by comparing the
geographical distribution of various human MTBC strains with
their position on the phylogenetic tree, it became evident that
MTBC most likely originated in Africa and that human MTBC
originally spread out of Africa together with ancient human
migrations along land routes. This view is further supported by the
fact that the so-called ‘‘smooth tubercle bacilli,’’ which are the
closest relatives of the human MTBC, are highly restricted to East
Africa [30]. The multilocus sequence data reported by Hershberg
et al. [29] further suggested a scenario in which the three
‘‘modern’’ lineages of MTBC (purple, blue, and red in Figure 3)
seeded Eurasia, which experienced dramatic human population
expansion in more recent times. These three lineages then spread
globally out of Europe, India, and China, respectively, accompanying waves of colonization, trade and conquest. In contrast to the
ancient human migrations, however, this more recent dispersal of
human MTBC occurred primarily along water routes [29].
The availability of comprehensive DNA sequence data has also
allowed researchers to address questions about the molecular
PLoS Pathogens | www.plospathogens.org
evolution of MTBC. In-depth population genetic analyses by
Hershberg et al. highlight the fact that purifying selection against
slightly deleterious mutations in this organism is strongly reduced
compared to other bacteria [29]. As a consequence, nonsynonymous SNPs tend to accumulate in MTBC, leading to a high ratio
of nonsynonymous to synonymous mutations (also known as dN/
dS). The authors hypothesized that the high dN/dS in MTBC
compared to most other bacteria might indicate increased random
genetic drift associated with serial population bottlenecks during
past human migrations and patient-to-patient transmission. If
confirmed, this would indicate that ‘‘chance,’’ not just natural
selection, has been driving the evolution of MTBC. Although these
kinds of fundamental evolutionary questions are often underappreciated by clinicians and biomedical researchers, studying the
evolution of a pathogen ultimately allows for better epidemiological predictions by contributing to our understanding of basic
biology, particularly with respect to antibiotic resistance.
A Vision for the Future
Thanks to recent increases in research funding for TB [4],
substantial progress has been made in our understanding of the
basic biology and epidemiology of the disease. Unfortunately, this
increased knowledge has not yet had any noticeable impact on the
current global trends of TB (Figure 1). While TB incidence
appears to have stabilized in many countries, the total number of
cases is still increasing as a function of global human population
growth [1]. Of particular concern are the ongoing epidemics of
multidrug-resistant TB [31], as well as the synergies between TB
and the ongoing epidemics of HIV/AIDS and other comorbidities
such as diabetes (Box 1).
3
October 2009 | Volume 5 | Issue 10 | e1000600
Figure 3. The global phylogeny of Mycobacterium tuberculosis complex (MTBC). The phylogenic relationships between various human- and
animal-adapted strains and species are largely consistent when defined by using either (A) large sequence polymorphisms (LSPs) or (B) single
nucleotide polymorphisms (SNPs) identified by sequencing 89 genes in 108 MTBC strains. Numbers inside the squares in (A) refer to specific lineagedefining LSPs. Colors indicate congruent lineages (adapted from [20] and [29]).
doi:10.1371/journal.ppat.1000600.g003
As our understanding of TB improves, we would like to be able to
make better predictions about the future trajectory of the disease
and to develop new tools to control the disease better and ultimately
reverse global trends. For this to be feasible, TB epidemiology needs
to evolve into a more predictive, interdisciplinary endeavour; a
discipline we might refer to as ‘‘systems epidemiology’’ (Figure 4).
Systems biology is already a rapidly emerging field, in which cycles
of mathematical modelling and experiments using various largescale ‘‘-omics’’ datasets are integrated in an iterative manner [32].
Novel biological processes are being discovered through these
systems approaches, which might not have been possible using more
traditional methods [33–35].
Last year, Young et al. argued that systems biology approaches
will be necessary to elucidate some of the key aspects of host–
pathogen interactions in TB [36] and to develop new drugs,
vaccines, and biomarkers to evaluate new interventions [3]. For
PLoS Pathogens | www.plospathogens.org
example, according to another dogma in the TB field, latent TB
infections are caused by physiologically dormant bacilli and can
thus be differentiated from active disease where MTBC is actively
growing and dividing [37]. In reality, however, the phenomenon
of TB latency most likely reflects a whole spectrum of responses to
TB infection, involving phenotypically distinct bacterial subpopulations and spanning various degrees of bacterial burden and
associated host immune responses [38]. We agree with Young
et al. [36] that TB latency and similar biological complexities will
only be adequately addressed using systems approaches, and we
argue further that to comprehend the current TB epidemic as a
whole, and to better predict its future trajectory, a complementary
systems epidemiology approach will be necessary (Figure 4).
Mathematical models are already being used extensively to study
the epidemiology of TB and to guide control policies [39]. Recent
applications have shown that socioeconomic factors are key drivers
4
October 2009 | Volume 5 | Issue 10 | e1000600
Figure 4. A systems epidemiology approach to TB research. The spread of TB is influenced by social and biological factors. On the one hand,
the new discipline of systems biology integrates approaches that address the host, the pathogen, and interactions between the two. On the other
hand, epidemiology addresses the burden of the disease and the social, economic, and ecological causes of its frequency and distribution. There is
little crosstalk between these two disciplines at the moment. ‘‘Systems epidemiology’’ is an attempt to take into account the interactions between
these various fields of research.
doi:10.1371/journal.ppat.1000600.g004
of today’s TB epidemic [40]. In addition, much theoretical emphasis
has been placed on trying to define the impact that drug resistance
will have on the global TB epidemic [41]. Some of this theoretical
work has become more complex by incorporating new biological
insights obtained empirically and through targeted experimental
studies. Early theoretical studies on the spread of drug-resistant
MTBC were based on the assumption that all drug-resistant
bacteria had an inherent fitness disadvantage compared to drugsusceptible strains [42]; however, as is becoming clear from
experimental and molecular epidemiological investigation, substantial heterogeneity exists with respect to the reproductive success of
drug-resistant strains [43–46]. Newer mathematical models account
for some of this heterogeneity [47–49].
One could imagine an expansion of such mathematical
approaches—much as systems biology operates—in which epidemiological modelling is combined with more comprehensive
biological data related to the host, the pathogen, and their
interactions (Figure 4). Of course, environmental and sociological
data would also need to be considered [40]. As mathematical
models become more finely tuned, they could in turn inform
future experimental work to test some of the specific predictions.
The genomics revolution now offers the opportunity to study host–
PLoS Pathogens | www.plospathogens.org
pathogen interactions at an unprecedented depth. To be able to
make sense out of the current and upcoming deluge of -omics data,
however, scientists will have to rely on a mathematically and
statistically robust analytical framework. Ideally, some of these
theoretical approaches will be able to accommodate increasingly
diverse sets of data in order to capture the various biological,
environmental, and social aspects of TB.
Among the newly emerging technologies, we believe that nextgeneration DNA sequencing will play an important role in
improving our understanding of TB [50]. Whole-genome
sequencing could potentially become the new gold standard for
strain typing in routine molecular epidemiology [51]. For host
genetics and TB susceptibility, too, de novo DNA sequencing
based approaches could have advantages over traditional SNP
typing [52]. For example, many of the human populations
carrying the largest proportion of the global TB burden have
not been sufficiently characterised genetically (Figure 1) [53,54],
and screening for currently limited human SNP collections might
have little relevance for these populations [55]. Furthermore,
comprehensive DNA sequencing of TB patients and controls in
various human populations could help unveil rare but biologically
relevant mutations [56]. Another approach increasingly being
5
October 2009 | Volume 5 | Issue 10 | e1000600
used to study both the host and the pathogen is sequence-based
transcriptomics, in which gene expression is measured by whole
genome sequencing of RNA transcripts; a method referred to as
RNA-seq [57]. One of the advantages of this approach over
existing microarray-based methods is that changes in the
expression of noncoding RNAs and other novel transcripts can
be easily detected. RNA-seq is particularly useful for genome-wide
studies of small regulatory RNAs, as such studies are more difficult
to perform using standard DNA microarrays. Recent studies, for
example, have reported a role for small regulatory RNAs in M.
tuberculosis [58], and there is little doubt more regulatory RNAs will
soon be identified by RNA-seq [57].
the problems has been that the macrophage and mouse infection
models used in these studies relied on poorly characterised strains,
and finding relevant links to human disease has been all but
impossible [14,21].
In TB control, too, potential new dogmas might emerge to limit
future progress. A strong T cell–derived interferon gamma (INFc)
response appears to be crucial for the immunological control of
TB, and many MTBC antigens have been identified based on
their capacity to elicit INFc responses in TB patients or their
infected contacts [62]. Some of these antigens are being developed
into new TB diagnostics and vaccines, but the potential impact of
MTBC diversity on immune responses is not generally being
considered [21]. A recent study in The Gambia showed that INFc
responses to one of the key MTBC antigens differed in an MTBC
lineage–specific manner [63]. Developing a universally effective
vaccine might be the only way to eliminate TB in the future [3].
This is particularly true given the large reservoir of latently
infected individuals in the world, which would be impossible to
eliminate through prophylactic drug treatment. Considering that
natural TB infection does not protect against exogenous
reinfection and disease, however, mimicking natural infection
using attenuated strains or a cocktail of traditional INFc-inducing
antigens might not necessarily be the most promising vaccine
strategy. Indeed, the largely unsuccessful implementation of BCG
vaccination might serve as a warning [64].
Challenges for the Future
Advances in TB research are hampered by the fact that MTBC
is a Biosafety Level 3 pathogen with a long generation time,
making it slow and complex to culture. Moreover, TB is a chronic
disease that can develop over many years, and is characterised by
extended periods of latency during which MTBC cannot be
isolated from infected individuals. All of these factors complicate
and prolong the development of new interventions and their
assessment in clinical trials. As we have already mentioned, the
field has been marked by a number of dogmas that, in some cases,
might have contributed to the slow progress in TB research. New
insights are now questioning some of these views, but at the same
time, new opinions could well evolve into new dogmas. For
example, we and others have spent much of our scientific careers
seeking convincing evidence for the role of MTBC strain diversity
in human disease. Although some pieces of evidence have recently
started to emerge [59–61], the subject needs more work. One of
Acknowledgments
We thank Peter Small and Douglas Young for comments on the
manuscript.
References
1. World Health Organization (2009) Global tuberculosis control - surveillance,
planning, financing. Geneva, Switzerland: WHO.
2. Stop TB Partnership (2006) The global plan to stop TB 2006–2015. Geneva:
WHO.
3. Young DB, Perkins MD, Duncan K, Barry CE (2008) Confronting the scientific
obstacles to global control of tuberculosis. J Clin Invest 118: 1255–1265.
4. Kaufmann SH, Parida SK (2007) Changing funding patterns in tuberculosis.
Nat Med 13: 299–303.
5. Frieden TR, Fujiwara PI, Washko RM, Hamburg MA (1995) Tuberculosis in
New York City–turning the tide. N Engl J Med 333: 229–233.
6. Sreevatsan S, Pan X, Stockbauer KE, Connell ND, Kreiswirth BN, et al. (1997)
Restricted structural gene polymorphism in the Mycobacterium tuberculosis complex
indicates evolutionarily recent global dissemination. Proc Natl Acad Sci U S A
94: 9869–9874.
7. van Embden JD, Cave MD, Crawford JT, Dale JW, Eisenach KD, et al. (1993)
Strain identification of Mycobacterium tuberculosis by DNA fingerprinting:
recommendations for a standardized methodology. J Clin Microbiol 31:
406–409.
8. Small PM, Hopewell PC, Singh SP, Paz A, Parsonnet J, et al. (1994) The
epidemiology of tuberculosis in San Francisco. A population-based study using
conventional and molecular methods. N Engl J Med 330: 1703–1709.
9. Small PM, Shafer RW, Hopewell PC, Singh SP, Murphy MJ, et al. (1993)
Exogenous reinfection with multidrug-resistant Mycobacterium tuberculosis in
patients with advanced HIV infection. N Engl J Med 328: 1137–1144.
10. Mathema B, Kurepina NE, Bifani PJ, Kreiswirth BN (2006) Molecular
epidemiology of tuberculosis: current insights. Clin Microbiol Rev 19: 658–685.
11. Cole ST, Brosch R, Parkhill J, Garnier T, Churcher C, et al. (1998) Deciphering
the biology of Mycobacterium tuberculosis from the complete genome sequence.
Nature 393: 537–544.
12. Tsolaki AG, Hirsh AE, DeRiemer K, Enciso JA, Wong MZ, et al. (2004)
Functional and evolutionary genomics of Mycobacterium tuberculosis: insights from
genomic deletions in 100 strains. Proc Natl Acad Sci U S A 101: 4865–4870.
13. Kato-Maeda M, Rhee JT, Gingeras TR, Salamon H, Drenkow J, et al. (2001)
Comparing genomes within the species Mycobacterium tuberculosis. Genome Res
11: 547–554.
14. Nicol MP, Wilkinson RJ (2008) The clinical consequences of strain diversity in
Mycobacterium tuberculosis. Trans R Soc Trop Med Hyg 102: 955–65.
15. Hirsh AE, Tsolaki AG, DeRiemer K, Feldman MW, Small PM (2004) Stable
association between strains of Mycobacterium tuberculosis and their human host
populations. Proc Natl Acad Sci U S A 101: 4871–4876.
PLoS Pathogens | www.plospathogens.org
16. Supply P, Warren RM, Banuls AL, Lesjean S, Van Der Spuy GD, et al. (2003)
Linkage disequilibrium between minisatellite loci supports clonal evolution of
Mycobacterium tuberculosis in a high tuberculosis incidence area. Mol Microbiol 47:
529–538.
17. Brosch R, Gordon SV, Marmiesse M, Brodin P, Buchrieser C, et al. (2002) A
new evolutionary scenario for the Mycobacterium tuberculosis complex. Proc Natl
Acad Sci U S A 99: 3684–3689.
18. Mostowy S, Cousins D, Brinkman J, Aranaz A, Behr MA (2002) Genomic
deletions suggest a phylogeny for the Mycobacterium tuberculosis complex. J Infect
Dis 186: 74–80.
19. Reed MB, Pichler VK, McIntosh F, Mattia A, Fallow A, et al. (2009) Major
Mycobacterium tuberculosis lineages associate with patient country of origin. J Clin
Microbiol 47: 1119–28.
20. Gagneux S, Deriemer K, Van T, Kato-Maeda M, de Jong BC, et al. (2006)
Variable host-pathogen compatibility in Mycobacterium tuberculosis. Proc Natl Acad
Sci U S A 103: 2869–2873.
21. Gagneux S, Small PM (2007) Global phylogeography of Mycobacterium tuberculosis
and implications for tuberculosis product development. Lancet Infect Dis 7:
328–337.
22. Baker L, Brown T, Maiden MC, Drobniewski F (2004) Silent nucleotide
polymorphisms and a phylogeny for Mycobacterium tuberculosis. Emerg Infect Dis
10: 1568–1577.
23. Gutacker MM, Mathema B, Soini H, Shashkina E, Kreiswirth BN, et al. (2006)
Single-nucleotide polymorphism-based population genetic analysis of Mycobacterium tuberculosis strains from 4 geographic sites. J Infect Dis 193: 121–128.
24. Filliol I, Motiwala AS, Cavatore M, Qi W, Hernando Hazbon M, et al. (2006)
Global phylogeny of Mycobacterium tuberculosis based on single nucleotide
polymorphism (SNP) analysis: insights into tuberculosis evolution, phylogenetic
accuracy of other DNA fingerprinting systems, and recommendations for a
minimal standard SNP set. J Bacteriol 188: 759–772.
25. Brudey K, Driscoll JR, Rigouts L, Prodinger WM, Gori A, et al. (2006)
Mycobacterium tuberculosis complex genetic diversity: mining the fourth international spoligotyping database (SpolDB4) for classification, population genetics
and epidemiology. BMC Microbiol 6: 23.
26. Maiden MC, Bygraves JA, Feil E, Morelli G, Russell JE, et al. (1998) Multilocus
sequence typing: a portable approach to the identification of clones within
populations of pathogenic microorganisms. Proc Natl Acad Sci U S A 95:
3140–3145.
27. Maiden MC (2006) Multilocus sequence typing of bacteria. Annu Rev Microbiol
60: 561–588.
6
October 2009 | Volume 5 | Issue 10 | e1000600
28. Achtman M (2008) Evolution, population structure, and phylogeography of
genetically monomorphic bacterial pathogens. Annu Rev Microbiol 62: 53–70.
29. Hershberg R, Lipatov M, Small PM, Sheffer H, Niemann S, et al. (2008) High
functional diversity in Mycobacterium tuberculosis driven by genetic drift and human
demography. PLoS Biol 6: e311.
30. Gutierrez C, Brisse S, Brosch R, Fabre M, Omais B, et al. (2005) Ancient origin
and gene mosaicism of the progenitor of Mycobacterium tuberculosis. PLoS
Pathogens 1: 1–7.
31. World Health Organization (2008) Anti-tuberculosis drug resistance in the world
report no. 4. Geneva, Switzerland: WHO.
32. Zak DE, Aderem A (2009) Systems biology of innate immunity. Immunol Rev
227: 264–282.
33. Gilchrist M, Thorsson V, Li B, Rust AG, Korb M, et al. (2006) Systems biology
approaches identify ATF3 as a negative regulator of Toll-like receptor 4. Nature
441: 173–178.
34. Querec TD, Akondy RS, Lee EK, Cao W, Nakaya HI, et al. (2009) Systems
biology approach predicts immunogenicity of the yellow fever vaccine in
humans. Nat Immunol 10: 116–125.
35. Stuart LM, Boulais J, Charriere GM, Hennessy EJ, Brunet S, et al. (2007) A
systems biology analysis of the Drosophila phagosome. Nature 445: 95–101.
36. Young D, Stark J, Kirschner D (2008) Systems biology of persistent infection:
tuberculosis as a case study. Nat Rev Microbiol 6: 520–8.
37. Gill WP, Harik NS, Whiddon MR, Liao RP, Mittler JE, et al. (2009) A
replication clock for Mycobacterium tuberculosis. Nat Med 15: 211–4.
38. Young DB, Gideon HP, Wilkinson RJ (2009) Eliminating latent tuberculosis.
Trends Microbiol 17: 183–188.
39. Cohen T, Dye C, Colijn C, Murray M (2009) Mathematical models of the
epidemiology and control of drug-resistant TB. Expert Rev Resp Med in press.
40. Lonnroth K, Jaramillo E, Williams BG, Dye C, Raviglione M (2009) Drivers of
tuberculosis epidemics: The role of risk factors and social determinants. Soc Sci
Med 68: 2240–6.
41. Borrell S, Gagneux S (2009) Infectiousness, reproductive fitness, and evolution of
drug-resistant Mycobactyerium tuberculosis. Int J Tuberc Lung Dis in press.
42. Dye C, Williams BG, Espinal MA, Raviglione MC (2002) Erasing the world’s
slow stain: strategies to beat multidrug-resistant tuberculosis. Science 295:
2042–2046.
43. Bottger EC, Springer B, Pletschette M, Sander P (1998) Fitness of antibioticresistant microorganisms and compensatory mutations. Nat Med 4: 1343–1344.
44. Gagneux S, Burgos MV, DeRiemer K, Encisco A, Munoz S, et al. (2006) Impact
of bacterial genetics on the transmission of isoniazid-resistant Mycobacterium
tuberculosis. PLoS Pathog 2: e61.
45. Gagneux S, Long CD, Small PM, Van T, Schoolnik GK, et al. (2006) The
competitive cost of antibiotic resistance in Mycobacterium tuberculosis. Science 312:
1944–1946.
46. van Soolingen D, de Haas PE, van Doorn HR, Kuijper E, Rinder H, et al.
(2000) Mutations at amino acid position 315 of the katG gene are associated with
high-level resistance to isoniazid, other drug resistance, and successful
PLoS Pathogens | www.plospathogens.org
47.
48.
49.
50.
51.
52.
53.
54.
55.
56.
57.
58.
59.
60.
61.
62.
63.
64.
7
transmission of Mycobacterium tuberculosis in the Netherlands. J Infect Dis 182:
1788–1790.
Cohen T, Murray M (2004) Modeling epidemics of multidrug-resistant M.
tuberculosis of heterogeneous fitness. Nat Med 10: 1117–1121.
Blower SM, Chou T (2004) Modeling the emergence of the ‘hot zones’:
tuberculosis and the amplification dynamics of drug resistance. Nat Med 10:
1111–1116.
Dye C (2009) Doomsday postponed? Preventing and reversing epidemics of
drug-resistant tuberculosis. Nat Rev Microbiol 7: 81–87.
Mardis ER (2008) Next-generation DNA sequencing methods. Annu Rev
Genomics Hum Genet 9: 387–402.
MacLean D, Jones JD, Studholme DJ (2009) Application of ‘next-generation’
sequencing technologies to microbial genetics. Nat Rev Microbiol 7: 287–296.
Hardy J, Singleton A (2009) Genomewide association studies and human
disease. N Engl J Med 360: 1759–1768.
Tishkoff SA, Reed FA, Friedlaender FR, Ehret C, Ranciaro A, et al. (2009) The
genetic structure and history of Africans and African Americans. Science 324:
1035–44.
Basu A, Mukherjee N, Roy S, Sengupta S, Banerjee S, et al. (2003) Ethnic India:
a genomic view, with special reference to peopling and structure. Genome Res
13: 2277–2290.
Campbell MC, Tishkoff SA (2008) African Genetic Diversity: Implications for
human demographic history, modern human origins, and complex disease
mapping. Annu Rev Genomics Hum Genet 9: 403–33.
Goldstein DB (2009) Common genetic variation and human traits. N Engl J Med
360: 1696–1698.
Wang Z, Gerstein M, Snyder M (2009) RNA-Seq: a revolutionary tool for
transcriptomics. Nat Rev Genet 10: 57–63.
Arnvig KB, Young DB (2009) Identification of small RNAs in Mycobacterium
tuberculosis. Mol Microbiol 73: 397–408.
de Jong BC, Hill PC, Aiken A, Awine T, Antonio M, et al. (2008) Progression to
active tuberculosis, but not transmission, varies by Mycobacterium tuberculosis
lineage in the Gambia. J Infect Dis 198: 1037–43.
Caws M, Thwaites G, Dunstan S, Hawn TR, Thi Ngoc Lan N, et al. (2008) The
influence of host and bacterial genotype on the development of disseminated
disease with Mycobacterium tuberculosis. PLoS Pathog 4: e1000034.
Thwaites G, Caws M, Chau TT, D’Sa A, Lan NT, et al. (2008) The relationship
between Mycobacterium tuberculosis genotype and the clinical phenotype of
pulmonary and meningeal tuberculosis. J Clin Microbiol 46: 1363–8.
Ernst JD, Lewinsohn DM, Behar S, Blythe M, Schlesinger LS, et al. (2007)
Meeting report: NIH workshop on the Tuberculosis Immune Epitope Database.
Tuberculosis (Edinb) 88: 366–70.
de Jong BC, Hill PC, Brookes RH, Gagneux S, Jeffries DJ, et al. (2006)
Mycobacterium africanum elicits an attenuated T Cell response to Early Secreted
Antigenic Target, 6 kDa, in patients with tuberculosis and their household
contacts. J Infect Dis 193: 1279–1286.
Andersen P, Doherty TM (2005) Opinion: The success and failure of BCG implications for a novel tuberculosis vaccine. Nat Rev Microbiol 3: 656–62.
October 2009 | Volume 5 | Issue 10 | e1000600
Review
Helicobacter pylori ’s Unconventional Role in Health and
Disease
Marion S. Dorer, Sarah Talarico, Nina R. Salama*
Division of Human Biology, Fred Hutchinson Cancer Research Center, Seattle, Washington, United States of America
countries has fallen dramatically, for unknown reasons, with a
corresponding decrease in gastric cancer [7]. This public health
success is tempered by the recent demonstration of an inverse
relationship between H. pylori infection and esophageal adenocarcinoma, Barrett’s esophagus, and reflux esophagitis [8]. H. pylori
has been with humans since our earliest days, thus it is not
surprising that its relationship is that of both a commensal
bacterium and a pathogen, causing some diseases and possibly
protecting against others. In addition, it is genetically diverse,
likely as a result of constant exposure to both environmental and
immunological selection, suggesting that genetic diversification is a
strategy for long-term colonization.
Abstract: The discovery of a bacterium, Helicobacter
pylori, that is resident in the human stomach and causes
chronic disease (peptic ulcer and gastric cancer) was
radical on many levels. Whereas the mouth and the colon
were both known to host a large number of microorganisms, collectively referred to as the microbiome, the
stomach was thought to be a virtual Sahara desert for
microbes because of its high acidity. We now know that H.
pylori is one of many species of bacteria that live in the
stomach, although H. pylori seems to dominate this
community. H. pylori does not behave as a classical
bacterial pathogen: disease is not solely mediated by
production of toxins, although certain H. pylori genes,
including those that encode exotoxins, increase the risk of
disease development. Instead, disease seems to result
from a complex interaction between the bacterium, the
host, and the environment. Furthermore, H. pylori was the
first bacterium observed to behave as a carcinogen. The
innate and adaptive immune defenses of the host,
combined with factors in the environment of the
stomach, apparently drive a continuously high rate of
genomic variation in H. pylori. Studies of this genetic
diversity in strains isolated from various locations across
the globe show that H. pylori has coevolved with humans
throughout our history. This long association has given
rise not only to disease, but also to possible protective
effects, particularly with respect to diseases of the
esophagus. Given this complex relationship with human
health, eradication of H. pylori in nonsymptomatic
individuals may not be the best course of action. The
story of H. pylori teaches us to look more deeply at our
resident microbiome and the complexity of its interactions, both in this complex population and within our
own tissues, to gain a better understanding of health and
disease.
The Role of Infection in Disease Risk
H. pylori infection is generally acquired during childhood and,
without specific antibiotic treatment, can persist for the lifetime of
the host. Disease often does not develop until adulthood, after
decades of infection, and H. pylori induces variable pathologies in
the stomach. Duodenal ulcer disease is characterized by gastritis
that is largely confined to the antrum (the distal compartment of
the stomach), relatively low inflammation of the corpus (the
middle, acid-secreting compartment), and high levels of stomach
acid secretion (Figure 1A). Those with gastric ulcer or stomach
cancer have high levels of inflammation of the corpus, multifocal
gastric atrophy, and low levels of stomach acid secretion, due to
the destruction of stomach acid–secreting parietal cells (Figure 1B)
[9,10]. Some of this inflammatory response is controlled by the
cytokine IL-1b, which is induced by H. pylori infection [11] and
both elicits a proinflammatory response and inhibits secretion of
gastric acid [12]. Polymorphisms in the interleukin gene cluster,
including IL-1b, are risk factors for H. pylori–associated gastric
cancer [13,14], and studies of the transcriptional response of both
human and model hosts to H. pylori confirm induction of
transcriptional regulators of proinflammatory programs. In
Citation: Dorer MS, Talarico S, Salama NR (2009) Helicobacter pylori’s
Unconventional Role in Health and Disease. PLoS Pathog 5(10): e1000544.
doi:10.1371/journal.ppat.1000544
Common wisdom circa 1980 suggested that the stomach, with
its low pH, was a sterile environment. Then, endoscopy of the
stomach became common and, in 1984, pathologist Robin
Warren and gastroenterologist Barry Marshall saw an extracellular, curved bacillus, often in dense sheets, lining the stomach
epithelium of patients with gastritis (inflammation of the stomach)
and ulcer disease [1]. Soon, the medical community understood
that the gram-negative bacterium Helicobacter pylori, not stress, is
the major cause of stomach inflammation, which, in some infected
individuals, precedes peptic ulcer disease (10%–20%), distal gastric
adenocarcinoma (1%–2%), and gastric mucosal-associated lymphoid tissue (MALT) lymphoma (,1%) [2–5]. Thus, H. pylori
gained distinction as the only known bacterial carcinogen [6]. It is
believed that half of the world’s population is infected with H.
pylori; however, the burden of disease falls disproportionately on
less-developed countries. The incidence of infection in developed
PLoS Pathogens | www.plospathogens.org
Editor: Marianne Manchester, The Scripps Research Institute, United States of
America
Published October 26, 2009
Copyright: ß 2009 Dorer et al. This is an open-access article distributed under
the terms of the Creative Commons Attribution License, which permits
unrestricted use, distribution, and reproduction in any medium, provided the
original author and source are credited.
Funding: Work in the Salama lab is supported by National Institutes of Health
grant AI054423. The funder had no role in study design, data collection and
analysis, decision to publish, or preparation of the manuscript.
Competing Interests: The authors have declared that no competing interests
exist.
* E-mail: [email protected]
This article is part of the ‘‘Genomics of Emerging Infectious Disease’’ PLoS Journal
collection (http://ploscollections.org/emerginginfectiousdisease/).
1
October 2009 | Volume 5 | Issue 10 | e1000544
Learning about Disease from H. pylori
H. pylori expands our view of how microbes survive at high levels
while activating inflammatory responses and shows us that microbes
may be underappreciated as an important factor in chronic disease
pathogenesis. In the case of pathogens that cause acute infections,
there is a massive inflammatory response, which often supports
bacterial replication and transmission. Alternatively, some pathogens, such as Mycobacterium tuberculosis, persist in the host by
manipulating the immune response to create a protected compartment. H. pylori introduces a third strategy; it actively replicates and
maintains a continuous balance with the inflammatory response
over years of infection with little evidence for increased H. pylori–
related disease upon immune suppression [25]. As the role of
chronic inflammation in many diseases including cardiovascular
disease, diabetes mellitus, Alzheimer’s disease, and others is
increasingly recognized, researchers are focusing on infectious
agents as one possible source of this chronic inflammation.
Genomic Insights into the Biology of H. pylori
The study of H. pylori is strongly influenced by the genomic age.
The sequencing of its genome was completed in 1997 [26], just 13
years after Marshall and Warren reported their discovery.
However, almost a quarter (24%) of H. pylori genes have no
sequence similarity with genes available in public databases [27],
suggesting that lessons learned from well-studied bacteria like
Escherichia coli would not necessarily apply to this evolutionarily
distinct Epsilonproteobacteria. By using more advanced bioinformatic approaches, researchers are now identifying some pathways
first thought absent in H. pylori. For example, H. pylori appeared to
lack the E. coli recBCD pathway, which is involved in homologous
recombination and DNA double-strand break repair. More careful
examination of conserved domains and motifs, however, identified
the H. pylori addA and addB genes, which are present in most grampositive and many gram-negative bacteria and whose protein
products have enzymatic functions similar to those of the recBCD
pathway [28].
By 1999, H. pylori was the first species to have complete genomes
sequenced from two different strains—an important milestone,
given its genetic diversity. Comparison of the two genomes
revealed that 6%–7% of the genes were present in one strain but
not in the other. There was also a high level of nucleotide diversity
between the two strains, with only eight genes sharing at least 98%
nucleotide identity; however, most nucleotide differences were
synonymous changes [27]. Microarrays designed upon these
sequences were then used for comparative genomic hybridization
of H. pylori strains isolated from different ethnic groups and
geographic areas [29,30]. These studies found that 25% of H. pylori
genes are variably present among strains. Such genome-wide
analyses have played an important role in dividing H. pylori genes
into two classes: variable genes that are absent in some strains and
core genes that are present in all strains analyzed. The variable
genes are likely adaptive for different environmental niches, which
for the human stomach–restricted H. pylori comprise genetically
distinct hosts. The largest annotated class of variable genes encode
proteins expressed on or that modify the bacterial cell surface
(outer membrane proteins and proteins involved in lipopolysaccharide synthesis) [30], consistent with a function at the interface
of the bacteria and host. The core genes have diverse functions.
Some core genes are required for viability in culture. A genomic
study that utilized microarray-based mapping of a genomesaturating transposon library (a collection of H. pylori strains that
includes transposon mutants randomly distributed throughout the
genome) revealed that 23% of the genome is required for viability
in culture because these genes could not tolerate transposon
Figure 1. Distinct pathologies of H. pylori–induced disease. (A)
Duodenal ulcer disease correlates with high inflammation in the antrum
(red bursts), lower levels of inflammation in the corpus, and high acid
secretion (+). (B) Gastric ulcer or adenocarcinoma correlates with
increased inflammation in the corpus, low acid secretion, and multifocal
atrophy (wavy lines).
doi:10.1371/journal.ppat.1000544.g001
addition, transcription profiles reveal induction of several
chemokines and cytokines including those produced by nonlymphoid cells, and robust induction of innate immune defenses
including iron sequestration proteins and antimicrobial peptides
[15]. These studies suggest it would be wise to explore diverse
functional classes of genes for host genetic variant associations with
H. pylori disease progression. To this end, H. pylori researchers are
eagerly awaiting an unbiased genome-wide association study of
risk factors associated with progression to intestinal-type gastric
cancer or peptic ulcer disease in patients infected with H. pylori.
Such a study has been completed for sporadic diffuse-type gastric
cancer, which can be associated with H. pylori infection, revealing
two candidate loci, one that encodes a likely tumor suppressor
(prostate stem cell antigen [PSCA]) [16]. Genomic studies of this
sort will help elucidate host factors that synergize with H. pylori
infection to cause disease.
The association of H. pylori infection with gastric cancer raises
the interesting question of whether H. pylori encodes one or more
oncogenes. Oncogenic viruses initiate and promote cellular
transformation by integrating virally encoded oncogenes into the
host genome [17,18]. By contrast, H. pylori remains primarily
extracellular and does not integrate its genome into the host DNA.
The bacterium can still affect the function of host cells, however,
by translocating a bacterial protein, CagA, into host cells via a
specialized secretion system called the cag Type IV secretion
system (T4SS) [19,20]. In host cells, CagA interacts with a number
of cellular complexes implicated in oncogenesis [21,22]. Despite
elucidation of potentially transforming activities, transgenic
expression of CagA in the mouse stomach is only weakly
oncogenic [23]. As the cag T4SS also induces proinflammatory
cytokines via the intracellular bacterial peptidoglycan recognition
molecule Nod1, cancer progression may occur through synergy
with the host inflammatory response [24]. While CagA may not
promote cancer itself, exposure to CagA and inflammatory insults
may select for heritable host cell changes (genetic or epigenetic)
that together contribute to cancer progression.
PLoS Pathogens | www.plospathogens.org
2
October 2009 | Volume 5 | Issue 10 | e1000544
Learning about Disease from H. pylori
to sample adaptive variants. HIV, for example, has a flexible
reverse transcriptase that makes point mutations, insertions,
deletions, transversions, and duplications that produce variants
that may have a selective advantage [35]. Genetic variation in a
microbe indicates constant selection by a dynamic environment,
and H. pylori is a very genetically diverse species of bacteria [36–
38]. Genetic diversification may help H. pylori to adapt to a new
host after transmission, to different micro-niches within a single
host, and to changing conditions in the host over time—for
example, by avoiding clearance by host defenses.
Genetic diversity arises from within-genome diversification as
well as from reassortment by recombination with DNA from other
infecting H. pylori, generating novel clones within the stomach
(Figure 2). Within-genome diversification can include point
mutations, intragenomic recombination, and slipped-strand mispairing during DNA replication within repetitive sequences.
Reassortment can occur by recombination with either DNA from
a superinfecting H. pylori strain or a variant clone of the same
strain. Central to this reassortment is H. pylori’s natural
competence—the ability to take up exogenous DNA and
incorporate it into its genome. Evidence from our lab shows that
natural competence is induced by DNA damage, suggesting that
H. pylori responds to stress by diversifying its genome (MSD and
NRS, unpublished data). However, there are controls on this
rampant genetic exchange: restriction-modification systems, which
include a restriction endonuclease that cleaves a specific DNA
sequence and a DNA methyltransferase that protects the
bacterium’s own DNA from being cleaved by methylating the
target DNA sequence. Genes that encode restriction-modification
systems compose the second largest class of variably present genes
with known function, so the complement of available restrictionmodification systems varies between strains, giving a methylation
code to the DNA from each strain. This mechanism serves to limit
or prevent recombination between H. pylori strains as well as
between H. pylori and other bacteria or eukaryotic cells [39].
The H. pylori genome encodes relatively few proteins that
regulate transcription. Instead, some of the same processes that
govern the generation of genetic diversity (i.e., slipped-strand
mispairing, methyltransferase activity, and recombination) also
play an important role in varying gene expression in response to
environmental cues. There are 46 H. pylori genes that have long
repeats of one or two nucleotides that are prone to slipped-strand
mispairing during replication [26,27,40]. These genes are phasevariable because changes in the number of repeats can shift the
reading frame of the gene, switching gene expression on or off
(Figure 2). In addition, many H. pylori promoters have mononucleotide repeats that regulate gene expression by changing the
spacing between important regulatory sites in these promoters.
Orphan methyltransferases, which have lost their corresponding
restriction enzyme, may also regulate gene expression by
methylating sequences in the promoter region of genes, and some
of the methyltransferase genes are themselves subject to phasevariable expression. Recombination regulates gene expression
through deletions and duplications that occur during gene
conversion and locus switching. These mechanisms suggest that
H. pylori survives by constantly generating variants that adapt its
physiology to new environments.
One example of how H. pylori’s genetic variability helps it adapt
to new environments involves its adhesin genes, which encode
proteins that bind to the Lewis human blood group antigens,
which are carbohydrate-based epitopes [41]. The protein encoded
by one of these adhesin genes, BabA, binds the Lewis-b antigen on
the gastric mucosa, helping the bacterium adhere to the mucosa.
The babA gene is silent in some H. pylori strains but can be
Box 1. Tracking Human Genealogy with H.
pylori Genomics
Currently, a number of companies propose to predict your
‘‘genetic genealogy’’ from the DNA in a cheek swab. They
do this by analyzing informatively variable parts of our
genomes (such as the Y chromosome or mitochondrial
DNA) that show characteristic differences between ethnic
and geographic populations; thus, they can tell if you may
be distantly related to Ghengis Khan, for example.
Unfortunately, population bottlenecks [51], small population sizes, and long generation times have limited the
amount of genetic diversity in the human population that
can be used for these analyses. It turns out, however, that
genomic sequencing of the H. pylori strain harbored by an
individual does a better job in resolving ancestry than the
usual human genomic markers [52]. This is because of high
genetic diversity among H. pylori strains [53], a restricted
mode of transmission (primarily within families or households [54]), and the association of H. pylori with humans
throughout our evolution [55]. A major source of H. pylori’s
genetic diversity is recombination between strains [38],
which blurs signatures of descent. Despite this confounding
factor, Achtman and colleagues [53] identified evolutionary
signatures in strain sequences from diverse geographic
sources. These signatures, combined with new statistical
tools that take into account admixture and recombination
[55], have tracked ancient human migrations, such as our
emergence from Africa [55], and more recent events such as
colonization of the Pacific islands [56]. H. pylori gene
sequences can even distinguish between the Buddhist and
Muslim ethnic groups that have coexisted for at least 1,000
years in Ladakh [52]. The fact that H. pylori has maintained
evolutionarily distinct strain signatures during many generations of contact suggests either that interracial interactions
that promote transmission are very limited or that
additional mechanisms prevent strains from one ethnic
population from establishing a foothold in hosts of another
ethnic population.
insertion [31]. Additional core genes are essential only in the
context of host infection and several groups have completed
screens for transposon mutants that fail to colonize animal models
of infection [32,33]. An example of such a colonization core gene
is addA, which is required for recombinational repair of DNA
double-strand breaks, presumably caused by the host inflammatory response [28].
The nucleotide sequence diversity in H. pylori’s core genes can
distinguish between different ethnic and geographic human
populations, demonstrating that passage of H. pylori between
closely related humans has continued uninterrupted over tens of
thousands of years (see Box 1). Different geographic and ethnic
groups that have similar infection rates have quite varied relative
risks of H. pylori–associated diseases such as gastric cancer [34].
Thus, in addition to host genetic and environmental exposures,
differences among strains likely contribute to variation in disease
risk. Consequently, studies of pathogenesis need to be reproduced
in representative strain backgrounds to ensure that discoveries in
one strain apply in strain populations with a diverse evolutionary
history.
H. pylori Diversification during Persistent Infection
Genetic diversification can aid in the persistence of organisms
that continue to replicate during chronic infection, allowing them
PLoS Pathogens | www.plospathogens.org
3
October 2009 | Volume 5 | Issue 10 | e1000544
Learning about Disease from H. pylori
Figure 2. Mechanisms that create genetic diversity in H. pylori. Colored arrows represent different genes, and the correspondingly colored
triangles, rectangles, and circles represent the proteins encoded by these genes. Diversification mechanisms (right side of figure) include
spontaneous point mutations, slipped-strand mispairing, and intragenomic recombination. Allelic changes involving nonsynonymous point
mutations and mosaic genes resulting from intragenomic recombination can alter the function and/or the antigenic epitopes of the encoded protein.
Gene expression can also be regulated by gene conversion resulting from intragenomic recombination, and phase variation mediated by slippedstrand mispairing. Reassortment of genes (left side of figure) by natural transformation with exogenous DNA also contributes to genetic diversity.
Natural transformation with DNA from a superinfecting strain, for example, can introduce new genes and new alleles of already present genes
(horizontal gene transfer). Similarly, natural transformation with DNA from a variant clone of the same strain can further propagate an advantageous
allele acquired by within-genome diversification.
doi:10.1371/journal.ppat.1000544.g002
H. pylori’s Interaction with the Microbiome
expressed if it recombines with the babB gene, an event mediated
by homologous sequences at the 59 and 39 ends of the two genes
[42]. Thus, recombination can help H. pylori alter its adherence
properties to adapt to selective pressures in the host. These
selective pressures may include variation in the host receptors
present or in conditions that favor a shift in the ratio of bacteria
adherent to the gastric cell epithelium over those swimming freely
in the mucus.
Genetic variation may also be important for the ability of H.
pylori to evade the host immune system. H. pylori further exploits
the Lewis antigen system by ‘‘camouflaging’’ its surface lipopolysaccharide with its own Lewis-type antigen, which mimics that of
the individual host. The bacterium adapts the spectrum of Lewis
antigens it expresses by phase variation of the genes involved in
their biosynthesis [43]. Furthermore, recombination among the
many members of the large outer membrane protein (omp) gene
family has the potential to create mosaic omp genes, generating
antigenic variation that may keep H. pylori ahead of the ability of
the host’s immune system to recognize these cell surface exposed
epitopes.
PLoS Pathogens | www.plospathogens.org
H. pylori share their niche with the stomach microbiome, the
collection of microorganisms living on and in us. Study of
microorganisms was once limited to only those microbes that could
be cultured in the laboratory. Advances in sequencing technology
now allow us to study the collection of genes encoded by any
group of organisms—so-called metagenomics—making it possible
to characterize also the microbes that cannot be cultured but
nevertheless affect our health. Given that H. pylori engages in DNA
exchange, the metagenome may serve as a repository for novel
traits. When present, H. pylori dominates the microbiome in the
stomach [44,45], although the effect of this dominance is not
known. Perhaps H. pylori infection changes the composition of the
stomach microbiome, with unknown consequences.
Challenges for the Future
H. pylori is considered pathogenic, even carcinogenic. With this
simple view, eradication seems an obvious choice. In reality,
however, the relationship between H. pylori and disease is more
4
October 2009 | Volume 5 | Issue 10 | e1000544
Learning about Disease from H. pylori
nuanced. Like the cancer risk associated with smoking, a recent
trial showed that the cancer risk from H. pylori diminished
measurably only 12 years after eradication of the infection [46].
Some studies suggest that infection may prevent diseases of the
esophagus, and there is a debate in the literature concerning a
relationship between H. pylori and childhood asthma [8,47,48].
There is clear consensus that H. pylori should be eliminated in cases
of peptic ulcer disease, gastric MALT lymphoma, early gastric
cancer, first-degree relatives of gastric cancer patients, and
uninvestigated dyspepsia in high-prevalence populations. Despite
its potential to prevent ulcer and cancer, universal eradication of
H. pylori infection has not gained wide support, because of the
mixture of positive and negative disease associations with infection,
the lack of a definitive bacterial or host molecule accounting for
disease causation, and poor success rates of treating non-ulcer
dyspepsia by clearing H. pylori infection [49,50]. Thus a more
detailed picture of this host–pathogen interaction is needed and
likely will depend upon further advances in both endoscopy and
genomics.
We have a poor understanding of the immune responses to H.
pylori and the reasons that most hosts fail to clear infection. The
host restriction of H. pylori to humans and some nonhuman
primates has hampered development of robust animal models to
study the disease process. Thus progress will require improvements
in animal models and improved access to patient samples.
Endoscopy of the upper gastrointestinal tract is an invasive
procedure, so a major limitation to research is collection of
bacterial and human tissue samples from infected people.
Available samples are biased toward patients with severe
dyspepsia, ulcer symptoms, and gastric cancer, and only a small
fraction of the stomach can be sampled. Advances in less-invasive
methods, such as capsule endoscopy, may allow increased
sampling to monitor bacterial and tissue changes during chronic
colonization, including isolation and phenotypic analysis of
immune effector cells in infected tissue. Less-invasive methods
would also provide an opportunity to study infection in
asymptomatic individuals and transmission of H. pylori infection,
conditions in which the selective pressures that drive the observed
H. pylori genetic diversification likely operate.
A major opportunity to increase our understanding of how H.
pylori causes or prevents disease arises from recent advances in
high-throughput sequencing technologies. Currently, several
platforms allow researchers to accomplish in a single experiment
sequencing or resequencing of tens of H. pylori genomes,
characterization of host immune and epithelial cell types that
change during infection with highly sensitive digital expression tag
analysis, or analysis of the microbiome present in the stomach and
esophagus through metagenomic sequencing or targeted bacterial
or fungal small ribosomal subunit DNA sequencing. The sequence
data generated by such experiments will address several important
mysteries of H. pylori biology, including the timing and extent of H.
pylori genetic diversification. While strains from unrelated
individuals show dramatic variation in gene content and gene
sequence, the extent of sequence variation among clones during
persistent infection of a single host or upon transmission has not
been adequately sampled. Whole-genome sequencing of multiple
isolates of individual patients with dense spatial and temporal
sampling would definitively establish when, where, and by what
mechanisms genetic diversity is generated. This information will
inform efforts to combat resistance to current antibiotics, to
develop vaccines, and to understand H. pylori’s coevolution with
humans. Exploration of the influence of H. pylori on the
microbiome will identify organisms that collaborate with or can
be antagonized by H. pylori. Such organisms may mediate some of
the disease risks that have been associated with H. pylori presence
and absence. Finally, the rapid pace of resequencing of H. pylori’s
human host will provide a deeper understanding of genetic
variation in the human population that may influence risk for H.
pylori–associated pathologies and which, by association, could
provide clues to the cellular pathways disrupted in disease. Thus,
genomic approaches to study host response, the human microbiome, bacterial genetic variation, and, perhaps most importantly,
the intersections among these components, will help researchers
determine whether eradication is appropriate for all individuals in
all populations.
Acknowledgments
We thank Olivier Humbert and Laura Sycuro for their critical comments
on the manuscript and Laura Sycuro for providing H. pylori images.
References
12. El-Omar EM (2001) The importance of interleukin 1beta in Helicobacter pylori
associated disease. Gut 48: 743–747.
13. El-Omar EM, Carrington M, Chow WH, McColl KE, Bream JH, et al. (2000)
Interleukin-1 polymorphisms associated with increased risk of gastric cancer.
Nature 404: 398–402.
14. Figueiredo C, Machado JC, Pharoah P, Seruca R, Sousa S, et al. (2002)
Helicobacter pylori and interleukin 1 genotyping: an opportunity to identify highrisk individuals for gastric carcinoma. J Natl Cancer Inst 94: 1680–1687.
15. Humbert O, Pinto-Santini DM, Salama NR (2008) Genomotyping of Helicobacter
pylori and its host: microarray-based insights on gene variation, expression and
function. In: Yamaoka Y, ed. Helicobacter pylori Molecular Genetics and Cellular
Biology. Norfolk, UK: Caister Academic Press. pp 205–244.
16. Sakamoto H, Yoshimura K, Saeki N, Katai H, Shimoda T, et al. (2008) Genetic
variation in PSCA is associated with susceptibility to diffuse-type gastric cancer.
Nat Genet 40: 730–740.
17. Maeda N, Fan H, Yoshikai Y (2008) Oncogenesis by retroviruses: Old and new
paradigms. Rev Med Virol 18: 387–405.
18. Howley PM, Livingston DM (2009) Small DNA tumor viruses: Large
contributors to biomedical sciences. Virology 384: 256–259.
19. Segal ED, Cha J, Lo J, Falkow S, Tompkins LS (1999) Altered states:
Involvement of phosphorylated CagA in the induction of host cellular growth
changes by Helicobacter pylori. Proc Natl Acad Sci U S A 96: 14559–14564.
20. Stein M, Rappuoli R, Covacci A (2000) Tyrosine phosphorylation of the
Helicobacter pylori CagA antigen after cag-driven host cell translocation. Proc Natl
Acad Sci U S A 97: 1263–1268.
21. Bourzac KM, Guillemin K (2005) Helicobacter pylori-host cell interactions
mediated by type IV secretion. Cell Microbiol 7: 911–919.
1. Marshall BJ, Warren JR (1984) Unidentified curved bacilli in the stomach of
patients with gastritis and peptic ulceration. Lancet 1: 1311–1315.
2. Nomura A, Stemmermann GN, Chyou P, Kato I, Perez-Perez G, et al. (1991)
Helicobacter pylori infection and gastric carcinoma among japanese americans in
Hawaii. N Engl J Med 325: 1132–1136.
3. Parsonnet J, Friedman GD, Vandersteen DP, Chang Y, Vogelman JH, et al.
(1991) Helicobacter pylori infection and the risk of gastric carcinoma. N Engl J Med
325: 1127–1131.
4. Parsonnet J, Hansen S, Rodriguez L, Gelb AB, Warnke RA, et al. (1994)
Helicobacter pylori infection and gastric lymphoma. N Engl J Med 330: 1267–1271.
5. Kusters JG, van Vliet AH, Kuipers EJ (2006) Pathogenesis of Helicobacter pylori
infection. Clin Microbiol Rev 19: 449–490.
6. WHO (2006) Fact sheet No. 297, Cancer. World Health Organization.
7. Peek RM Jr, Blaser MJ (2002) Helicobacter pylori and gastrointestinal tract
adenocarcinomas. Nat Rev Cancer 2: 28–37.
8. Anderson LA, Murphy SJ, Johnston BT, Watson RG, Ferguson HR, et al.
(2008) Relationship between Helicobacter pylori infection and gastric atrophy
and the stages of the oesophageal inflammation, metaplasia, adenocarcinoma sequence: Results from the FINBAR case-control study. Gut 57:
734–739.
9. Amieva MR, El-Omar EM (2008) Host-bacterial interactions in Helicobacter pylori
infection. Gastroenterology 134: 306–323.
10. Rubin CE (1997) Are there three types of Helicobacter pylori gastritis?
Gastroenterology 112: 2108–2110.
11. Basso D, Scrigner M, Toma A, Navaglia F, Di Mario F, et al. (1996) Helicobacter
pylori infection enhances mucosal interleukin-1 beta, interleukin-6, and the
soluble receptor of interleukin-2. Int J Clin Lab Res 26: 207–210.
PLoS Pathogens | www.plospathogens.org
5
October 2009 | Volume 5 | Issue 10 | e1000544
Learning about Disease from H. pylori
39. Humbert O, Salama NR (2008) The Helicobacter pylori HpyAXII restrictionmodification system limits exogenous DNA uptake by targeting GTAC sites but
shows asymmetric conservation of the DNA methyltransferase and restriction
endonuclease components. Nucleic Acids Res 36: 6893–6906.
40. Salaun L, Linz B, Suerbaum S, Saunders NJ (2004) The diversity within an
expanded and redefined repertoire of phase-variable genes in Helicobacter pylori.
Microbiology 150: 817–830.
41. Lloyd KO (2000) The chemistry and immunochemistry of blood group A, B, H,
and Lewis antigens: Past, present and future. Glycoconj J 17: 531–541.
42. Backstrom A, Lundberg C, Kersulyte D, Berg DE, Boren T, et al. (2004)
Metastability of Helicobacter pylori bab adhesin genes and dynamics in Lewis b
antigen binding. Proc Natl Acad Sci U S A 101: 16923–16928.
43. Wirth HP, Yang M, Peek RM Jr, Tham KT, Blaser MJ (1997) Helicobacter pylori
Lewis expression is related to the host Lewis phenotype. Gastroenterology 113:
1091–1098.
44. Bik EM, Eckburg PB, Gill SR, Nelson KE, Purdom EA, et al. (2006) Molecular
analysis of the bacterial microbiota in the human stomach. Proc Natl Acad
Sci U S A 103: 732–737.
45. Andersson AF, Lindberg M, Jakobsson H, Backhed F, Nyren P, et al. (2008)
Comparative analysis of human gut microbiota by barcoded pyrosequencing.
PLoS ONE 3: e2836. doi:10.1371/journal.pone.0002836.
46. Mera R, Fontham ET, Bravo LE, Bravo JC, Piazuelo MB, et al. (2005) Long
term follow up of patients treated for Helicobacter pylori infection. Gut 54:
1536–1540.
47. Raj SM, Choo KE, Noorizan AM, Lee YY, Graham DY (2009) Evidence
against Helicobacter pylori being related to childhood asthma. J Infect Dis 199:
914–915; author reply 915–916.
48. Chen Y, Blaser MJ (2008) Helicobacter pylori colonization is inversely associated
with childhood asthma. J Infect Dis 198: 553–560.
49. Chey WD, Wong BC (2007) American College of Gastroenterology guideline on
the management of Helicobacter pylori infection. Am J Gastroenterol 102:
1808–1825.
50. Malfertheiner P, Megraud F, O’Morain C, Bazzoli F, El-Omar E, et al. (2007)
Current concepts in the management of Helicobacter pylori infection: The
Maastricht III Consensus Report. Gut 56: 772–781.
51. Cann RL, Stoneking M, Wilson AC (1987) Mitochondrial DNA and human
evolution. Nature 325: 31–36.
52. Wirth T, Wang X, Linz B, Novick RP, Lum JK, et al. (2004) Distinguishing
human ethnic groups by means of sequences from Helicobacter pylori: Lessons from
Ladakh. Proc Natl Acad Sci U S A 101: 4746–4751.
53. Achtman M, Azuma T, Berg DE, Ito Y, Morelli G, et al. (1999) Recombination
and clonal groupings within Helicobacter pylori from different geographical regions.
Mol Microbiol 32: 459–470.
54. Schwarz S, Morelli G, Kusecek B, Manica A, Balloux F, et al. (2008) Horizontal
versus familial transmission of Helicobacter pylori. PLoS Pathog 4: e1000180.
doi:10.1371/journal.ppat.1000180.
55. Falush D, Wirth T, Linz B, Pritchard JK, Stephens M, et al. (2003) Traces of
human migrations in Helicobacter pylori populations. Science 299: 1582–1585.
56. Moodley Y, Linz B, Yamaoka Y, Windsor HM, Breurec S, et al. (2009) The
peopling of the Pacific from a bacterial perspective. Science 323: 527–530.
22. Hatakeyama M (2006) Helicobacter pylori CagA — A bacterial intruder conspiring
gastric carcinogenesis. Int J Cancer 119: 1217–1223.
23. Ohnishi N, Yuasa H, Tanaka S, Sawa H, Miura M, et al. (2008) Transgenic
expression of Helicobacter pylori CagA induces gastrointestinal and hematopoietic
neoplasms in mouse. Proc Natl Acad Sci U S A 105: 1003–1008.
24. Viala J, Chaput C, Boneca IG, Cardona A, Girardin SE, et al. (2004) Nod1
responds to peptidoglycan delivered by the Helicobacter pylori cag pathogenicity
island. Nat Immunol 5: 1166–1174.
25. Romanelli F, Smith KM, Murphy BS (2007) Does HIV infection alter the
incidence or pathology of Helicobacter pylori infection? AIDS Patient Care STDS
21: 908–919.
26. Tomb JF, White O, Kerlavage AR, Clayton RA, Sutton GG, et al. (1997) The
complete genome sequence of the gastric pathogen Helicobacter pylori [published
erratum appears in Nature 1997 Sep 25;389(6649):412]. Nature 388: 539–547.
27. Alm RA, Ling LS, Moir DT, King BL, Brown ED, et al. (1999) Genomicsequence comparison of two unrelated isolates of the human gastric pathogen
Helicobacter pylori. Nature 397: 176–180.
28. Amundsen SK, Fero J, Hansen LM, Cromie GA, Solnick JV, et al. (2008)
Helicobacter pylori AddAB helicase-nuclease and RecA promote recombinationrelated DNA repair and survival during stomach colonization. Mol Microbiol
69: 994–1007.
29. Gressmann H, Linz B, Ghai R, Pleissner KP, Schlapbach R, et al. (2005) Gain
and loss of multiple genes during the evolution of Helicobacter pylori. PLoS Genet
1: e43. doi:10.1371/journal.pgen.0010043.
30. Salama N, Guillemin K, McDaniel TK, Sherlock G, Tompkins L, et al. (2000) A
whole-genome microarray reveals genetic diversity among Helicobacter pylori
strains. Proc Natl Acad Sci U S A 97: 14668–14673.
31. Salama NR, Shepherd B, Falkow S (2004) Global transposon mutagenesis and
essential gene analysis of Helicobacter pylori. J Bacteriol 186: 7926–7935.
32. Baldwin DN, Shepherd B, Kraemer P, Hall MK, Sycuro LK, et al. (2007)
Identification of Helicobacter pylori genes that contribute to stomach colonization.
Infect Immun 75: 1005–1016.
33. Kavermann H, Burns BP, Angermuller K, Odenbreit S, Fischer W, et al. (2003)
Identification and characterization of Helicobacter pylori genes essential for gastric
colonization. J Exp Med 197: 813–822.
34. Yamaguchi N, Kakizoe T (2001) Synergistic interaction between Helicobacter
pylori gastritis and diet in gastric cancer. Lancet Oncol 2: 88–94.
35. Johnson WE, Desrosiers RC (2002) Viral persistance: HIV’s strategies of
immune system evasion. Annu Rev Med 53: 499–518.
36. Israel DA, Salama N, Krishna U, Rieger UM, Atherton JC, et al. (2001)
Helicobacter pylori genetic diversity within the gastric niche of a single human host.
Proc Natl Acad Sci U S A 98: 14625–14630.
37. Salama NR, Gonzalez-Valencia G, Deatherage B, Aviles-Jimenez F,
Atherton JC, et al. (2007) Genetic analysis of Helicobacter pylori strain populations
colonizing the stomach at different times postinfection. J Bacteriol 189:
3834–3845.
38. Suerbaum S, Smith JM, Bapumia K, Morelli G, Smith NH, et al. (1998) Free
recombination within Helicobacter pylori. Proc Natl Acad Sci U S A 95:
12619–12624.
PLoS Pathogens | www.plospathogens.org
6
October 2009 | Volume 5 | Issue 10 | e1000544
Review
Helminth Genomics: The Implications for Human Health
Paul J. Brindley1*, Makedonka Mitreva2, Elodie Ghedin3, Sara Lustigman4
1 Department of Microbiology, Immunology, and Tropical Medicine, George Washington University Medical Center, Washington, D. C., United States of America,
2 Genome Centre and Department of Genetics, Washington University School of Medicine, St. Louis, Missouri, United States of America, 3 Division of Infectious Diseases,
University of Pittsburgh School of Medicine, Pittsburgh, Pennsylvania, United States of America, 4 New York Blood Center, Laboratory of Molecular Parasitology, New York,
New York, United States of America
received nearly the same level of support. This is partly because
helminthiases are diseases of the poorest people in the poorest
regions, but also because these pathogens are difficult to study in
the laboratory by comparison to most model eukaryotes and many
other pathogens. Standard tools and approaches, including cell
lines, culture in vitro, and animal models, are generally lacking. In
addition, the genomes of helminths are generally much more
complex than those of model organisms like yeast and fruit flies
[2].
Whereas helminth diseases are ancient scourges of humanity,
with some known from biblical times, most can also be considered
as re-emerging diseases in the sense that new outbreaks are
reported routinely in response to environmental and sociopolitical
changes [3]. For example, schistosomiasis has reemerged many
times in Africa in recent times in response to hydrological changes,
e.g. construction of dams, irrigation canals, reservoirs, etc. that
establish suitable new environments for the intermediate host
snails that transmit the parasites. Schistosomiasis has also
reemerged in mountainous and hilly regions in Sichuan, China,
where it had been controlled previously by intensive interventions
[4]. Furthermore, new strains of schistosomes are indeed emerging
through natural hybridizations between human and cattle species
of schistosomes [5].
Despite the difficulties with investigation of helminth parasites,
new insights into fundamental helminth biology are accumulating
through genome projects and the application of genome
manipulation technologies including RNA interference and
transgenesis (Figure 3). What’s more, research on immunology
of helminth infections has contributed enormously to our
understanding of Th2 immune responses, the function of
regulatory T cells, generation of alternatively activated macrophages, and the transmission dynamics of infectious agents. It is
hoped that this progress can be translated into new and robust
drugs, diagnostics, and vaccines for the helminth diseases
Abstract: More than two billion people (one-third of
humanity) are infected with parasitic roundworms or
flatworms, collectively known as helminth parasites. These
infections cause diseases that are responsible for enormous levels of morbidity and mortality, delays in the
physical development of children, loss of productivity
among the workforce, and maintenance of poverty.
Genomes of the major helminth species that affect
humans, and many others of agricultural and veterinary
significance, are now the subject of intensive genome
sequencing and annotation. Draft genome sequences of
the filarial worm Brugia malayi and two of the human
schistosomes, Schistosoma japonicum and S. mansoni, are
now available, among others. These genome data will
provide the basis for a comprehensive understanding of
the molecular mechanisms involved in helminth nutrition
and metabolism, host-dependent development and
maturation, immune evasion, and evolution. They are
likely also to predict new potential vaccine candidates and
drug targets. In this review, we present an overview of
these efforts and emphasize the potential impact and
importance of these new findings.
Helminth Infections—The Great Neglected
Tropical Diseases
Helminth parasites are parasitic worms from the phyla
Nematoda (roundworms) and Platyhelminthes (flatworms)
(Figures 1 and 2); together, they comprise the most common
infectious agents of humans in developing countries. The collective
burden of the common helminth diseases—which range from the
dramatic sequelae of elephantiasis and blindness to the more
subtle but widespread effects on child development, pregnancy,
and productivity—rivals that of the main high-mortality conditions such as HIV/AIDS or malaria [1]. For example, based on a
recent analysis [2], 85% of the neglected tropical disease (NTD)
burden for the poorest 500 million people living in sub-Saharan
Africa (SSA) results from helminth infections. Hookworm infection
occurs in almost half of the poorest people in SSA, including 40–
50 million school-aged children and 7 million pregnant women, in
whom it is a leading cause of anemia. Schistosomiasis (192 million
cases) is the second most prevalent NTD after hookworm,
accounting for 93% of the world’s number of cases of
schistosomiasis and possibly associated with increased horizontal
transmission of HIV/AIDS. Lymphatic filariasis (46–51 million
cases) and onchocerciasis (37 million cases) are also widespread in
SSA, each disease representing a significant cause of disability and
reduction in the region’s agricultural productivity. The disease
burden estimate in disability-adjusted life years (DALYs) for total
helminth infections in SSA is 5.4–18.3 million in comparison to
40.9 million DALYs for malaria and 9.3 million DALYs for
tuberculosis. Yet, research into helminth infections has not
www.plosntds.org
Citation: Brindley PJ, Mitreva M, Ghedin E, Lustigman S (2009) Helminth
Genomics: The Implications for Human Health. PLoS Negl Trop Dis 3(10): e538.
doi:10.1371/journal.pntd.0000538
Editor: Matty Knight, Biomedical Research Institute, United States of America
Published October 26, 2009
Copyright: ß 2009 Brindley et al. This is an open-access article distributed
under the terms of the Creative Commons Attribution License, which permits
unrestricted use, distribution, and reproduction in any medium, provided the
original author and source are credited.
Funding: Support from the NIH-NIAID, award numbers R01AI072773 (to PJB) and
R01AI081803 (to MM) is gratefully acknowledged. The funder had no role in study
design, data collection and analysis, decision to publish, or preparation of the
manuscript.
Competing Interests: The authors have declared that no competing interests
exist.
* E-mail: [email protected]
This article is part of the ‘‘Genomics of Emerging Infectious Disease’’ PLoS Journal
collection (http://ploscollections.org/emerginginfectiousdisease/).
1
October 2009 | Volume 3 | Issue 10 | e538
Figure 1. Montage of some of the major human helminth parasites, their developmental stages, and disease pathology. (A)
Microfilaria of Brugia malayi in a thick blood smear, stained with Giemsa (http://www.dpd.cdc.gov/dpdx/html/frames/a-f/filariasis/body_
Filariasis_mic1.htm); the microfilaria is about 250 mm in length. (B) Patient with lymphedema of the left leg due to lymphatic filariasis (http://
www.cdc.gov/ncidod/dpd/parasites/lymphaticfilariasis/index.htm). (C) Hookworm egg passed in the stool of an infected person; the microscopic egg,
barrel-shaped with a thin wall, is about 70640 mm in dimension. (D) longitudinal section through an adult hookworm attached to wall of small
intestine, ingesting host blood and mucosal wall. The parasite is about 1 cm in length. (E) Eggs of Schistosoma mansoni. The egg is about 150650 mm
in dimension; the lateral spine is diagnostic for S. mansoni in comparison to the other human schistosome species. Fibrotic responses to schistosome
eggs trapped in the intestines, liver, and other organs of the infected person are the cause of the schistosomiasis pathology and morbidity. (F) A pair
of adult worms of the blood fluke Schistosoma mansoni; the more slender female worm resides in the gynecophoral canal of the thicker male. The
worms are about 1.5 cm in length, and live for many years (http://www.dpd.cdc.gov/dpdx/HTML/ImageLibrary/Schistosomiasis_il.htm ).
doi:10.1371/journal.pntd.0000538.g001
protein sequence information for proteomics methods (e.g., [13]),
to name but a few applications. Quantitative analysis of ESTs
(transcriptomics), including serial expression of gene analysis, can
identify transcripts that are either over- or under-represented by
comparison to other transcripts in various helminth life cycle
stages or tissues (e.g., [14,15]), and the subset of genes evaluated
with gene ontology programs provide insights into cellular and
metabolic pathway functioning in the parasite (e.g., [16]).
Furthermore, one can identify potential targets for interventions
by applying a hierarchy of considerations including a matrix of
biological, expression, and phenotypic data [17] or by performing
a pan-phylum analysis to identify conserved parasite-specific genes
whose selective targeting will have low or no toxicity to the host
[18,19] or genes that have diverged enough from the host
counterpart, resulting in altered or absent functions [20].
The first multicellular genome sequenced was that of the freeliving roundworm C. elegans [21]; reported in 1998, it is still the
only metazoan for which the sequence of every nucleotide is
known with high confidence. Today, the genome sequences of 22
species of helminths that either infect humans or are closely related
parasites are completed or underway (Table 1). A comprehensive
genome analysis has been published for several of them, including
the lymphatic filarial nematode Brugia malayi [22], the dog
hookworm Ancylostoma caninum [23], and the blood flukes
Schistosoma japonicum and S. mansoni [24,25] (Figure 1; Table 1).
of humanity and those of our livestock and companion species
[1,6–10].
Genomics Approaches to Investigating Helminths
Over the past decade, increasing numbers of helminth-specific
genome sequences have become available due to ever-improving
techniques for obtaining biological material, extracting RNA and
DNA, constructing complimentary DNA (cDNA)/whole genome
shotgun libraries, and, especially, major advances in the chemistry
and instrumentation for DNA sequencing and its concomitant
decreased cost. Helminth genomics began with the generation and
analysis of transcribed sequences (expressed sequence tags [ESTs]
[11]), which has proved to be a rapid and cost-effective route to
discover genes in other eukaryotes. In April 2009, there were
,550,000 nematode and 450,000 platyhelminth ESTs in the
dbEST division of GenBank, excluding those from the model
nematode Caenorhabditis rhabditis. Of these, 60% were from
parasites of humans and closely related animal pathogens used
to study human infections (Table 1). These ESTs have many
applications. They can be used to annotate helminth genomes (see
below) to determine alternative splicing, verify open reading
frames, and confirm exon/intron and gene boundaries. They are
valuable also, for example, in functional genomics to design probes
for expression microarrays (e.g., [12]) and to provide putative
www.plosntds.org
2
October 2009 | Volume 3 | Issue 10 | e538
Figure 2. Phylogeny of the major taxa of human helminths—nematodes and platyhelminths—as established by maximum
likelihood (ML) analysis of 18S ribosomal RNA from 18 helminth species. Sequences were aligned using ClustalX [93]. The topology of the
tree was derived from a consensus tree by neighbor-joining–based bootstrapping, its branch lengths were computed using a ML-based method, and
it was rooted with the orthologue from the brewer’s yeast, Saccharomyces cerevisiae. The branch lengths of the phylogenetic tree were computed
using DNAML (PHYLIP package [94]) by allowing rate variation among sites. The headings Chromadorea, Enoplea, Trematoda, and Cestoda are major
classes of the phyla Nematoda and Platyhelminthes. The GenBank accession numbers of aligned sequences are DQ118536.1 (Trichuris trichiura),
AY851265.1 (Trichuris suis), AF036637.1 (Trichuris muris), AY497012.1 (Trichinella spiralis), U94366.1 (Ascaris lumbricoides), AF036587.1 (Ascaris suum),
AF036588.1 (Brugia malayi), AJ920348.1 (Necator americanus), AJ920347.2 (Ancylostoma caninum), AF036597.1 (Nippostrongylus brasiliensis), X03680.1
(Caenorhabditis elegans), AF036605.1 (Strongyloides ratti), U81581.1 (Strongyloides ratti), AB453329.1 (Strongyloides ratti), AF279916.2 (Strongyloides
stercoralis), AB453315.1 (Strongyloides stercoralis), M84229.1 (Strongyloides stercoralis), EU011664.1 (Saccharomyces cerevisiae), , U27015.1
(Saccharomyces cerevisiae), DQ157224.1 (Taenia solium), AF229852.1 (Clonorchis sinensis), Z11590.1 (Schistosoma japonicum), Z11976.1 (Schistosoma
haematobium), U65657.1 (Schistosoma mansoni).
doi:10.1371/journal.pntd.0000538.g002
Some of the main obstacles to research on human parasites are
their life cycle complexity, tissue complexity, and the paucity of
genetic and transgenic methods for manipulating genes of interest.
Comparative genome analyses have also provided insights into the
adaptations of various parasites to niches in their human (and
vector) hosts as well as insights into the molecular basis of the
mutualistic relationship between the filarial nematode B. malayi
and its endosymbiont Wolbachia (see below).
The genomes of the schistosomes S. japonicum and S. mansoni are
the first complete genomes reported for members of the
Lophotrochozoa [24,25], a large taxon that includes about 50%
of all metazoan phyla including the mollusks, annelids, brachiopods, nemerteans, bryozoans, playthelminths, and others [26].
These schistosome genome sequences revealed remarkable
features of the host–parasite relationship. Among these, the
schistosome genome has lost numerous protein-encoding domains.
Whereas the total number (,6,000) of protein families is broadly
similar among schistosomes, humans, C. elegans, and fruit fly, about
1,000 protein domains have been abandoned by S. japonicum,
including some involved in basic metabolic pathways and defense,
implying that loss of these domains could be a consequence of the
adoption of a parasitic way of life. If so, the remaining molecular
repertoire must have evolved in parallel with this extensive domain
loss to permit the pathogen to locate and infect humans efficiently,
nourish itself, and interact with the external environment as well as
with the host. On the other hand, despite extensive gene and
www.plosntds.org
domain loss, a number of schistosome gene families have
expanded and these provide insights into the requirements for a
parasitic lifestyle. Among the expanded gene families, a metalloprotease called invadolysin (or leishmanolysin) has at least 12
putative family members in schistosomes compared to a single
orthologue in the human, fruit fly, and C. elegans genomes and only
three in the free-living flatworm S. mediterranea. This protease
family may facilitate skin penetration and tissue invasion by the
cercaria, the infective-stage larva of the schistosome [24,25].
Publication of genome sequences of filaria and schistosomes has
underscored the pressing need to develop functional genomics
approaches for these significant pathogens. Functional analyses—
which use approaches such as RNA interference (RNAi) and
translational studies—are essential to resolve uncertainties in the
molecular physiology of helminths and to illuminate mechanisms
of pathogenesis that may lead to development of new interventions
to control and eliminate these parasites or the diseases. Progress in
the functional genomics of helminths was reviewed recently
[6,27,28]. In brief, RNAi has been used to inactivate the RNA
products of several genes in schistosomes (e.g., [29–32]) and
nematodes (e.g., [33]; reviewed in [8]). In addition, the recent
genome sequences of S. mansoni and S. japonicum now make feasible
genome-scale investigation of transgene integration into schistosome chromosomes. Gene therapy–like approaches to transform
schistosomes include the use of the piggyBac transposon and
pseudotyped murine leukemia retrovirus as transgene vectors
3
October 2009 | Volume 3 | Issue 10 | e538
Figure 3. Some recent approaches to expressing transgenes in human helminths. (A) Luciferase activity in Schistosoma mansoni larvae
(schistosomules) after transduction with a pseudotyped retrovirus that expresses the luciferase reporter gene. Anti-luciferase antibody staining of
schistosomules three days after exposure to pseudotyped lentivirus carrying the firefly luciferase transgene. Schistosomules examined by confocal
laser microscopy; (i) bright field, (ii) fluorescence red channel, (iii) merged images. Control non-transformed worms showed only background levels of
fluorescence (not shown; see [34–36] for relevant hypotheses and experimental methods). (B) Recent studies on transgenic Strongyloides stercoralis
indicated that morphogenesis of the infective L3 stage larva requires the DAF-16 orthologue FKTF-1 [38]. L3s of this parasitic nematode were
transfected with plasmids carrying the transgene fktf-1b::gfp::fktf-1b and examined by fluorescence microscopy. (i, ii) Transgenic first-stage larvae
express green fluorescent protein (GFP) in the procorpus (arrow) of the pharynx. (iii, iv) A first-stage larva (L1) expresses the GFP::FKTF-1b(wt)
transgene in the hypodermis. (v, vi) An infective L3 expresses the GFP::FKTF-1b(wt) fusion protein in the hypodermis and in a narrow band in the
pharynx (arrow). Scale bars, 10 mm. Adapted from [38].
doi:10.1371/journal.pntd.0000538.g003
[34–36] (Figure 3A), both of which offer a means to establish
transgenic lines of schistosomes, to elucidate schistosome gene
function and expression, and to advance functional genomics
approaches for these parasites. Notably, progress is also being
made to express reporter transgenes in parasitic nematodes
including Strongyloides stercoralis [37], in which transgene approaches
developed for use in C. elegans have recently been used to
demonstrate that morphogenesis of infective larvae requires the
DAF-16 orthologue FKTF-1 (Figure 3B) [38]. Progress is also
being made with systems for analysis of promoter sequences of
genes of parasitic nematodes (e.g., [39]).
Many future discoveries resulting from the parasitic helminths
genome information can be expected to emanate from the broader
scientific community rather than by the laboratory originating a
genome sequence project. For the specialized genome sequence
labs, dissemination of sequence information in a way that is useful,
consistent, centralized, and lasting has been therefore a key goal.
Efforts have gone well beyond depositing raw data in public
databases. Currently, helminthologists have available a number of
specialized sites for sequence analysis. C. elegans information is
easily accessible at http://www.wormbase.org [40]. Useful
information about the organism includes genome sequence,
genetic and physical maps, transcript data (EST, mRNA, SAGE,
TEC-RED, ORFeome, expression patterns from reporter gene
fusions, and microarrays), the developmental lineage of all cells,
connectivity of the nervous system, mutant phenotypes and genetic
markers, gene expression described at the level of single cells, 3D
protein structures, NCBI Clusters of Orthologous Groups, and
apoptosis and aging information. It also contains extensive
www.plosntds.org
information from large-scale genomics analyses, including precomputed sequence similarity searches, protein motif analyses,
protein–protein interactions, findings from systematic RNAi
screens, single nucleotide polymorphisms (SNPs), orthologous
and paralogous relationships, and the assignment of Gene
Ontology (GO) terms to gene products. These resources greatly
aid in the interpretation of much of the sequence data emerging
from parasitic helminths.
However, accumulating evidence suggest that C. elegans is not a
good model for all parasitic helminths, especially for the ones that
are phylogenetically very distant such as the basic nematode and
zoonotic parasite Trichinella spiralis (e.g., [41]). The other
specialized site is Nematode.net (http://www.nematode.net)
[42]), developed with a primary aim to disseminate the diverse
collection of information for parasitic nematodes to the broader
scientific community in a way that is useful, consistent, centralized,
and enduring. In addition to sequence data, the site hosts
assembled NemaGene clusters in GBrowse views, characterizing
composition and protein homology, functional Gene Ontology
annotations presented via the AmiGO browser, KEGG-based
graphical display of NemaGene clusters mapped to metabolic
pathways, codon usage tables, NemFam protein families (which
represent conserved nematode-restricted coding sequences not
found in public protein databases), and a Web-based WU-BLAST
search tool that allows complex querying and other assorted
resources. Furthermore, Nematode.net, by connecting data across
the entire phylum Nematoda, has made a substantial contribution
toward integrating the historically separate fields of C. elegans,
vertebrate parasitology, and plant parasitology research. Finally,
4
October 2009 | Volume 3 | Issue 10 | e538
www.plosntds.org
5
October 2009 | Volume 3 | Issue 10 | e538
Blood fluke/urinary schistosomiasis
Blood fluke/intestinal schistosomiasis
Liver fluke/clonorchiasis
S. haematobium
S. japonicum
Clonorchis sinensis
Pork tapeworm/taeniasis/cysticercosis
Blood fluke/intestinal schistosomiasis
Taenia solium
Schistosoma mansoni
Tapeworm/unilocular hydatidosis
Model whipworm
T. suis
E. granulosus
Model whipworm
T. muris
Tapeworm/alveolar hydatidosis
Whipworm/trichuriasis
Trichuris trichiura
Echinococcus multilocularis
Trichina worm/trichinosis
Trichinella spiralis
Model filaria
Filaria/loaisis (cutaneous
filariasis)/African eye worm
Loa Loa
Filaria/river blindness
Filaria/lymphatic filariasis
Brugia malayi
Acanthocheilonema viteae
Model large roundworm
A. sum
Onchocerca volvulus
Large roundworm/ascariasis
Model threadworm
Ascaris lumbricoides
Threadworm/strongyloidiasis
Model hookworm
Nippostrongylus brasiliensis
S. ratti
Model hookworm
Strongyloides stercoralis
Hookworm/necatoriasis
Ancylostoma caninum
Common Name / Disease
Necator americanus
Species
Human
Human
Human
Human
Human
Canids; larva
infects humans
Rodent; larva
infects humans
Pig
Mouse
Human
Pig to human
Rodent
Human
Human
Human
Pig
Human
Rat
Human
Rat
Dog
Human
Primary
host
—
400
—
390
270
150
150
-
96
—
71
—
150
—
96
230
230
—
—
—
344
—
Genome
size, Mb
17975
29491
12616
12599
17815
12620
—
—
—
—
12605
33239
—
—
9549
—
—
—
—
20445
12841
20369
GenBank
Project ID
3
104
0
206
25
10
1
0
7
0
25.3
0
15
3.3
26.2
55.7
1.8
27.4
11.4
14.7
81
5
cDNAs (3730
ABI), 1,000 s
b
WUGC, Washington University’s Genome Center.
Phylogeny based on Blaxter et al. [47].
BI, Broad Institute; CNHGC, Chinese National HGC; SI, Sanger Institute; SNUCM, Seoul National University College of Medicine; TIGR, The Institute for Genomic Research (now JCVI).
doi:10.1371/journal.pntd.0000538.t001
a
Trematoda (flukes)
Cestoda
(tapeworms)
Clade I
Clade III
Clade IV
Clade Vb
Nematoda
(roundworms)
Phylum or Class
Table 1. Human parasitic helminths (and their close relatives) with genome sequencing projects completed or underway.
In progress
Draft completed
In progress
Draft completed
Draft completed
In progress
In progress
In progress
In progress
In progress
Draft completed
In progress
In progress
In progress
Improving draft
Improving draft
In progress
In progress
In progress
In progress
Improving draft
In progress
Genome Sequencing
Status
SNUCM
CNHGC
SI
SI/TIGR
Mexico City
SI
SI
WUGC
SI
SI
WUGC
UMIGS
SI
BI
TIGR/University
of Pittsburgh
WUGC/SI
SI
SI/WUGC
SI
SI
WUGC
WUGC
Sequencing
Institutea
Nembase (http://www.nematodes.org [43]) also offers access to
parasite sequence and tools such as visualization of clusters by
stage of expression.
While each of these databases has been challenged by the
requirement to support the influx of new genomes and related
data, they nonetheless provide user communities with innovative
features and tools suited to their needs that are beyond the scope of
the large sequence repositories. For flatworms (Figure 2), it is
notable that public genome annotation and analysis tools are
already in place, including SchistoDB (http://schistoDB.net/), a
genomic database for S. mansoni that incorporates sequences and
annotation [44] and SjTPdb, http://function.chgc.sh.cn/sjproteome/index.htm, an integrated transcriptome and proteome
database and analysis platform for S. japonicum [45]. The genome
database for the planarian Schmidtea mediterranea, a model free-living
platyhelminth, can be expected to be advantageous to comparative
genome projects and specific research problems for the growing
number of parasitic flatworms that now are or will be subjects of
genome sequence analysis. In addition, because of the phylogenetic position of planarians as early bilaterian metazoans,
SmedGD (http://smedgd.neuro.utah.edu) will prove useful not
only to planarian research, but also to investigations on
developmental and evolutionary biology, comparative genomics
(specifically with parasitic flatworms including flukes and tapeworms), stem cell research, and regeneration [46]).
Platyhelminthes, particularly in comparison to the fresh-water
planarian S. mediterranea, a non-parasitic flatworm for which a draft
genome is available [53]. In addition to evolution of parasitism of
humans and other vertebrate hosts, helminth parasite genome
sequences will also facilitate evolutionary studies on the role of
intermediate hosts/vectors such as the snail in schistosome
infections and the mosquito in filarial infections in this evolution.
Host–Parasite Relationships
Investigations of regulatory networks involved in the embryonic
development, organogenesis, development, and reproduction of
helminths based on newly available genome sequences have
revealed the presence of well-characterized signaling pathways,
including those for Wnt, Notch, Hedgehog, and transforming
growth factor b (TGF-b). These pathways can be recognized in the
B. malayi and schistosome genomes [22,24,25]. These include
endogenous hormones including epidermal growth factor (EGF)like and fibroblast growth factors (FGF)-like peptides. Predicted
components of the Ras–Raf–MAPK and TGF-b–SMAD signaling
pathways (including FGF and EGF receptors), for example,
encoded by these genomes, have components sharing high
sequence identity with their mammalian orthologs, implying that
schistosomes or filarial worms, in addition to utilizing their own
pathways, might exploit host growth factors as developmental
signals.
Immune regulation by helminth parasites includes suppression,
diversion, and alteration of the host immune response, resulting in
an anti-inflammatory environment that is favorable to parasite
survival. For example, chronic infections induce key changes in
host immune cell populations including dominance of the T-helper
2 (Th2) cells and selective loss of effector T cell activity, against a
background of regulatory T cells, alternatively activated macrophages, and Th2-inducing dendritic cells [54,55]. With advances
in genomics, numerous parasite-derived proteins, including
cytokine homologs, protease inhibitors, and an intriguing set of
novel products, as well as glycoconjugates and small lipid moieties,
have been discovered with known or hypothesized roles in
immune interference [56–61]. These studies suggest that secreted
parasite products interfere with different arms of the immune
system by influencing the cytokine network and signal transduction pathways or by inhibiting essential enzymes. Using bioinformatics to compare the predicted proteome of B. malayi to proteins
implicated in the immune response (interleukins, chemokines, and
other signaling molecules), potential immune modulators produced by the filarial have been identified, including genes
encoding the macrophage migration inhibition family of signaling
proteins [62]. Furthermore, the genome of the blood fluke S.
mansoni encodes a large array of paralogues of fucosyl and
xylosyltransferases [25] that are involved in generating novel
glycans at the host–parasite interface and could have an important
role in the subverting the host immune system. A recent
comprehensive review summarizes our current understanding of
the growing number of individual helminth mediators that target
key receptors or pathways in the mammalian immune system [63].
Helminth infection can have a broad impact on the entire
immune system. Infection with trematode and nematode parasites,
for example, correlates with a reduced incidence of atopic,
allergic-type disorders [64]. Thus, helminth infection might
potentially be useful as a novel therapy for allergic or autoimmune
diseases [65]. Recently, worms, eggs, or purified nematode
parasite protein have been used in preclinical and clinical trials
to protect humans from allergy and autoimmunity (reviewed in
[66–70]), including Crohn’s disease and ulcerative colitis [71,72].
Evolution of Parasitism in Helminths
Genomics research has helped our understanding of the
evolution of helminths of humans and other hosts, certainly with
regard to roundworms of the phylum Nematoda. The first
comprehensive study of the molecular evolution of helminths
was a phylogenetic analysis of the small subunit ribosomal DNA (ss
rDNA) sequences from 53 roundworms [47]. This study included
both major parasitic and free-living taxonomic groups. It identified
five major clades within the Nematoda and suggested that
parasitism of animal and plants arose independently multiple
times. A more recent study included 339 nearly full-length ss
rDNAs and proposed subdivision of the phylum into 12 clades
[48]. This revealed that nematodes that feed on fungi occupy a
basal position compared to their plant parasite relatives,
confirming that the parasitic nematodes of plants arose from
fungivorous ancestors. Phylogenetic methods are also being used
to study evolution of parasitism-related protein-coding genes (such
as the enzymes that degrade the plant cell wall in nematode
parasites of plants [cellulases, pectate lyases, etc.]) to understand
better the mechanisms underlying the evolution of parasitism
(reviewed in [49]). Recent genome-wide analysis of two plant
parasitic nematodes [50,51] provided a more complete picture of
the acquisition of these cellulase genes, apparently by horizontal
gene transfer (HGT) from prokaryotes. The subsequent expansion
and diversification of HGT genes in these nematodes allow
inferences about the evolutionary history of these parasites, and in
addition present potential targets for anti-nematode drugs. When
the genome of the necromenic nematode Pristionchus pacificus was
reported recently, it became was clear that cellulases were not
restricted to plant parasitic nematodes; their presence in this
species indicated preadaptation for parasitism of animals [52],
consistent with the intermediate evolutionary position of Pristionchus between the microbivorous C. elegans and the animal
parasitic nematodes. In like fashion to evolution of parasitism
among nematodes, we can predict that additional analyses of
parasitic and free-living flatworm genomes will provide deeper
insights into how and when parasitism evolved in the phylum
www.plosntds.org
6
October 2009 | Volume 3 | Issue 10 | e538
Other studies have shown that substances produced by helminths,
for example Ascaris suum, Nippostrongylus brasiliensis, and Acanthocheilonema viteae, can directly interfere with allergic responses or with
development of allergen-specific Th2 responses [73–75]. ES-62, a
molecule secreted by the filarial nematode A. viteae, directly inhibits
the FceRI-induced release of mediators from mast cells, protects
against mast-cell–dependent hypersensitivity in skin and lungs [76]
and inhibits collagen-induced arthritis [77]. Research is underway
to develop molecules that mimic the activity of ES-62 as drugs for
allergic and autoimmune diseases [66]. Other helminth-derived
products have the potential to reduce allergic responses. These
products include schistosomal lysophosphatidylserine (lyso-PS)
[61] and thioredoxin peroxidase from the liver fluke Fasciola
hepatica [78]. These findings demonstrated that helminths produce
products that can interfere with both the development of allergic
responses and the workings of host effector mechanisms.
Ankyrin domain–containing proteins are noteworthy because of
their roles in protein–protein interactions in a variety of cellular
processes. A number of other wBm molecules are of interest as
potential drug targets. For example, glutathione biosynthesis genes
may provide glutathione for the protection of the filaria from
oxidative stress or human immunological effector molecules.
Heme produced from wBm (all five synthesis genes are present)
could be vital to worm embryogenesis, as there is evidence that
molting and reproduction are controlled by ecdysteroid-like
hormones [82], synthesis of which requires heme. Depletion of
Wolbachia might therefore halt production of these hormones and
block molting and/or embryogenesis in B. malayi. Most, if not all,
nematodes, including B. malayi, appear to be unable to synthesize
heme, but must obtain it from extraneous sources, such as the host,
the food supply, or perhaps from endosymbionts.
Challenges for the Future
The ‘‘Dependent’’ Helminth
The filarial and schistosome genome sequences now available
provide the vanguard for assembly of a genome sequence catalog
of the numerous other neglected helminth parasites (Table 1).
Comparative genomics will likely be a dominant approach to
organize, interpret, and utilize the vast amounts of genomic
information anticipated from the genomes of these parasites (e.g.
[83,84]). In terms of sequencing tools, the new generation of
‘‘massively parallel’’ sequencing platforms commercially available
today, (such as the Roche/454 pyrosequencer [85], Illumina/
Solexa [86], and SOLiD [87]) offer of the order of 100- to 1,000fold increases in throughput over the Sanger sequencing
technology [88] on capillary electrophoresis instruments. This
rapid change to producing millions of DNA sequence reads in a
short time will have a huge impact on research on NTDs. Each
platform has a specific application: while the Roche/454 is
optimal for in-depth analysis of whole transcriptomes and de novo
sequencing of bacterial and small eukaryotic genomes, the
Illumina and SOLiD systems are highly attractive for resequencing
projects aimed at identifying genetic variants (mutations, insertions, deletions), profiling and discovering noncoding RNAs
(ncRNAs), and studying epigenetic modifications of histones and
DNA. With the increased read length and improved error rate of
massively parallel pyrosequencing technology, de novo sequencing
of helminth genomes has become possible at a fraction of earlier
costs. In the next five years, projects at the Washington
University’s
Genome
Center
(http://www.genome.gov/
10002154) and the Wellcome Trust Sanger Institute (http://
www.sanger.ac.uk/Projects/Helminths/) should increase the
available sequence data on human helminths and their close
relatives by an order of magnitude, adding more than 20 draft
genomes to those listed in Table 1.
Once these reference genomes become available, sequencing of
clinical isolates is expected to accelerate. Sequencing of the clinical
strains and strain-to-reference comparisons can be performed
using platforms such as Illumina/Solexa and SOLiD to investigate
genome-wide polymorphism and provide a comprehensive picture
of natural helminth genome variation. These approaches should
also be valuable for exploring genetic changes involved in
resistance to anti-worm drugs and understanding the potential
mechanisms of drug resistance in human parasites, and can be
expected to facilitate development of genetic markers to monitor
and manage any future appearance and spread of drug resistance.
These phenomena are of tremendous importance, particularly
since some major neglected helminth diseases are being targeted in
mass drug treatment campaigns [89]. In addition, the new
generation of sequencing technologies has also provided unprec-
As a consequence of evolution of an obligatory parasitic
existence, helminth parasites are dependent upon their intermediate and definitive hosts for many necessities including nutrients
such as amino acids; filariae are dependent on insect vectors to
transport them to the host. The newly available genome sequences
for schistosomes and B. malayi have confirmed earlier biochemical
studies that had revealed aspects of physiological/ biochemical
dependence of these parasites on the host. For example,
schistosomes cannot synthesize fatty acids de novo, or sterols,
purines, and nine human essential amino acids plus arginine or
tyrosine, and must catabolize complex precursors obtained from
their hosts. Loss or degeneracy of fatty acid, sterol, and purine
synthesis pathways in schistosomes likely relates to the adoption of
a parasitic lifestyle; it is notable that genes encoding all the key
enzymes for both the de novo fatty acid and purine syntheses are
complete in the (free-living) planarian S. mediterranea. To obtain
essential lipid nutrients, the schistosome genome encodes transporters, including apolipoproteins, low-density lipoprotein receptor, scavenger receptor, fatty-acid-binding protein, ATP-bindingcassette transporters and cholesterol esterase, to exploit fatty acids
and cholesterol from host blood [25,79].
Many species of filarial nematodes are themselves infected by
the endosymbiotic bacterium Wolbachia. The genome sequence of
the Wolbachia species that infects the roundworm nematode B.
malayi (wBm) [80] helped establish which metabolites the
bacterium potentially provides to the nematode (riboflavin, flavine
adenine dinucleotide, heme, and nucleotides, for example) and
which are provided by the nematode to the endobacterium
(notably, amino acids). This type of information has opened up the
exciting possibility that drugs already registered for human use
might inhibit key biochemical pathways in Wolbachia that could
sterilize or kill the adult worms. Although the Wolbachia genome is
even more degenerate than that of the related pathogen Rickettsia,
it has retained more intact metabolic pathways than Rickettsia. This
may be important in its biochemical contribution to host (i.e.,
filarial) viability and fecundity.
The wBm genome encodes many more proteases and
peptidases than Rickettsia, which likely degrade host proteins in
the extracellular environment. Other proteins encoded by wBm
include a common type IV secretion system, as used by some
pathogenic gram-negative bacteria to transfer plasmids and
proteins into surrounding host cells, and an abundance of ankyrin
domain-containing proteins, which might regulate host gene
expression, as suggested for Ehrlichia phagocytophilia AnkA [81], as
well as several proteins predicted to localize on the cell surface.
www.plosntds.org
7
October 2009 | Volume 3 | Issue 10 | e538
edented opportunities for high-throughput functional genomic
research (reviewed in [90]) that awaits application to helminth
research.
Although some details of immunomodulation by helminth
components have been characterized, we are just beginning to
understand how these parasite products act on immune responses
and to assemble fragmentary information on individual components into a comprehensive picture. Comparisons of helminth
molecules with orthologues/paralogues from free-living relatives
will strengthen efforts to decipher the strategies adopted by
helminth parasites to evade and subvert their host immune
responses. This information will be exploitable for development of
drugs and vaccines against the parasites and potentially also novel
therapeutic biologics for use in humans. Future studies might
determine whether helminth proteins with unknown function
might be the source for the intriguing regulatory effects helminth
infections have on the host immune response.
Treatment for helminthic infections, responsible for hundreds of
thousands of deaths each year, depends almost exclusively on just
two or three drugs: praziquantel, the benzimidazoles, and
ivermectin. Vaccines and new drugs are needed, certainly because
drug resistance in human helminth parasites such as schistosomes,
whipworms, and filariae, to these compounds would present a
major problem for current treatment and control strategies.
Pharmacogenomics with the new helminth genomes represents a
practicable route forward toward new drugs. For example,
chemogenomics screening of the genome sequence of S. mansoni
identified .20 parasite proteins for which potential drugs are
available approved for other human ailments [25], and indeed for
which, in the case of the schistosome thioredoxin glutathione
reductase, auranofin (an anti-arthritis medication) was shown
recently to exhibit potent anti-schistosomal activity [91]. Finally,
the vast new sequence information will undoubtedly allow revision
of our understanding of the host–parasite relationship, its
evolution, vector–pathogen and helminth–symbiont interactions,
unique aspects of cell biology and biochemistry, phylogenetic
relationships, intervention targets, research approaches (e.g. [92]),
and so forth.
Acknowledgments
We thank Victoria Mann, Geoffrey Gobert and Gabriel Rinaldi for access
to their unpublished findings on schistosomes transduced with pseudotyped
virions.
References
1. Hotez PJ, Brindley PJ, Bethony JM, King CH, Pearce EJ, et al. (2008) Helminth
infections: The great neglected tropical diseases. J Clin Invest 118: 1311–1321.
2. Hotez PJ, Kamath A (2009) Neglected tropical diseases in sub-Saharan Africa:
Review of their prevalence, distribution, and disease burden. PLoS Negl Trop
Dis 3: e412.
3. Patz JA, Graczyk TK, Geller N, Vittor AY (2000) Effects of environmental
change on emerging parasitic diseases. Int J Parasitol 30: 1395–1405.
4. Liang S, Yang C, Zhong B, Qiu D (2006) Re-emerging schistosomiasis in hilly
and mountainous areas of Sichuan, China. Bull WHO 84: 139–144.
5. Huyse T, Webster BL, Geldof S, Stothard JR, Diaw OT, et al. (2009)
Bidirectional introgressive hybridization between a cattle and human schistosome species. PLoS Pathog 5: e1000571. doi:10.1371/journal.ppat.1000571.
6. Kalinna BH, Brindley PJ (2007) Manipulating the manipulators: Advances in
parasitic helminth transgenesis. Trends Parasitol 23: 197–204.
7. Krasky A, Rohwer A, Schroeder J, Selzer PM (2007) A combined bioinformatics
and chemoinformatics approach for the development of new antiparasitic drugs.
Genomics 89: 36–43.
8. Mitreva M, Zarlenga DS, McCarter JP, Jasmer DP (2007) Parasitic nematodes From genomes to control. Vet Parasitol 148: 31–42.
9. Berriman M, Lustigman S, McCarter JP (2007) Helminth initiative for drug
discovery – Report of the informal consultation, genomics and emerging drug
discovery technologies. Expert Opin Drug Discovery 2: S83–S89.
10. Lustigman S, Ford S, Crawford MJ (2008) RNA Interference: from functional
genomics to validation of drug targets in helminths. In: RNA interference
research progress LylandRoger T, BrowningIrving B, eds. Nova Publishers. pp
135–162.
11. Franco GR, Adams MD, Soares MB, Simpson AJG, Venter JC, et al. (1995)
Identification of new Schistosoma mansoni genes by the EST strategy using a
directional cDNA library. Gene 152: 141–147.
12. Gobert GN, Moertel L, Brindley PJ, McManus DP (2009) Developmental gene
expression profiles of the human pathogen Schistosoma japonicum. BMC Genomics
10: 128.
13. Robinson MW, Connolly B (2005) Proteomic analysis of the excretory-secretory
proteins of the Trichinella spiralis L1 larva, a nematode parasite of skeletal
muscle. Proteomics 5: 4525–4532.
14. Mitreva M, McCarter JP, Martin J, Dante M, Wylie T, et al. (2004)
Comparative genomics of gene expression in the parasitic and free-living
nematodes Strongyloides stercoralis and Caenorhabditis elegans. Genome Res 14:
209–220.
15. Taft AS, Vermeire JJ, Bernier J, Birkeland SR, Cipriano MJ, et al. (2009)
Transcriptome analysis of Schistosoma mansoni larval development using serial
analysis of gene expression (SAGE). Parasitology 136: 469–485.
16. Mitreva M, McCarter JP, Arasu P, Hawdon J, Martin J, et al. (2005)
Investigating hookworm genomes by comparative analysis of two Ancylostoma
species. BMC Genomics 6: 58.
17. McCarter JP (2004) Genomic filtering: An approach to discovering novel
antiparasitics. Trends Parasitol 20: 462–468.
18. Wasmuth J, Schmid R, Hedley A, Blaxter M (2008) On the extent and origins of
genic novelty in the phylum Nematoda. PloS Negl Trop Dis 2: e258.
doi:10.1371/journal.pntd.0000258.
19. Yin Y, Martin J, Abubucker S, Wang Z, Wyrwicz L, et al. (2009) Molecular
determinants archetypical to the phylum Nematoda. BMC Genomics 10: 114.
www.plosntds.org
20. Wang Z, Martin J, Abubucker S, Yin Y, Gasser R, et al. (2009) Systematic
analysis of insertions and deletions specific to nematode proteins and their
proposed functional and evolutionary relevance. BMC Evol Biol 9: 23.
21. The C. elegans Sequencing Consortium (1998) Genome sequence of the
nematode C. elegans: A platform for investigating biology. Science 282:
2012–2018.
22. Ghedin E, Wang S, Spiro D, Caler E, Zhao Q, et al. (2007) Draft genome of the
filarial nematode parasite Brugia malayi. Science 317: 1756–1760.
23. Abubucker S, Martin J, Yin Y, Fulton L, Yang S-P, et al. (2008) The canine
hookworm genome: Analysis and classification of Ancylostoma caninum survey
sequences. Mol Biochem Parasitol 157: 187–192.
24. Schistosoma japonicum Genome Sequencing and Functional Analysis Consortium,
Liu F, Zhou Y, Wang ZQ, Lu G, et al. (2009) The Schistosoma japonicum genome
reveals features of host-parasite interplay. Nature 460: 345–351.
25. Berriman M, Haas BJ, LoVerde PT, Wilson RA, Dillon GP, et al. (2009) The
genome of the blood fluke Schistosoma mansoni. Nature 460: 352–358.
26. Dunn CW, Hejnol A, Matus DQ, Pang K, Browne WE, et al. (2008) Broad
phylogenomic sampling improves resolution of the animal tree of life. Nature
452: 745–749.
27. Krautz-Peterson G, Bhardwaj R, Faghiri Z, Tararam C, Skelly PJ (2009) RNA
interference in schistosomes: machinery and methodology. Parasitology;E-pub
ahead of print. doi:10.1017/S0031182009991168.
28. Mann VH, Morales ME, Kines KJ, Brindley PJ (2008) Transgenesis of
schistosomes: approaches using mobile genetic elements. Parasitology 134: 1–13.
29. Freitas TC, Jung E, Pearce EJ (2007) TGF-beta signaling controls embryo
development in the parasitic flatworm Schistosoma mansoni. PLoS Pathog 3: e52.
doi:10.1371/journal.ppat.0030052.
30. Morales ME, Rinaldi G, Kines KJ, Gobert GN, Tort JF, et al. (2008) RNA
interference targeting Schistosoma mansoni cathepsin D, the apical enzyme of the
hemoglobin proteolysis cascade. Mol Biochem Parasitol 157: 160–168.
31. Rinaldi G, Morales ME, Alrefaei YN, Cancela M, Castillo E, et al. (2009) RNA
interference targeting leucine aminopeptidases inhibits hatching of eggs of the
human blood fluke, Schistosoma mansoni. Mol Biochem Parasitol 167: 118–126.
32. Faghiri Z, Skelly PJ (2009) The role of tegumental aquaporin from the human
parasitic worm, Schistosoma mansoni, in osmoregulation and drug uptake. FASEB J
23: 2780–2789.
33. Ford L, Zhang J, Liu J, Hashmi S, Fuhrman JA, et al. (2009) Functional analysis
of the cathepsin-like cysteine protease genes in adult Brugia malayi using RNA
interference. PLoS Negl Trop Dis 3: e377. doi: 10.1371/journal.pntd.0000377.
34. Morales ME, Mann VH, Kines KJ, Gobert GN, Kalinna BH, et al. (2007)
piggyBac transposon mediated transgenesis of the human blood fluke, Schistosoma
mansoni. FASEB J 21: 3479–3489.
35. Kines KJ, Mann VH, Morales ME, Shelby BD, Kalinna BH, et al. (2006)
Transduction of Schistosoma mansoni by vesicular stomatitis virus glycoproteinpseudotyped Moloney murine leukemia retrovirus. Exp Parasitol 112: 209–220.
36. Kines KJ, Morales ME, Mann VH, Gobert GN, Brindley PJ (2008) Integration
of reporter transgenes into Schistosoma mansoni chromosomes mediated by
pseudotyped murine leukemia virus. FASEB J 22: 2936–2948.
37. Li X, Massey HC, Jr., Nolan TJ, Schad GA, Kraus K, et al. (2006) Successful
transgenesis of the parasitic nematode Strongyloides stercoralis requires endogenous
non-coding control elements. Int J Parasitol 36: 671–679.
8
October 2009 | Volume 3 | Issue 10 | e538
65. Imai S, Fujita K (2004) Molecules of parasites as immunomodulatory drugs.
Curr Top Med Chem 4: 539–552.
66. Harnett W, Harnett MM (2008) Therapeutic immunomodulators from
nematode parasites. Expert Rev Mol Med 10: e18.
67. Harnett W, Harnett MM (2008) Parasitic nematode modulation of allergic
disease. Curr Allergy Asthma Rep 8: 392–397.
68. Johnston MJ, MacDonald JA, McKay DM (2009) Parasitic helminths: A
pharmacopeia of anti-inflammatory molecules. Parasitology 136: 125–147.
69. McKay DM (2009) The therapeutic helminth? Trends Parasitol 25: 109–114.
70. Erb KJ (2009) Can helminths or helminth-derived products be used in humans
to prevent or treat allergic diseases? Trends Immunol 30: 75–82.
71. Summers RW, Elliott DE, Urban JF, Jr., Thompson R, Weinstock JV (2005)
Trichuris suis therapy in Crohn’s disease. Gut 54: 87–90.
72. Summers RW, Elliott DE, Urban JF, Jr., Thompson RA, Weinstock JV (2005)
Trichuris suis therapy for active ulcerative colitis: A randomized controlled trial.
Gastroenterology 128: 825–832.
73. Lima C, Perini A, Garcia ML, Martins MA, Teixeira MM, et al. (2002)
Eosinophilic inflammation and airway hyper-responsiveness are profoundly
inhibited by a helminth (Ascaris suum) extract in a murine model of asthma. Clin
Exp Allergy 32: 1659–1566.
74. Schnoeller C, Rausch S, Pillai S, Avagyan A, Wittig BM, et al. (2008) A
helminth immunomodulator reduces allergic and inflammatory responses by
induction of IL-10-producing macrophages. J Immunol 180: 4265–4272.
75. Melendez AJ, Harnett MM, Pushparaj PN, Wong WS, Tay HK, et al. (2007)
Inhibition of Fc epsilon RI-mediated mast cell responses by ES-62, a product of
parasitic filarial nematodes. Nat Med 13: 1375–1381.
76. McInnes IB, Leung BP, Harnett M, Gracie JA, Liew FY, et al. (2003) A novel
therapeutic approach targeting articular inflammation using the filarial
nematode-derived phosphorylcholine-containing glycoprotein ES-62.
J Immunol 171: 2127–2133.
77. Donnelly S, O’Neill SM, Sekiya M, Mulcahy G, Dalton JP (2005) Thioredoxin
peroxidase secreted by Fasciola hepatica induces the alternative activation of
macrophages. Infect Immun 73: 166–173.
78. Holland MJ, Harcus YM, Riches PL, Maizels RM (2000) Proteins secreted by
the parasitic nematode Nippostrongylus brasiliensis act as adjuvants for Th2
responses. Eur J Immunol 30: 1977–1987.
79. Han ZG, Brindley PJ, Wang S, Chen Z (2009) Schistosome genomics: New
perspectives on schistosome biology and host parasite interaction. Annu Rev
Genomics Hum Genet 10: 211–240.
80. Foster J, Ganatra M, Kamal I, Ware J, Makarova K, et al. (2005) The Wolbachia
genome of Brugia malayi: endosymbiont evolution within a human pathogenic
nematode. PLoS Biol 3: e121. doi:10.1371/journal.pbio.0030121.
81. Park J, Kim KJ, Choi K-S, Grab DJ, Dumler JS (1993) Anaplasma phagocytophilum
AnkA binds to granulocyte DNA and nuclear proteins. Cell Microbiol 6:
743–751.
82. Warbrick EV, Barker GC, Rees HH, Howells RE (1993) The effect of
invertebrate hormones and potential hormone inhibitors on the third larval
moult of the filarial nematode, Dirofilaria immitis, in vitro. Parasitology 107:
459–463.
83. Nisbet AJ, Cottee PA, Gasser RB (2008) Genomics of reproduction in
nematodes: prospects for parasite intervention? Trends Parasitol 24: 89–95.
84. Dieterich C, Sommer RJ (2009) How to become a parasite - Lessons from the
genomes of nematodes. Trends Genet 25: 203–209.
85. Margulies M, Egholm M, Altman WE, Attiya S, Bader JS, et al. (2005) Genome
sequencing in microfabricated high-density picolitre reactors. Nature 437:
376–380.
86. Bennett S (2004) Solexa Ltd. Pharmacogenomics 5: 433–438.
87. Shendure J, Porreca GJ, Reppas NB, Lin X, McCutcheon JP, et al. (2005)
Accurate multiplex polony sequencing of an evolved bacterial genome. Science
309: 1728–1732.
88. Sanger F, Niklen S, Coulson A (1977) DNA sequencing with chain-terminating
inhibitors. Proc Natl Acad Sci U S A 74: 5463–5467.
89. Fenwick A (2009) Host-parasite relations and implications for control. Adv
Parasitol 68: 247–261.
90. Morozova O, Marra MA (2008) Applications of next-generation sequencing
technologies in functional genomics. Genomics 92: 255–264.
91. Kuntz AN, Davioud-Charvet E, Sayed AA, Califf LL, Dessolin J, et al. (2007)
Thioredoxin glutathione reductase from Schistosoma mansoni: An essential parasite
enzyme and a key drug target. PLoS Med 4: e206. Erratum in: PLoS Med 2007,
4: e264.
92. Cosseau C, Azzi AH, Smith K, Freitag M, Mitta G, et al. (2009) Native
chromatin immunoprecipitation (N-ChIP) and ChIP-Seq of Schistosoma mansoni:
Critical experimental parameters. Mol Biochem Parasitol 166: 70–76.
93. Chenna R, Sugawara H, Koike T, Lopez R, Gibson TJ, et al. (2003) Multiple
sequence alignment with the Clustal series of programs. Nucleic Acids Res 31:
3497–3500.
94. Felsenstein J (1988) Phylogenies from molecular sequences: Inference and
reliability. Ann Rev Genet 22: 521–565.
38. Castelletto ML, Massey HC, Jr., Lok JB (2009) Morphogenesis of Strongyloides
stercoralis infective larvae requires the DAF-16 ortholog FKTF-1. PLoS Pathog 5:
e1000370. doi: 10.1371/journal.ppat.1000370.
39. de Oliveira A, Katholi CR, Unnasch TR (2008) Characterization of the
promoter of the Brugia malayi 12 kDa small subunit ribosomal protein (RPS12)
gene. Int J Parasitol 38: 1111–1119.
40. Rogers A, Antoshechkin I, Bieri T, Blasiar D, Bastiani C, et al. (2008) Wormbase
2007. Nucleic Acids Res 36(Database issue). pp D612–617.
41. Mitreva N, Appleton J, McCarter JP, Jasmer DP (2005) Expressed sequence tags
from life cycle stages of Trichinella spiralis: Application to biology and parasite
control. Vet Parasitol 132: 13–17.
42. Martin J, Abubucker S, Wylie T, Yin Y, Mitreva M (2009) Nematode.net update
2008: Improvements enabling more efficient data mining and comparative
nematode genomics. Nucleic Acids Res 37(Database issue): D571–578.
43. Parkinson J, Whitton C, Schmid R, Thomson M, Blaxter M (2004) NEMBASE:
A resource for parasitic nematode ESTs. Nucleic Acids Res 32: D427–430.
44. Zerlotini A, Heiges M, Wang H, Moraes RL, Dominitini AJ, et al. (2009)
SchistoDB: A Schistosoma mansoni genome resource. Nucleic Acids Res
37(Database issue): D579–582.
45. Liu F, Chen P, Cui SJ, Wang ZQ, Han ZG (2008) SjTPdb: Integrated
transcriptome and proteome database and analysis platform for Schistosoma
japonicum. BMC Genomics 9: 304.
46. Robb SMC, Ross E, Sánchez Alvarado A (2008) SmedGD: The Schmidtea
mediterranea genome database. Nucleic Acids Res 36(Database issue). pp
D599–D606.
47. Blaxter ML, De Ley P, Garey JR, Liu LX, Scheldeman P, et al. (1998) A
molecular evolutionary framework for the phylum Nematoda. Nature 392:
71–75.
48. Holterman M, van der Wurff A, van den Elsen S, van Megen H, Bongers T,
et al. (2006) Phylum-wide analysis of SSU rDNA reveals deep phylogenetic
relationships among nematodes and accelerated evolution toward crown clades.
Mol Biol Evol 23: 1792–1800.
49. Mitreva M, Smant G, Helder J (2009) Role of horizontal gene transfer in the
evolution of plant parasitism among nematodes. In: Horizontal Gene Transfer.
Methods Mol Biol 532: 517–535.
50. Abad P, Gouzy J, Aury J-M, Castagnone-Sereno P, Danchin EG, et al. (2008)
Genome sequence of the metazoan plant-parasitic nematode Meloidogyne incognita.
Nat Biotech 26: 909–915.
51. Opperman CH, Bird DM, Williamson VM, Rokhsar DS, Burke M, et al. (2008)
Sequence and genetic map of Meloidogyne hapla: A compact nematode genome for
plant parasitism. Proc Natl Acad Sci U S A 105: 14802–14807.
52. Dieterich C, Clifton SW, Schuster LN, Chinwalla A, Delehaunty K, et al. (2008)
The Pristionchus pacificus genome provides a unique perspective on nematode
lifestyle and parasitism. Nat Genet 40: 1193–1198.
53. Robb SM, Ross E, Sánchez Alvarado A (2008) SmedGD: The Schmidtea
mediterranea genome database. Nucleic Acids Res 6: D599–D606.
54. Maizels RM, Balic A, Gomez-Escobar N, Nair M, Taylor MD, et al. (2004)
Helminth parasites–Masters of regulation. Immunol Rev 201: 89–116.
55. Ohnmacht C, Voehringer D (2009) Basophil effector function and homeostasis
during helminth infection. Blood 113: 2816–2825.
56. Hartmann S, Kyewski B, Sonnenburg B, Lucius R (1997) A filarial cysteine
protease inhibitor down-regulates T cell proliferation and enhances interleukin10 production. Eur J Immunol 27: 2253–2260.
57. Hartmann S, Lucius R (2003) Modulation of host immune responses by
nematode cystatins. Int J Parasitol 33: 1291–1302.
58. Harnett W, McInnes IB, Harnett MM (2004) ES-62, a filarial nematode-derived
immunomodulator with anti-inflammatory potential. Immunol Lett 94: 27–33.
59. Gomez-Escobar N, Lewis E, Maizels RM (1998) A novel member of the
transforming growth factor-beta (TGF-beta) superfamily from the filarial
nematodes Brugia malayi and B. pahangi. Exp Parasitol 88: 200–209.
60. Gomez-Escobar N, Gregory WF, Maizels RM (2000) Identification of tgh-2, a
filarial nematode homolog of Caenorhabditis elegans daf-7 and human transforming
growth factor beta, expressed in microfilarial and adult stages of Brugia malayi.
Infect Immun 68: 6402–6410.
61. van der Kleij D, Latz E, Brouwers JF, Kruize JC, Schmitz M, et al. (2002) A
novel host-parasite lipid cross-talk. Schistosomal lyso-phosphatidylserine activates toll-like receptor 2 and affects immune polarization. J Biol Chem 277:
48122–48129.
62. Pastrana DV, Raghavan N, FitzGerald P, Eisinger SW, Metz C, et al. (1998)
Filarial nematode parasites secrete a homologue of the human cytokine
macrophage migration inhibitory factor. Infect Immun 66: 5955–5963.
63. Hewitson JP, Grainger JR, Maizels RM (2009) Helminth immunoregulation:
The role of parasite secreted proteins in modulating host immunity. Mol
Biochem Parasitol 167: 1–11.
64. Yazdanbakhsh M, van den Biggelaar A, Maizels RM (2001) Th2 responses
without atopy: Immunoregulation in chronic helminth infections and reduced
allergic disease. Trends Immunol 22: 372–377.
www.plosntds.org
9
October 2009 | Volume 3 | Issue 10 | e538