Download The molecular natural history of the human genome

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Vectors in gene therapy wikipedia , lookup

Adaptive evolution in the human genome wikipedia , lookup

Gene desert wikipedia , lookup

Mitochondrial DNA wikipedia , lookup

Copy-number variation wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Ridge (biology) wikipedia , lookup

Oncogenomics wikipedia , lookup

Genetic engineering wikipedia , lookup

Genomic imprinting wikipedia , lookup

No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup

Human genetic variation wikipedia , lookup

Gene expression profiling wikipedia , lookup

Biology and consumer behaviour wikipedia , lookup

NUMT wikipedia , lookup

Gene wikipedia , lookup

Transposable element wikipedia , lookup

Public health genomics wikipedia , lookup

ENCODE wikipedia , lookup

Microevolution wikipedia , lookup

RNA-Seq wikipedia , lookup

Genome (book) wikipedia , lookup

Designer baby wikipedia , lookup

Metagenomics wikipedia , lookup

Whole genome sequencing wikipedia , lookup

Segmental Duplication on the Human Y Chromosome wikipedia , lookup

Non-coding DNA wikipedia , lookup

History of genetic engineering wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Helitron (biology) wikipedia , lookup

Pathogenomics wikipedia , lookup

Minimal genome wikipedia , lookup

Genomic library wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Genomics wikipedia , lookup

Human genome wikipedia , lookup

Genome editing wikipedia , lookup

Human Genome Project wikipedia , lookup

Genome evolution wikipedia , lookup

Transcript
420
Research Update
Hannu Ylönen
Dept of Biological and Environmental
Science, University of Jyväskylä, PO Box 35,
40351 Jyväskylä, Finland.
e-mail: [email protected]
References
1 Jackson, R.J. et al. (2001) Expression of mouse
interleukin-4 by a recombinant Ectromelia virus
suppresses cytolytic lymphocyte responses and
overcomes genetic resistance to mousepox.
J. Virol. 73, 1205–1210
2 Chambers, L.K. et al. (1999) Biological control of
rodents – the case for fertility control using
immunocontraception. In Ecologically Based
Management of Rodent Pests (Singleton, G. et al.,
eds), pp. 215–242, ACIAR Monograph No. 59
3 Singleton, G. et al., eds (1999) Ecologically Based
Management of Rodent Pests, ACIAR Monograph
No. 59
TRENDS in Ecology & Evolution Vol.16 No.8 August 2001
4 Nowak, R. (2001) Disaster in the making. An
engineered mouse virus leaves us one step away
from the ultimate bioweapon. New Scientist 169,
4–5
5 Finkel, E. (2001) Engineered mouse virus spurs
bioweapon fears. Science 291, 585
6 Frank, F. (1957) The causality of microtine cycles in
Germany. J. Wildl. Manage. 21, 113–121
7 Singleton, G. et al. Reproductive changes in
fluctuating house mouse populations southeastern
Australia. Proc. Zool. Soc. London Ser. B (in press)
8 Pech, R.P. et al. (1999) Models for predicting plagues
of house mouse (Mus domesticus) in Australia. In
Ecologically Based Management of Rodent Pests
(Singleton, G. et al., eds), pp. 81–112, ACIAR
Monograph No. 59
9 Muller, L.I. et al. (1997) Theory and practice of
immunocontraception in wild animals. Wildl. Soc.
Bull. 25, 507–514
10 Fayrer-Hosken, R.A. et al. (2000)
Immunocontraception of African elephants – a
11
12
13
14
15
humane method to control elephant populations
without behavioural side effects. Nature 407, 149
Miller, L.A. et al. (2000) Immunocontraception of
white-tailed deer using native and recombinant
zona pellucida vaccines. Anim. Reprod. Sci. 63,
187–195
Rout, P.K. and Vrati, S. (1997) Oral immunization
with recombinant vaccinia expressing cell-surfaceanchored beta hCG induces anti-hCG antibodies
and T-cell proliferative response in rats. Vaccine 15,
1503–1505
Chambers, L.K. et al. (1999) Fertility of wild mouse
populations: the effects of hormonal competence
and an imposed level of sterility. Wildl. Res. 26,
579–591
Jackson, R.J. et al. (1998) Infertility in mice
induced by recombinant Ectromelia virus
expressing mouse zona pellucida glycoprotein.
Biol. Rep. 58, 152–159
Duncan, A. (2001) Virus a timely reminder to
step up controls. The Australian, 12 January, 2
The molecular natural history of the human genome
Michael Lynch
A remarkable pair of recently published
studies provides the first glimpse of the
fine-scale structure of the human genome
sequence. The data revealed by these
investigations, and their future refinements,
provide an infrastructure that will forever
transform the intellectual playing field for
evolutionary biologists.
The 15 February 2001 issue of Nature
displayed a landmark series of papers
outlining the fine-scale features of the draft
human genome sequence revealed by a
publicly funded consortium, the IHGSC
(International Human Genome Sequencing
Consortium)1. Just a day later, a parallel
series of reports appeared in Science, in this
case drawing from a privately funded project
(the Celera project) that enjoyed
unrestricted access to the publicly funded
database2. Because of the asymmetry of
information flow between the two projects,
Celera had the clear upper hand in terms of
power of analysis. The two groups employed
rather different sequencing and processing
strategies, and not surprisingly, there has
been a fair amount of posturing as to who
has produced the superior product, in spite
of the considerable amount of consensus
between the two reports.
The human genome is big, although by no
means the largest and, by my estimate, a
printing of the entire sequence would have
required approximately 300 000 pages of
Nature or Science. The pressure to publish
early was intense, and there are a few
http://tree.trends.com
important caveats to keep in mind. First,
only ~90% of the sequence has actually been
completed, with <20% of the genome being
represented in contigs >100 kb and half of it
falling in contigs <22 kb. This is significant
because the average human gene is
approximately 30 kb in length (i.e. larger
than the average contig). Thousands of gaps
remain to be filled, and although both
groups are hard at work trying to obtain the
finishing touches, many months will pass
before the finished product is available, and
some of the information will be essentially
unattainable by conventional sequencing
methods. Second, at least half of the proteincoding loci identified are simply candidates
suggested by computer algorithms and
hence await more rigorous evaluation.
Third, the genomic sequences are chimeric
(both within and between projects) in the
sense that they are derived from DNA
extracted from an array of individuals,
whose ethnic and geographical backgrounds
will remain permanently unknown. Thus,
strictly speaking, characterization of the
human genome is not complete, nor does it
accurately represent any single member of
our species. Rather, the Nature and Science
reports are best viewed as a pair of rapidly
assembled and hypothetical abstracts of a
documentary that has not yet been
completely written.
Fairly well-curated sequences of two
human chromosomes (21 and 22) have been
available for well over a year, and assuming
that they are fairly typical, characterization
of the remainder of the genome was not
expected to lead to too many new
revelations. However, certain aspects of
genome structure can only be ascertained
after a nearly complete sequence has become
available. A few of the more interesting
observations are the subject of this article.
Our debris-laden genome
Approximately half of the human genome
consists of sequences that are obviously
associated with transposable-element
activity, and a large fraction of the
remaining noncoding DNA might be a
product of such activity but too divergent to
be recognized as such. So much for
intelligent design. The numbers and ages of
mobile elements in humans greatly exceed
those in Drosophila melanogaster and
Caenorhabditis elegans, and much of this
difference is the result of the huge
proliferation of two families (Alu and LINE1
elements), which together account for ~60%
of interspersed elements in humans. Part of
the reason for this large accumulation of
excess DNA could be that the rate of deletion
of nonfunctional DNA from the human
genome is extraordinarily low. By one
estimate, the half-life of such sequences in
modern humans is of the order of 800 million
years, compared to ~12 million years in
flies3.
The high incidence of ‘junk’DNA in the
human genome is even more remarkable
when one considers an additional claim by
the IHGSC – that there has been a
0169–5347/01/$ – see front matter © 2001 Elsevier Science Ltd. All rights reserved. PII: S0169-5347(01)02242-X
Research Update
substantial recent decline in the activity
(birth rate) of all transposons throughout the
human genome. The validity of this
conjecture is unclear. To evaluate the age
distribution of interspersed repeats, the
IHGSC derived a consensus sequence for
each of the major families and then simply
estimated the divergence of each extant
sequence from the consensus. Finding that
few elements were similar to the consensus,
they reasoned that all of the elements must
be quite old. However, the signature of a
newborn element is not the degree of
identity with a reconstructed consensus but
with another extant element. A more
rigorous assessment of the degree of activity
of mobile elements in humans will require a
proper phylogenetic analysis as well as a
population-level survey for site-specific
presence/absence polymorphisms.
An unexpectedly small gene count?
Although our genome size is approximately
30 times greater than that of C. elegans in
nucleotide number, the number of coding
genes might be no more than twofold
greater. The IHGSC and Celera papers both
estimate that the human genome encodes
30 000–40 000 genes, compared to the
estimate of 19 000 for C. elegans. However,
both sets of counts are to a large extent
based on indirect computational searches
rather than on direct observation. Although
gene-prediction algorithms have become
quite sophisticated, various aspects of the
human genome, including intron number
and size (commonly hundreds to thousands
of bases), present considerable
computational challenges. It will not be
surprising if the current counts are
underestimated by as much as twofold, and
a more recent computational screening
suggests the human number is more in the
range of 65 000 to 75 000 (Ref. 4).
Complete sequences from cDNA
libraries will help clarify which
hypothetical genes are actually
transcribed, but they will also almost
certainly reveal previously undetected
genes. For example, from a recent
screening of a cDNA library extracted from
just a single tissue (the brain) of
D. melanogaster, >10% of the observed
clones had not been previously identified by
whole-genome gene-prediction surveys for
this species5. cDNA analysis is unlikely to
illuminate all of our functional genes, as
transcripts of genes that are only active
during brief stages of development, rare
environmental circumstances, or in
http://tree.trends.com
TRENDS in Ecology & Evolution Vol.16 No.8 August 2001
restricted cell lineages may be very elusive.
In addition, because nongene sequences are
probably sometimes transcribed, an
unknown fraction of the members of a
cDNA library can also be expected to be
false positives. An alternative approach to
verifying gene expression – the use of tiled
microarrays6 – has similar limitations. The
most powerful tool for identifying
functional human genes (along with their
associated regulatory elements) will almost
certainly be comparative analysis with an
array of related species. Within the next
few months, nearly complete genomic
sequences will become available for mouse
Mus musculus, rat Rattus rattus, pufferfish
Fugus rubripes and zebrafish Danio rerio,
and we will enjoy those for numerous other
vertebrate species shortly thereafter.
Counting problems aside, the actual
number of unique transcripts in the human
genome is clearly of the order of 105 or larger.
Roughly a third of all human genes appear
to experience alternative splicing
(i.e. different ways of stitching exon
sequences together). The average number of
transcripts per human gene is
approximately three, which contrasts with
the situation in C. elegans, where the
average is ~1.3. Thus, the number of distinct
proteins employed in humans could easily be
three or more times that in worms, and a
further increase in our genetic repertoire
must result from greater complexity in the
tissue-specificity of gene regulation. It is
becoming increasingly clear that these two
aspects of gene structure – alternative
splicing and regulatory-region complexity –
provide the molecular basis of pleiotropy, a
phenomenon that evolutionary biologists
have long regarded as a fundamental, but
mechanistically mystical, constraining
process in adaptive evolution. Finally, it
should be remembered that the number of
potential two-gene interactions increases
with the square of gene number, three-gene
interactions with the cube of gene number,
and so on. Thus, even a twofold difference
between species in gene number can
translate into a several-fold difference in
epistatic interactions.
A high rate of origin of new (hopeful) genes
The human genome harbors a very large
number of duplicate genes – at least 40% of
our genes are represented in two or more
copies7. Although many of these duplicates
probably arose before the origin of
mammals, the number of recently derived
copies is remarkably high. For example, at
421
least 5% of our genome is represented in
two or more locations, in the form of large
(1–200 kb) segmental duplications. Celera
located a total of 1077 segmental
duplications, 781 of which contained five or
more shared genes. These are very
conservative estimates, because assembly
problems almost certainly resulted in the
exclusion of many true duplicates. Indeed,
the wide distribution of recently derived
segmental duplications is one of the
primary impediments to accurate
reconstruction of chromosomal sequences
from small (approximately 1 kb) random
sequences, the main strategy of the Celera
group. The high incidence of segmental
duplications in humans is well beyond that
which is seen in D. melanogaster and
C. elegans, but not greatly different than
that which is observed in Arabidopsis8,
although the Arabidopsis genome has been
substantially patterned by one or more
ancient polyploidization events.
Because many of the interchomosomal
duplications in humans have a very high
degree of sequence similarity, IHGSC
suggests that they arose during a recent
period of intense duplication activity.
However, an alternative explanation for
such a pattern is that segmental
duplications have a high rate of eradication.
The distribution of sequence similarity
between duplicate segments must
ultimately reflect the joint processes of birth
and death of such segments9,10. Many pairs
of human segmental duplicates are Ⰶ1%
divergent, whereas comparative analysis of
homologous nucleotide positions throughout
the human genome implies an average
divergence of ~0.1% per nucleotide site. This
suggests that a substantial fraction of the
segmental duplications observed in these
studies might be younger than the mean
coalescence time of neutral nucleotide sites
and hence may not even be fixed in the
human population. Such presence/absence
polymorphisms are already known to exist
for duplicate olfactory receptor genes in the
human population11. Thus, although most of
our attention in evolutionary quantitative
genetics has been focused on nucleotide
polymorphisms, the possibility that
phenotypic variation depends in important
ways on individual differences in gene
content merits consideration.
The human segmental duplications are
not randomly distributed. Regions near the
centromeres consist almost entirely of such
segments, and telomeric regions are also
enriched. Moreover, several cases exist in
422
Research Update
which a duplicated segment has been
reduplicated and redistributed to still
another location. Chromosome 19 appears to
be especially unusual in this regard, with
large blocks of genes having been moved by
duplicative transposition to virtually all of
the other chromosomes. It might be no
coincidence that chromosome 19 is
unusually rich in Alu elements, which may
promote interchromosomal exchange.
No analysis of duplicate genes in humans
would be complete without considering
Ohno’s hypothesis that one or more basal
polyploidization events provided the fuel for
the origin of morphological novelties in
vertebrates12. Although this hypothesis has
been treated with increasing skepticism and
ridicule over the past few years13–15, and a
very limited analysis by the IHGSC fails to
support it, all of the phylogenetic tests of
Ohno’s hypothesis have several potential
shortcomings, including a shortage of welldefined sequences from basal chordates
known to contain single-copy genes, an
inability to discriminate secondary
duplications from putative events associated
with earlier polyploidization, and
inattention to information on map positions.
Because of the substantial amount of
sequence divergence, genome
rearrangement, and gene deletion that
occurs over a time span of approximately one
billion years (summing over two descendant
lineages), the challenges to testing Ohno’s
hypothesis in a proper fashion are
formidable, and a rigorous formal analysis
remains to be done.
Extinction of the Ecdysozoa?
With the (nearly) complete genomic
sequences now available for a nematode, a
fly, and a vertebrate, what about the
phylogenetic arrangement of these three
organisms? A definitive answer to this
question has significant implications for
our understanding of the origins of
morphological diversity in animals, as the
vast majority of work in development and
genetics is performed on members of these
three clades. Although classical
morphological analysis had historically
positioned nematodes as basal to the
protostome (e.g. fly) – deuterostome
(e.g. vertebrate) divergence, a recent
analysis of 18S rRNA sequences led to the
suggestion that all molting animals
(including flies and nematodes) are
members of the same monophyletic clade
(the Ecdysozoa)16. It is risky and no longer
necessary to base a deep phylogenetic
http://tree.trends.com
TRENDS in Ecology & Evolution Vol.16 No.8 August 2001
analysis on a single gene, and a study based
on 50 genes supports the traditional basal
position of nematodes17 as does another
study base on four highly conserved
proteins18. With sequences of thousands of
homologous genes now available in all
three lineages, it should be possible to
settle the matter once and for all with a
massively large-scale phylogenetic
analysis. Although the IHGSC notes that
the human genome appears to share about
1.5 times as many homologous genes with
D. melanogaster as with C. elegans, neither
this observation nor sequence comparisons
can be translated into measures of
phylogenetic affinity until key out-groups
have been entered into the analysis.
Fortunately, there already is a fungus
(Saccharomyces cerevisiae) and a plant
(Arabidopsis thaliana) to work with,
although a cnidarian or a sponge would
presumably be more informative.
The future
The IHGSC and Celera projects provide just
an opening glimpse of the structure of the
human genome, mainly providing a contrast
with single members of two of our sister
animal phyla (arthropods and nematodes).
Much of this century will be spent trying to
elucidate the sources of variation within our
own species. There is much to explain, as
heritabilities for morphological and
behavioral traits in humans are quite large
compared to those seen in other species.
Driven by the insatiable financial dreams of
the pharmaceutical and biotechnology
industries, progress is likely to be rapid.
Methods for sequencing DNA will change
radically over the next few years, perhaps to
the point that the reading of a megabase of
DNA can be performed on a time scale not
much greater than the reading of a megabyte
of data by current computers. Whole genome
sequences will then be available for
numerous members of our population and for
many of our closest living primate relatives,
and it is no longer far fetched to think that
comparative molecular data will be
attainable for our closest extinct relatives19.
Making the connection between
variation at the molecular and phenotypic
levels will require a lot of biology, and it is
clear that the rich array of tools from
quantitative genetics and evolutionary
biology will play a central role in this
endeavor. Molecular evolution will no
longer be focused entirely on single-gene
analyses but on explaining the origin,
proliferation, and occasional demise of
entire networks of genes. It is a privilege to
live during the time in which the essence
of the unique ingenuity that got us here
might actually become biologically
understandable.
Acknowledgements
I thank J. Crow, D. Hartl, P. Phillips, and
J. Postlethwait for helpful comments.
References
1 International Human Genome Sequencing
Consortium (2001) Initial sequencing and analysis
of the human genome. Nature 409, 860–921
2 Venter, J.C. et al. (2001) The sequence of the human
genome. Science 291, 1304–1351
3 Petrov, D.A. and Hartl, D.L. (1999) Patterns of
nucleotide substitution in Drosophila and mammalian
genomes. Proc. Natl. Acad. Sci. U. S. A. 96, 1475–1479
4 Wright, F.A. et al. A draft annotation and overview of
the human genome. Genome Biol. (in press)
5 Posey, K.L. et al. (2001) Survey of transcripts in the
adult Drosophila brain. Genome Biol. (in press)
6 Shoemaker, D.D. et al. (2001) Experimental
annotation of the human genome using microarray
technology. Nature 409, 922–927
7 Li, W-H. et al. (2001) Evolutionary analyses of the
human genome. Nature 409, 847–850
8 Vision, T.J. et al. (2000) The origins of genomic
duplications in Arabidopsis. Science 290, 2114–2117
9 Nei, M. et al. (1997) Evolution by the birth-anddeath process in multigene families of the vertebrate
immune system. Proc. Natl. Acad. Sci. U. S. A. 94,
7799–7806
10 Lynch, M. and Conery, J. (2000) The evolutionary
fate and consequences of duplicate genes. Science
290, 1151–1154
11 Trask, B.J. et al. (1998) Members of the olfactory
receptor gene family are contained in large blocks of
DNA duplicated polymorphically near the ends of
human chromosomes. Hum. Mol. Genet. 7, 13–26
12 Ohno, S. (1970) Evolution by Gene Duplication,
Springer-Verlag
13 Skrabanek, L. and Wolfe, K.H. (1998) Eukaryote
genome duplication – where’s the evidence? Curr.
Opin. Genet. Dev. 8, 694–700
14 Hughes, A.L. (1999) Phylogenies of developmentally
important proteins do not support the hypothesis of
two rounds of genome duplication early in
vertebrate history. J. Mol. Evol. 48, 565–576
15 Martin, A. (2001) Is tetralogy true? Lack of support
for the ‘one-to-four rule’. Mol. Biol. Evol. 18, 89–93
16 Aguinaldo, A.M. et al. (1997) Evidence for a clade of
nematodes, arthropods and other moulting animals.
Nature 29, 489–493
17 Wang, D.Y. et al. (1999) Divergence time estimates
for the early history of animal phyla and the origin of
plants, animals and fungi. Proc. R. Soc. London B
Biol. Sci. 266, 163–171
18 Baldauf, S.L. et al. (2000) A kingdom-level
phylogeny of eukaryotes based on combined protein
data. Science 290, 972–977
19 Ovchinnikov, I.V. et al. (2000) Molecular analysis of
Neanderthal DNA from the northern Caucasus.
Nature 404, 490–493
Michael Lynch
Dept of Biology, Indiana University,
Bloomington, IN 47405, USA.
e-mail: [email protected]