Download Evolution of Closely Linked Gene Pairs in

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Zinc finger nuclease wikipedia , lookup

Human genetic variation wikipedia , lookup

Epistasis wikipedia , lookup

Ridge (biology) wikipedia , lookup

Long non-coding RNA wikipedia , lookup

Point mutation wikipedia , lookup

Oncogenomics wikipedia , lookup

Genomics wikipedia , lookup

Human genome wikipedia , lookup

Epigenetics in learning and memory wikipedia , lookup

Epigenetics of neurodegenerative diseases wikipedia , lookup

Minimal genome wikipedia , lookup

Transposable element wikipedia , lookup

Genomic imprinting wikipedia , lookup

NEDD9 wikipedia , lookup

Epigenetics of human development wikipedia , lookup

Public health genomics wikipedia , lookup

Pathogenomics wikipedia , lookup

Saethre–Chotzen syndrome wikipedia , lookup

Genetic engineering wikipedia , lookup

Copy-number variation wikipedia , lookup

History of genetic engineering wikipedia , lookup

Neuronal ceroid lipofuscinosis wikipedia , lookup

Gene wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

Epigenetics of diabetes Type 2 wikipedia , lookup

Gene therapy of the human retina wikipedia , lookup

RNA-Seq wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Genome editing wikipedia , lookup

The Selfish Gene wikipedia , lookup

Genome (book) wikipedia , lookup

Gene therapy wikipedia , lookup

Gene expression profiling wikipedia , lookup

Gene expression programming wikipedia , lookup

Gene desert wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Gene nomenclature wikipedia , lookup

Helitron (biology) wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Genome evolution wikipedia , lookup

Microevolution wikipedia , lookup

Designer baby wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Transcript
Evolution of Closely Linked Gene Pairs in Vertebrate Genomes
Erik Franck,* Tim Hulsen, Martijn A. Huynen, Wilfried W. de Jong,* Nicolette H. Lubsen,* and
Ole Madsen*1
*Biomolecular Chemistry, 271 Nijmegen Center of Molecular Life Science, Radboud University Nijmegen, Nijmegen, The
Netherlands; and Centre for Molecular and Biomolecular Informatics, NCMLS, Radboud University Nijmegen Medical Centre,
Nijmegen, The Netherlands
The orientation of closely linked genes in mammalian genomes is not random: there are more head-to-head (h2h) gene
pairs than expected. To understand the origin of this enrichment in h2h gene pairs, we have analyzed the phylogenetic
distribution of gene pairs separated by less than 600 bp of intergenic DNA (gene duos). We show here that a lack of
head-to-tail (h2t) gene duos is an even more distinctive characteristic of mammalian genomes, with the platypus genome
as the only exception. In nonmammalian vertebrate and in nonvertebrate genomes, the frequency of h2h, h2t, and tail-totail (t2t) gene duos is close to random. In tetrapod genomes, the h2t and t2t gene duos are more likely to be part of
a larger gene cluster of closely spaced genes than h2h gene duos; in fish and urochordate genomes, the reverse is seen. In
human and mouse tissues, the expression profiles of gene duos were skewed toward positive coexpression, irrespective of
orientation. The organization of orthologs of both members of about 40% of the human gene duos could be traced in
other species, enabling a prediction of the organization at the branch points of gnathostomes, tetrapods, amniotes, and
euarchontoglires. The accumulation of h2h gene duos started in tetrapods, whereas that of h2t and t2t gene duos only
started in amniotes. The apparent lack of evolutionary conservation of h2t and t2t gene duos relative to that of h2h gene
duos is thus a result of their relatively late origin in the lineage leading to mammals; we show that once they are formed
h2t and t2t gene duos are as stable as h2h gene duos.
Introduction
The textbook view of a eukaryote gene is a solitary
functional entity, a monocistronic coding sequence of
which the expression is controlled by an autonomous promoter. In fact, there is a significant clustering of genes in the
mammalian genome where the genes in these clusters show
coordinate expression (Hurst et al. 2004; The FANTOM
Consortium 2005; Gierman et al. 2007; Purmann et al.
2007). Often coordinate expression is due to a similar regulation of autonomous promoters located within the same
active chromatin region, but it can also be the result of promoter cross talk (Hampf and Gossen 2007), of sharing of
a promoter (a bidirectional promoter; see also below), or of
transcriptional interference through a variety of mechanisms such as promoter occlusion, promoter competition,
or RNA polymerase collision (Callen et al. 2004; Leupin
et al. 2005; Shearwin et al. 2005). Nonautonomous expression, that is, expression of a gene coupled to the expression
of another gene, whether positively or negatively, is likely
to require close proximity of the genes.
The orientation of closely linked genes in the human
genome is not random: there are more closely linked headto-head (h2h) genes, usually defined as genes divergently
transcribed from opposite strands separated by an intergenic region of 1 kb or less (Adachi and Lieber 2002;
Trinklein et al. 2004; Li et al. 2006), than expected. The
region between these h2h gene pairs is usually denoted
as a bidirectional promoter. Formal experimental proof that
expression of a h2h gene pair is regulated by a common and
shared bidirectional element is available for only a few of
such bidirectional promoters (see e.g., Hansen et al. 2003).
1
Present address: Animal Breeding and Genetics Group, University
of Wageningen, Wageningen, The Netherlands.
Key words: head-to-head gene, bidirectional promoter, coordinate
expression.
E-mail: [email protected].
Mol. Biol. Evol. 25(9):1909–1921. 2008
doi:10.1093/molbev/msn136
Advance Access publication June 19, 2008
Ó The Author 2008. Published by Oxford University Press on behalf of
the Society for Molecular Biology and Evolution. All rights reserved.
For permissions, please e-mail: [email protected]
However, close juxtaposition of 2 autonomous promoters
does result in promoter cross talk (Hampf and Gossen
2007), unless an insulator is interposed (see e.g., Xie et
al. 2007). It is therefore likely that the members of a closely
linked h2h gene pair are no longer independently expressed. Indeed, most (Trinklein et al. 2004; Li et al.
2006; Lin et al. 2007; Yang et al. 2007), but not all (Takai
and Jones 2004), expression analyses showed significant
correlation, both negative and positive, between the expression of h2h gene pair members.
The usual explanation for the evolutionary origin of
closely linked h2h pairs is that once created by chance,
it becomes difficult to separate the pair as insertion of intergenic DNA, such as a repetitive element, would disturb expression of both genes. H2h gene pairs would thus slowly
accumulate during evolution. This explanation is supported
by the higher than average evolutionary conservation of
h2h pairs (Koyanagi et al. 2005; Li et al. 2006) and a lack
of repetitive elements in the bidirectional promoter region
between human h2h pairs (Takai and Jones 2004). It is curious, however, that enrichment in closely linked h2h pairs
is reported to be limited to mammals (Koyanagi et al. 2005);
one would expect them to accumulate in all evolutionary
lineages. We have therefore examined the evolution of
closely linked h2h gene pairs and compared the dynamics
of the evolution of h2h gene pairs with that of closely linked
convergently transcribed antisense gene pairs (tail-to-tail;
t2t) and that of head-to-tail gene pairs (h2t; consecutive
genes transcribed from the same strand). We show here that
the enrichment in closely linked h2h gene pairs is not limited to mammalian genomes but is also seen in, for example,
those of chicken and Xenopus tropicalis. A distinguishing
feature of the mammalian gene organization, with the exception of platypus, as compared with that of other investigated vertebrates and lower eukaryotes, is the relative lack
of closely linked h2t gene pairs. By tracing the emergence
of the h2h, t2t, and h2t gene pairs closely linked in the human genome, we show that the accumulation of h2h pairs
predates that of h2t and t2t pairs. However, once formed,
the h2t and t2t pairs are as stable as the h2h pairs.
1910 Franck et al.
Methods
Data Sets and Gene Distribution
In Ensembl (version 40 and 46; Hubbard et al. 2007), the
following species were used: primates: Homo sapiens (Hs;
man), Pan troglodytes (Pt; chimpanzee), Macaca mulatta
(Mmu; macaque), and Otolemur garnettii (Og; bush baby);
Scandentia: Tupaia belangeri (Tb; tree shrew); rodents:
Rattus norvegicus (Rn; rat), Mus musculus (Mm; mouse),
Spermophilus tridecemlineatus (St; squirrel), and Cavia
porcellus (Cp; guinea pig); Lagomorpha: Oryctolagus
cuniculus (Oc; rabbit); Laurasiatheria: Canis familiaris
(Cf; dog), Felis catus (Fc; cat), Bos taurus (Bt; cow),
Myotis lucifugus (Ml; microbat), Erinaceus europaeus
(Ee; hedgehog), and Sorex araneus (Sa; shrew); Afrotheria:
Loxodonta africana (La; elephant) and Echinops telfairi (Et;
tenrec); Xenarthra: Dasypus novemcinctus (Dn; armadillo);
marsupial: Monodelphis domestica (Md; opossum); monotreme: Ornithorhynchus anatinus (Oa; platypus); bird:
Gallus gallus (Gg; chicken); amphibian: Xenopus tropicalis (Xt; pipid frog); fish: Tetraodon nigroviridis (Tn; spotted green puffer fish), Takifugu rubripes (Tf; fugu or
japanese puffer fish), Danio rerio (Dr; zebrafish), Gasterosteus aculeatus (Ga; stickleback), and Oryzias latipes
(Ol; medaka); urochordates: Ciona intestinalis (Ci; sea
squirt) and Ciona savignyi (Cs; sea squirt); and other eukaryotes: Caenorhabditis elegans (Ce; nematode), Drosophila melanogaster (Dm; fruit fly), Anopheles gambiae (Ag;
mosquito), and Saccharomyces cerevisiae (Sc; yeast).
For all species, the number of h2h, h2t, and t2t gene
pairs was determined, together with the length of the intergenic region, by means of python scripting on the Ensembl
gene annotation files in Ensmart (script available from the
authors). The intergenic region was defined as the number
of base pairs between the beginning and/or ends of the transcripts as annotated in Ensembl. In the better annotated genomes, this includes the 5# and 3# untranslated regions
(UTR); in poorly annotated genomes, information about
the UTRs may be incomplete, and the number of closely
linked gene pairs could be underestimated.
Conservation and Dynamics of Gene Pairs
The cross species homology data (orthology files in
Ensmart) were used to find orthologs of human gene pairs
with an intergenic distance ,600 bp (gene duos) in other
species. When more than one possible ortholog was found,
the most probable gene pair was chosen, that is, the one
which most resembled the human situation in orientation
and/or distance. For the orthologs of each member of a human gene duo, the location and organization in other species were determined with possible outcome h2h, h2t, t2t,
or not linked. In case of location on the same chromosome,
the intergenic distance was determined as well as whether
or not the 2 genes were separated by other genes. By combining the data from different species, the most likely organization of the members of the Hs gene duos at the
primate–rodent divergence was then inferred. Similarly,
the putative organization of the orthologs could be inferred at the other branching points of a vertebrate tree
consisting of Hs, Mm, Rn, Gg, Xt, Tn, Tr, Dr, Ol, and
Ga, in which we placed Mm and Rn together in a rodent
group and the 5 fish species in a fish group. In analyzing
these data, the maximum parsimony principle was applied,
assuming the least chromosomal rearrangements. Gene
duos that were inferred to be closely linked at a branching
point, yet separated in a descendant, were considered to be
lost again (e.g., genes that are h2h gene duos in Hs, Mm,
and Gg but not in Rn). We could trace orthologs of 47%
(365) of the h2h, 27% (99) of the h2t, and 41% (185) of the
t2t Hs gene duos; orthologs of the remainder of the Hs
gene duos could not be found in a sufficient number of
species.
Gene Expression
An expression data set consisting of a subset of normal human and mouse tissue samples from the Gene Logic
BioExpress Database product (http://www.genelogic.com/
genomics/bioexpress/) was used. The human data set consists of 115 tissue categories (compiled from 3,269 tissue
samples) and 44,792 cDNA fragments; and the mouse data
set consists of 25 tissue categories (compiled from 859 tissue samples) and 36,701 cDNA fragments (Hulsen et al.
2006). First, the Pearson correlations between the expression profiles of all cDNA fragments in the human set and all
genes in the mouse set were calculated (data available at
http://www.cmbi.ru.nl/;timhulse/orthocomp/). A perfect
correlation has a score of 1; a perfect anticorrelation has
a score of 1. Second, the Affymetrix fragment IDs of the
chip data were mapped to the Ensembl (version 40) IDs used
in our study, using Ensembl-Affymetrix mapping files from
the Ensembl FTP site (see also http://www.ensembl.org/info/about/docs/microarray_probe_set_mapping.html). When
one Ensembl ID was mapped to multiple Affymetrix fragment
IDs, the average of the multiple correlation coefficients
was used. Of the 18,553 Ensembl Hs gene IDs mapped
to 28,348 Affymetrix IDs, 10,497 map to a single ID,
4,854 to 2 IDs, 2,037 to 3 IDs, and 1,165 to 4 or more
Affymetrix IDs. Of the 16,269 Ensembl Mm gene IDs
mapped to 20,548 Affymetrix IDs, 11,146 map to a singe
ID, 3,588 to 2 IDs, 1,136 to 3 IDs, and 399 to 4 or more
Affymetrix IDs. Finally, the correlation coefficients were
mapped for human and mouse Ensembl (version 40) h2h,
h2t, and t2t gene pairs and 3,249 (human) or 2,197
(mouse) randomly assembled gene pairs as control.
Results
Organization of Closely Linked Gene Pairs
Ensembl (Hubbard et al. 2007) provides a comprehensive and integrated source of annotation of genome sequences and provides orthology links between genes in
annotated genomes. This makes this database particularly
suitable for analyzing the evolution of gene organization
within and between species. The total number of tandem
gene pairs in a genome is theoretically equal to the number
of genes. In practice, in well-assembled genomes, the
number counted can be slightly higher because more than
one potential pair can be formed when genes overlap. In
poorly assembled genomes, where contigs are still short,
Evolution of Closely Linked Gene Pairs
the number of tandem gene pairs detected is significantly
smaller than the number of genes (compare e.g., Hs and
Dn in table 1).
If gene organization is random, then the frequency of
the 3 possible orientations of gene pairs should be 50% h2t,
25% h2h, and 25% t2t. This ratio is indeed more or less
observed in most of the 34 investigated eukaryotic genomes
(21 mammalian and 13 nonmammalian; table 1). However,
in some genomes, notably not only in that of the hedgehog
(Ee) but also, for example, in Cp, Sa, and La, we see considerably deviating frequencies. This is likely due to incomplete assembly of the genome; for reasons that we do not
understand, incomplete assembly tends to result in a bias
toward t2t pairs.
The mammalian genome has been reported to be enriched in closely linked h2h gene pairs (Trinklein et al.
2004; Koyanagi et al. 2005; Li et al. 2006). As in the last
years, the sequences of a number of other vertebrate genomes, both mammalian and nonmammalian, have become
available, we reexamined whether enrichment in closely
linked h2h gene pairs is indeed a characteristic of mammalian genomes only. The frequency of h2h, h2t, and t2t gene
pairs relative to the intergenic distance between the members of those pairs in representatives of different vertebrate
1911
groups is shown in figure 1 (data for all species examined
are shown in supplementary fig. S1 [Supplementary Material online]; note that we only selected protein-coding genes
and that overlapping genes were not taken into account).
With the notable exception of the opossum (Md) and platypus (Oa) genomes, all mammalian genomes do show an enrichment in closely linked h2h gene pairs and in some cases
also in closely linked t2t pairs. It is unlikely that the difference in organization of the Md and Oa genomes is due to
incomplete annotation as other mammalian genomes that
also contain very few closely linked gene pairs do nevertheless show at least some enrichment in h2h pairs (see e.g., Cp
and Et in supplementary fig. S2, Supplementary Material
online). Enrichment in closely linked h2h pairs is also seen
in the genome of G. gallus (Gg) and, to a lesser extent, in
that of X. tropicalis (Xt) but not in any of the 5 fish genomes. The puffer fish genomes (Tn and Tr) show an enrichment in h2t as do the nonvertebrate genomes.
As these data indicate that between vertebrate clades
genomes might differ not only in the enrichment of closely
linked h2h gene pairs but also in the frequency of closely
linked h2t and t2t gene pairs, we determined the number of
closely linked h2h, h2t, and t2t pairs (table 1; as the enrichment of h2h gene pairs in the mammalian genomes is seen
Table 1
The Number of Gene Pairs in Vertebrate and in Some Lower Eukaryotic Genomes (Ensembl version 46)
Speciesa
Hs
Pt
Mmu
Og
Tb
Mm
Rn
St
Cp
Oc
Cf
Fc
Bt
Ml
Ee
Sa
La
Et
Dn
Md
Oa
Gg
Xt
Tn
Tr
Ga
Ol
Dr
Ci
Cs
Ce
Dm
Ag
Sc
a
Genome
Size (Mb)
ProteinCoding Genes
3,253
2,929
3,094
1,969
2,137
3,378
2,507
1,913
1,950
2,076
2,385
1,643
3,247
1,674
2,133
1,833
2,296
2,112
2,146
3,502
1,918
1,051
1,511
342
393
447
700
1,527
173
177
100
133
278
12
22,701
20,572
21,944
15,444
15,458
24,118
22,993
14,828
14,064
15,439
19,305
14,839
21,755
16,229
14,588
13,290
15,717
16,562
15,540
19,520
18,527
16,736
18,025
28,005
21,880
20,791
20,131
21,322
14,180
11,604
20,069
14,086
12,457
6,697
Species, see Methods for abbreviations.
Number of Gene Pairs
(% of total)
h2h
5,809
5,266
5,324
1,671
1,539
5,989
5,711
1,484
783
1,011
4,762
1,780
5,529
1,557
424
623
903
740
632
4,828
2,259
4,344
3,761
6,240
4,517
4,893
4,577
4,691
2,704
2,478
4,647
3,829
3,192
1,822
(25%)
(25%)
(25%)
(21%)
(22%)
(24%)
(25%)
(22%)
(21%)
(22%)
(25%)
(22%)
(25%)
(21%)
(16%)
(20%)
(20%)
(20%)
(22%)
(24%)
(23%)
(25%)
(24%)
(22%)
(24%)
(24%)
(24%)
(23%)
(21%)
(22%)
(22%)
(26%)
(26%)
(27%)
h2t
11,813
10,525
10,844
3,747
3,272
12,717
11,892
3,114
1,575
2,087
9,901
3,957
10,881
3,447
1,034
1,315
2,020
1,524
1,231
10,048
4,974
8,358
7,782
15,481
10,048
10,684
10,191
11,050
7,363
6,461
11,231
7,009
6,167
3,224
(50%)
(50%)
(50%)
(48%)
(46%)
(52%)
(51%)
(47%)
(43%)
(45%)
(51%)
(48%)
(50%)
(47%)
(38%)
(42%)
(45%)
(42%)
(42%)
(51%)
(50%)
(49%)
(51%)
(55%)
(52%)
(52%)
(52%)
(54%)
(58%)
(56%)
(55%)
(48%)
(49%)
(47%)
Number of Gene Pairs
with Intergenic Distance ,600 bp (% of total)
t2t
5,792
5,271
5,300
2,433
2,233
5,943
5,695
2,085
1,336
1,579
4,750
2,422
5,508
2,347
1,244
1,166
1,589
1,385
1,051
4,833
2,613
4,354
3,843
6,239
4,541
4,875
4,596
4,699
2,729
2,526
4,692
3,745
3,186
1,808
(25%)
(25%)
(25%)
(31%)
(32%)
(24%)
(24%)
(31%)
(36%)
(33%)
(24%)
(30%)
(25%)
(32%)
(46%)
(38%)
(35%)
(38%)
(36%)
(25%)
(27%)
(26%)
(25%)
(23%)
(24%)
(24%)
(24%)
(23%)
(21%)
(22%)
(23%)
(26%)
(25%)
(26%)
h2h
770
588
421
92
87
703
399
74
75
103
208
102
395
128
86
63
93
111
76
92
47
445
201
440
115
387
289
239
303
142
1,055
1,663
451
1,017
(49%)
(53%)
(47%)
(46%)
(42%)
(50%)
(43%)
(40%)
(41%)
(52%)
(39%)
(42%)
(47%)
(48%)
(44%)
(48%)
(53%)
(47%)
(53%)
(34%)
(23%)
(43%)
(41%)
(14%)
(12%)
(26%)
(34%)
(43%)
(14%)
(12%)
(16%)
(31%)
(23%)
(21%)
h2t
363
185
188
17
28
303
199
18
30
19
95
33
159
31
13
15
19
21
15
58
99
373
224
1,883
628
653
290
160
1,344
695
3,525
2,323
786
2,262
(23%)
(17%)
(21%)
(9%)
(14%)
(22%)
(21%)
(10%)
(16%)
(10%)
(18%)
(13%)
(19%)
(12%)
(7%)
(11%)
(11%)
(9%)
(10%)
(22%)
(49%)
(36%)
(46%)
(60%)
(65%)
(44%)
(35%)
(29%)
(62%)
(60%)
(52%)
(43%)
(40%)
(48%)
t2t
445
331
293
91
89
395
338
91
79
76
225
110
291
106
95
55
62
103
53
116
58
220
67
817
222
443
261
156
538
319
2,137
1,374
708
1,479
(28%)
(30%)
(32%)
(45%)
(44%)
(28%)
(36%)
(50%)
(43%)
(38%)
(43%)
(45%)
(34%)
(40%)
(49%)
(41%)
(36%)
(44%)
(37%)
(44%)
(28%)
(21%)
(14%)
(26%)
(23%)
(30%)
(31%)
(28%)
(25%)
(28%)
(32%)
(26%)
(36%)
(31%)
1912 Franck et al.
FIG. 1.—Frequency of h2h (black diamonds), h2t (dotted gray line), and t2t (gray line) gene pairs as a function of the length of the intergenic
region in the species indicated in the panels. The frequency was calculated by dividing the number of gene pairs separated by a particular distance
(in intervals of 50 bp) by the total number of gene pairs in the genome. The arrows indicate the cutoff for gene duos used in subsequent analyses. Data
for other species are shown in supplementary figure S2 (Supplementary Material online).
for gene pairs with an intergenic distance of 600 bp or less
[arrows in fig. 1], we used 600 bp, rather than the 1,000 bp
used in other studies, as the cutoff for closely linked gene
pairs. We will refer below to such closely linked gene pairs
as gene duos). The percentages of the 3 possible orientations of gene duos relative to the total number of gene duos
in different vertebrate and nonvertebrate species are plotted
in figure 2. There is some variation in the pattern in tetrapods, but the overall trend is clear: there is an increase in
h2h gene duos not only in all mammals, except Oa, but also
in Gg and Xt. The t2t gene duos are also in excess in most
mammals but not in Gg and Xt, which have actually less t2t
gene duos than expected from a random distribution (note
that the excess of t2t gene duos is a feature of both well
[e.g., Rn] and poorly [e.g., Ee] assembled mammalian genomes and thus unlikely to be an assembly artifact). The
most noticeable and consistent feature is the marked lack
of h2t gene duos in mammalian genomes except again in
that of Oa. As can be expected from their much longer divergence times, the organization of the 5 fish genomes is
much more variable than that of the mammalian genomes:
Dr has a mammalian-like distribution with a lack of h2t
gene duos and an overrepresentation of h2h gene duos,
whereas in the Tn and Tr genomes, as in nonvertebrate genomes, the organization of the gene duos is more random.
Clustering of Gene Duos
If the location of genes was random, then gene density
must correlate with genome size and a more compact genome such as that of Dm or Ce is likely to have more gene
duos, merely due to a higher gene density. There is also
a significant correlation between the closely linked h2h
gene pair ratio and gene density in the human genome
(Li et al. 2006). To determine whether gene duos are in general part of a larger gene dense cluster, we determined
whether either one of the members of the gene pair is
flanked by another gene within 1,000 bp or less. As ex-
pected from their compact genomes, in Tn and Ci, about
one-third of the gene duos are part of a cluster of at least
3 genes, whereas in the larger Dr genome, this proportion is
only around 10% (table 2). In the tetrapod genomes examined on the average 24% of the h2t gene duos and 21% of
the t2t gene pairs are part of a gene cluster. The h2h gene
duos are more likely to be solitary, with an average of only
13% belonging to a larger dense gene cluster. This holds
true for both well-annotated genomes (Hs and Mm) and less
well-annotated genomes and is thus unlikely to be an artifact due to lack of information about the exact lengths of the
5# and 3# UTRs. A reason for the more solitary nature of
h2h gene duos in tetrapod genomes may be that a gene
neighboring the pair would be in an h2t or a t2t organization, and at least h2t gene duos are scarce in mammalian
genomes (fig. 2). Such gene pairs are also scarce in the
Dr genome, and h2h gene duos are also more likely to
be solitary in this genome than h2t and t2t gene duos. In
the Tn and Ci genomes, in which gene duos have a more
random organization, the h2h gene duos are more likely to
be part of a larger gene cluster than the h2t gene duos.
Gene duplication often results in gene clusters; the
b-globin gene cluster is a prime example. To determine
whether gene duplication is a significant cause of gene
duos, we determined how many genes in the human genome are adjacent to a paralog gene. Paralogous genes were
identified via the paralogy link in Ensembl. As shown in
table 3, about 10% of all human protein-coding genes have
a paralog neighbor transcribed from the same strand, but
only 6% of the h2t gene duos are paralogous. About 8%
of the human protein-coding genes have a paralog neighbor
transcribed from the opposite strand, with an equal occurrence of divergent or convergent transcription. However,
for both the divergently and the convergently transcribed
gene duos, only 1% consists of paralogs (table 3). The
members of h2t gene duos are thus 3 times more likely
to be paralogs than those of h2h or t2t gene duos, but gene
duplication events have not significantly contributed
Evolution of Closely Linked Gene Pairs
1913
FIG. 2.—Frequency of h2h, h2t, or t2t gene duos relative to the total numbers of gene pairs with intergenic distance of ,600 bp. Dotted lines
indicate 25% and 50% which is the expected distribution of h2h or t2t (25%) and h2t (50%) if gene organization is random. Species are indicated as
Hs 5 Homo sapiens, Pt 5 Pan troglodytes, Mmu 5 Macaca mulatta, Og 5 Otolemur garnettii, Tb 5 Tupaia belangeri, Mm 5 Mus musculus, Rn 5
Rattus norvegicus, St 5 Spermophilus tridecemlineatus, Cp 5 Cavia porcellus, Oc 5 Oryctolagus cuniculus, Cf 5 Canis familiaris, Fc 5 Felis catus,
Bt 5 Bos taurus, Ml 5 Myotis lucifugus, Ee 5 Erinaceus europaeus, Sa 5 Sorex araneus, La 5 Loxodonta africana, Et 5 Echinops telfairi, Dn 5
Dasypus novemcinctus, Md 5 Monodelphis domestica, Oa 5 Ornithorhynchus anatinus, Gg 5 Gallus gallus, Xt 5 Xenopus tropicalis, Tn 5 Tetraodon
nigroviridis, Tr 5 Takifugu rubripes, Ga 5 Gasterosteus aculeatus, Ol 5 Oryzias latipes, Dr 5 Danio rerio, Ci 5 Ciona intestinalis, Cs 5 Ciona
savignyi, Ce 5 Caenorhabditis elegans, Dm 5 Drosophila melanogaster, Ag 5 Anopheles gambiae, and Sc 5 Saccharomyces cerevisiae. For common
names of these species, see Methods. For clarities’ sake, the 2 nonmammalian tetrapod species (Gg and Xt) and the 2 urochordates (Ci and Cs) are shown in
gray. The mammalian species are divided in separate groups shown in alternating black and striped bars (primates, Scandentia, Laurasiatheria, and
Xenarthra in black; rodents, Lagomorpha, Afrotheria, marsupials, and monotremes in striped bars). The different taxonomic groupings are indicated at the
bottom of the figure.
overall to the formation of gene duos. To have some idea as
to when the gene duplications occurred, we checked for the
presence of orthologs of all the Hs paralogous gene duos in
other species. As shown in table 3, 5 out of the 7 h2h paralogous gene duos likely predate the gnathostomes,
whereas only 11 out of the 22 h2t and 2 of the 4 t2t paralogous gene duos do so (see also below). Overall, at least
half of the Hs paralogous gene duos that could be traced
back in the vertebrate tree are the result of a gene duplication early in vertebrate evolution.
1914 Franck et al.
Table 2
The Extent to which h2h, h2t, or t2t Gene Duos Are Part of
a Larger Dense Gene Cluster
Speciesa
Percent of Gene Duos Having a
Neighboring Gene within 1,000 bp
h2h (%)
13
12
13
14
11
37
8
38
Hs
Pt
Mm
Gg
Xt
Tn
Dr
Ci
a
h2t (%)
27
26
22
22
23
24
13
32
t2t (%)
22
21
20
23
21
33
11
31
See Methods for abbreviations.
Table 3
The Contribution of Gene Duplication to the Formation of
Gene Duos in the Human Genome
Percent of human gene pairs in a
particular organization consisting of paralogous genes
h2h (%)
h2t (%)
t2t (%)
All
4
10
4
Duos
1
6
1
Age of human paralogous gene duos
h2h
h2t
t2t
(1)a
Human
1 (4)a
Euarchontoglires
(2)a
(1)a
Amniotes
1 (1)a
(4)a
Tetrapods
Gnathostomes
5
11
2
Total
7
22
4
a
Between brackets: number of gene duos that could not be traced back
further.
Conservation of Gene Duos
If close apposition of genes has consequences for the
regulation of expression of those genes, then one would expect conservation of gene organization. Previous studies
have shown that h2h gene pairs are significantly more likely
to have the same gene organization in other species than h2t
pairs (Koyanagi et al. 2005; Li et al. 2006; Sémon and Duret
2006). Sémon and Duret (2006) showed that genes convergently transcribed (t2t) are less likely to be linked in other
species than genes transcribed in the same (h2t) or divergent
direction (h2h). However, this study focused on gene clusters with no constraint on intergenic distance, and gene
pairs were still considered as pairs even if another gene
was interposed. We have therefore repeated this analysis
and compared the organization of the members of h2h,
h2t, and t2t human gene duos with that of their orthologs
in other species. Most orthologs were in the same orientation in tetrapod genomes; even in fish about half of the orthologous gene pairs have the same relative orientation as in
man (fig. 3, panel a). However, when also the intergenic
distance is considered, then close apposition of orthologs
is only seen in primates and rodents (fig. 3, panel b). In other
vertebrates, the intergenic distance between the orthologs of
the human gene duos is usually greater than 600 bp.
When only the orthologs of the human gene duos are
considered, there is little difference between the human
h2h, h2t, and t2t gene duos with respect to similarity of their
organization in other species (fig. 3, panel b). However,
when the number of those orthologs in a species relative
to the number of gene duos in that species is also taken into
account, a different picture emerges (fig. 3, panel c). As expected, the larger the evolutionary distance is from man, the
fewer the orthologous gene duos found. When ortholog
gene duos are found, they are mostly h2h and not h2t or
t2t gene duos. Thus, in species more distant from man, there
are relatively fewer orthologous h2t and t2t gene duos than
h2h gene duos. Assuming that the evolutionary rate of generation of gene duos is the same for h2h, h2t, and t2t, there
are 2 possible explanations for the preponderance of orthologous h2h gene duos. One is that h2h gene duos were generated earlier in evolution than h2t and t2t genes and are
therefore more likely to be common; the other is that the
close linkage of h2h genes, once generated, is better conserved during evolution than that of h2t and t2t genes.
Dynamics of Formation of Gene Duos
To determine when a particular human gene duo was
formed during evolution, we need to trace the organization
of the genes in the ancestral species. To that end, we used
the data about the organization of the orthologs of the members of the human gene duos in other species. Of the human
gene duos, 365 h2h, 99 h2t, and 185 t2t were phylogenetically informative, that is, the organization of the orthologs
could be traced in a sufficient number of different vertebrate
species to be able to infer the most likely organization of
those orthologs at the branching points of the gnathostome,
tetrapod, amniote, and euarchontoglires (primates and rodents) lineages (nodes A–D in fig. 4; see also Methods).
The inferred rearrangements of the members of the human
h2h, h2t, and t2t duos are outlined in figure 5. For example,
in the genome of the common ancestor of the gnathostomes
(fig. 5, node A), 58 of the 365 human h2h gene duos were
already present as h2h gene duos, 153 were already gene
pairs but with a larger intervening distance, 34 were separated by intervening genes, 45 were linked but in the wrong
orientation, and 75 were dispersed. Between branching
points A and B, of these 75 dispersed pairs, 6 became linked
but in the wrong orientation, 4 became linked in the right
orientation but separated by intervening genes, 38 became
a gene pair separated by .600 bp, and 8 became a gene
duo, thus leaving 19 as dispersed gene pairs at node B.
Of the 45 gene pairs at node A, which were linked in
the wrong orientation, 2 became linked in the right orientation but still separated by intervening genes, 22 became
a gene pair separated by .600 bp, and 8 became a gene
duo. This left 13 linked gene pairs, but together with the
6 gene pairs that now became linked, this yields a total
of 19 at node B. Of the 34 gene pairs separated by intervening genes at node A, in 10 cases the intervening genes
were lost but leaving an intergenic distance .600 bp, and in
4 cases a gene duo was formed. Together with the 2 þ 4
gene pairs that were gained, this gives a total of 26 at node
B. Of the 153 gene pairs separated by .600 bp at node A,
19 now became a gene duo, presumably due to loss of intergenic DNA, while 10 þ 22 þ 38 were added, which then
yielded 204 gene pairs at node B. Finally, to the 58 gene
Evolution of Closely Linked Gene Pairs
1915
FIG. 3.—Percentage of orthologs of human h2h, h2t, and t2t gene duos present in other eukaryotes. On the left (panel a, gray bars), the percentage
of orthologs of the human gene duos with the same orientation is shown; in the middle (panel b, black bars), the percentage of orthologs of the human
gene duos with the same orientation and close proximity (,600 bp) is shown. The percentage of h2h genes is shown in the top panels, that of the h2t
genes in the middle panels, and that of the t2t genes in the bottom panels. Species are indicated as in the legend to figure 2. The percentage is relative to
the total number of orthologous gene pairs found (see also table 4), thus correcting for missing orthologs due to incomplete gene annotation and/or
orthology determination. On the right (panel c), a Venn diagram shows the overlap between Hs gene duos and those in other species. The left circle
represents the Hs gene duos, and the right circle the gene duos in the species (denoted as in the legend to fig. 2) indicated on the left. The overlap
represents the extent to which the gene duos are orthologous; the numbers are from left to right: the number of unique Hs gene duos, the number of
common (orthologous) gene duos, and the number of gene duos unique to the species indicated. The left column shows the data for the h2h gene duos,
the middle column those for the h2t gene duos, and the right column those for the t2t gene duos.
duos at node A, 19 þ 4þ8 þ 8 were added, giving 97 gene
duos at node B.
Figure 5 illustrates that the mode of formation of the
gene duos, whether h2h, h2t, or t2t, is in general very similar: first genes happened to be rearranged such that they
are linked in the proper orientation, then intergenic DNA
was lost. For all 3 gene pair orientations, more than 80% of
the pair members were already organized in the right orientation without an intervening gene in tetrapods (fig. 5,
node B, and fig. 6, left panel). The most conspicuous difference between the h2h gene duos on the one hand and
the h2t and t2t gene duos on the other hand is that about
50% of the h2h gene duos (172 out of 365) predate amniotes, whereas only 14% of the h2t (15 out of 99) and
28% of the t2t gene duos (52 out of 185) do so (see
fig. 5, node C, and fig. 6, right panel). Formation of the
human h2h gene duos thus started in early tetrapod evo-
lution, whereas most of the human h2t and t2t gene duos
were formed in amniotes (fig. 6).
We have also attempted to estimate the rate of loss of
gene duos. In principle, the best measure is counting how
many gene duos appeared to be lost again later in evolution
(e.g., if orthologs form a gene duo in fish, Xt, rodents, and
man but not in Gg, the inference is that the gene duo is lost
in the Gg lineage). For that we need to know which gene
duos were present in the ancestral genome. The numbers
shown in figure 5 at a particular node are the sum of the
gene duos present at the previous node plus the gene duos
inferred to have been formed prior to divergence of the lineages deriving from that node. The latter gene duos thus
represent those that are inferred to be present at that node
because they are present in the descendant species. Hence,
by definition, loss of those gene duos cannot be detected.
That means that we can only determine whether a gene duo
1916 Franck et al.
FIG. 4.—Phylogenetic tree of the vertebrate species used to reconstruct the dynamics of human gene pair formation during vertebrate
evolution. The branch points of gnathostomes (A), tetrapods (B),
amniotes (C), and euarchontoglires (D) and the endpoint Homo sapiens
(E) are indicated by black diamonds. Species are indicated as in the
legend to figure 2.
is lost after a particular node, if that gene duo was present at
the previous node. For example, the 172 h2h gene duos
present at node C (fig. 5) should also be present at node
D. Loss in the rodent lineage can then be inferred for those
172 h2h gene duos (table 4). The numbers are very small,
particularly in the case of h2t gene duos, and the reliability
is therefore not high. A rate of loss can also be calculated by
combining the data presented in figures 3 and 5 and knowing how many orthologs of the human gene duos can
detected in other species (supplementary table S1, Supplementary Material online). From figure 5, it can be calculated
how many of the orthologs of the human gene duos present
in a particular species (supplementary table S1, Supplementary Material online) were likely to be already present at the
nearest branch point; the loss then follows from the number
of gene duos present in the genome of that species at this
time (for sample calculation, see supplementary fig. S2,
Supplementary Material online). For the h2h gene duos,
these 2 approaches yield very similar rates: a loss between
11% and 14% per 100 Myr in the mouse lineage and between 8% and 10% per 100 Myr in the chicken lineage.
For the h2t gene duos, the estimates for the mouse lineage
are a loss between 6% and 15% per 100 Myr; in the chicken
lineage, none would be lost. Finally, the loss of the t2t gene
duos is estimated to be between 0% and 19% per 100 Myr in
the mouse lineage and between 13% and 15% per 100 Myr
in the chicken lineage. From these estimates, it appears that
FIG. 5.—Dynamics of human gene pair formation during vertebrate evolution. The organization of the human gene pairs at the branch points of
gnathostomes (A), tetrapods (B), amniotes (C), and euarchontoglires (D) and in H. sapiens (E) was inferred from the organization in the available
vertebrate genomes. The various possible organizations are shown at the top and denoted as: dis 5 dispersed; linked 5 linked on the same chromosome
but in the wrong orientation; 2 arrows flanking a black box 5 linked in the proper orientation but separated by another gene; 2 arrows flanking a slash 5
genes are adjacent in the proper orientation but the intergenic distance is larger than 600 bp; and 2 arrows 5 gene duo, that is, genes are adjacent in the
proper orientation and with an intergenic distance ,600 bp. The inferred numbers of gene pairs in the various predicted organizations are given for the
branch points A–D; the arrows underneath each number indicate the number of those gene pairs that changed to a specific other type of organization
before the next branch point, ultimately resulting in the observed numbers of human gene duos at point E. The data for the 365 human h2h gene duos
are shown on the left, those for the 99 h2t gene duos in the middle, and those for the 185 t2t gene duos on the right.
Evolution of Closely Linked Gene Pairs
FIG. 6.—Extent of formation of gene pairs during evolution. In the
left panel, the percentage of gene pairs, irrespective of intergenic distance,
present at the branch and endpoints as in figures 4 and 5 is shown. The
right panel shows the percentage of gene duos at these branch and
endpoints. Note that in this figure, the combined data of 5 fish species are
used; the data shown in figure 3 refer to a single species.
there is no major difference in stability of gene duos depending on orientation.
Gene Expression
The closer 2 genes are the more likely they are to be
located in the same expression cluster (Sémon and Duret
2006; Purmann et al. 2007). However, coexpression may
be more marked when gene pairs are considered. To test
this, we plotted the Pearson correlation coefficient of expression of gene pairs grouped by intergenic distance, that
is, a group of gene pairs that were nonlinked (control), one
distantly linked (intergenic distance .100 kb), one linked
with an intergenic distance between 10 and 100 kb, one
linked with an intergenic distance between 0.6 and 10 kb,
and the closely linked gene duos (less than 600 bp).
Although the overall results are very similar—the closer
the linkage, the higher the likelihood of coexpression—
whether we looked at human or mouse tissue or whether
we considered h2h, h2t, or t2t orientation, there are some
1917
surprising differences (fig. 7A). For the human h2h genes,
there was very little difference in the likelihood of coexpression with decreasing distance when the intergenic distance was less than 10 kb; in mice, closely linked gene pairs
(,600 bp) are more likely to be coexpressed than more distantly linked gene pairs. The reverse was found for the human h2t pairs: close linkage in man is more likely to result
in coexpression than in mice. The effect of close linkage on
coordinate expression was least for t2t pairs: the shift toward coexpression is less than for h2h and h2t pairs and
slightly larger in mice than in men. To show the difference
in the likelihood of coexpression relative to the orientation of
closely linked gene pairs more directly, the curves for the
h2h, h2t, and t2t gene duos were superimposed (fig. 7B). This
clearly shows that for all orientations, close linkage correlates with coexpression but that this effect is largest for
h2t gene pairs in man and for h2h gene pairs in mice. Note
that all curves skew to the right, that is, a positive correlation
between the expression of both members of a gene duo. In
these experiments, we found no evidence for a negative correlation of expression of a significant number of gene pairs.
Discussion
Mammalian genomes have been reported to be enriched in divergently transcribed cis-antisense gene pairs
with an intergenic distance of less than 1 kb, the h2h genes
with a bidirectional promoter region. The exact number of
such genes in the human genome is still a matter of debate;
the reported numbers vary from 677 (Koyanagi et al. 2005)
to 1,262 (Li et al. 2006), 1,304 (Lin et al. 2007), 1,352
(Trinklein et al. 2004), and 5,653 (Yang et al. 2007). Most
of these differences can be explained by the precision with
which transcription start sites have been mapped, by
whether not only major but also minor start sites were taken
into account and by whether or not only protein-coding
genes were used. In addition, it is not always clear whether
or not overlapping genes were excluded. In the experiments
reported here, we have used only protein-coding genes, excluded overlapping transcription units, and restricted our
analysis to nonoverlapping transcription units separated
by an intergenic distance of 600 bp or less (dubbed gene
Table 4
Loss of Previously Formed Gene Duos during Later Evolution
Number of Known Gene Duos at Nodea
Node Db to Mm, Rnc
Node C to Gg
Node B to Xt
Node D to Mm, Rn
Node C to Gg
Node B to Xt
Node D to Mm, Rn
Node C to Gg
Node B to Xt
a
b
c
h2h
172
97
58
h2t
15
10
10
t2t
52
35
29
Number of Gene Duos Lost
Dispersed
2 (1.2%)
2 (2.1%)
5 (8.6%)
Dispersed
0 (0.0%)
0 (0.0%)
0 (0.0%)
Dispersed
1 (1.9%)
2 (5.7%)
3 (10.3%)
.600 bp Intergenic
16 (9.3%)
24 (24.7%)
16 (27.6%)
.600 bp intergenic
3 (20.0%)
0 (0.0%)
0 (0.0%)
.600 bp intergenic
8 (15.4%)
11 (31.4%)
12 (41.4%)
Inverted
0 (0.0%)
4 (4.1%)
2 (3.4%)
Inverted
0 (0.0%)
0 (0.0%)
0 (0.0%)
Inverted
0 (0.0%)
0 (0.0%)
1 (3.4%)
Gene Insertion
1 (0.6%)
1 (1.0%)
1 (1.7%)
Gene insertion
0 (0.0%)
0 (0.0%)
0 (0.0%)
Gene insertion
0 (0.0%)
1 (2.9%)
2 (6.9%)
Note that the number of gene duos known to be present in the ancestral species at a particular node is the number present at the previous node.
See figure 4.
See Methods for abbreviations.
Total Lost
19 (11.0%)
31 (31.9%)
24 (41.4%)
Total lost
3 (20.0%)
0 (0.0%)
0 (0.0%)
Total lost
9 (17.3%)
14 (40.0%)
18 (62.1%)
1918 Franck et al.
Evolution of Closely Linked Gene Pairs
duos), where the transcription units were defined as annotated in Ensembl. Using this definition, we found 770 h2h,
363 h2t, and 445 t2t gene duos in the human genome (table 1).
The genomes of other mammals show a similar organization
with respect to gene duos: on the average about 10% of the
h2h gene pairs and about 6% of the t2t gene pairs has an
intergenic distance of 600 bp or less (excluding the opossum and platypus genomes). One striking property of the
mammalian gene organization is that close linkage between h2t gene pairs appears to be avoided, with on
the average only about 2% of the gene pairs in this orientation having an intergenic distance of 600 bp or less. In
spite of this similarity in gene organization, many gene
duos are lineage specific (fig. 3). Our data thus emphasize
that the formation of gene duos is a dynamic process: they
have been continuously formed and also continuously lost
during evolution. The reconstruction of the formation of
the human gene duos showed that the present-day human
h2h gene duos originated earlier in evolution than the h2t
or t2t gene duos (figs. 5 and 6). A priori one would expect
that there is an equal chance of a DNA rearrangement
yielding an h2h, h2t, or t2t gene duo. The apparently later
formation of the present-day human h2t and t2t gene duos
would then be due to a lesser stability than that of the h2h
gene duos—h2t and t2t gene duos formed prior to the
emergence of the amniotes could just have been lost again
(note that only about 15% instead of the expected 50% of
the gene duos in eutherian genomes are h2t gene duos [fig.
2 and table 1], which suggests that fixation of an h2t gene
duo is a rare event). However, our data (table 4 and supplementary fig. S2 [Supplementary Material online]) indicate that there is no major difference in stability between
h2h, h2t, and t2t gene duos in the amniote lineage, which
would imply that h2t and t2t gene duos were selected against
earlier. Evolutionary changes in the acceptability of closely
linked genes are likely. For example, about one-third of the
gene pairs in the Dm genome are closely linked gene pairs; in
the Hs genome, only about 7% (table 1). Furthermore, in
nonmammalian genomes, the frequency of h2h, h2t, and
t2t gene duos is mostly close to random, in eutherian mammalian genomes, it is not (fig. 2).
A potential problem with h2t gene pairs is transcriptional read-through from the upstream gene into the downstream gene. This can cause promoter occlusion—the
elongating RNA polymerase could remove positive transcription factors; promoter activation—the elongating
RNA polymerase could remove repressors (Callen et al.
2004; Leupin et al. 2005; Shearwin et al. 2005); or result
in the synthesis of a read-through mRNA, which encodes
a chimeric protein (Parra et al. 2006). To prevent transcriptional interference, a strong transcription termination signal
1919
is needed between the 2 genes. The exact mechanisms and
sequence motifs that signal termination by polymerase II
are still not exactly understood (for recent reviews, see
Buratowski 2005; Rosonina et al. 2006), but it is clear that
the cleavage that precedes polyA addition is a prerequisite.
In this respect, it is of interest to note that the polyA addition
signaling in yeast is more complex than in mammalian cells
(Zhao et al. 1999) and that the distance from the polyA addition site to transcription termination in yeast may be
shorter (about 0.1 kb; Russo and Sherman 1989) than that
in mammalian cells (.0.5 kb; Rosonina et al. 2006). Stringency of polyA addition and transcription termination
could be a factor in the maintenance of h2t gene duos,
and it would be of interest to examine if polyA addition
signaling is also more complex in, for example, fugu or urochordates than in eutherian mammalian cells.
If transcription termination poses a problem for h2t
gene pairs, why are t2t gene duos not depleted? One possible explanation is that collision could cause pausing of
RNA polymerase II, which in turn enhances termination
(Zhao et al. 1999; Buratowski 2005; Rosonina et al.
2006) and thereby solving the termination problem. Another possibility is that the antisense transcripts serve in
a presumably regulatory, yet unknown role. Antisense
transcription abounds in the human genome (Yelin et al.
2003; Dahary et al. 2005; Sun et al. 2005; Engström
et al. 2006) and over 40% of the human or mouse transcription units may have an antisense transcript, usually noncoding (Engström et al. 2006). The functional consequences of
antisense transcription could be a factor in the selection
against t2t gene duos.
The depletion of h2t gene duos is not quite unique to
eutherian mammalian genomes; we saw this also in the genome of the zebrafish, although not in other genomes of
lower eukaryotes. What distinguishes the eutherian mammalian genomes from that of zebrafish is a nonuniform distribution of the distance between h2h genes (fig. 1). A
subset of h2h genes with a short intergenic region is also
seen in the Gg and Xt genomes. This would suggest that
the trend toward formation of such a subset of h2h gene
pairs started in tetrapods, before the mammalian divergence. However, the earliest offshoots in mammalian
evolution, the monotremes (platypus) and marsupials
(opossum) lack this subset of h2h gene pairs. The opossum
genome has very few gene duos, which could be the reason
that this subset is not detected. The opossum genome otherwise shares the eutherian mammalian characteristic of depletion of h2t gene duos and enrichment in h2h gene duos.
The lack of gene duos in the opossum genome does not
appear to be a problem of assembly of the genome sequence
as the number of gene pairs that can be formed is about
FIG. 7.—Correlation of expression of members of h2h, h2t, and t2t gene pairs. The relative number of gene pairs with specific Pearson correlation
coefficients is plotted against the Pearson correlation coefficients in increments of 0.2 units. (A) Human and mouse distribution plots of h2h, h2t, and t2t
gene pairs in several intergenic distance intervals (0–600 bp 5 ,600; 600 bp–10 kb 5 600–10 k; 10–100 kb 5 10–100 k; and 100 kb–N 5 .100 k).
The number of human gene pairs for h2h are, respectively, 599, 594, 1,638, and 624 pairs; for h2t, 247, 2,131, 3,513, and 1,015 pairs; and for t2t, 364,
1,328, 1,425, and 441 pairs. The number of mouse gene pairs for h2h are, respectively, 401, 548, 1,213, and 355 pairs; for h2t, 145, 1,727, 2,574, and
608 pairs; and for t2t, 252, 1,088, 926, and 247 pairs. The number of human randomly paired genes was 3,249 pairs; for mouse, 2,197 random gene
pairs were selected. (B) Pearson correlation distribution plot for human or mouse gene duos for which microarray data are available (human: 599 h2h,
247 h2t, and 364 t2t; mouse: 401 h2h, 145 h2t, and 252 t2t). Data are based on Ensembl Version 40.
1920 Franck et al.
equal to the number of protein-coding genes (table 1). The
platypus genome assembly is not yet complete as the number of potential gene pairs is only half of the number of
protein-coding genes. The paucity of gene duos may be
the reason that a subset of h2h gene pairs is not seen, it cannot explain that the organization of the gene duos in the
platypus genome is random.
Incompletely assembled eutherian mammalian genomes with only a few gene duos (see table 1) do show
the typical depletion of h2t genes (fig. 2). If the platypus
gene organization reflects that of the mammalian ancestor,
then we must conclude that the formation of a subset of
closely linked h2h genes in chicken and Xenopus is evolutionarily independent of the emergence of such a gene organization in mammalian genomes. The alternative is that
large rearrangements have taken place in the platypus genome. The latter alternative is the most likely as almost all
the gene duos likely to have been present in the last common ancestor are no longer gene duos in the platypus genome (data not shown). It is noteworthy that platypus is
a typical mammal with respect to the density of repetitive
elements in its genome; for at least one stretch even higher
than in the human genome (Margulies et al. 2005; see also
Warren et al. 2008). Continuous insertion of repetitive elements would tend to create gene-poor and gene-rich domains and has been suggested to be one of the factors
driving genes together (e.g., Takai and Jones 2004).
Whether the platypus genome indeed contains gene-poor
and gene-rich domains awaits further analysis of that genome; if so, the organization of the gene duos in the platypus genome would then show that compaction to a generich domain does not necessarily lead to an enrichment
in closely linked h2h gene pairs. Amphibian and avian genomes are relatively poor in repetitive elements (Organ
et al. 2007) but do contain a subclass of closely linked
h2h genes. Hence, there is no strict correlation between
density of repetitive elements and enrichment in closely
linked h2h gene pairs. For the h2h gene duos, it has been
suggested that the sharing of regulator elements provides
selection pressure to maintain the gene pair (Adachi and
Lieber 2002; Trinklein et al. 2004; Lin et al. 2007; Yang
et al. 2007). In the case of the rare eutherian h2t gene
duo, it could be the transcriptional coupling or the chimeric gene product that is favorable; for the t2t gene duo, the
antisense transcript could have a regulatory role (RIKEN
Genome Exploration Research Group and Genome Science Group [Genome Network Project Core Group] and
the FANTOM Consortium 2005). It could also just be
chance that gene duos stay together: insertion of DNA
in such a short intergenic region would be a rare event.
In the case of h2h or h2t gene duos, the target area for
DNA insertion would be even smaller as repetitive elements tend to be excluded from the first 300 bp of the promoter region (Takai and Jones 2004).
Supplementary Material
Supplementary table S1 and figures S1 and S2 are
available at Molecular Biology and Evolution online
(http://www.mbe.oxfordjournals.org/).
Acknowledgments
The authors thank Gene Logic Inc. for the use of a subset of normal human and mouse tissue samples from the
Gene Logic BioExpress Database product. This work
was financially supported by the Netherlands Organization
for Advancement of Pure Research (NWO).
Literature Cited
Adachi N, Lieber MR. 2002. Bidirectional gene organization:
a common architectural feature of the human genome. Cell.
109:807–809.
Buratowski S. 2005. Connections between mRNA 3# end
processing and transcription termination. Curr Opin Cell
Biol. 17:257–261.
Callen BP, Shearwin KE, Egan JB. 2004. Transcriptional
interference between convergent promoters caused by elongation over the promoter. Mol Cell. 14:647–656.
Dahary D, Elroy-Stein O, Sorek R. 2005. Naturally occurring
antisense: transcriptional leakage or real overlap? Genome
Res. 15:364–368.
Engström PG, Suzuki H, Ninomiya N, et al. (24 co-authors). 2006.
Complex loci in human and mouse genomes. PLoS Genet. 2:e47.
Gierman HJ, Indemans MHG, Koster J, Goetze S, Seppen J,
Geerts D, van Driel R, Versteeg R. 2007. Domain-wide
regulation of gene expression in the human genome. Genome
Res. 17:1286–1295.
Hampf M, Gossen M. 2007. Promoter crosstalk effects on gene
expression. J Mol Biol. 365:911–920.
Hansen J, Bross P, Westergaard M, Nielsen M, Eiberg H,
Børglum A, Mogensen J, Kristiansen K, Bolund L,
Gregersen N. 2003. Genomic structure of the human
mitochondrial chaperonin genes: HSP60 and HSP10 are
localised head to head on chromosome 2 separated by
a bidirectional promoter. Hum Genet. 112:71–77.
Hubbard TJP, Aken BL, Beal K, et al. (58 co-authors). 2007.
Ensembl. Nucleic Acids Res. 35:D610–D617.
Hulsen T, Huynen M, de Vlieg J, Groenen P. 2006.
Benchmarking ortholog identification methods using functional genomics data. Genome Biol. 7:R31.
Hurst LD, Pal C, Lercher MJ. 2004. The evolutionary dynamics
of eukaryotic gene order. Nat Rev Genet. 5:299–310.
Koyanagi KO, Hagiwara M, Itoh T, Gojobori T, Imanishi T.
2005. Comparative genomics of bidirectional gene pairs and
its implications for the evolution of a transcriptional regulation system. Gene. 353:169–176.
Leupin O, Attanasio C, Marguerat S, Tapernoux M,
Antonarakis SE, Conrad B. 2005. Transcriptional activation
by bidirectional RNA polymerase II elongation over a silent
promoter. EMBO Rep. 6:956–960.
Li Y-Y, Yu H, Guo Z-M, Guo T-Q, Tu K, Li Y-X. 2006.
Systematic analysis of head-to-head gene organization:
evolutionary conservation and potential biological relevance.
PLoS Comput Biol. 2:e74.
Lin JM, Collins PJ, Trinklein ND, Fu Y, Xi H, Myers RM, Weng Z.
2007. Transcription factor binding and modified histones in
human bidirectional promoters. Genome Res. 17:818–827.
Margulies EH, NISC Comparative Sequencing Program,
Maduro VVB, Thomas PJ, Tomkins JP, Amemiya CT,
Luo M, Green ED. 2005. Comparative sequencing provides
insights about the structure and conservation of marsupial and
monotreme genomes. Proc Natl Acad Sci USA. 102:3354–3359.
Organ CL, Shedlock AM, Meade A, Pagel M, Edwards SV.
2007. Origin of avian genome size and structure in non-avian
dinosaurs. Nature. 446:180–184.
Evolution of Closely Linked Gene Pairs
Parra G, Reymond A, Dabbouseh N, Dermitzakis ET, Castelo R,
Thomson TM, Antonarakis SE, Guigo R. 2006. Tandem
chimerism as a means to increase protein complexity in the
human genome. Genome Res. 16:37–44.
Purmann A, Toedling J, Schueler M, Carninci P, Lehrach H,
Hayashizaki Y, Huber W, Sperling S. 2007. Genomic
organization of transcriptomes in mammals: coregulation
and cofunctionality. Genomics. 89:580–587.
RIKEN Genome Exploration Research Group and Genome
Science Group (Genome Network Project Core Group) and
the FANTOM Consortium. 2005. Antisense transcription in
the mammalian transcriptome. Science. 309:1564–1566.
Rosonina E, Kaneko S, Manley JL. 2006. Terminating the
transcript: breaking up is hard to do. Genes Dev. 20:
1050–1056.
Russo P, Sherman F. 1989. Transcription terminates near the
poly(A) site in the CYC1 gene of the yeast Saccharomyces
cerevisiae. Proc Natl Acad Sci USA. 86:8348–8352.
Sémon M, Duret L. 2006. Evolutionary origin and maintenance
of coexpressed gene clusters in mammals. Mol Biol Evol.
23:1715–1723.
Shearwin KE, Callen BP, Egan JB. 2005. Transcriptional
interference—a crash course. Trends Genet. 21:339–345.
Sun M, Hurst LD, Carmichael GG, Chen J. 2005. Evidence for
a preferential targeting of 3#-UTRs by cis-encoded natural
antisense transcripts. Nucleic Acids Res. 33:5533–5543.
Takai D, Jones PA. 2004. Origins of bidirectional promoters:
computational analyses of intergenic distance in the human
genome. Mol Biol Evol. 21:463–467.
1921
The FANTOM Consortium. 2005. The transcriptional landscape
of the mammalian genome. Science. 311:1709–1711.
Trinklein ND, Aldred SF, Hartman SJ, Schroeder DI, Otillar RP,
Myers RM. 2004. An abundance of bidirectional promoters in
the human genome. Genome Res. 14:62–66.
Warren WC, Hillier LW, Marshall Graves JA, et al. (99 coauthors). 2008. Genome analysis of the platypus reveals
unique signatures of evolution. Nature. 453:175–183.
Xie X, Mikkelsen TS, Gnirke A, Lindblad-Toh K, Kellis M,
Lander ES. 2007. Systematic discovery of regulatory motifs in
conserved regions of the human genome, including thousands
of CTCF insulator sites. Proc Natl Acad Sci USA.
104:7145–7150.
Yang MQ, Koehly LM, Elnitski LL. 2007. Comprehensive
annotation of bidirectional promoters identifies co-regulation
among breast and ovarian cancer genes. PLoS Comput Biol.
3:e72.
Yelin R, Dahary D, Sorek R, et al. (16 co-authors). 2003.
Widespread occurrence of antisense transcription in the
human genome. Nat Biotechnol. 21:379–386.
Zhao J, Hyman L, Moore C. 1999. Formation of mRNA 3# ends
in eukaryotes: mechanism, regulation, and interrelationships
with other steps in mRNA synthesis. Microbiol Mol Biol Rev.
63:405–445.
Kenneth Wolfe, Associate Editor
Accepted June 12, 2008