Download Close relationship between non-viral retroposons in Drosophila

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Gene wikipedia , lookup

Promoter (genetics) wikipedia , lookup

Community fingerprinting wikipedia , lookup

Transcriptional regulation wikipedia , lookup

Transposable element wikipedia , lookup

Silencer (genetics) wikipedia , lookup

Metabolism wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

Deoxyribozyme wikipedia , lookup

Non-coding DNA wikipedia , lookup

Amino acid synthesis wikipedia , lookup

Point mutation wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Endogenous retrovirus wikipedia , lookup

Genetic code wikipedia , lookup

Biochemistry wikipedia , lookup

Biosynthesis wikipedia , lookup

Nucleic acid analogue wikipedia , lookup

Transcript
Volume 16 Number 9 1988
Nucleic Acids Research
Close relationship between non-viral retroposons in Drosophila melanogaster
Pier Paolo Di Nocera
European Molecular Biology Laboratory, Meyerhofstrasse 1, D-6900, Heidelberg, FRG and
International Institute of Genetics and Biophysics, CNR, Via Marconi 10, Naples, Italy
Received February 12, 1988; Revised and Accepted March 29, 1988
Accession no.X06950
ABSTRACT
G elements constitute one of the several moderately repeated DNA families of the
Drosophila melanogaster genome. G elements lack terminal repetitions and structurally resemble
mammalian processed pseudogenes because they terminate at one end in oligo-A tracts of
variable length. G elements are mostly interspersed in the chromocentric heterochromatin with
other repeated DNA sequences. Nucleotide sequence analysis of G3A, a family member inserted
in a non-nucleolar rDNA unit, shows that functional G elements might have coding capacity for
two polypeptides; one has homology to reverse transcriptases, the other is reminiscent of RNA
binding proteins derived from the cleavage of retroviral gag polyproteins. Functionally related
polypeptides are similarly encoded by members of two other Drosophila repeated DNA families,
the F elements and the I factors. The similarity in structural organization and the relatedness of
their potential gene products favors the hypothesis that G, F and I sequences derive from a
common ancestor and result from processes based on the reverse transcription of RNA
intermediates that probably differ markedly from those ensuring the maintenance and dispersion
of copia-like elements.
INTRODUCTION
In addition to the integrated forms of mammalian retroviruses, many eukaryotic sequences
might originate from the reverse transcription of RNA intermediates (1-4). These sequences,
commonly referred as retroposons, can be sorted, on the basis of common properties, into two
major groups. Viral retroposons include DNA sequences whose structural organization closely
resemble that of vertebrate retroviruses, like murine intracisternal A-particles and Drosophila
copia-likc or yeast Ty elements (4-7). Non viral retroposons comprise a heterogeneous set of
sequences that correspond to partial or complete copies of cellular RNA species. They include
processed pseudogenes, processed snRNA pseudogenes, highly repeated sequences such as Alu
or LI elements and a variety of other sequences (2-4). Retroposons of this type exhibit no
consistent structural similarities, except for the frequent presence of oligo-A stretches at one end
(2-4). A major difference between viral and non viral retroposons is that the former encode
enzymes responsible for their dispersion, but there are exceptions to this general rule. LI
sequences are a superfamily of oligo (A)-terminated sequences spread, as species specific
versions, throughout mammals in more than 104 copies per haploid genome (reviewed in 8 and
9). Many LI units are truncated at the 5' end and/or contain internal deletions or inversions;
however, sequence analysis of randomly selected clones indicates that complete family members
© IRL Press Limited, Oxford, England.
4041
Nucleic Acids Research
might encode a reverse transcriptase (10-13). Drosophila F elements (14,15) resemble LI
sequences in their structural organization, and potentially encode a homologous polypeptide
(16). LI-like reverse transcriptases are also encoded by Drosophila I factors (17), Bombyx mori
Rl and R2 elements (18,19) and Trypanosoma brucei ingi elements (20). Some of these
elements in addition encode polypeptides structurally related to cleavage products of retroviral
gag polyproteins (16-18).
Like F elements, Drosophila G elements terminate at one end in oligo-A tracts often preceded
by polyadenylylation signals (21-22). We show here that this structural similarity reflects a
common origin, because G and F elements have the same genetic organization and potentially
encode homologous polypeptides.
MATERIALS AND METHODS
DNA sequence analysis
The two Hind III fragments from lambda G3 (22) that include the G3A element (see fig. 1)
were subcloned and a detailed restriction map established. Suitable restriction fragments were
subcloned into the multilinker region of pEMBL8 (23), and their DNA sequence determined by
hybridizing 1-2 pmoles of plasmid DNA to 0.5 pmoles of either direct or reverse M13 primers as
described by Hattori et al. (24). Elongation of annealed primer molecules was catalyzed, in the
presence of [35S] dATP and appropriate mixtures of dNTPs and ddNTPs, by bacteriophage T7
encoded DNA polymerase (25) utilizing the Sequenase™ kit from USB (United States
Biochemical Corp.). Reaction products were resolved on 6% acrylamide-8M urea gels (26). The
sequencing strategy adopted is diagrammed in fig. 1. Restriction sites exploited for subcloning
were crossed at least on one strand; ambiguities were resolved by determining the nucleotide
sequence according to the chemical method of Maxam and Gilbert (26).
RESULTS
Sequence analysis of a complete D. melanogaster G element
G elements are predominantly associated with repeated DNA sequences (21,22). The
frequent association with rDNA and rDNA insertions sequences is a consequence of the
insertion of G family members at a specific site within the non transcribed spacer region of
rDNA units located outside the nucleolus organizer (22). The DNA region cloned in lambda G3
is representative of the peculiar interspersion of G and ribosomal sequences (fig. 1). The two G
elements in lambda G3 probably represent full length family members; both are framed by target
site duplications, and are not interrupted, as other cloned specimens, by the insertion of foreign
DNA (22). The two Hind III fragments including the G3A element were cloned, and the
complete G3A nucleotide sequence determined (see fig. 1 and MATERIALS and METHODS).
G3A is 4346 base pairs long (fig. 2). The element terminates at one end in a stretch of 19
adenosine residues and is flanked by 9 bp long target site duplications (22). The distribution of
4042
Nucleic Acids Research
0.2Kb
Fig. 1. Structure of the G3A element. The D. melanogaster DNA region cloned in phage lambda
G3 (22) is schematically shown on top. Black boxes indicate rDNA non transcribed spacer
(NTS) sequences, the stippled box uncharacterized repeated DNA sequences. Relevant
restriction sites within the G3A element are shown; bars indicate subcloned DNA fragments
whose nucleotide sequence has been determined (see MATERIALS AND METHODS).
Restriction sites are as follows: A, Ace I; B, Bam HI; Bg, Bgl II; C, Cla I; D, Dra HI; E, Eco
Rl; H, Hind III; Ps, Pst I; Pv, Pvu H; S, Sac I; X, Xba I.
ATG and stop codons along the three G3A frames forwarding the terminal A-rich tract is shown
in fig. 3.
Homologv of orf E to reverse transcriptases
The structure of G3A and of other cloned G elements (21,22) suggests that these sequences
might represent, like processed pseudogenes, cDNA copies of polymerase II transcripts.
Recently, it has been suggested that long oligo-A terminated sequences, such as mammalian LI
and Drosophila F elements, might originate from the self-mediated cDNA conversion of
transcripts that encode reverse transcriptase-like polypeptides (10,11,12,13,16). The 210 amino
acid long G3A orf E (fig. 3) contains many amino acids identified by Toh et al. (27) as invariant
residues in viral reverse transcriptases. Segments from orf E are aligned with conserved domains
of known and hypothetical reverse transcriptases in fig. 4. Noticeably, the highest homology is
found between orf E and the reverse transcriptase encoded by Drosophila F elements (62%
similarity). Significant homologies are also found with the reverse transcriptases encoded by
Drosophila I factors, mammalian LI elements, B. mori rDNA insertions and T. brucei ingi
elements (fig. 4), the number of amino acid matches ranging from 30 to 40 %. Like all these
hypothetical polypeptides, orf E differs markedly from the reverse transcriptases encoded by
vertebrate retroviruses, both in terms of identical or chemically related amino acids and with
respect to the distances between conserved domains. As previously observed for F elements and
I factors (16,17), the homology between orf E and the reverse transcriptase encoded by the
4048
Nucleic Acids Research
1
ACAGTCGCGATCGAACACTCAACGACTGCAGACGTGCCTACGGACCGACGGCAAGTTATTTTCGTGCTCAAM
101
TGTCGCGAGATTTCTTCGCGCACCGTGATTGGTrCAGCCG<X<»ACCTrACCXn'ATCG(rrACCACTACCAACa>CTCCTGCGTGCGTCITATCGGTATC
201
AACACTTACATTCGGCTAAAGTTACTGCGAACAACTCAGCAGCAGCCACGTGCTGAGGCTGGTACM^
301
CCTTCXCTACTCIXGGACAACATGGACTGGCAAGCCCCCCCGCGACCCACaWCTGACCA^
401
AAGGTGAAAGCAGCTGCTCAAGCGATAGCAXKTCCTCCXaOTCAGAGCCTGGGGAAGTCAAGCGCA^^
501
CMCGTGCCCAACACCAGC<XJ^TCTGCGCAAC»AGCTGGAAAATAACTCCTTCGCCCTTCT^
601
ACCGACAACGAGCAGCAAACCCCTGTTGGGGAATCTGCTCCAAAAACCATGAAAAAACCCAACCCGACCCCGAAGACCATCAAGCCACCCCCGATCTACA
701
TCCCAGACGTGACCAACXTCTCAGCXCTTCTCAGGATCWTTACGACTCTCGTCGGTGCCC^^
801
901
1201
1301
1401
cyACrrr.AAfyccrACAAC(^AArnvCATCnAra^TATrriTrArTrnATTTnAGAr^-Trjrrrr.AAAnAATnATmA^AAmTCTTT^
1501
AnrTTgrri^TTrrATrrrrAArAfy>tCTrAT^AAATAAnTrTCAAr>TAgr<riTCTr^AACi^rAAcaa7rr
1601
CACACCATCAAAACCGACAACATCGATATTTTATTGCTCTCAGAATCCCXTTrrT(KXCaKGATCCCACTTCATCATCTCCGGTTACGACCTCATCACAG
1701
CCAACCACCCATCAGGTAGAGCTCGA^Ky\G<y^GCG«XATGCTCATCAAAAGCGGCATACAGTTCACTGAACTXKCTGCGATACAGGA(KATTGGGCACA
1801
CTGTGC^GTG«CAG«CTC»ATAGCCTACAGa»GATATTAC«nTO»GCGG^
1901
GTTCCTCGAGTCOrrCGtMACTCGCrrCAnOCAGCCGaiGArTTOlATGCAAAGCACTCCTGGTGWXKTCCaaiCAAA
2001
TCCAC^AGTACCTGATGCGCAAAAACTTGGACTGCCACTCTACTGGAGAGCCCACACACTGGCCCTCGGACCCTTCTAAGCAGCCGGATCTGCTGGACAT
2201
cntCTCAGGAAGACGCax:TC£GTAGACTCACaXXMTC(XACOlAT«:CCCCAA
2301
7CTCCACTCCA«XAATATAGCKGCGGCCATC(»AAAACTGAAa\AG<a«»TGCACAACGCCGCTGAGrrT«Xa^CCCTCCTCCTCC^^
2401
AACTCCCGCAAOiGACCTGCATTTGTGGTCCCCJlGAAATCGCCGa;CTCGT(XKCGAGAAavGAax:CTCAGACGAGTATGCTTCCTCTCGC»TAACCCC
2501
AGGGACAAGACAGCGCTCAATCGCGCCTC(^AGGAArTCAAG(»CAAACTAACCACCCTAa^CAAGArTCGTTT(»ACGATTCCTrGAAGA
2701
GTCTGAGGOVGAAAGAGCC(aAGCTTGCTOlCCACCTTCGCTCTGCCTTCACTCCGTTTGACCGATGCACAGCTGCAO«XJ^^
2801
TGTTGAAAGCCCATGTGCTCCAGGACCTGCAAn'CAGCCCGTCGCACCA' J«WAGATCGCGCAGGAAATTGCCTCGCTCAGAAACGGCAAGTCTCCCGGC
2901
CCTGATCGCATCGACGCTACnKGTTAAAAATGTTGCCCAaTTCTGCTCACAGCTGCTTGCCAACAl IT IIAACAGCTGCTTCCGCCTAGGGTATTTCC
3001
CAAAACAATAGAAACGaKCGAAGTGATTACCATCCCCAAGCCCGGCAAACCTGAAGCCAATCTnKCTCCTATCGTCCGATAAGTCTGCrGGC^
3 1 0 1 CTCCAAAATACTCGAAACACTATTTCTGCGCAGAGTGCTGCCACTACTGGATGAGGCTCCTrrGATC^
3201
GGAACAC(^OAG«ATGCCACX:GGCTTGTAaVGCAAATTTTXXa«XKCTrC(aAAGGAAGCAATACTGCTGCGCCGTCATGCTGGATG^
3 3 0 1 TCGACAAAGTCTGGCACCCTGGACTCCACTATA/WATCAAGACTCACCTTCCCGGATCCCATre^
3501
GCAGACCTTCCTATCA(>CCCTCCCGGAGCCTAACAGTGGCCACATATGCCGATGACACCGCCTTCCTAGCCTCCGCCTCAGACCCCCAAGAAGCATCAA
3601
CCATCJVTTCTAAGCCAGCra»TGCCCTCGArcCATGGTT<»AAa»TGC*CCATTGCCGTGAAC«»GACAAATCCTre
3701
CAGAGGAGACTGCTCCCCACnT^GCTCAACGGGGAAACTATTCCAACCTCAAGTTCCCXGAAATACCTT^
3901
GAAGCCAATTTGGACTTAT(WaTTC»GCTGT>GGGCACTGCCAGCJlTCTCAAACCGCAACCGCATACAGCGCT^
4 001
4101
4 201
4301
CTGACCCTCACCCATACOlTG»AA«nrcxr.TATC«CAAGGAGCITGGAATGCtt^
ACGACTGGACAACCACCCTAAa>TCTGG<rrATrAACCTCC«XaCAACAGTGAAA^XATC«»CGC(nCCAa^^
CTATAACC«CAACAATGAAOCCCCGACCAATCTACAACTTrGTAATCCCTTAAGTTAAT(KCCCCCCCACCCAAACATTrAATTATTGTCCACATGGAC
AGATTTTAAArrAATACATAGATCGCTAAAAAAAAAAAAAAAAAAA
4346
Fig. 2. Nucleoride sequence of the G3A element. The complete nucleotide sequence of the G3A
is shown. Broken and continuous lines below sequence residues denote G-orfO and G-orfl,
respectively. A double dashed line marks the hypothetical G-orf2; dots denote frameshift
regions, asterisks stop codons (see text for a detailed description).
4044
Nucleic Acids Research
Drosophila 17.6 element, or by other copia-likc elements (data not shown), is poor, and is even
lower than that found with retroviral polymerases.
Homology between G and F elements
The close relationship between G and F elements which emerged from the previous analysis
prompted us to search for additional matches between F-orf2, the 859 amino acid long open
reading frame encoded by F elements that includes the reverse transcriptase-like domain (16),
and other G3A encoded orfs. Segments from seven adjacent orfs (A to G infig.3) distributed
on the three reading frames of the G3A element indeed exhibit homology to F-orf2 (fig. 5). With
a small number of nucleotide changes, these segments can be joined into a long uninterrupted
orf, that we will refer to as G-orf2, that might be slightly longer than F-orf2 at either termini.
The continuous homology to F-orf2 allows us to define rather precisely where the insertion or
deletion of single nucleotides might have occurred, thereby breaking the continuity of G-orf2
(see figs. 2 and 5). Deletion of a few oligonucleotides might have caused the loss of amino acid
residues at the boundaries between the E and F and F and G orfs (fig. 5).
Taking into account only positional identities, the overall similarity of G-orf2 and F-orf2
exceeds 40% (fig. 6). The degree of similarity varies along the proposed alignment; in addition
to the reverse transcriptase-like domain, a second highly conserved domain is found at the amino
terminus of F-orf2 and G-orf2 (figs. 5 and 6). The homology between the two regions not only
supports the hypothesis of an evolutionary link between F and G elements, but also suggests
that constraints are operating to maintain a biological function.
G-ORF1
In most retroviruses, the pol gene is preceded by the gag gene. The primary translation
product of the gag gene is a polyprotein eventually cleaved into virions core structural proteins
(34). One of these is a nucleic acid binding protein (NBP) structurally characterized by the
presence of one or multiple adjacent copies of the amino acid motif Cx2Cx 4 Hx 4 C (34).
Cysteine-rich motifs of the same kind are reiterated within orfs preceding the hypothetical
2000
I
51
I
Gorf-0
H I
II II I I I I
I I
I I I
I
III I III
Mill
3000
I
I
I
I
I
I
I
I II Ml I I I G I III
I
I I Illl II
II Gorf-1
I IIII I I
4000 (bp)
I
II
B I I III
F
I II II 3
I D
Fig. 3. G3A orfs. The three forward frames of the G3A sequence are represented as horizontal
lines; vertical lines above and below indicate ATG and translational stop codons, respectively.
Stippled boxes denote segments of the hypothetical G-orf2 (see text).
4045
Nucleic Acids Research
i
LIBS
LIHd
RIBm
R2Bm
Ingi
HTLV-I
RSV
BBV
MoMLV
17.6
G
F
I
LIHs
LIMd
RIBm
R2Bm
Ingi
HTLV-I
BSV
HBV
MoMLV
17.6
IPKPGKP-EAfiL&SYRPISLLAILSKILER
IPKPGKN - HJVA£S YRPI SLL£CISK1FEJS
ILKPNTD-K2KTSSYRPISLNCCIAKILEK
IPKPGBD-T2KKENERPISLUNIDAKILUK
LPKG.NGRPL2BPKAYRPYILLEYLGJCILEK
YPKVEEP-GG.-PG.EYRPISI.ASIPLBHFHS
ILKA.GKKA-ECLDSYRPYXLT£CLCKjaiER
UKICA.NG
1MRFI
HDLEATHS
IRKASG
SYRLL
HDLEAYNA
YDKNENNSS^— -ESRLVVDFSQFSBGHTR
YJOCPGTN
DYFPy
QDLEEYtffi
KQDASGKQ
KERIY
GVPQGSVLGPILYXLYTADLP
GVPQGSVLGPTLYLIYTAD.1P
GXPQGSP1SVILEL1AFNKL2
GTRQGCPLSPILEN1VLEVLA.
GTRQGCPL£PYLEN1VLEVLA.
GCPQGSVLGPTLHNYLMDDLL
GVRQGDPLSPILENWMDLJ.L
GVPQG1VPGSIMEHVMNSLS
VLPQGFKNSPTLEEMQLAHIL
VLPQGMTCSPTICQLWSQYL
K1PMGVGL£PFLLA.QET£ALA.
RLPQGFKN£PTLEDEALHRDL
RMPFGLKN&PATEQ
RCMN
IDYEKLME
[ 6]
( 4]
[ 8)
[18]
[18]
[ 8]
[15]
[ 8]
[ 9)
(10)
[ 8]
[ 9]
[ 8]
(54) LDVKQAFDKVRH
[54) LDVSQAFDKVRL
[51] LDFSRAFDfiVGV
[531 XDAEKAFDKIQQ
(53) LDAEKAFDKIQH
[4 9] LDISGAFDNAHW
[50] LDFAKAFDTVSH
[51] YDYEKAJTDTVDH
[27] 1DLBEATFQ1PL
[27] LDLKDCFFS1PL
[2 6] LDVSAAFYfllPI
[28] LDLKCAFFCLRL
[27] IDLAKG.FHQIEM
o
LTVATYADDTAFLES
LTV£TEADDTAILSR
1KFNAYADDFFLIIN
YKLSLEADDMIVYLE
YK1SLLADDMIVYIS
LQHSFEADDL2LLAR
CTXLQYMDDILLASE
C-MLHYMDDLLLAAS
CWFfiYMDDLVLGAR
LILLQYVDDLLLA&1
KHCLVYLDDIIVFS1
(60)
(60]
(60]
(55]
(55]
[60]
[65]
[34]
(33)
[33]
[33]
[33]
[33]
[44]
(44]
(45]
(46]
(45]
(43]
[46]
[45]
[27]
[23]
[72]
[26]
[16]
SEKYLG
EVTYLG
SLKILG
RIKYLG
NIKYLG
QVTVLG
RWgYLG
KCTLFG
1IKELG
fiVQYLG
HL-EMG
QVKYLG
E2TELG
oo-
Fig. 4. Similarities between orf E and reverse transcriptases. G-orfE segments are aligned with
homologous regions from known and hypothetical reverse transcriptases. The single-letter
amino acid code is used. Bold face characters denote identical residues, underlined characters
denote chemically related amino acids grouped as in Schwartz and Dayhoff (28). Dots and open
circles indicate positions respectively occupied by identical or similar amino acids among a large
group of viral reverse transcriptases (27). F, Drosophila Fw element, orf2 (16 ); I, Drosophila I
factor, orf2 (17); LIHs, Homo sapiens LI consensus orf sequence (12); LIMd, Mus
domesticus LI element, clone LlMd-A2, orf 2 (11); RIBm, Bombyx mori rDNA type I
insertion, orf 2 (18 ); R2Bm, Bombyx mori rDNA type II insertion (19); ingi, Trypanosoma
brucei ingi element (20); HTLV-I, human adult T-cell leukemia virus type I (29); RSV, Rous
sarcoma virus (30); HBV, hepatitis B virus (31); MoMLV, Moloney murine leukemia virus
(32); 17.6, Drosophila 17.6 element, orf 2 (33). Numbers in brackets refer to amino acids
residues present between the reported regions.
reverse transcriptases in F elements, I factors and RIBm elements (16-18). One copy of the
Cx 2 Cx 4 Hx 4 C motif, and two adjacent imperfect ones (fig. 7a) are found within G-orfl, a 241
amino acid long orf that partially overlaps G-orf2 (figs. 2 and 3). In fig. 7b the G-orfl region
encompassing the cysteine-rich motifs is aligned with the homologous regions in F-orf 1 and Iorfl. The homology between G-orfl and F-orf 1 is about twice as high as that between G-orfl
and I-orfl (45 versus 24% of positional identities).
In the 51 portion of G3A a 242 amino acid long orf (G-orfO) extends from residue 221 to 948
overlapping G-orfl (figs. 2,3,7c). As hypothesized for G-orf2, it is possible that G-orfO and Gorfl are part of a unique frame in functional G elements and that mutations have interrupted their
continuity in G3A. Because of the 5' truncation of the Fw element (16) the amino terminus of Forfl is unknown. However, amino acids homologous to the G-orfO residues underlined in
fig.7c are found at the expected position in F 19, a full length F element (15; P.P. Di Nocera,
4046
Nucleic Acids Research
G
F
MQISLWIVFWNANGL-QRSKAEVEHTIKTONIDI LLVSESHFCPRSHFI I SGYDLI -TAKHPSGRARGSAAKLIKSGIQFTELPA1(»I)WAQCAVAPVNSLQ
IMATIillATONAW^QR-KmAQFIiiEKHIDVMLIJETHLTSKYNF^
G
F
NSLQ-GDITVGAVY. .PRHAITETHLHEFFESLGTRFIAAGDFNAKHSWWGS...NPKGKTLHKYLMRKN—LDCHSTGEPTHWPSDPSKQPDLLDIAICKG
QUBtmATLAAVYCPPRFTVI^QFUJFFQALGPHFIAAGDYMAKHrHWGSM.^
G
F
IGRAKLVCTrYDRLVSDHSAVmj^IPVUUOTIiUU.Tr3miTWiTTF>i^
ISRSLVKAIX:iJ'DLSSDHSP\a,IHlJUlYAEN\TO1PTRLTSSiawnJlYKKYISSHIELSPKLNTESDI
G
F
U*SPEIAALVAEiOtfUJUWWFLSRNPRDin'AU4RASKEIia3KLTTLRQ^
KTimQIEQLVHVKRFUJtftE^SSRSPTAKQKIJWATRiaANA^
ESCTCALQSILTAAALTATPKiraNTINSK
•
G
F
.ADHLRSAFITFDRCTAAEQADTIRAVESPCAPGPAIQPVAPEEIAQEIASIJWGKSPG^^
FAAHLQNVFTPNC^TSTrAlJSYPVNRHQQHTP---IVFRPlffiITKIIKDNI£PKKSPGYDLITPEMIIQli>HSAVRyiT!^^
G
F
TIPKPGIO'EANIASYWISUJUl^ILERVFIiUlVlJ'VIiJEAGLIPDH^^
MIPKPGKNHTVASSYRPISLI^CISKIJ'EKCLLIRI^HQTYHNIIPAHQFGFTOSHGTIEQVNRIT^
G
F
ICTHIJGSHFAFLKSFTEGREFQVCCCTATSTPRPIRAGWQGSVICPILYTLYTADIP^
IKISIJESTHiaiJ?SYLYDRKFAVRCNTATSTVHTIEAGWQGSVl£PTLYLIYTADIP-TOSR-LTVSTFADDTAri£RSRSPIOATAQI^
G
KRWTIAVNADKSSQTTFSLRRGDCPPVTLNGETIPTSSSPKYLGLTLDRRLTW..
QADIJUJ^LHWLIGKRSKLRENLKLLLYKAILKPIWTYGI
WIEAKKTQIJaJttNNLHWLINSGSPLSLDHKVLLYNSrija'IWTYGS
G
F
QLWGTASISNRNRIQRFQNKCL..
AHPYHENSVIHKELGMPWVAEEISRFSERYAK
QLWGNASNSNIDIIQRAQSKILRTITGAPWYVRSENIQRDLNIPSVTNAITELKEKYL*
Fig. 5. Alignment of G orfs segments to F-orf2. Segments from G3A orfs A through G (see fig.
3) are aligned with the 859 amino acids long F-orf2 encoded by the Fw element (16). Slashes
through the line above amino acids residues signal where frameshifts need to be introduced to
adjoin segments from different G3A orfs. Dots denote amino acids residues that cannot be
unambiguously assigned to either one of the adjacent orfs because of the lack of homology to Forf2. Dashes indicate amino acids gaps. Filled circles and crosses below sequence lines denote
identical residues and favored amino acids substitutions grouped as in fig. 4, respectively. Stop
codons are indicated by asterisks.
A
IBI
C
ID
E
IF
Id
1 60
S
—
50
i -
-,
•
a
-
|—|
-
-|
-|
j
<
m
/
20-
-
to100
200 300 400 500 600 700
{ residues number]
SCO
Fig. 6. Amino acids identities between G-orf2 and F-orf2.
4047
Nucleic Acids Research
B
Copia
RSV
MoMSV
HTLV-I
P
F
A
V
P
K
L
E
K
T
A
V
G
E
D
Q
G
Q C F R C Q G F G H T Q R Y C F L
R C V K C G G L - H D S R A C E K
C C L H C Q A D - H P A S F K G C
Q C T N C Q E Y G H T R S Y C T L
V C V V C G D L - H D S K O C O I
K C N N C G G N - H T A N Y R G C
R C K K C L R F G H P T P I C K S
1 C 1
N C S E T K H T N D G E K C
N C L N C R N N P E L D H Q H S P
C C N K C Q Q Y G H P E K F C R A
T C G R C G E D G H R M E A C K A
K C H H C G R E G H I K K D C Y H
L C Y T C G S P G H Y Q A Q C P K
R C Q L C N G M G H N A K Q C R K
Q C T Y C E E Q G H W A K D C P K
P C F R C G K A G H W S R D C T Q
P C P L C Q D P T H W K R D C P R
C . . C . . . . H . . . . C
FFVNLEPASNNTD-IYiaKRICRSWTVEPPLKFNDVPQCFRCOGFGHTQRYC
EPENKPPRKNEVHPIYKLQLLLHRRITVEEPHKRNAPVQCTNCOEYGHTRSYC
FLEFRCVKCGGLHDSRACEKKEDEKAC
CLHCQADHPASFKGCPAYKKAK
TLAPVCWCGDLHDSKQCO-IKKEHACEKKCNNCGGNHTANYRGCPIYKELK
FFVNLEPASNNTDIYKLKRICRSW-TVEPPLKFNDVPQCFRCQGFGHTQRYC
TLVETGLIIITFESHKLPEIVRIGYETVRVRDYIPLPRLCKKCLRFGHPTPIC
FLEFRCVKCGGLHDSRACEKKEDEKAC
KSVETCINCS^TKHTKDGEKCTNEKN
CLHCQADHPASFKGCPAYKKAK
CUJCRNNPELDHQHSPIDRKCP
SYCEQLSSSH\aJ^VHCX^TVPT\^SSPPSLLRD^M^WC^PRPTKIJTKVPRIa^ALKEAPGEGESSCSSDSSSSESEPGEVKRKAASRDAKEAADNVPHTSAALRKKLE
A
NNSFAlLSSTEDEDDDDDrm>raCOTPVGESAPKTMKKPNPTPJSTIKEP£mPDVTNISAIJ^ITTLVGAH^
A
A
A "QGILLSLSSAACDPEO
RHRTFQLSGTCTTQLAKNQRNPWGSSL*
KAPHISTVRDLHNTIGKKSKEPLGIFFVNLEPASNNTDIYKLKRICRSVVTVEPPLKFNDVPQCFRCQGFGHTQRYCFLEFRCVKCGGLHDSRACEKKEDEKACCLHC
C^HPASFKGCPAYKKAKAOC^^KPKARSMESHNKPSFELPNITNGMSYRDALSGTRKSCASTPPPTPPTPPEAPQPNHMEAra"TRFESLVERMMEKKFAGVTOLVAS
SILNSKSCK*
Fig. 7. Cvsteine-rich motifs within G-orfl. A) Alignment of Cys motifs. F, Fw element, orf 1
(16); I, I factor orf 1 (17); RIBm, RIBm element, orf 1 (18); copia, copia element, orf 1 (35);
RSV, Rous sarcoma virus, gag P12 (30); HTLV-I, human adult T-cell leukemia virus type I,
gag?15 (29); MoMLV, Moloney murine leukemia virus, gagPIO (32). B) Alignment of G-orfl
to F-orfl and I-orfl. Similarities between amino acids residues are outlined as in fig. 5. Q
Amino acid sequences of G-orfO and G-orfl. Possible initiator methionines are marked by
triangles. For underlined amino acid residues in G-orfO see text.
unpublished). Four of the five ATG codons within G-orfO might correspond, according to the
consensus established by Kozak (36), to an initiator methionine (fig. 7c). The break in
homology between G-orfO and its counterpart in F19 upstream of the signalled region favors the
hypothesis that translation of the NBP-like protein might start from the second ATG in G-orfO.
DNA homologies between G and F elements
Using the algorithm of Wilbur and Lipman (37) we searched the registers of comparison that
have the largest number of short perfect matches between the DNA sequences of the G3A and
the Fw elements. By combining the results of this analysis with those independently obtained
from the amino acid matches shown in figs. 5 and 6 we derived an overall alignment of the two
elements up to their terminal A-rich 31 ends (fig. 8). Representative examples of the degree of
homology of G3A and Fw DNAs are shown in fig. 8. Differences between the two elements
along their coding regions, except those leading to the frameshifts illustrated in fig. 5, are due to
insertions or deletions of triplets. The two elements show a major divergence at the junction
region between orfl and orf2, where a segment of about 100 bp is absent in G3 A (fig. 8). The
length of the region encompassing this site is similar in G3B and other G family members (22;
data not shown). Further analyses might clarify whether this observation has any functional
implication.
4048
Nucleic Acids Research
80 -
o
o
60 -
-
n
n
ri
•
!
40 -
1
-r
-
-i
•
-|
n
r.
-
-i
•
n
-
•
n
•
n
n
r
-
•
-
r
-
•
•
•
-
n
r
r
r
•
-
20 -
1013
176
•
CGTCACCGTTGAGCCCKCTCTGAAATTCAACGATGTTCCGCAGTGCTTCAGATGTCAAGGGTTCGGACACACCCAGCGCTACTGCTTTTTAGAGTTTCGC
MI! I I I I
I I I I II I I I I I I I I I I I I
I II IIII II II IIIII I II II
GATCACGGTAGAAGAGCCGCACAAACGCAACGCTCCTGTACAATGTACAAACTGCCAAGAGTATGGCCACACGAGGTCATATTGTACACTTGCCCCGGTG
76
1113
•
TGCGTCAAGTGTGGTGGCCTCCACGACTCCAGGGCGTGTGAAAAAAAGGAAGACGAGAAAGCATGC
TGCCTACACTGTCAAGCCGACCATC
II I I I
I I I I I I I II I III I I I I II
IIII II
I I I I I I I I IIII I
II
IIII
TGCGTAGTCTGTGGAGATCTCCACGACTCCAAACAGTGTCAA. . .ATTAACAAAGAAAATGCATGCGAGAAAAAATGTAATMCTGCGGGGGCAATCACA
I I
M i l l
B
1891
ACCTGCATGAGTTCTTCGAGTCCCTCGGAACTCGCTTCAITGCAGCCGGAGACTTCAATGCAAAGCACTCCTGGTGGGGGTC.CGCACAAACAACCCCAA
I I II II II II I I I I I I I I I II I II II II I I I I M i l I II II II II I
IIIII II II
III I I I I
AATTCCTGGATTTCTTCCAAGCACTAGGGCCACACTTCATTGCAGCAGGCGACTACAACGCTAAACATACTCACTGGGGATCGCGACTTGTGAACCCAAA
AGGCAAAACGCTCCACAAGTACCTGAT
GCGCAAAAACTTGGACTGCCACTCTACTGGAGAGCCCACACACTGGCCCTCGGACCCTTCTAAGCAG
I I I I I I III I I I I
III
I I I I I II I
II I I I
II III IIIIIII II IIII
IIIII I
AGGAAAACAGCTTTATAAGACGATAATAAAAGCCACTAATAAACTTGACCATGTTTCCCCCGGGAGTCCTACATACTGGCCATCAGACCTCAATAAGCTG
I
1072
1990
1172
3278
2451
3378
2551
3478
2651
CATGCTGGATGT^WiGCAGGCCTTCGACAAAGTCTGGCACCCTGGACTCCACTATAAAATCAAGACTCACCrTCCCGGATCCCACTTCGCCrrcCTCAAA
I I II II
I I I I I I I I I I I I II I I I I I I I I I I I I I
IIIIIII II I
1II IIII I I
ATTTTTAGACGTATCCCAAGCATTCGACAAAGTCTGGCTCGACGGCCTAATGTTTAAAATTAAAATATCCCTACCCGAAAGCACACACAAACTTCTAAAG
I
TCAITCACTGAGGGTAGAGAGTTCCAAGTTTGCTGCGGAACAGCGACCAGCACGCCTAGGCCGATAAGAGCCGGAGTACCCCAAGGCAGCGTCCTTGGAC
II I I
I I M l Mil
M lI III
II II II I I II
I III
I I Mill
IMIIIMMMM
Mill I
TCTTACCTCTATGACAGAAAGTTTGCAGTGCGGTGCAACACTGCCACTTCCACTGTTCATACAATTGAGGCTGGAGTCCCCCAAGGCAGCGTTC7TGGGC
CAATACTGIACACACTCTACACAGCAGACCTTCCTATCACACCCTCCCGGAGCCTAACAGTGGCCACATATGCCGATGACACCGCCTTCCTAGCCTCCGC
III
I II I
II I I I I I II I II I I I I
III
I II II II II I M I II III I I I I I II II M M
CAACCTTATACCTCATCTATACAGCCGACATCCCT. . . ACA. .. AATAGTCGCTTAACGGTATCCACATTTGCCGACGATACAGCTATCCTTAGCCGTTC
I
Fig.8. DNA sequence homologv between G3A and Fw. In the histogram is reported the
percentage of DNA sequence homology derived from the alignment of the G3A element, from
residue 950 to the oligo-A rich end, to the entire Fw element (16). Bars corresponds to 50 bp
intervals. The region devoid of bars corresponds to a segment of the Fw sequence whose
counterpart is absent in G3A. Three examples of the alignment between G3A and Fw DNA
sequences are shown. G3A sequences are at the top. Vertical lines denote base identities.
DISCUSSION
DNA sequence analysis of G3A, a member of the G family inserted in a non-nucleolar rDNA
unit (22), further supports the hypothesis that in Drosophila mobile elements other than copia
and copia-Uke sequences might transpose via RNA intermediates (16,17). G elements potentially
4049
I
II
Nucleic Acids Research
encode polypeptides homologous to reverse transcriptases and nucleic acid binding proteins
derived from retroviral gag polyproteins. It is therefore likely that these elements, as previously
proposed for Drosophila F elements and I factors (16,17), and other repeated DNA sequences
from different species (10,11,12,18,19,20), originate from the self-mediated cDNA conversion
of RNA moieties. The similarity in structural and functional organization of these three
retroposons suggests that they derive from a common ancestor. According to this view, the
relative percentage of positional identities shared by the hypothetical gene products (figs.
4,5,6,7,9; see also ref. 16) suggests that divergence between G and F elements and I factors
must have occurred very early in evolution. The relationship between F and G elements is also
supported by the alignment of their nucleotide sequences (fig. 8).
Whereas F elements and I factors are found at different chromosomal sites in Drosophila
strains, and mutants associated with their insertions have been described (6,16,17), evidence of
retroposition for G elements is poor. The G family members characterized thus far are associated
with repeated DNA (21,22), and in situ hybridization experiments indicate that G elements are
restricted to the chromocenter (22). The degree of homology of the polymerases potentially
encoded by F and G is however a strong indication that a selective constraint is operating to
preserve a function; in addition, whole Southern analysis reveals a few qualitative differences in
the genomic distribution of G elements among laboratory fly stocks (22). While further
investigations are needed to clarify this issue, both observations can be taken as an indication
that the process of de novo formation of G elements might still operate. It is also possible that
functional G elements are predominantly present, like I factors (38), only in a few fly
populations, and/or are similarly mobilized only in certain genetic backgrounds.
The molecular mechanisms leading to the dispersal of this type of retroposon are currently
poorly understood. It has been suggested that these elements are capable of further rounds of
retroposition because they contain an internal promoter (17); alternatively most family members
might represent functionless copies originating from a few intact master elements. The cysteinerich polypeptides might interact with the reverse transcriptase and play a structural role in the
process of cDNA conversion; alternatively they might have a regulatory function, since cysteine
and histidine residues, though in a different relative spacial arrangement, have been shown to
constitute the functional domain of a variety of eukaryotic DNA binding proteins (39).
In many respects, G elements are reminiscent of ribosomal insertions. Ribosomal type I and
II insertions are non homologous sequences occurring at nearby sites within the 28S portion of
more than 40% of Drosophila rDNA genes (reviewed in 40). rDNA insertions lack terminal
repeats, and type I sequences have oligo-A tails at one terminus (40). Type I insertions also
occur in the chromocenter, arranged as G elements in tandem arrays (14,41). Many G elements
(for example G3A) are inserted in the non transcribed spacer of non-nucleolar rDNA units, at a
site that is remarkably homologous to the 28S gene interval targeted by ribosomal insertions
(22). The notion of a close relationship between rDNA insertions and G elements is further
strengthened by the finding that rDNA insertions in B. mori have the same genetic organization
4050
Nucleic Acids Research
as G elements (18,19; see also figs. 4 and 7a). It is tempting therefore to speculate that the
similarity of the integration sites of these sequences might be the consequence of a relatively
sequence-specific endonucleolytic activity associated with one of their hypothetical gene
products.
The Drosophila genome presumably harbours additional families of non viral retroposons.
Jockey elements have no terminal repetititons (42); doc (43) and D (44) elements feature oligo-A
tracts at one end. It would not be surprising if any of these elements were shown to be related to
G and F elements.
ACKNOWLEDGMENTS
We wish to thank Drs. Thomas Eickbush and David Finnegan for communicating results
prior to publication, and Giovanna Grimaldi and Graham Tebb for critical reading of the
manuscript The work done in Italy was supported by Progetto Finalizzato Ingegneria Genen'ca e
Basi Molecolari delle Malattie Ereditarie of the C.N.R.
REFERENCES
1. Rogers, J.E. (1983) Nature 301, 460.
2. Rogers, J.H. (1985) Int. Rev. Cytol. 93, 187-279.
3. Wilde, C D . (1986) Crit. Rev. Bioch. 19, 323-352.
4. Weiner, A.M., Deininger, P.L. and Efstratiadis, A. (1986) Ann. Rev. Biochem. 55, 631661.
5. Finnegan, D.J. (1986) Int. Rev. Cytol. 93, 281-326.
6. Finnegan, D.J. and Fawcett, D.H. (1986) In Oxford Surveys on Eucaryotic genes, Oxford
University Press, Oxford, Vol. 3, pp. 1-62.
7. Boeke, J.D., Garfinkel, D.J., Styles, G.A. and Fink, G.R. (1985) Cell 40, 491-500.
8. Singer, M.F. (1982) Int. Rev. Cytol. 76, 67-112.
9. Singer, M.F. and Skowronski, J. (1985) Trends in Biol. Sci. 10, 119-122.
10. Skowronski, J. and Singer, M.F. (1986) Cold Spring Harbor Symp. Quant. Biol. 51,
457-464.
11. Loeb, D.D., Padgett, R.W., Hardies, S.C., Shehee, W.R., Comer, M.B., Edgell, M.H.
and Hutchison, C.A. (1986) Mol. Cell. Biol. 6, 168-182.
12. Hattori, M., Kuhara, S., Takenaka, O. and Sakaki, Y. (1986) Nature 321, 625-628.
13. Fanning, T. and Singer, M. F. (1987) Nucleic Acids Res. 15, 2251-2260.
14. Dawid, I.B., Long, E.O, Di Nocera, P.P. and Pardue, M.L. (1981) Cell 25, 399-408.
15. Di Nocera, P.P., Digan, M.E. and Dawid, I.B. (1983) J. Mol. Biol. 168, 715-728.
16. Di Nocera, P.P. and Casari, G. (1987) Proc. Natl. Acad. Sci. USA 87, 5843-5847.
17. Fawcett, D.H., Lister, C.K., Kellett, E. and Finnegan, D.J. (1986) Cell 47, 1007-1015.
18. Xiong, Y. and Eickbush, T.H. (1988) Mol. Cell. Biol. 8, 114-123.
19. Burke, W.D., Calalang, C.C. and Eickbush, T.H. (1987) Mol. Cell. Biol. 7, 2221-2230.
20. Kimmel, B.E, Ole-Moiyoi, O.K and Young, J.R. (1987). Mol. Cell. Biol. 7, 1465-1475.
21. Di Nocera, P.P. and Dawid, I.B. (1983) Nucleic Acids Res. 11, 5475-5482.
22. Di Nocera, P.P., Graziani, F. and Lavorgna, G. (1986) Nucleic Acids Res. 14, 675-691.
23. Dente, L., Cesareni, G. and Cortese, R. (1983) Nucleic Acids Res. 14,1267-1277.
24. Hattori, M. and Sakaki, Y. (1986) Analyt. Biochem. 152, 232-238.
25. Tabor, S. and Richardson, C.C. (1987) Proc. Natl. Acad. Sci. USA 84, 4767-4771.
26. Maxam A. and Gilbert, W. (1980) Methods in Enzymology 65,499-560.
27. Toh, H., Kikuno, R., Hayashida, H., Miyata, T., Kugimiya, W., Inouye, S., Yuki, S.
and Saigo, K. (1985) EMBO J. 4, 1267-1272.
28. Schwartz, R.M. and Dayhoff, M.O. (1978) In Atlas of protein sequence and structure,
National Biomedical Research Foundation, Washington, Vol. 5, pp. 353-358.
4051
Nucleic Acids Research
29. Seiki, M., Hattori, S., Hirayama, Y. and Yoshida, M. (1983) Proc. Natl. Acad. Sci. USA
80, 3618-3622.
30. Schwartz, D.E., Tizard, R. and Gilbert, W. (1983) Cell 32, 853-869.
31. Galibert, F., Mandart, E., Fitoussi, F., Tiollais, P. and Charnay, P. (1979) Nature 281,
646-650.
32. Shinnick, T.M.. Lerner, R.A. and Sutcliff, J.G. (1981) Nature 293, 543-548.
33. Saigo, K., Kugimiya, W., Matsuo, Y., Inouye, S., Yoshioka, K. and Yuki, S.C. (1984)
Nature 312, 659-661.
34. Covey, S.N. (1986) Nucleic Acids Res. 14, 623-633.
35. Mount, S.M. and Rubin, G.M. (1985) Mol. Cell. Biol. 5, 1630-1638.
36. Kozak, M. (1984) Nucleic Acids Res. 12, 857-872.
37. Wilbur, W.J.and Lipman, D.J. (1983) Proc. Natl. Acad. Sci. USA 80, 726-730.
38. Bregliano, J.C. and Kidwell, M.G. (1983) In J. Shapiro (ed.), Mobile Genetic Elements,
Academic Press, New York, pp 363-410.
39. Berg, J.M. (1986) Science 232, 485-487.
40. Beckingham, K. (1982) In Busch, H. and Rothblum, L. (eds.) The cell nucleus, Academic
Press, New York, Vol. X, part A, pp. 205-269.
41. Roiha, H., Miller, J.R., Woods, L.C. and Glover, D.M. (1981) Nature 290, 749-753.
42. Mizrokhi, L.J., Obolenkova, L.A., Priimagi, A.F., Ilyin, Y.V., Gerasimova, T. and
Georgiev, G.P. (1985) EMBO J. 4, 3781-3787.
43. Schneuwly, S., Kuroiwa, A. and Gehring, W.J. (1987) EMBO J. 6, 201-206.
44. Pittler, S.J. and Davis, R.L. (1987) Mol. Gen. Genet. 208, 325-328.
4052