* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Close relationship between non-viral retroposons in Drosophila
Promoter (genetics) wikipedia , lookup
Community fingerprinting wikipedia , lookup
Transcriptional regulation wikipedia , lookup
Transposable element wikipedia , lookup
Silencer (genetics) wikipedia , lookup
Vectors in gene therapy wikipedia , lookup
Deoxyribozyme wikipedia , lookup
Non-coding DNA wikipedia , lookup
Amino acid synthesis wikipedia , lookup
Point mutation wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Endogenous retrovirus wikipedia , lookup
Genetic code wikipedia , lookup
Biochemistry wikipedia , lookup
Volume 16 Number 9 1988 Nucleic Acids Research Close relationship between non-viral retroposons in Drosophila melanogaster Pier Paolo Di Nocera European Molecular Biology Laboratory, Meyerhofstrasse 1, D-6900, Heidelberg, FRG and International Institute of Genetics and Biophysics, CNR, Via Marconi 10, Naples, Italy Received February 12, 1988; Revised and Accepted March 29, 1988 Accession no.X06950 ABSTRACT G elements constitute one of the several moderately repeated DNA families of the Drosophila melanogaster genome. G elements lack terminal repetitions and structurally resemble mammalian processed pseudogenes because they terminate at one end in oligo-A tracts of variable length. G elements are mostly interspersed in the chromocentric heterochromatin with other repeated DNA sequences. Nucleotide sequence analysis of G3A, a family member inserted in a non-nucleolar rDNA unit, shows that functional G elements might have coding capacity for two polypeptides; one has homology to reverse transcriptases, the other is reminiscent of RNA binding proteins derived from the cleavage of retroviral gag polyproteins. Functionally related polypeptides are similarly encoded by members of two other Drosophila repeated DNA families, the F elements and the I factors. The similarity in structural organization and the relatedness of their potential gene products favors the hypothesis that G, F and I sequences derive from a common ancestor and result from processes based on the reverse transcription of RNA intermediates that probably differ markedly from those ensuring the maintenance and dispersion of copia-like elements. INTRODUCTION In addition to the integrated forms of mammalian retroviruses, many eukaryotic sequences might originate from the reverse transcription of RNA intermediates (1-4). These sequences, commonly referred as retroposons, can be sorted, on the basis of common properties, into two major groups. Viral retroposons include DNA sequences whose structural organization closely resemble that of vertebrate retroviruses, like murine intracisternal A-particles and Drosophila copia-likc or yeast Ty elements (4-7). Non viral retroposons comprise a heterogeneous set of sequences that correspond to partial or complete copies of cellular RNA species. They include processed pseudogenes, processed snRNA pseudogenes, highly repeated sequences such as Alu or LI elements and a variety of other sequences (2-4). Retroposons of this type exhibit no consistent structural similarities, except for the frequent presence of oligo-A stretches at one end (2-4). A major difference between viral and non viral retroposons is that the former encode enzymes responsible for their dispersion, but there are exceptions to this general rule. LI sequences are a superfamily of oligo (A)-terminated sequences spread, as species specific versions, throughout mammals in more than 104 copies per haploid genome (reviewed in 8 and 9). Many LI units are truncated at the 5' end and/or contain internal deletions or inversions; however, sequence analysis of randomly selected clones indicates that complete family members © IRL Press Limited, Oxford, England. 4041 Nucleic Acids Research might encode a reverse transcriptase (10-13). Drosophila F elements (14,15) resemble LI sequences in their structural organization, and potentially encode a homologous polypeptide (16). LI-like reverse transcriptases are also encoded by Drosophila I factors (17), Bombyx mori Rl and R2 elements (18,19) and Trypanosoma brucei ingi elements (20). Some of these elements in addition encode polypeptides structurally related to cleavage products of retroviral gag polyproteins (16-18). Like F elements, Drosophila G elements terminate at one end in oligo-A tracts often preceded by polyadenylylation signals (21-22). We show here that this structural similarity reflects a common origin, because G and F elements have the same genetic organization and potentially encode homologous polypeptides. MATERIALS AND METHODS DNA sequence analysis The two Hind III fragments from lambda G3 (22) that include the G3A element (see fig. 1) were subcloned and a detailed restriction map established. Suitable restriction fragments were subcloned into the multilinker region of pEMBL8 (23), and their DNA sequence determined by hybridizing 1-2 pmoles of plasmid DNA to 0.5 pmoles of either direct or reverse M13 primers as described by Hattori et al. (24). Elongation of annealed primer molecules was catalyzed, in the presence of [35S] dATP and appropriate mixtures of dNTPs and ddNTPs, by bacteriophage T7 encoded DNA polymerase (25) utilizing the Sequenase™ kit from USB (United States Biochemical Corp.). Reaction products were resolved on 6% acrylamide-8M urea gels (26). The sequencing strategy adopted is diagrammed in fig. 1. Restriction sites exploited for subcloning were crossed at least on one strand; ambiguities were resolved by determining the nucleotide sequence according to the chemical method of Maxam and Gilbert (26). RESULTS Sequence analysis of a complete D. melanogaster G element G elements are predominantly associated with repeated DNA sequences (21,22). The frequent association with rDNA and rDNA insertions sequences is a consequence of the insertion of G family members at a specific site within the non transcribed spacer region of rDNA units located outside the nucleolus organizer (22). The DNA region cloned in lambda G3 is representative of the peculiar interspersion of G and ribosomal sequences (fig. 1). The two G elements in lambda G3 probably represent full length family members; both are framed by target site duplications, and are not interrupted, as other cloned specimens, by the insertion of foreign DNA (22). The two Hind III fragments including the G3A element were cloned, and the complete G3A nucleotide sequence determined (see fig. 1 and MATERIALS and METHODS). G3A is 4346 base pairs long (fig. 2). The element terminates at one end in a stretch of 19 adenosine residues and is flanked by 9 bp long target site duplications (22). The distribution of 4042 Nucleic Acids Research 0.2Kb Fig. 1. Structure of the G3A element. The D. melanogaster DNA region cloned in phage lambda G3 (22) is schematically shown on top. Black boxes indicate rDNA non transcribed spacer (NTS) sequences, the stippled box uncharacterized repeated DNA sequences. Relevant restriction sites within the G3A element are shown; bars indicate subcloned DNA fragments whose nucleotide sequence has been determined (see MATERIALS AND METHODS). Restriction sites are as follows: A, Ace I; B, Bam HI; Bg, Bgl II; C, Cla I; D, Dra HI; E, Eco Rl; H, Hind III; Ps, Pst I; Pv, Pvu H; S, Sac I; X, Xba I. ATG and stop codons along the three G3A frames forwarding the terminal A-rich tract is shown in fig. 3. Homologv of orf E to reverse transcriptases The structure of G3A and of other cloned G elements (21,22) suggests that these sequences might represent, like processed pseudogenes, cDNA copies of polymerase II transcripts. Recently, it has been suggested that long oligo-A terminated sequences, such as mammalian LI and Drosophila F elements, might originate from the self-mediated cDNA conversion of transcripts that encode reverse transcriptase-like polypeptides (10,11,12,13,16). The 210 amino acid long G3A orf E (fig. 3) contains many amino acids identified by Toh et al. (27) as invariant residues in viral reverse transcriptases. Segments from orf E are aligned with conserved domains of known and hypothetical reverse transcriptases in fig. 4. Noticeably, the highest homology is found between orf E and the reverse transcriptase encoded by Drosophila F elements (62% similarity). Significant homologies are also found with the reverse transcriptases encoded by Drosophila I factors, mammalian LI elements, B. mori rDNA insertions and T. brucei ingi elements (fig. 4), the number of amino acid matches ranging from 30 to 40 %. Like all these hypothetical polypeptides, orf E differs markedly from the reverse transcriptases encoded by vertebrate retroviruses, both in terms of identical or chemically related amino acids and with respect to the distances between conserved domains. As previously observed for F elements and I factors (16,17), the homology between orf E and the reverse transcriptase encoded by the 4048 Nucleic Acids Research 1 ACAGTCGCGATCGAACACTCAACGACTGCAGACGTGCCTACGGACCGACGGCAAGTTATTTTCGTGCTCAAM 101 TGTCGCGAGATTTCTTCGCGCACCGTGATTGGTrCAGCCG<X<»ACCTrACCXn'ATCG(rrACCACTACCAACa>CTCCTGCGTGCGTCITATCGGTATC 201 AACACTTACATTCGGCTAAAGTTACTGCGAACAACTCAGCAGCAGCCACGTGCTGAGGCTGGTACM^ 301 CCTTCXCTACTCIXGGACAACATGGACTGGCAAGCCCCCCCGCGACCCACaWCTGACCA^ 401 AAGGTGAAAGCAGCTGCTCAAGCGATAGCAXKTCCTCCXaOTCAGAGCCTGGGGAAGTCAAGCGCA^^ 501 CMCGTGCCCAACACCAGC<XJ^TCTGCGCAAC»AGCTGGAAAATAACTCCTTCGCCCTTCT^ 601 ACCGACAACGAGCAGCAAACCCCTGTTGGGGAATCTGCTCCAAAAACCATGAAAAAACCCAACCCGACCCCGAAGACCATCAAGCCACCCCCGATCTACA 701 TCCCAGACGTGACCAACXTCTCAGCXCTTCTCAGGATCWTTACGACTCTCGTCGGTGCCC^^ 801 901 1201 1301 1401 cyACrrr.AAfyccrACAAC(^AArnvCATCnAra^TATrriTrArTrnATTTnAGAr^-Trjrrrr.AAAnAATnATmA^AAmTCTTT^ 1501 AnrTTgrri^TTrrATrrrrAArAfy>tCTrAT^AAATAAnTrTCAAr>TAgr<riTCTr^AACi^rAAcaa7rr 1601 CACACCATCAAAACCGACAACATCGATATTTTATTGCTCTCAGAATCCCXTTrrT(KXCaKGATCCCACTTCATCATCTCCGGTTACGACCTCATCACAG 1701 CCAACCACCCATCAGGTAGAGCTCGA^Ky\G<y^GCG«XATGCTCATCAAAAGCGGCATACAGTTCACTGAACTXKCTGCGATACAGGA(KATTGGGCACA 1801 CTGTGC^GTG«CAG«CTC»ATAGCCTACAGa»GATATTAC«nTO»GCGG^ 1901 GTTCCTCGAGTCOrrCGtMACTCGCrrCAnOCAGCCGaiGArTTOlATGCAAAGCACTCCTGGTGWXKTCCaaiCAAA 2001 TCCAC^AGTACCTGATGCGCAAAAACTTGGACTGCCACTCTACTGGAGAGCCCACACACTGGCCCTCGGACCCTTCTAAGCAGCCGGATCTGCTGGACAT 2201 cntCTCAGGAAGACGCax:TC£GTAGACTCACaXXMTC(XACOlAT«:CCCCAA 2301 7CTCCACTCCA«XAATATAGCKGCGGCCATC(»AAAACTGAAa\AG<a«»TGCACAACGCCGCTGAGrrT«Xa^CCCTCCTCCTCC^^ 2401 AACTCCCGCAAOiGACCTGCATTTGTGGTCCCCJlGAAATCGCCGa;CTCGT(XKCGAGAAavGAax:CTCAGACGAGTATGCTTCCTCTCGC»TAACCCC 2501 AGGGACAAGACAGCGCTCAATCGCGCCTC(^AGGAArTCAAG(»CAAACTAACCACCCTAa^CAAGArTCGTTT(»ACGATTCCTrGAAGA 2701 GTCTGAGGOVGAAAGAGCC(aAGCTTGCTOlCCACCTTCGCTCTGCCTTCACTCCGTTTGACCGATGCACAGCTGCAO«XJ^^ 2801 TGTTGAAAGCCCATGTGCTCCAGGACCTGCAAn'CAGCCCGTCGCACCA' J«WAGATCGCGCAGGAAATTGCCTCGCTCAGAAACGGCAAGTCTCCCGGC 2901 CCTGATCGCATCGACGCTACnKGTTAAAAATGTTGCCCAaTTCTGCTCACAGCTGCTTGCCAACAl IT IIAACAGCTGCTTCCGCCTAGGGTATTTCC 3001 CAAAACAATAGAAACGaKCGAAGTGATTACCATCCCCAAGCCCGGCAAACCTGAAGCCAATCTnKCTCCTATCGTCCGATAAGTCTGCrGGC^ 3 1 0 1 CTCCAAAATACTCGAAACACTATTTCTGCGCAGAGTGCTGCCACTACTGGATGAGGCTCCTrrGATC^ 3201 GGAACAC(^OAG«ATGCCACX:GGCTTGTAaVGCAAATTTTXXa«XKCTrC(aAAGGAAGCAATACTGCTGCGCCGTCATGCTGGATG^ 3 3 0 1 TCGACAAAGTCTGGCACCCTGGACTCCACTATA/WATCAAGACTCACCTTCCCGGATCCCATre^ 3501 GCAGACCTTCCTATCA(>CCCTCCCGGAGCCTAACAGTGGCCACATATGCCGATGACACCGCCTTCCTAGCCTCCGCCTCAGACCCCCAAGAAGCATCAA 3601 CCATCJVTTCTAAGCCAGCra»TGCCCTCGArcCATGGTT<»AAa»TGC*CCATTGCCGTGAAC«»GACAAATCCTre 3701 CAGAGGAGACTGCTCCCCACnT^GCTCAACGGGGAAACTATTCCAACCTCAAGTTCCCXGAAATACCTT^ 3901 GAAGCCAATTTGGACTTAT(WaTTC»GCTGT>GGGCACTGCCAGCJlTCTCAAACCGCAACCGCATACAGCGCT^ 4 001 4101 4 201 4301 CTGACCCTCACCCATACOlTG»AA«nrcxr.TATC«CAAGGAGCITGGAATGCtt^ ACGACTGGACAACCACCCTAAa>TCTGG<rrATrAACCTCC«XaCAACAGTGAAA^XATC«»CGC(nCCAa^^ CTATAACC«CAACAATGAAOCCCCGACCAATCTACAACTTrGTAATCCCTTAAGTTAAT(KCCCCCCCACCCAAACATTrAATTATTGTCCACATGGAC AGATTTTAAArrAATACATAGATCGCTAAAAAAAAAAAAAAAAAAA 4346 Fig. 2. Nucleoride sequence of the G3A element. The complete nucleotide sequence of the G3A is shown. Broken and continuous lines below sequence residues denote G-orfO and G-orfl, respectively. A double dashed line marks the hypothetical G-orf2; dots denote frameshift regions, asterisks stop codons (see text for a detailed description). 4044 Nucleic Acids Research Drosophila 17.6 element, or by other copia-likc elements (data not shown), is poor, and is even lower than that found with retroviral polymerases. Homology between G and F elements The close relationship between G and F elements which emerged from the previous analysis prompted us to search for additional matches between F-orf2, the 859 amino acid long open reading frame encoded by F elements that includes the reverse transcriptase-like domain (16), and other G3A encoded orfs. Segments from seven adjacent orfs (A to G infig.3) distributed on the three reading frames of the G3A element indeed exhibit homology to F-orf2 (fig. 5). With a small number of nucleotide changes, these segments can be joined into a long uninterrupted orf, that we will refer to as G-orf2, that might be slightly longer than F-orf2 at either termini. The continuous homology to F-orf2 allows us to define rather precisely where the insertion or deletion of single nucleotides might have occurred, thereby breaking the continuity of G-orf2 (see figs. 2 and 5). Deletion of a few oligonucleotides might have caused the loss of amino acid residues at the boundaries between the E and F and F and G orfs (fig. 5). Taking into account only positional identities, the overall similarity of G-orf2 and F-orf2 exceeds 40% (fig. 6). The degree of similarity varies along the proposed alignment; in addition to the reverse transcriptase-like domain, a second highly conserved domain is found at the amino terminus of F-orf2 and G-orf2 (figs. 5 and 6). The homology between the two regions not only supports the hypothesis of an evolutionary link between F and G elements, but also suggests that constraints are operating to maintain a biological function. G-ORF1 In most retroviruses, the pol gene is preceded by the gag gene. The primary translation product of the gag gene is a polyprotein eventually cleaved into virions core structural proteins (34). One of these is a nucleic acid binding protein (NBP) structurally characterized by the presence of one or multiple adjacent copies of the amino acid motif Cx2Cx 4 Hx 4 C (34). Cysteine-rich motifs of the same kind are reiterated within orfs preceding the hypothetical 2000 I 51 I Gorf-0 H I II II I I I I I I I I I I III I III Mill 3000 I I I I I I I I II Ml I I I G I III I I I Illl II II Gorf-1 I IIII I I 4000 (bp) I II B I I III F I II II 3 I D Fig. 3. G3A orfs. The three forward frames of the G3A sequence are represented as horizontal lines; vertical lines above and below indicate ATG and translational stop codons, respectively. Stippled boxes denote segments of the hypothetical G-orf2 (see text). 4045 Nucleic Acids Research i LIBS LIHd RIBm R2Bm Ingi HTLV-I RSV BBV MoMLV 17.6 G F I LIHs LIMd RIBm R2Bm Ingi HTLV-I BSV HBV MoMLV 17.6 IPKPGKP-EAfiL&SYRPISLLAILSKILER IPKPGKN - HJVA£S YRPI SLL£CISK1FEJS ILKPNTD-K2KTSSYRPISLNCCIAKILEK IPKPGBD-T2KKENERPISLUNIDAKILUK LPKG.NGRPL2BPKAYRPYILLEYLGJCILEK YPKVEEP-GG.-PG.EYRPISI.ASIPLBHFHS ILKA.GKKA-ECLDSYRPYXLT£CLCKjaiER UKICA.NG 1MRFI HDLEATHS IRKASG SYRLL HDLEAYNA YDKNENNSS^— -ESRLVVDFSQFSBGHTR YJOCPGTN DYFPy QDLEEYtffi KQDASGKQ KERIY GVPQGSVLGPILYXLYTADLP GVPQGSVLGPTLYLIYTAD.1P GXPQGSP1SVILEL1AFNKL2 GTRQGCPLSPILEN1VLEVLA. GTRQGCPL£PYLEN1VLEVLA. GCPQGSVLGPTLHNYLMDDLL GVRQGDPLSPILENWMDLJ.L GVPQG1VPGSIMEHVMNSLS VLPQGFKNSPTLEEMQLAHIL VLPQGMTCSPTICQLWSQYL K1PMGVGL£PFLLA.QET£ALA. RLPQGFKN£PTLEDEALHRDL RMPFGLKN&PATEQ RCMN IDYEKLME [ 6] ( 4] [ 8) [18] [18] [ 8] [15] [ 8] [ 9) (10) [ 8] [ 9] [ 8] (54) LDVKQAFDKVRH [54) LDVSQAFDKVRL [51] LDFSRAFDfiVGV [531 XDAEKAFDKIQQ (53) LDAEKAFDKIQH [4 9] LDISGAFDNAHW [50] LDFAKAFDTVSH [51] YDYEKAJTDTVDH [27] 1DLBEATFQ1PL [27] LDLKDCFFS1PL [2 6] LDVSAAFYfllPI [28] LDLKCAFFCLRL [27] IDLAKG.FHQIEM o LTVATYADDTAFLES LTV£TEADDTAILSR 1KFNAYADDFFLIIN YKLSLEADDMIVYLE YK1SLLADDMIVYIS LQHSFEADDL2LLAR CTXLQYMDDILLASE C-MLHYMDDLLLAAS CWFfiYMDDLVLGAR LILLQYVDDLLLA&1 KHCLVYLDDIIVFS1 (60) (60] (60] (55] (55] [60] [65] [34] (33) [33] [33] [33] [33] [44] (44] (45] (46] (45] (43] [46] [45] [27] [23] [72] [26] [16] SEKYLG EVTYLG SLKILG RIKYLG NIKYLG QVTVLG RWgYLG KCTLFG 1IKELG fiVQYLG HL-EMG QVKYLG E2TELG oo- Fig. 4. Similarities between orf E and reverse transcriptases. G-orfE segments are aligned with homologous regions from known and hypothetical reverse transcriptases. The single-letter amino acid code is used. Bold face characters denote identical residues, underlined characters denote chemically related amino acids grouped as in Schwartz and Dayhoff (28). Dots and open circles indicate positions respectively occupied by identical or similar amino acids among a large group of viral reverse transcriptases (27). F, Drosophila Fw element, orf2 (16 ); I, Drosophila I factor, orf2 (17); LIHs, Homo sapiens LI consensus orf sequence (12); LIMd, Mus domesticus LI element, clone LlMd-A2, orf 2 (11); RIBm, Bombyx mori rDNA type I insertion, orf 2 (18 ); R2Bm, Bombyx mori rDNA type II insertion (19); ingi, Trypanosoma brucei ingi element (20); HTLV-I, human adult T-cell leukemia virus type I (29); RSV, Rous sarcoma virus (30); HBV, hepatitis B virus (31); MoMLV, Moloney murine leukemia virus (32); 17.6, Drosophila 17.6 element, orf 2 (33). Numbers in brackets refer to amino acids residues present between the reported regions. reverse transcriptases in F elements, I factors and RIBm elements (16-18). One copy of the Cx 2 Cx 4 Hx 4 C motif, and two adjacent imperfect ones (fig. 7a) are found within G-orfl, a 241 amino acid long orf that partially overlaps G-orf2 (figs. 2 and 3). In fig. 7b the G-orfl region encompassing the cysteine-rich motifs is aligned with the homologous regions in F-orf 1 and Iorfl. The homology between G-orfl and F-orf 1 is about twice as high as that between G-orfl and I-orfl (45 versus 24% of positional identities). In the 51 portion of G3A a 242 amino acid long orf (G-orfO) extends from residue 221 to 948 overlapping G-orfl (figs. 2,3,7c). As hypothesized for G-orf2, it is possible that G-orfO and Gorfl are part of a unique frame in functional G elements and that mutations have interrupted their continuity in G3A. Because of the 5' truncation of the Fw element (16) the amino terminus of Forfl is unknown. However, amino acids homologous to the G-orfO residues underlined in fig.7c are found at the expected position in F 19, a full length F element (15; P.P. Di Nocera, 4046 Nucleic Acids Research G F MQISLWIVFWNANGL-QRSKAEVEHTIKTONIDI LLVSESHFCPRSHFI I SGYDLI -TAKHPSGRARGSAAKLIKSGIQFTELPA1(»I)WAQCAVAPVNSLQ IMATIillATONAW^QR-KmAQFIiiEKHIDVMLIJETHLTSKYNF^ G F NSLQ-GDITVGAVY. .PRHAITETHLHEFFESLGTRFIAAGDFNAKHSWWGS...NPKGKTLHKYLMRKN—LDCHSTGEPTHWPSDPSKQPDLLDIAICKG QUBtmATLAAVYCPPRFTVI^QFUJFFQALGPHFIAAGDYMAKHrHWGSM.^ G F IGRAKLVCTrYDRLVSDHSAVmj^IPVUUOTIiUU.Tr3miTWiTTF>i^ ISRSLVKAIX:iJ'DLSSDHSP\a,IHlJUlYAEN\TO1PTRLTSSiawnJlYKKYISSHIELSPKLNTESDI G F U*SPEIAALVAEiOtfUJUWWFLSRNPRDin'AU4RASKEIia3KLTTLRQ^ KTimQIEQLVHVKRFUJtftE^SSRSPTAKQKIJWATRiaANA^ ESCTCALQSILTAAALTATPKiraNTINSK • G F .ADHLRSAFITFDRCTAAEQADTIRAVESPCAPGPAIQPVAPEEIAQEIASIJWGKSPG^^ FAAHLQNVFTPNC^TSTrAlJSYPVNRHQQHTP---IVFRPlffiITKIIKDNI£PKKSPGYDLITPEMIIQli>HSAVRyiT!^^ G F TIPKPGIO'EANIASYWISUJUl^ILERVFIiUlVlJ'VIiJEAGLIPDH^^ MIPKPGKNHTVASSYRPISLI^CISKIJ'EKCLLIRI^HQTYHNIIPAHQFGFTOSHGTIEQVNRIT^ G F ICTHIJGSHFAFLKSFTEGREFQVCCCTATSTPRPIRAGWQGSVICPILYTLYTADIP^ IKISIJESTHiaiJ?SYLYDRKFAVRCNTATSTVHTIEAGWQGSVl£PTLYLIYTADIP-TOSR-LTVSTFADDTAri£RSRSPIOATAQI^ G KRWTIAVNADKSSQTTFSLRRGDCPPVTLNGETIPTSSSPKYLGLTLDRRLTW.. QADIJUJ^LHWLIGKRSKLRENLKLLLYKAILKPIWTYGI WIEAKKTQIJaJttNNLHWLINSGSPLSLDHKVLLYNSrija'IWTYGS G F QLWGTASISNRNRIQRFQNKCL.. AHPYHENSVIHKELGMPWVAEEISRFSERYAK QLWGNASNSNIDIIQRAQSKILRTITGAPWYVRSENIQRDLNIPSVTNAITELKEKYL* Fig. 5. Alignment of G orfs segments to F-orf2. Segments from G3A orfs A through G (see fig. 3) are aligned with the 859 amino acids long F-orf2 encoded by the Fw element (16). Slashes through the line above amino acids residues signal where frameshifts need to be introduced to adjoin segments from different G3A orfs. Dots denote amino acids residues that cannot be unambiguously assigned to either one of the adjacent orfs because of the lack of homology to Forf2. Dashes indicate amino acids gaps. Filled circles and crosses below sequence lines denote identical residues and favored amino acids substitutions grouped as in fig. 4, respectively. Stop codons are indicated by asterisks. A IBI C ID E IF Id 1 60 S — 50 i - -, • a - |—| - -| -| j < m / 20- - to100 200 300 400 500 600 700 { residues number] SCO Fig. 6. Amino acids identities between G-orf2 and F-orf2. 4047 Nucleic Acids Research B Copia RSV MoMSV HTLV-I P F A V P K L E K T A V G E D Q G Q C F R C Q G F G H T Q R Y C F L R C V K C G G L - H D S R A C E K C C L H C Q A D - H P A S F K G C Q C T N C Q E Y G H T R S Y C T L V C V V C G D L - H D S K O C O I K C N N C G G N - H T A N Y R G C R C K K C L R F G H P T P I C K S 1 C 1 N C S E T K H T N D G E K C N C L N C R N N P E L D H Q H S P C C N K C Q Q Y G H P E K F C R A T C G R C G E D G H R M E A C K A K C H H C G R E G H I K K D C Y H L C Y T C G S P G H Y Q A Q C P K R C Q L C N G M G H N A K Q C R K Q C T Y C E E Q G H W A K D C P K P C F R C G K A G H W S R D C T Q P C P L C Q D P T H W K R D C P R C . . C . . . . H . . . . C FFVNLEPASNNTD-IYiaKRICRSWTVEPPLKFNDVPQCFRCOGFGHTQRYC EPENKPPRKNEVHPIYKLQLLLHRRITVEEPHKRNAPVQCTNCOEYGHTRSYC FLEFRCVKCGGLHDSRACEKKEDEKAC CLHCQADHPASFKGCPAYKKAK TLAPVCWCGDLHDSKQCO-IKKEHACEKKCNNCGGNHTANYRGCPIYKELK FFVNLEPASNNTDIYKLKRICRSW-TVEPPLKFNDVPQCFRCQGFGHTQRYC TLVETGLIIITFESHKLPEIVRIGYETVRVRDYIPLPRLCKKCLRFGHPTPIC FLEFRCVKCGGLHDSRACEKKEDEKAC KSVETCINCS^TKHTKDGEKCTNEKN CLHCQADHPASFKGCPAYKKAK CUJCRNNPELDHQHSPIDRKCP SYCEQLSSSH\aJ^VHCX^TVPT\^SSPPSLLRD^M^WC^PRPTKIJTKVPRIa^ALKEAPGEGESSCSSDSSSSESEPGEVKRKAASRDAKEAADNVPHTSAALRKKLE A NNSFAlLSSTEDEDDDDDrm>raCOTPVGESAPKTMKKPNPTPJSTIKEP£mPDVTNISAIJ^ITTLVGAH^ A A A "QGILLSLSSAACDPEO RHRTFQLSGTCTTQLAKNQRNPWGSSL* KAPHISTVRDLHNTIGKKSKEPLGIFFVNLEPASNNTDIYKLKRICRSVVTVEPPLKFNDVPQCFRCQGFGHTQRYCFLEFRCVKCGGLHDSRACEKKEDEKACCLHC C^HPASFKGCPAYKKAKAOC^^KPKARSMESHNKPSFELPNITNGMSYRDALSGTRKSCASTPPPTPPTPPEAPQPNHMEAra"TRFESLVERMMEKKFAGVTOLVAS SILNSKSCK* Fig. 7. Cvsteine-rich motifs within G-orfl. A) Alignment of Cys motifs. F, Fw element, orf 1 (16); I, I factor orf 1 (17); RIBm, RIBm element, orf 1 (18); copia, copia element, orf 1 (35); RSV, Rous sarcoma virus, gag P12 (30); HTLV-I, human adult T-cell leukemia virus type I, gag?15 (29); MoMLV, Moloney murine leukemia virus, gagPIO (32). B) Alignment of G-orfl to F-orfl and I-orfl. Similarities between amino acids residues are outlined as in fig. 5. Q Amino acid sequences of G-orfO and G-orfl. Possible initiator methionines are marked by triangles. For underlined amino acid residues in G-orfO see text. unpublished). Four of the five ATG codons within G-orfO might correspond, according to the consensus established by Kozak (36), to an initiator methionine (fig. 7c). The break in homology between G-orfO and its counterpart in F19 upstream of the signalled region favors the hypothesis that translation of the NBP-like protein might start from the second ATG in G-orfO. DNA homologies between G and F elements Using the algorithm of Wilbur and Lipman (37) we searched the registers of comparison that have the largest number of short perfect matches between the DNA sequences of the G3A and the Fw elements. By combining the results of this analysis with those independently obtained from the amino acid matches shown in figs. 5 and 6 we derived an overall alignment of the two elements up to their terminal A-rich 31 ends (fig. 8). Representative examples of the degree of homology of G3A and Fw DNAs are shown in fig. 8. Differences between the two elements along their coding regions, except those leading to the frameshifts illustrated in fig. 5, are due to insertions or deletions of triplets. The two elements show a major divergence at the junction region between orfl and orf2, where a segment of about 100 bp is absent in G3 A (fig. 8). The length of the region encompassing this site is similar in G3B and other G family members (22; data not shown). Further analyses might clarify whether this observation has any functional implication. 4048 Nucleic Acids Research 80 - o o 60 - - n n ri • ! 40 - 1 -r - -i • -| n r. - -i • n - • n • n n r - • - r - • • • - n r r r • - 20 - 1013 176 • CGTCACCGTTGAGCCCKCTCTGAAATTCAACGATGTTCCGCAGTGCTTCAGATGTCAAGGGTTCGGACACACCCAGCGCTACTGCTTTTTAGAGTTTCGC MI! I I I I I I I I II I I I I I I I I I I I I I II IIII II II IIIII I II II GATCACGGTAGAAGAGCCGCACAAACGCAACGCTCCTGTACAATGTACAAACTGCCAAGAGTATGGCCACACGAGGTCATATTGTACACTTGCCCCGGTG 76 1113 • TGCGTCAAGTGTGGTGGCCTCCACGACTCCAGGGCGTGTGAAAAAAAGGAAGACGAGAAAGCATGC TGCCTACACTGTCAAGCCGACCATC II I I I I I I I I I I II I III I I I I II IIII II I I I I I I I I IIII I II IIII TGCGTAGTCTGTGGAGATCTCCACGACTCCAAACAGTGTCAA. . .ATTAACAAAGAAAATGCATGCGAGAAAAAATGTAATMCTGCGGGGGCAATCACA I I M i l l B 1891 ACCTGCATGAGTTCTTCGAGTCCCTCGGAACTCGCTTCAITGCAGCCGGAGACTTCAATGCAAAGCACTCCTGGTGGGGGTC.CGCACAAACAACCCCAA I I II II II II I I I I I I I I I II I II II II I I I I M i l I II II II II I IIIII II II III I I I I AATTCCTGGATTTCTTCCAAGCACTAGGGCCACACTTCATTGCAGCAGGCGACTACAACGCTAAACATACTCACTGGGGATCGCGACTTGTGAACCCAAA AGGCAAAACGCTCCACAAGTACCTGAT GCGCAAAAACTTGGACTGCCACTCTACTGGAGAGCCCACACACTGGCCCTCGGACCCTTCTAAGCAG I I I I I I III I I I I III I I I I I II I II I I I II III IIIIIII II IIII IIIII I AGGAAAACAGCTTTATAAGACGATAATAAAAGCCACTAATAAACTTGACCATGTTTCCCCCGGGAGTCCTACATACTGGCCATCAGACCTCAATAAGCTG I 1072 1990 1172 3278 2451 3378 2551 3478 2651 CATGCTGGATGT^WiGCAGGCCTTCGACAAAGTCTGGCACCCTGGACTCCACTATAAAATCAAGACTCACCrTCCCGGATCCCACTTCGCCrrcCTCAAA I I II II I I I I I I I I I I I I II I I I I I I I I I I I I I IIIIIII II I 1II IIII I I ATTTTTAGACGTATCCCAAGCATTCGACAAAGTCTGGCTCGACGGCCTAATGTTTAAAATTAAAATATCCCTACCCGAAAGCACACACAAACTTCTAAAG I TCAITCACTGAGGGTAGAGAGTTCCAAGTTTGCTGCGGAACAGCGACCAGCACGCCTAGGCCGATAAGAGCCGGAGTACCCCAAGGCAGCGTCCTTGGAC II I I I I M l Mil M lI III II II II I I II I III I I Mill IMIIIMMMM Mill I TCTTACCTCTATGACAGAAAGTTTGCAGTGCGGTGCAACACTGCCACTTCCACTGTTCATACAATTGAGGCTGGAGTCCCCCAAGGCAGCGTTC7TGGGC CAATACTGIACACACTCTACACAGCAGACCTTCCTATCACACCCTCCCGGAGCCTAACAGTGGCCACATATGCCGATGACACCGCCTTCCTAGCCTCCGC III I II I II I I I I I II I II I I I I III I II II II II I M I II III I I I I I II II M M CAACCTTATACCTCATCTATACAGCCGACATCCCT. . . ACA. .. AATAGTCGCTTAACGGTATCCACATTTGCCGACGATACAGCTATCCTTAGCCGTTC I Fig.8. DNA sequence homologv between G3A and Fw. In the histogram is reported the percentage of DNA sequence homology derived from the alignment of the G3A element, from residue 950 to the oligo-A rich end, to the entire Fw element (16). Bars corresponds to 50 bp intervals. The region devoid of bars corresponds to a segment of the Fw sequence whose counterpart is absent in G3A. Three examples of the alignment between G3A and Fw DNA sequences are shown. G3A sequences are at the top. Vertical lines denote base identities. DISCUSSION DNA sequence analysis of G3A, a member of the G family inserted in a non-nucleolar rDNA unit (22), further supports the hypothesis that in Drosophila mobile elements other than copia and copia-Uke sequences might transpose via RNA intermediates (16,17). G elements potentially 4049 I II Nucleic Acids Research encode polypeptides homologous to reverse transcriptases and nucleic acid binding proteins derived from retroviral gag polyproteins. It is therefore likely that these elements, as previously proposed for Drosophila F elements and I factors (16,17), and other repeated DNA sequences from different species (10,11,12,18,19,20), originate from the self-mediated cDNA conversion of RNA moieties. The similarity in structural and functional organization of these three retroposons suggests that they derive from a common ancestor. According to this view, the relative percentage of positional identities shared by the hypothetical gene products (figs. 4,5,6,7,9; see also ref. 16) suggests that divergence between G and F elements and I factors must have occurred very early in evolution. The relationship between F and G elements is also supported by the alignment of their nucleotide sequences (fig. 8). Whereas F elements and I factors are found at different chromosomal sites in Drosophila strains, and mutants associated with their insertions have been described (6,16,17), evidence of retroposition for G elements is poor. The G family members characterized thus far are associated with repeated DNA (21,22), and in situ hybridization experiments indicate that G elements are restricted to the chromocenter (22). The degree of homology of the polymerases potentially encoded by F and G is however a strong indication that a selective constraint is operating to preserve a function; in addition, whole Southern analysis reveals a few qualitative differences in the genomic distribution of G elements among laboratory fly stocks (22). While further investigations are needed to clarify this issue, both observations can be taken as an indication that the process of de novo formation of G elements might still operate. It is also possible that functional G elements are predominantly present, like I factors (38), only in a few fly populations, and/or are similarly mobilized only in certain genetic backgrounds. The molecular mechanisms leading to the dispersal of this type of retroposon are currently poorly understood. It has been suggested that these elements are capable of further rounds of retroposition because they contain an internal promoter (17); alternatively most family members might represent functionless copies originating from a few intact master elements. The cysteinerich polypeptides might interact with the reverse transcriptase and play a structural role in the process of cDNA conversion; alternatively they might have a regulatory function, since cysteine and histidine residues, though in a different relative spacial arrangement, have been shown to constitute the functional domain of a variety of eukaryotic DNA binding proteins (39). In many respects, G elements are reminiscent of ribosomal insertions. Ribosomal type I and II insertions are non homologous sequences occurring at nearby sites within the 28S portion of more than 40% of Drosophila rDNA genes (reviewed in 40). rDNA insertions lack terminal repeats, and type I sequences have oligo-A tails at one terminus (40). Type I insertions also occur in the chromocenter, arranged as G elements in tandem arrays (14,41). Many G elements (for example G3A) are inserted in the non transcribed spacer of non-nucleolar rDNA units, at a site that is remarkably homologous to the 28S gene interval targeted by ribosomal insertions (22). The notion of a close relationship between rDNA insertions and G elements is further strengthened by the finding that rDNA insertions in B. mori have the same genetic organization 4050 Nucleic Acids Research as G elements (18,19; see also figs. 4 and 7a). It is tempting therefore to speculate that the similarity of the integration sites of these sequences might be the consequence of a relatively sequence-specific endonucleolytic activity associated with one of their hypothetical gene products. The Drosophila genome presumably harbours additional families of non viral retroposons. Jockey elements have no terminal repetititons (42); doc (43) and D (44) elements feature oligo-A tracts at one end. It would not be surprising if any of these elements were shown to be related to G and F elements. ACKNOWLEDGMENTS We wish to thank Drs. Thomas Eickbush and David Finnegan for communicating results prior to publication, and Giovanna Grimaldi and Graham Tebb for critical reading of the manuscript The work done in Italy was supported by Progetto Finalizzato Ingegneria Genen'ca e Basi Molecolari delle Malattie Ereditarie of the C.N.R. REFERENCES 1. Rogers, J.E. (1983) Nature 301, 460. 2. Rogers, J.H. (1985) Int. Rev. Cytol. 93, 187-279. 3. Wilde, C D . (1986) Crit. Rev. Bioch. 19, 323-352. 4. Weiner, A.M., Deininger, P.L. and Efstratiadis, A. (1986) Ann. Rev. Biochem. 55, 631661. 5. Finnegan, D.J. (1986) Int. Rev. Cytol. 93, 281-326. 6. Finnegan, D.J. and Fawcett, D.H. (1986) In Oxford Surveys on Eucaryotic genes, Oxford University Press, Oxford, Vol. 3, pp. 1-62. 7. Boeke, J.D., Garfinkel, D.J., Styles, G.A. and Fink, G.R. (1985) Cell 40, 491-500. 8. Singer, M.F. (1982) Int. Rev. Cytol. 76, 67-112. 9. Singer, M.F. and Skowronski, J. (1985) Trends in Biol. Sci. 10, 119-122. 10. Skowronski, J. and Singer, M.F. (1986) Cold Spring Harbor Symp. Quant. Biol. 51, 457-464. 11. Loeb, D.D., Padgett, R.W., Hardies, S.C., Shehee, W.R., Comer, M.B., Edgell, M.H. and Hutchison, C.A. (1986) Mol. Cell. Biol. 6, 168-182. 12. Hattori, M., Kuhara, S., Takenaka, O. and Sakaki, Y. (1986) Nature 321, 625-628. 13. Fanning, T. and Singer, M. F. (1987) Nucleic Acids Res. 15, 2251-2260. 14. Dawid, I.B., Long, E.O, Di Nocera, P.P. and Pardue, M.L. (1981) Cell 25, 399-408. 15. Di Nocera, P.P., Digan, M.E. and Dawid, I.B. (1983) J. Mol. Biol. 168, 715-728. 16. Di Nocera, P.P. and Casari, G. (1987) Proc. Natl. Acad. Sci. USA 87, 5843-5847. 17. Fawcett, D.H., Lister, C.K., Kellett, E. and Finnegan, D.J. (1986) Cell 47, 1007-1015. 18. Xiong, Y. and Eickbush, T.H. (1988) Mol. Cell. Biol. 8, 114-123. 19. Burke, W.D., Calalang, C.C. and Eickbush, T.H. (1987) Mol. Cell. Biol. 7, 2221-2230. 20. Kimmel, B.E, Ole-Moiyoi, O.K and Young, J.R. (1987). Mol. Cell. Biol. 7, 1465-1475. 21. Di Nocera, P.P. and Dawid, I.B. (1983) Nucleic Acids Res. 11, 5475-5482. 22. Di Nocera, P.P., Graziani, F. and Lavorgna, G. (1986) Nucleic Acids Res. 14, 675-691. 23. Dente, L., Cesareni, G. and Cortese, R. (1983) Nucleic Acids Res. 14,1267-1277. 24. Hattori, M. and Sakaki, Y. (1986) Analyt. Biochem. 152, 232-238. 25. Tabor, S. and Richardson, C.C. (1987) Proc. Natl. Acad. Sci. USA 84, 4767-4771. 26. Maxam A. and Gilbert, W. (1980) Methods in Enzymology 65,499-560. 27. Toh, H., Kikuno, R., Hayashida, H., Miyata, T., Kugimiya, W., Inouye, S., Yuki, S. and Saigo, K. (1985) EMBO J. 4, 1267-1272. 28. Schwartz, R.M. and Dayhoff, M.O. (1978) In Atlas of protein sequence and structure, National Biomedical Research Foundation, Washington, Vol. 5, pp. 353-358. 4051 Nucleic Acids Research 29. Seiki, M., Hattori, S., Hirayama, Y. and Yoshida, M. (1983) Proc. Natl. Acad. Sci. USA 80, 3618-3622. 30. Schwartz, D.E., Tizard, R. and Gilbert, W. (1983) Cell 32, 853-869. 31. Galibert, F., Mandart, E., Fitoussi, F., Tiollais, P. and Charnay, P. (1979) Nature 281, 646-650. 32. Shinnick, T.M.. Lerner, R.A. and Sutcliff, J.G. (1981) Nature 293, 543-548. 33. Saigo, K., Kugimiya, W., Matsuo, Y., Inouye, S., Yoshioka, K. and Yuki, S.C. (1984) Nature 312, 659-661. 34. Covey, S.N. (1986) Nucleic Acids Res. 14, 623-633. 35. Mount, S.M. and Rubin, G.M. (1985) Mol. Cell. Biol. 5, 1630-1638. 36. Kozak, M. (1984) Nucleic Acids Res. 12, 857-872. 37. Wilbur, W.J.and Lipman, D.J. (1983) Proc. Natl. Acad. Sci. USA 80, 726-730. 38. Bregliano, J.C. and Kidwell, M.G. (1983) In J. Shapiro (ed.), Mobile Genetic Elements, Academic Press, New York, pp 363-410. 39. Berg, J.M. (1986) Science 232, 485-487. 40. Beckingham, K. (1982) In Busch, H. and Rothblum, L. (eds.) The cell nucleus, Academic Press, New York, Vol. X, part A, pp. 205-269. 41. Roiha, H., Miller, J.R., Woods, L.C. and Glover, D.M. (1981) Nature 290, 749-753. 42. Mizrokhi, L.J., Obolenkova, L.A., Priimagi, A.F., Ilyin, Y.V., Gerasimova, T. and Georgiev, G.P. (1985) EMBO J. 4, 3781-3787. 43. Schneuwly, S., Kuroiwa, A. and Gehring, W.J. (1987) EMBO J. 6, 201-206. 44. Pittler, S.J. and Davis, R.L. (1987) Mol. Gen. Genet. 208, 325-328. 4052