Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Lectures in Computational Virology Bioinformatic Studies on the Evolution Structure and Function of RNA-based Life Forms Marcella A. McClure, Ph.D. Department of Microbiology and the Center for Computational Biology Montana State University, Bozeman MT [email protected] Summary Lecture II 1) Introduction to Retroid Agents 2) The Genome Parsing Suite 3) Retroid Agents in the Human Genome 4) Discovery-based Hypothesis Generation Retroid Agents Retroviruses, retrotransposons, pararetroviruses, retroposons, retroplasmids, retrointrons, and retrons RNA viruses e.g., Ebola, rabies, influenza, polio All cellular systems & most DNA Viruses RNA reverse transcriptase mediated replication or transposition DNA transcription Replication by RNA-dependent translation RNA Polymerase Replication by DNA-dependent DNA polymerase snRNAs, ribozymes tRNA, rRNA PROTEIN SYNTHESIS McClure, 2000 Distribution of Retroid Agents among Eukaryotes and Eubacteria Retroid agents Eubacteria Eukaryotes Human Vertebrates Invertebrates Plants Fungi Protists Slime Mold Alga Plastids Baculovirus Genome Oomycetes Archaea Conjugative transposons Protozoa Retroviruses + + +a Pararetroviruses Caulimoviruses + Badnaviruses + Hepadnaviruses + + Transposons: Retrotransposons Gypsy- + DIRS1- + + CopiaRetroprosons +b + + + + + + +b + + + + + + Retrointrons Retroplasmids + + + + + + + + + + + + + + + + + + Retrons + Retrophages + + Variable features of Retroid genomes Retroid agent LTRs PBS DNA synthesis primer host self protein Retroviruses + + tRNA - - Pararetroviruses Plant Animal +a + - tRNA - - RT self - Integration specificity other site regional structural - - - + NAb NAc Transposons: Retrotransposons GypsyGypsy Tf1 + + + + + + + + - tRNA - RNA - DIRS- ITRs - DNA ? ? + ? ? ? ? Copia- + + tRNA - - + + + + + Retroposons Retrointrons -d - - DNA DNA - - ? +e + ? + ? + ? + ? Retroplasmids Mitochondrial Fungal NA - - ? tRNA ? ? Retrons - - - RNA - ? ? ? ? ? Retrophages - - - RNA - ? ? ? ? ? Phylogenetic Tree based on 65 RT sequences Gene Maps MA retroviruses orphan class C NC HIV-1 DIRS-1 C NC gypsy-like retrotransposons 17.6 CaMV caulimoviruses hepadnaviruses copia-like retrotransposons HBV NC NC Copia C LIN-H NC C retroposons CIN4 R2Bm C NC C I-FAC Group II INGI introns plasmids INT-SC1 MAUP retrons RT = reverse transcriptase RH= ribonuclease H H-C/IN =integrase PR = aspartic acid protease MX65 TERT 1000 2000 3000 4000 Nucleotides McClure, 2000 RNA-dependent DNA Polymerase Reverse Transcriptase 1 K fingers 2 3 D P DD KG palm fingers 4 Ribonuclease H 1 2 3 5 6 4 D E D palm thumb NX3D connection Aspartic Acid Protease 1 2 DTG G 3 1 ILG DTG 2 3 G ILG Integrase 1 Hx4H CX2C zinc-binding 2 D core 3 4 D E 1 Hx4H CX2C DNA-binding zinc-binding 2 3 4 D D E core DNA-binding Roles of Retroid Agents: 1) Disease: a) retroviruses: 1) exogenous infectious: HIV HTLV 2) endogenous associations: breast cancer, testicular tumors, insulin dependent diabetes, multiple sclerosis, rheumatoid arthritis, schizophrenia and systemic lupus erythematosus b) LINEs insertional mutagenesis: 1) Hemophilia A 2) muscular dystrophies; Duchenne and Fukuyama- congenital type 3) X-linked disorders; Alport Syndrome-Diffuse Leiomyomatosis and Chronic Granulomatous Disease 2) Regulation of cellular genes and reproduction 3) Telomere maintenance 4) Repair of broken dsDNA 5) Exchange of genetic information among and between organisms Possible function of HERV-W Trophoblast Syncytiotrophoblast HERV-W Endometrium Syncytin Predicted functional RT Predicted Retroid genome Real Contig Real Chromosome What is the “host” genomic environment of active Retroid Agents ? Disease Reproduction Development Mapping Genomic Retroid Agents Query Sequences Database 22 RT sequences Data categories The Human Genome By Subgroup By Chromosome Significant BLAST hits from 22 queries on 24 chromosomes Probable RT function determined by: E-value, OSM score and gene architecture Probable active Retroid agents determined by: 1) genomic boundaries 2) genome architecture 3) identification of OSM in PR/RH and IN sequences 4) presence of non-enzymatic Retroid genes Map host gene environment of Retroid genome Determine total versus potentially Active Retroid Agents in Human Genome What is the distribution of active Retroid Agents in the Human Genome Hypothesis Testing regarding the Functiona and Evolution of Retroid Agents The seven major steps of GPS BLAST usi ng 22 RT consensus sequences Remove duplicates and overlaps Evaluate OSM of RT Select RTs to annotate Extract Genome based on RT type Annotate usi ng consensus library Anal yz e the entire Retroid Agent RNA-dependent DNA Polymerase Reverse Transcriptase 1 K fingers 2 3 D P DD KG palm fingers 4 Ribonuclease H 1 2 3 5 6 4 D E D palm thumb NX3D connection Aspartic Acid Protease 1 2 DTG G 3 1 ILG DTG 2 3 G ILG Integrase 1 Hx4H CX2C zinc-binding 2 D core 3 4 D E 1 Hx4H CX2C DNA-bindingzinc-binding 2 3 4 D D E core DNA-binding The score of a given motif is calculated by M score = M + M1 + M2 M length M, M1 and M2 are based on the number of amino acids in a motif found in common between a known RT query sequence and the potential RT M is a count of amino acid identities M1 is a count on conservative substitution of (ILMV, AG, ST, DE, NQ, FY, RK) M2 accounts for older substitutions (LIMV, AGST, DENQ, FYW, RKH) The overall OSM score is calculated by OSM score = ∑ T motifs M score_i i=1 T motifs T motifs is the number of motifs comprising the OSM Status of the Human Genome Project • 3,200,000 Kbp of the euchromatic portion of the human chromosomes are being sequenced • Heterochromatic portion is not being done • As of January 5, 2003: – Non-redundant sequence only – 98.8% of euchromatic portion has been done – 3.0% is completed to the working draft level – 95.8% has been completed to 99% accuracy B. Y Y X 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 X 22 21 20 19 18 17 16 15 Chromosomes Chromosomes A. 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 50 100 150 Megabasepairs 200 250 0 2000 4000 6000 8000 Unique RTs Fluctuation in nucleotides per chromosome (A) and unique BLAST RT hits per chromosome (B) over the last four freezes. The bar codes are as follows: black November, 2002; right-hatched, June 2002; gray April 2002; and left-hatched December 2001. Chr chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19 chr20 chr21 chr22 chrx chry Totals Size Raw hits Unique RTs w/6 motifs Intact OSM Full LINE Perfect LINE 221.3 16450 237.5 17573 194.3 16858 187.7 17479 177.7 15793 166.8 5611 153.8 4465 142.4 10488 117 9063 131 8396 132.2 10465 128.4 9598 95.2 4446 88.1 2842 82.9 5742 80.6 3892 79.7 3554 74.6 5852 56.3 3368 59.4 2061 33.9 1456 33.8 1397 147.3 22249 22.7 4311 2845 203409 6202 6810 6243 6463 5937 2119 1860 4018 3601 3216 3943 3563 1815 1131 2297 1575 1453 2257 1215 851 581 602 8148 1230 77130 595 712 650 709 676 266 154 401 323 295 415 396 135 125 172 132 85 170 61 59 33 27 921 132 7644 207 259 243 264 232 99 57 149 126 106 152 130 50 43 64 50 35 79 30 21 11 12 336 40 2795 124 162 136 151 135 65 39 95 70 63 90 86 30 23 42 31 20 46 15 11 9 7 186 20 1656 17 15 12 10 17 4 4 7 5 5 9 6 3 1 5 8 2 5 0 2 0 2 13 1 153 Distribution of significant BLAST hits retrieved by 22 RT protein query sequences per chromosome. Chromosomal size from the Nov. 2002 HGD freeze is given in megabase pairs. Other column designations are described in the text. The significant raw and unique hits are from all 22 queries. The RTs with six motifs are significant hits retrieved by LINEs, HERVs, MMLV and TERT queries. Intact OSMs are found only in LINEs, HERVs and the TERT. The last two columns report the full length LINEs with all components and perfect LINEs, respectively. Classicification of 1656 whole LINEs No. in HG Stop codons Frame-shifts Details 153 0 0 Perfect 86 1 0 43 in LZ/ 15 in T 80 0 1 11 En/2 intra ORF 1337 Multiple Multiple Many cases 1656 A total of 153 LINEs appear to be perfect, while 86 contain a single stop codon and 80 a single frame-shift. Distribution of significant BLAST hits per query sequence. Query Hits Query Hits H-LIN 170260/69692/7345/2760 RTBV 60/12/0/0 HERV-K 2982/496/86/22 CMV 174/11/0/0 HERV-L 8208/2910/208/12 Copia 104/9/0/0 MMLV 4559/2108/4/0 Gypsy 334/14/0/0 MPMV 3506/52/0/0 DIRO 97/12/0/0 HIV 903/8/0/0 IPAO 27/13/0/0 FIV 1505/15/0/0 PMAUP 19/18/0/0 HTLV 3232/51/0/0 RECO 9/9/0/0 Snakehead 3109/39/0/0 H_TERT 1857/1581/1/1 SPUMA 2369/17/0/0 R_TERT 26/21/0/0 HBV 58/31/0/0 Archaea 11/11/0/0 Values indicate raw hits/unique hits/RTs with 6 motifs/Perfect OSMs. The 22 representative sequences used to query the HGD. Sequences, excluding the HERVs and human TERT, are the representative mean sequences for over 600 RTs from eight different classes of Retroid Agents. Chromosome 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y Total RTBV CMV Gypsy 1/0 2/2 1/0 1/1 1/0 1/0 1/1 3/1 1/1 1/1 1/0 2/1 1/1 2/2 HBV HTLV IPAO FIV HIV MPMV 1/0 1/0 1/1 1/0 3/2 3/2 1/1 2/0 1/1 1/0 5/4 1/1 1/1 1/0 1/1 5/4 2/1 2/0 1/1 1/0 2/0 1/0 1/0 3/0 1/0 1/1 1/1 1/1 5/2 4/2 2/1 1/0 1/0 2/0 1/0 1/1 4/2 9/6 13/7 1/1 18/6 1/0 1/0 1/0 1/0 1/0 1/1 2/1 1/0 Snakehead Spuma 4/2 1/1 1/0 3/3 3/1 2/1 3/3 1/0 1/0 1/1 1/1 1/1 2/0 2/2 1/1 1/1 TERT 29/0 14/0 13/0 9/0 19/6 9/1 10/0 6/0 15/0 13/0 10/0 8/0 4/0 8/0 13/0 24/0 21/0 8/0 24/0 8/0 5/0 11/0 5/0 26/18 15/8 286/7 2/0 1/1 1/1 1/1 1/0 1/0 3/2 2/1 2/1 1/1 1/1 1/1 1/1 1/1 2/1 1/1 9/4 1/0 1/0 3/0 2/1 1/0 1/0 1/0 1/0 1/0 PMAUP 1/1 1/1 2/2 1/1 1/0 DIRO 1/0 2/1 1/1 1/0 4/3 1/1 1/0 2/0 4/4 2/2 40/22 1/0 1/0 2/1 7/1 13/3 4/3 1/1 3/2 2/2 30/23 2/0 5/0 1/0 7/0 Total 41/5 25/7 25/9 23/8 33/12 17/1 13/2 10/3 19/2 22/3 19/4 19/5 11/4 10/1 20/3 30/4 26/2 17/3 35/6 10/2 10/2 16/2 20/12 11/6 482/108 Distribution of the 482 Low Frequency Reverse Transcriptase hits with remnants of at least one motif. Number of Low Frequency hits/Number of hits with a minimum of one recognizable motif. Of the 482 hits, 108 have at least one recognizable RT motif. The remaining 374 hits have remnants of at least one motif and were conserved enough to be scored by GPS. HIV Chromosome Motifs K 1 2 3 4 D QG DD 1R 1C LG K D (1)C C (1)C QG C DD G-K LG K D QG DD TERT G-K LG K 1R 3C(1)C 1R 1R C 1R 1C (1)R 1C 2C 1C C 1C (1)C C1R 1R 1R(1)C 1R 1C 1C 1R 1C R 1C 1C 1C 1C 1R (2)C C C 2R 1C 1C 1R (1)C 1C(1)R (1)C R C 1C C C 1C C QG DD G-K LG 13R 9R 1C 1C D 29R 14R 1C 11 12 13 14 15 22 X Y G-K Spuma 1C 5 6 7 8 9 10 16 17 18 19 20 21 MPMV 1C12R 1C1R 1C 8R 10R 6R 15R 13R 10R 8R 4R 8R 12R 24R 21R 8R 22R 2R 8R 5R 10R 1R 5R 1C 1C 1R Looking at the environment of each Retroid Agent Truncated LINE inserted into Intron 6 Truncated L1MB1 inserted into Intron 6 TPTE Gene Truncated L1PA5 inserted into Intron 8 Truncated LINE inserted into Intron 18 Chromosome 21 contig NT_029490 Figure 3: Looking at the environment of each Retroid Genome. In this example, four truncated LINEs are found within three different exons of a putative Tyrosine Phosphatase gene (TPTE). Insertions of Retroid genomes into introns may have little effect on a gene, or may allow for gene shuffling. In this case none of the coding region for the gene was disrupted, which demonstrates that Retroid sequence information may be utilized to make introns, or selection favors insertions that do not disrupt coding capacity or introns may provide the preferential target site for transposition. The black lines represent the exons of the TPTE gene. RepeatMasker Information Name: L1PA4 Family: L1 Divergence: 7.6% Deletions: 0.7% End in repeat: 6147 Left in repeat: -8 End in chr: 10255769 Chromosome: 21 End in Chr: 10255769 Genomic Size: 6163 A. Class: LINE Insertions: 1.0% Chromosome: 21 Band: 21p11.1 Strand: + SW Score: 39577 Begin in repeat: 2 Begin in chr: 10249606 Begin in Chr: 10249607 View DNA for this feature B. 5ΥUTR ORF I ORF II ~300 Amino Acids 3ΥUTR ~1300 Amino Acids 0 6126 5 UTR 3 249607 2 1 Frames LZ EN RT T RH 3UTR 255769 LZ UTR LZ UTR EN EN U Frameshifts Gene LZ EN Un Un Un RT T RH UN Len(AA) RF 334 +1 242 +2 210 +1 210 +1 21 0 +1 411 +1 199 +1 132 +2 Shift@ 323 70 103 180 189 none none none Shift to +3 +1 +3 +2 +1 none none none Comp 5UTR LZ EN Un RT T RH 3UTR RH RT Positions Start End 249607 250367 250646 251658 251725 252449 252450 253082 253083 254315 254316 254918 254919 255315 255639 255769 T UTR Chromosome: 21 Contig: NT_001715 Pos: 10249607-10255769 Strand: (+) Retroid Agent: L1PA4 LINE Length: Whole (6124) Environ: No known genes. DNA Genes Environ Distribution of Retroid Agents on Human Chromosomes (November, 2002 Freeze) Query: 22 distinct reverse transcriptase sequences representing 18 subgroups were used to query the NCBI’s Human Genome Database Results: 1) Retroid Agents are not randomly distributed on Human Chromosomes. 2) Chromosomes X and Y have the highest percent Retroid Agent sequence 3) Of those remaining, Chromosome 4, has the most, while Chromosome 20 comprises the least percent Retroid Agents. Only two chromosomes, 19 and 21 are without at least one intact and potentially active LINE. Using exact sequence lengths for each hit of each category indicated in the table of data, the November freeze of the human genome contains at least 1.01% unique RT sequences, 0.35% full-length LINEs and 0.032% active LINEs. New hypotheses from discovery-based research 1) Low frequency RT-like sequences (not from LINEs or ERVs) are discernible in the Human Genome. 2) Human low frequency RT-like sequences are remnants of ancient invasions. 3) Human low frequency RT-like sequences are remnants of failed invasions. 4)The pattern of low frequency RT-like sequences is unique in each organismal genome. 5) Both unique and trans-organismal patterns of low frequency RT-like sequences are found in Eukaryotes. What mechanisms could be maintaining these signals ? 1) 2) 3) 4) Gene conversion, an event without a mechanism. Transcriptional inactivation due to methylation of CpG regions. Translational recoding. Complementation. Eric Donaldson, B.S., Bioinformatician II Dustin Lee, M.S., Bioinformatics Programmer Aaron Juntunen, Undergraduate programmer Crystal Hepp, Undergraduate Kendal Harwood, Undergraduate Dr. Marcella McClure, P.I. (Marcie)