Download Distribution of the 482 Low Frequency Reverse Transcriptase hits

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Lectures in
Computational Virology
Bioinformatic Studies on the Evolution Structure
and Function of RNA-based Life Forms
Marcella A. McClure, Ph.D.
Department of Microbiology and the
Center for Computational Biology
Montana State University, Bozeman MT
[email protected]
Summary Lecture II
1) Introduction to Retroid Agents
2) The Genome Parsing Suite
3) Retroid Agents in the Human Genome
4) Discovery-based Hypothesis Generation
Retroid Agents
Retroviruses, retrotransposons,
pararetroviruses, retroposons,
retroplasmids, retrointrons, and retrons
RNA viruses e.g.,
Ebola, rabies,
influenza, polio
All cellular systems
& most DNA Viruses
RNA
reverse transcriptase
mediated replication
or transposition
DNA
transcription
Replication by
RNA-dependent translation
RNA Polymerase
Replication by
DNA-dependent
DNA polymerase
snRNAs, ribozymes tRNA, rRNA
PROTEIN
SYNTHESIS
McClure, 2000
Distribution of Retroid Agents among Eukaryotes and Eubacteria
Retroid agents
Eubacteria
Eukaryotes
Human Vertebrates Invertebrates Plants Fungi
Protists
Slime Mold Alga
Plastids Baculovirus
Genome
Oomycetes
Archaea
Conjugative
transposons
Protozoa
Retroviruses
+
+
+a
Pararetroviruses
Caulimoviruses
+
Badnaviruses
+
Hepadnaviruses
+
+
Transposons:
Retrotransposons
Gypsy-
+
DIRS1-
+
+
CopiaRetroprosons
+b
+
+
+
+
+
+
+b
+
+
+
+
+
+
Retrointrons
Retroplasmids
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Retrons
+
Retrophages
+
+
Variable features of Retroid genomes
Retroid agent
LTRs
PBS
DNA synthesis primer
host
self
protein
Retroviruses
+
+
tRNA
-
-
Pararetroviruses
Plant
Animal
+a
+
-
tRNA
-
-
RT
self
-
Integration specificity
other site regional structural
-
-
-
+
NAb
NAc
Transposons:
Retrotransposons
GypsyGypsy
Tf1
+
+
+
+
+
+
+
+
-
tRNA
-
RNA
-
DIRS-
ITRs
-
DNA
?
?
+
?
?
?
?
Copia-
+
+
tRNA
-
-
+
+
+
+
+
Retroposons
Retrointrons
-d
-
-
DNA
DNA
-
-
?
+e
+
?
+
?
+
?
+
?
Retroplasmids
Mitochondrial
Fungal
NA
-
-
?
tRNA
?
?
Retrons
-
-
-
RNA
-
?
?
?
?
?
Retrophages
-
-
-
RNA
-
?
?
?
?
?
Phylogenetic Tree based
on 65 RT sequences
Gene Maps
MA
retroviruses
orphan class
C
NC
HIV-1
DIRS-1
C NC
gypsy-like
retrotransposons 17.6
CaMV
caulimoviruses
hepadnaviruses
copia-like
retrotransposons
HBV
NC
NC
Copia
C
LIN-H
NC
C
retroposons
CIN4
R2Bm
C
NC
C
I-FAC
Group II
INGI
introns
plasmids
INT-SC1
MAUP
retrons
RT = reverse transcriptase
RH= ribonuclease H
H-C/IN =integrase
PR = aspartic acid protease
MX65
TERT
1000
2000
3000
4000
Nucleotides
McClure, 2000
RNA-dependent DNA Polymerase
Reverse Transcriptase
1
K
fingers
2
3
D
P DD KG
palm
fingers
4
Ribonuclease H
1 2 3
5 6
4
D E D
palm
thumb
NX3D
connection
Aspartic Acid Protease
1
2
DTG
G
3
1
ILG DTG
2
3
G
ILG
Integrase
1
Hx4H CX2C
zinc-binding
2
D
core
3
4
D
E
1
Hx4H CX2C
DNA-binding zinc-binding
2
3
4
D
D
E
core
DNA-binding
Roles of Retroid Agents:
1) Disease:
a) retroviruses:
1) exogenous infectious: HIV HTLV
2) endogenous associations: breast cancer, testicular tumors,
insulin dependent diabetes, multiple sclerosis, rheumatoid
arthritis, schizophrenia and systemic lupus erythematosus
b) LINEs insertional mutagenesis:
1) Hemophilia A
2) muscular dystrophies; Duchenne and Fukuyama- congenital type
3) X-linked disorders; Alport Syndrome-Diffuse
Leiomyomatosis and Chronic Granulomatous Disease
2) Regulation of cellular genes and reproduction
3) Telomere maintenance
4) Repair of broken dsDNA
5) Exchange of genetic information among and between organisms
Possible function of HERV-W
Trophoblast
Syncytiotrophoblast
HERV-W
Endometrium
Syncytin
Predicted functional RT
Predicted Retroid genome
Real Contig
Real Chromosome
What is the “host” genomic environment of active Retroid Agents ?
Disease
Reproduction
Development
Mapping Genomic Retroid Agents
Query Sequences
Database
22 RT sequences
Data categories
The Human Genome
By Subgroup
By Chromosome
Significant BLAST hits from 22 queries on 24 chromosomes
Probable RT function determined by: E-value, OSM score and gene architecture
Probable active Retroid agents determined by:
1) genomic boundaries
2) genome architecture
3) identification of OSM in PR/RH and IN sequences
4) presence of non-enzymatic Retroid genes
Map host gene environment of Retroid genome
Determine total versus potentially
Active Retroid Agents in Human Genome
What is the distribution of active Retroid
Agents in the Human Genome
Hypothesis Testing regarding the
Functiona and Evolution of Retroid Agents
The seven major steps of GPS
BLAST usi ng 22 RT consensus sequences
Remove duplicates and overlaps
Evaluate OSM of RT
Select RTs to annotate
Extract Genome based on RT type
Annotate usi ng consensus library
Anal yz e the entire Retroid Agent
RNA-dependent DNA Polymerase
Reverse Transcriptase
1
K
fingers
2
3
D
P DD KG
palm
fingers
4
Ribonuclease H
1 2 3
5 6
4
D E D
palm
thumb
NX3D
connection
Aspartic Acid Protease
1
2
DTG
G
3
1
ILG DTG
2
3
G
ILG
Integrase
1
Hx4H CX2C
zinc-binding
2
D
core
3
4
D
E
1
Hx4H CX2C
DNA-bindingzinc-binding
2
3
4
D
D
E
core
DNA-binding
The score of a given motif is calculated by
M score =
M + M1 + M2
M length
M, M1 and M2 are based on the number of amino acids in a motif found
in common between a known RT query sequence and the potential RT
M is a count of amino acid identities
M1 is a count on conservative substitution of (ILMV, AG, ST, DE, NQ, FY, RK)
M2 accounts for older substitutions (LIMV, AGST, DENQ, FYW, RKH)
The overall OSM score is calculated by
OSM score
=
∑ T motifs M score_i
i=1
T motifs
T motifs is the number of motifs comprising the OSM
Status of the
Human Genome Project
• 3,200,000 Kbp of the euchromatic portion of the human
chromosomes are being sequenced
• Heterochromatic portion is not being done
• As of January 5, 2003:
– Non-redundant sequence only
– 98.8% of euchromatic portion has been done
– 3.0% is completed to the working draft level
– 95.8% has been completed to 99% accuracy
B.
Y
Y
X
22
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
X
22
21
20
19
18
17
16
15
Chromosomes
Chromosomes
A.
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
50
100
150
Megabasepairs
200
250
0
2000
4000
6000
8000
Unique RTs
Fluctuation in nucleotides per chromosome (A) and unique BLAST RT hits per chromosome (B)
over the last four freezes. The bar codes are as follows: black November, 2002; right-hatched,
June 2002; gray April 2002; and left-hatched December 2001.
Chr
chr1
chr2
chr3
chr4
chr5
chr6
chr7
chr8
chr9
chr10
chr11
chr12
chr13
chr14
chr15
chr16
chr17
chr18
chr19
chr20
chr21
chr22
chrx
chry
Totals
Size
Raw hits
Unique
RTs w/6 motifs
Intact OSM
Full LINE
Perfect LINE
221.3 16450
237.5 17573
194.3 16858
187.7 17479
177.7 15793
166.8 5611
153.8 4465
142.4 10488
117
9063
131
8396
132.2 10465
128.4 9598
95.2
4446
88.1
2842
82.9
5742
80.6
3892
79.7
3554
74.6
5852
56.3
3368
59.4
2061
33.9
1456
33.8
1397
147.3 22249
22.7
4311
2845 203409
6202
6810
6243
6463
5937
2119
1860
4018
3601
3216
3943
3563
1815
1131
2297
1575
1453
2257
1215
851
581
602
8148
1230
77130
595
712
650
709
676
266
154
401
323
295
415
396
135
125
172
132
85
170
61
59
33
27
921
132
7644
207
259
243
264
232
99
57
149
126
106
152
130
50
43
64
50
35
79
30
21
11
12
336
40
2795
124
162
136
151
135
65
39
95
70
63
90
86
30
23
42
31
20
46
15
11
9
7
186
20
1656
17
15
12
10
17
4
4
7
5
5
9
6
3
1
5
8
2
5
0
2
0
2
13
1
153
Distribution of significant BLAST hits retrieved by 22 RT protein query sequences per
chromosome. Chromosomal size from the Nov. 2002 HGD freeze is given in megabase pairs. Other column
designations are described in the text. The significant raw and unique hits are from all 22 queries. The RTs with
six motifs are significant hits retrieved by LINEs, HERVs, MMLV and TERT queries. Intact OSMs are found
only in LINEs, HERVs and the TERT. The last two columns report the full length LINEs with all components
and perfect LINEs, respectively.
Classicification of 1656 whole LINEs
No. in HG Stop codons
Frame-shifts
Details
153
0
0
Perfect
86
1
0
43 in LZ/ 15 in T
80
0
1
11 En/2 intra ORF
1337
Multiple
Multiple
Many cases
1656
A total of 153 LINEs appear to be perfect,
while 86 contain a single stop codon and 80 a single frame-shift.
Distribution of significant BLAST
hits per query sequence.
Query
Hits
Query
Hits
H-LIN
170260/69692/7345/2760
RTBV
60/12/0/0
HERV-K
2982/496/86/22
CMV
174/11/0/0
HERV-L
8208/2910/208/12
Copia
104/9/0/0
MMLV
4559/2108/4/0
Gypsy
334/14/0/0
MPMV
3506/52/0/0
DIRO
97/12/0/0
HIV
903/8/0/0
IPAO
27/13/0/0
FIV
1505/15/0/0
PMAUP
19/18/0/0
HTLV
3232/51/0/0
RECO
9/9/0/0
Snakehead
3109/39/0/0
H_TERT
1857/1581/1/1
SPUMA
2369/17/0/0
R_TERT
26/21/0/0
HBV
58/31/0/0
Archaea
11/11/0/0
Values indicate raw hits/unique hits/RTs with 6 motifs/Perfect OSMs. The 22
representative sequences used to query the HGD. Sequences, excluding the
HERVs and human TERT, are the representative mean sequences for over
600 RTs from eight different classes of Retroid Agents.
Chromosome
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
X
Y
Total
RTBV CMV
Gypsy
1/0
2/2
1/0
1/1
1/0
1/0
1/1
3/1
1/1
1/1
1/0
2/1
1/1
2/2
HBV HTLV IPAO FIV HIV MPMV
1/0
1/0
1/1 1/0
3/2
3/2
1/1
2/0
1/1
1/0
5/4
1/1
1/1
1/0
1/1
5/4
2/1
2/0
1/1
1/0
2/0
1/0
1/0
3/0
1/0
1/1
1/1
1/1
5/2
4/2
2/1
1/0
1/0
2/0
1/0
1/1
4/2
9/6
13/7
1/1
18/6
1/0
1/0
1/0
1/0
1/0
1/1
2/1
1/0
Snakehead Spuma
4/2
1/1
1/0
3/3
3/1
2/1
3/3
1/0
1/0
1/1
1/1
1/1
2/0
2/2
1/1
1/1
TERT
29/0
14/0
13/0
9/0
19/6
9/1
10/0
6/0
15/0
13/0
10/0
8/0
4/0
8/0
13/0
24/0
21/0
8/0
24/0
8/0
5/0
11/0
5/0
26/18
15/8
286/7
2/0
1/1
1/1
1/1
1/0
1/0
3/2
2/1
2/1
1/1
1/1
1/1
1/1
1/1
2/1
1/1
9/4
1/0
1/0
3/0
2/1
1/0
1/0
1/0
1/0
1/0
PMAUP
1/1
1/1
2/2
1/1
1/0
DIRO
1/0
2/1
1/1
1/0
4/3
1/1
1/0
2/0
4/4
2/2
40/22
1/0
1/0
2/1
7/1
13/3
4/3
1/1
3/2
2/2
30/23
2/0
5/0
1/0
7/0
Total
41/5
25/7
25/9
23/8
33/12
17/1
13/2
10/3
19/2
22/3
19/4
19/5
11/4
10/1
20/3
30/4
26/2
17/3
35/6
10/2
10/2
16/2
20/12
11/6
482/108
Distribution of the 482 Low Frequency Reverse Transcriptase hits with remnants of at least one motif.
Number of Low Frequency hits/Number of hits with a minimum of one recognizable motif. Of the 482 hits,
108 have at least one recognizable RT motif. The remaining 374 hits have remnants of at least one motif and
were conserved enough to be scored by GPS.
HIV
Chromosome
Motifs
K
1
2
3
4
D
QG
DD
1R
1C
LG
K
D
(1)C C (1)C
QG
C
DD
G-K
LG
K
D
QG
DD
TERT
G-K
LG
K
1R
3C(1)C
1R
1R
C 1R
1C
(1)R
1C
2C
1C
C
1C
(1)C C1R
1R
1R(1)C
1R
1C
1C
1R
1C
R
1C
1C
1C
1C
1R (2)C
C
C
2R
1C
1C
1R (1)C
1C(1)R
(1)C
R
C
1C
C
C
1C
C
QG
DD
G-K
LG
13R
9R
1C
1C
D
29R
14R
1C
11
12
13
14
15
22
X
Y
G-K
Spuma
1C
5
6
7
8
9
10
16
17
18
19
20
21
MPMV
1C12R
1C1R 1C
8R
10R
6R
15R
13R
10R
8R
4R
8R
12R
24R
21R
8R
22R 2R
8R
5R
10R 1R
5R
1C
1C
1R
Looking at the environment of
each Retroid Agent
Truncated LINE
inserted into
Intron 6
Truncated L1MB1
inserted into
Intron 6
TPTE Gene
Truncated L1PA5
inserted into
Intron 8
Truncated LINE
inserted into
Intron 18
Chromosome 21 contig NT_029490
Figure 3: Looking at the environment of each Retroid Genome. In this example, four
truncated LINEs are found within three different exons of a putative Tyrosine Phosphatase
gene (TPTE). Insertions of Retroid genomes into introns may have little effect on a gene,
or may allow for gene shuffling. In this case none of the coding region for the gene was
disrupted, which demonstrates that Retroid sequence information may be utilized to make
introns, or selection favors insertions that do not disrupt coding capacity or introns may
provide the preferential target site for transposition. The black lines represent the exons
of the TPTE gene.
RepeatMasker Information
Name: L1PA4
Family: L1
Divergence: 7.6%
Deletions: 0.7%
End in repeat: 6147
Left in repeat: -8
End in chr: 10255769
Chromosome: 21
End in Chr: 10255769 Genomic Size: 6163
A.
Class: LINE
Insertions: 1.0%
Chromosome: 21
Band: 21p11.1
Strand: +
SW Score: 39577
Begin in repeat: 2
Begin in chr: 10249606
Begin in Chr: 10249607
View DNA for this feature
B.
5ΥUTR
ORF I
ORF II
~300 Amino Acids
3ΥUTR
~1300 Amino Acids
0
6126
5 UTR
3 249607
2
1
Frames
LZ
EN
RT
T
RH
3UTR
255769
LZ
UTR
LZ
UTR
EN
EN U
Frameshifts
Gene
LZ
EN
Un
Un
Un
RT
T
RH
UN
Len(AA) RF
334
+1
242
+2
210
+1
210
+1
21 0
+1
411
+1
199
+1
132
+2
Shift@
323
70
103
180
189
none
none
none
Shift to
+3
+1
+3
+2
+1
none
none
none
Comp
5UTR
LZ
EN
Un
RT
T
RH
3UTR
RH
RT
Positions
Start
End
249607 250367
250646 251658
251725 252449
252450 253082
253083 254315
254316 254918
254919 255315
255639 255769
T
UTR
Chromosome: 21
Contig: NT_001715
Pos: 10249607-10255769
Strand: (+)
Retroid Agent: L1PA4 LINE
Length: Whole (6124)
Environ: No known genes.
DNA
Genes
Environ
Distribution of Retroid Agents on
Human Chromosomes
(November, 2002 Freeze)
Query:
22 distinct reverse transcriptase sequences representing 18 subgroups were used
to query the NCBI’s Human Genome Database
Results:
1) Retroid Agents are not randomly distributed on Human Chromosomes.
2) Chromosomes X and Y have the highest percent Retroid Agent sequence
3) Of those remaining, Chromosome 4, has the most, while Chromosome 20
comprises the least percent Retroid Agents.
Only two chromosomes, 19 and 21 are without at least one intact and potentially
active LINE. Using exact sequence lengths for each hit of each category indicated in
the table of data, the November freeze of the human genome contains at least 1.01%
unique RT sequences, 0.35% full-length LINEs and 0.032% active LINEs.
New hypotheses from discovery-based research
1) Low frequency RT-like sequences (not from LINEs or ERVs) are
discernible in the Human Genome.
2) Human low frequency RT-like sequences are remnants of
ancient invasions.
3) Human low frequency RT-like sequences are remnants
of failed invasions.
4)The pattern of low frequency RT-like sequences is unique
in each organismal genome.
5) Both unique and trans-organismal patterns of low frequency
RT-like sequences are found in Eukaryotes.
What mechanisms could be maintaining these signals ?
1)
2)
3)
4)
Gene conversion, an event without a mechanism.
Transcriptional inactivation due to methylation of CpG regions.
Translational recoding.
Complementation.
Eric Donaldson, B.S., Bioinformatician II
Dustin Lee, M.S., Bioinformatics Programmer
Aaron Juntunen, Undergraduate programmer
Crystal Hepp, Undergraduate
Kendal Harwood, Undergraduate
Dr. Marcella McClure, P.I. (Marcie)
Related documents