Download Searching for Mobile Genetic Elements in the Genome of the

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

United Kingdom National DNA Database wikipedia , lookup

DNA barcoding wikipedia , lookup

Short interspersed nuclear elements (SINEs) wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

RNA-Seq wikipedia , lookup

History of genetic engineering wikipedia , lookup

Point mutation wikipedia , lookup

Cre-Lox recombination wikipedia , lookup

Public health genomics wikipedia , lookup

Minimal genome wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

NUMT wikipedia , lookup

No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup

Multiple sequence alignment wikipedia , lookup

Whole genome sequencing wikipedia , lookup

Pathogenomics wikipedia , lookup

Microsatellite wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Genomic library wikipedia , lookup

Smith–Waterman algorithm wikipedia , lookup

Metagenomics wikipedia , lookup

Sequence alignment wikipedia , lookup

Non-coding DNA wikipedia , lookup

Human genome wikipedia , lookup

Human Genome Project wikipedia , lookup

Genome evolution wikipedia , lookup

Genomics wikipedia , lookup

Genome editing wikipedia , lookup

Helitron (biology) wikipedia , lookup

Transposable element wikipedia , lookup

Transcript
Searching for Mobile Genetic Elements in the Genome of the Tasmanian Devil
(Sarcophilus harrisii)
German Lagunas-Robles & Peter Arensburger
California State Polytechnic University, Pomona
Abstract
Figure 1. Charlie24_Sh
BLAST run (against nr
database) used to determine
if possible TE likelihood of
being a strong candidate.
Using 1) Bedtools (a suite of programs for genome arithmetic (Quinlan and Hall 2010))
All eukaryotic genomes contain mobile DNA segments known as transposable
2) the BED formatted file with the sequence names and 3) base pair positions, two files
elements (TEs) which can transpose between nonhomologous sites. Various diseases,
containing the sequence of all potential class II TEs from the Tasmanian devil genome
including cancers, as well as changes in genetic expressions can be associated with
was created. The two files – a file with the headers as the possible TE name and its
TEs – this makes their annotation in genomes important. The Tasmanian devil
sequence and a second file that recorded the location in the genome and its sequence.
(Sarcophilus harrissi) population is facing extinction as a result of a transmissible
These two files were merged into a single FASTA formatted file using a custom-made
cancer, Devil facial tumor disease (DFTD), the origin of which is still poorly understood.
PERL script. The FASTA file was further parsed by retrieving the sequences that were
Given the possible links between TEs and various cancers I undertook the annotation of
100 base pairs or more. There were 16 possible transposable element names from this
one class of TEs (class II TEs) in the recently sequenced Tasmanian devil genome. The
list which were graphed individually by name. The peaks of interest were the highest
TE component of the Tasmanian devil genome was annotated denovo using a variety of
peaks on each graph (2 – 3 sequences per graph). The peaks represented sequences
bioinformatics tools. Using this list as well as a previously published Tasmanian devil TE
pertaining to graph's element which were later compared to the non-redundant (nr)
list by Gallus et al. (2015) I used the RepeatMasker program to screen the Tasmanian
BLAST database as well as the Repbase sequences to rank the possibility of the
devil genome for low complexity DNA sequences and various repeats, including TEs. I
element being a real element obtain the possible TE name, its location in the genome,
wrote custom computer analysis scripts in the Perl programming language to analyze
and its corresponding.
the results of this analysis. Sixteen potential transposable elements were identified and
Results
scored for their likelihood of being real TEs. In this poster I present a detailed
description of my bioinformatics search methodology as well as a summary of the novel
Table 1. Possible transposable elements rankings. Significant similarities are the
elements to which they are most similar to when aligned.
TE sequences I discovered.
Introduction
Possible Element Name
Element Probability
Significant Similarities
(Repbase Database)
rnd-3_family-432_Sh
not a transposable element
N/A
rnd-4_family-65_Sh
possible fragmented
transposable element
Charlie1
possible transposable element
hat-1_MeU
Transposable elements (TEs) make up a significant percentage of genome in all
organisms. These elements are mobile and can have effects on the organism's
expression of genes if allowed to transpose . When the relationship between TEs and
Figure 2. Charlie24_Sh
BLAST run (against Repbase
database) used to determine
if possible TE likelihood of
being a strong candidate.
Discussion
After the possible transposable element sequences were compared to the Repbase
the host genome is examined, TEs can sometimes act as parasites that are attempting
rnd-4_family-509_Sh
to replicate (increase copy number within host genome) and transpose, either within the
rnd-5_family-155_Sh
not a transposable element
N/A
genome or to a new genome altogether via a vector species (e.g. viruses,Munoz-Lopez
rnd-5_family-671_Sh
not a transposable element
N/A
rnd-5_family-1106_Sh
not a transposable element
N/A
I ran a BLAST comparison against the non-redundant (nr) BLAST database to ensure
rnd-5_family-1563_Sh
possible transposable element
hat-1_MeU
that the sequences that were being flagged as possible sequences are not sequences
rnd-6_family-59_Sh
possible fragmented
transposable element
Blackjack
that have been previously identified. The likelihood of the nr sequence from the BLAST
et al. 2010). TEs jumping to the genome of a different species, is known as horizontal
transfer.
database, the list of sequences were still only potential transposable element
sequences. To determine whether or not each sequence is truly a transposable element
database containing a transposable element is something that must be considered.
TEs are classified into two main groups – Class I and Class II TEs. Class I TEs (a.k.a.
rnd-6_family-332_Sh
not a transposable element
N/A
retrotransposons) can be similar to retroviruses in structure and lifestyle (Munoz-Lopez
rnd-6_family-1271_Sh
not a transposable element
N/A
rnd-6_family-1583_Sh
not likely a transposable
element
Blackjack
rnd-6_family-1913_Sh
possible transposable element
Cheshire-2_MD
element (Munoz-Lopez et al. 2010). Previous activity by TEs can still be seen in the
Charlie1b_Sh
not a transposable element
N/A
genome with remnants of TIRs allowing classification of these sequences as potential
Charlie24_Sh
not likely a transposable
element
Charlie24
Mariner1_MD_Sh
possible transposable element
Mariner1_MD
the best match by BLAST to this sequence. It was found that the devil genome has low
Mariner3_MD_Sh
possible transposable element
Mariner3_MD
complexity of MHC I genes. This might explain the large number of Charlie24_Sh
et al. 2010). Class II TEs (DNA transposons) transpose by a cut-and-paste mechanism.
A full length Class II TE will have two Terminal Inverted Repeat (TIR) flanking the
fragmented TEs. The likelihood of finding a complete intact TE is slim but finding
remnants of TEs has a good possibility of occurring.
Methods
Charlie24_Sh Sequence Lengths
The objective of this research project was to identify class II transposable elements in
University. RepeatModeler, a program used to identify repeat sequences within a
genome, was used to obtain a list of possible DNA transposable elements sequences.
The list of sequences that was obtained from RepeatModeler was compared using
1800
1600
Length of Sequence
by the Center for Comparative Genomics and Bioinformatics at Pennsylvania State
element is indeed a likely TE.
At first glance, Charlie24_Sh exhibited a relatively low e-value when being compared to
the Repbase database sequence of Charlie24. This suggested that Charlie24_Sh was
a possible TE, but when run against the nr BLAST database, a MHC I gene came up as
sequences therefore it might not represent a true TE sequence.
2000
the Tasmanian devil genome. The genome being used in the analysis was assembled
Therefore external sources were consulted in order to make a decision on whether an
1400
1200
1000
800
Graph 1.
Charlie24_Sh
sequence lengths
used to determine
what sequences are
candidates for the
BLAST run.
significantly similar to the Charlie1 sequence when compared to the Repbase
database. When it was compared to the nr BLAST, it was similar to the human
chromosome 7 and 8 (two sequences were compared). Upon further examination, one
of the two chromosomes was annotated for transposable elements, one of which was
600
400
Charlie1. This qualified rnd-4_family-65_Sh as a possible transposable element.
200
BLAST, an alignment algorithm for comparing to a list of transposable element
Rnd-4_family-65_Sh, a possible TE indicated by the initial RepeatModeler run, was
0
1
sequences from Repbase (a database for repetitive DNA sequences) with the addition
5
9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77 81 85 89 93 97 101105109113117121125129133137141145149
Sequence Number
Rnd-6_family-1913_Sh, a possible TE indicated by the initial RepeatModeler run,
of specific transposable elements identified by Gallus et al. (2015). The output from the
exhibited a relatively low e-value when compared to Repbase sequence of Cheshire-
BLAST algorithm was parsed using a custom-made PERL script that identified element
Rnd-4_family-65_Sh Sequence Lengths
600
Graph 2. Rnd4_family-65_Sh
sequence lengths
used to determine
what sequences are
candidates for the
BLAST run.
names along with their corresponding base pair positions if the e-value (a metric
0.0005. The sequence names and corresponding base pair positions were formatted
into the BED format, a format that allows for flexibility when defining data lines using a
custom-made PERL script.
Length of Sequences
implemented in BLAST for how good the alignment was) was less than or equal to
500
400
300
200
100
2_MD. When rnd-6_family-1913_SH was compared to the nr BLAST database, it was
determined that a sequence in Macropus eugenii, Tammar Wallaby, matched this
sequence. Previous annotation of the Tammar Wallaby genome did not indicate any
TEs. There exists a strong possibility that this is in fact a TE that was passed through
the marsupial linage by way of horizontal transfer.
Further studies and analyses should be done to expand on these findings. One finding
0
1
39
77
115
153
191
229
267
305
343
381
419
457
495
533
571
609
647
685
723
761
799
837
875
913
951
989
1027
1065
1103
1141
1179
1217
1255
1293
1331
1369
1407
1445
1483
1521
1559
1597
1635
1673
1711
1749
of relative interest is the possibility of horizontal transfer between marsupials.
Photo Credit: Bonorong Wildlife
Sanctuary
Sequence Number
Rnd-6_family-1913_Sh Sequence Lengths
1400
Graph 2. Rnd6_family-1913_Sh
sequence lengths
used to determine
what sequences are
candidates for the
BLAST run.
Length of Sequences
1200
1000
800
600
400
Munoz-Lopez, M, and JL Garcia-Perez. "DNA Transposons: Nature and
Applications in Genomics." Current Genomics, 11.2 (2010): 115-128.
Quinlan, A. R., and I. M. Hall. 2010. “BEDTools: A Flexible Suite of Utilities for
Comparing Genomic Features.” Bioinformatics 26 (6): 841–42.
200
doi:10.1093/bioinformatics/btq033.
1
13
25
37
49
61
73
85
97
109
121
133
145
157
169
181
193
205
217
229
241
253
265
277
289
301
313
325
337
349
361
373
385
397
409
421
433
445
457
469
481
493
505
517
529
541
553
0
Photo Credit: Amie Hindson
Healsville Sanctuary
References
Sequence Number