Download Viral Metagenome Analysis Nicholas Upton Introduction A

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

RNA-Seq wikipedia , lookup

DNA vaccination wikipedia , lookup

NEDD9 wikipedia , lookup

Viral phylodynamics wikipedia , lookup

Genetic code wikipedia , lookup

DNA virus wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

Helitron (biology) wikipedia , lookup

Protein moonlighting wikipedia , lookup

Sequence alignment wikipedia , lookup

Point mutation wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Genomics wikipedia , lookup

Metagenomics wikipedia , lookup

Transcript
Viral Metagenome Analysis
Nicholas Upton
Introduction
A metagenome - a genomic sample taken directly from the environment - can be an
extremely useful tool in biological studies, providing valuable data for ecology, genomics, and
systematics. Metagenomic studies have shown promise in finding better estimates of
biodiversity compared to cultivation techniques, with one study showing that less than 1% of
bacteria and archaea may be sequenced by culture-based studies compared to metagenomic
samples1. Other studies have given estimates of bacterial and viral biodiversity and discovered
new species and strains; a study of the Sargasso Sea found near 150 new bacterial species2, and a
metagenome of two marine environments found an estimate of up to 7000 viral species, up to
65% of which were uncharacterized3.
This analysis focuses on a metagenome sequenced from two alkali hot springs in
Yellowstone National Park, Octopus and Bear Paw, by Schoenfeld et al. The hot springs are
populated by bacteria and archaea and the viruses which infect them4. The techniques used to
isolate and sequence the viral DNA may be found in the article. The sequences found were
placed into NCBI’s GenBank, and most remain uncharacterized.
Since any characterization of a read would increase the body of biological and virological
knowledge, it was decided to attempt to identify a series of viral reads. Two random reads were
chosen (OctHS.ATYB2334-b2 and OctHS.ATYB2334-g2, from Octopus hot spring), although
due to time constraints only one (OctHS.ATYB2334-b2, see appendix I for sequence) was
thoroughly analyzed. Due to the sequencing technique, fragments of the primer and linker used
in sequencing remained on the reads5; a technique was developed by Jeff Elhai to remove these
unwanted parts and for the remainder of the analysis the edited reads were used.
From a read to a contig
Since the length of the selected read is 969 nucleotides (nt), and the average viral gene
length is 800-1500nt6, it's not very likely that the viral read has at least one full gene within it. So
the first major step in characterizing the read is to extend it into a longer contiguous read, or
contig. The primary tool used for this – and most of the rest of the analysis – was BioBIKE,
specifically the iteration for virology known as ViroBIKE7. BioBIKE includes a function,
SEQUENCE-SIMILAR-TO, which uses NCBI's BLAST to compare an input sequence to a
series of sequences in a collection.
In order to build a contig, overlaps with other sequences must be found, and they must
meet certain criteria; one or two matching nucleotides, after all, are not enough overlap to
1
2
3
4
5
6
7
Impact of Culture-Independent Studies on the Emerging Phylogenetic View of Bacterial Diversity. J
Bacteriol. Hugenholtz et al. 1998.
Environmental Genome Shotgun Sequencing of the Sargasso Sea. Science. Venter et al. 2004.
Genomic analysis of uncultured marine viral communities. Proc Natl Acad Sci USA. Breitbart et al 2002.
Assembly of Viral Metagenomes from Yellowstone Hot Springs. Appl Environ Microbiol. Schoenfeld et
al. 2008.
Ibid.
1/f Correlations in viral genomes – a Fast-Fourier Transformation (FFT) study. Indian J Biochem &
Biophys. Rekha and Mitra 2006.
http://biobike.csbc.vcu.edu:9003/biologin
consider the sequences connected. 95% identity over at least 20nt is the standard mentioned in
Schoenfeld's article8, and was used here as well. Additionally, an overlap should occur at one
end of one read and the other end of the second; if they occur in the middle on both yet do not
extend to the ends of either, the sequences are most likely not good matches.
Using SEQUENCE-SIMILAR-TO with the read as input and searching over the edited
reads of the hot spring metagenome, fitting results were visually aligned with the query sequence
and, if sufficiently matching, the two were joined. As an example, for the initial read
(OctHSe.ATYB23340b2), three matches had an overlap of 221nt with 99.55% identity and
matched at the opposite end from the query sequence. The longest of these (OctHSe.ATYB2327b2) was matched and joined. This process was then repeated with the new matching sequence,
and continued until the read was extended to 9 reads combined to a contig of 5546nt (see
appendix II for sequence).
747nt
767nt
67nt
689nt
47nt
368nt
72nt
Read
221nt
Overlap
5
773
923
990
1
690
893
941
1
369
875
947
1
222
964
7 (ATYB6595-g2)
OctHSe.ATYB4255-g2(inverse)
OctHSe.ATYB3926-g2
OctHSe.ATYB2327-g2
OctHSe.ATYB2334-b2
200 (APNO828-b2)
OctHSe.ATYB6595-g2
OctHSe.ATYB6782-g2
OctHSe.APNO1640-b2
OctHSe.ATYB2327-b2(inverse)
OctHSe.APNO838-b2
964
1
1
68
270
959
1
48
476
844
1
73
763
984
947 (APNO828-b2)
745 (ATYB6595-g2)
Fig. 1: map of the extended contig
Finding significance
Once a long contig is found, the next step is to attempt to determine if there are any
significant gene products within the contig. From this, a comparison can be made to known
proteins to determine the relatedness of the sequence to other organisms. In order to establish
potential proteins, one should first attempt to find open reading frames that could contain coding
sequences. A good tool for this is GeneMark9, which will take a sequence and predict open
reading frames from it.
Running the contig through GeneMark’s prediction program, several open reading frames
are found, all of which are interestingly on the + strand. The contig was then run through
BLAST’s translated-nucleotide vs. protein comparison to identify similarities to any known
protein products. Translated-nucleotide vs. protein was used because of the high rate of mutation
of DNA compared to the protein product; as many codons may code for a single amino acid, the
specific DNA sequence can vary widely and still produce largely similar proteins. Following
this, BioBIKE’s SEQUENCE-SIMILAR-TO function was again used, this time comparing to the
database of *known-viruses*, by translated-nucleotide vs. protein comparison.
8
9
Op cit. 4
http://exon.biology.gatech.edu/
Hypothetical protein, S.
islandicus rudivirus,
2420-3128 (29.7%ID)
*known-viruses*
Hypothetical
glycosyltransferase,
Stygiolobus virus, 68-335
(26%ID)
GenBank
Hypo. .protein, S.
islandicus fil. virus,
2420-3128 (30.7%ID)
Hypo. glycosyltransferase,
S. islandicus rudivirus, 68355 (23%ID)
Hypothetical protein,
Stygiolobus virus, 20483139 (24-30%ID)
Hypothetical
glycosyltransferase,
Acidianus virus, 68-1105
(17-26%ID)
Hypothetical protein, S.
islandicus rudivirus, 20583151 (24-29%ID)
Hypothetical protein,
Acidianus virus, 20583151 (25-34%ID)
1-1069
1085-2025
2289-2423
2426-3163
3189-3461
4520-4891
4966-5448
1-5546
Fig. 2: map of open reading frames, compared to known protein products
There are several things of note on this graph. First, nearly all matches overlap the
predicted open reading frames. There are several potential reasons for this. Open reading frame
prediction is never perfect; GeneMark may have left out some areas which could code for
proteins. Additionally, there may be (and most likely are) differences between the organisms
coding for the known protein and the organism from which the sequence is taken. Furthermore,
the sequence may be (and most likely is) imperfect, leading to occasional insertions, deletions,
and wrong nucleotides which may hinder the open reading frame prediction.
Next, one may notice the organisms which have protein products similar to the contig.
They are all viruses, infecting Sulfolobus islandicus, Acidianus spp., and Stygiolobus spp.
Sulfolobus, Stygiolobus, and Acidianus are all genera of Sulfolobaceae, a family of thermophilic
archaea known to exist in hot springs10. Since the viruses were isolated from hot springs which
are known to contain Archaean bacteria, it’s no surprise that the closest matches should be hot
spring bacteriophages.
Also of note is the low sequence identity, in the range of 20-35%. If indeed the protein is
of a similar function, its amino-acid identity may seem like it should be higher. However, many
amino acids have similar chemical properties to others, so while changes between them would
alter the sequence, it might not alter the function or structure of the protein. The protein may,
too, be similarly derived but of somewhat different properties.
The matching proteins with a theoretical function are putative glycosyltransferases,
which is a large family of proteins of various biological functions; a glycosyltransferase attaches
a sugar to a substrate by glycosydic linkage, but the sugar, the substrate, and the purpose of the
reaction vary widely11. So the contig may code for a glycosyltransferase with similar function to
10
11
http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Info&id=118883
www.cazy.org/fam/acc_GT.html
the known viruses’ glycosyltransferase, or for a different protein of different function with a few
somewhat conserved domain, or may not even code for a protein at all.
It is important to note that the matches found are all partial matches; they do not fully
match any known gene product, even ignoring their low identity. So it may also be that the
contig’s gene product has a few similar domains, such as a transmembrane section or receptor
site, while the rest of the protein may have entirely different function.
At any rate, with some matches to known organisms, the next step is to compare
sequences to try and determine interrelatedness, to see where the contig may fit in the “tree of
life.”
Phylogenetic analysis
One of the most useful products of phylogenetic analysis is the phylogenetic tree, the
well-known forking diagram showing potential lines of descent based on various evidence, in
this case the differences between genetic sequences. BioBIKE’s TREE-OF function assembles
such a tree from an alignment of sequences. From these trees we can determine the organisms or
sequences most related to a given sequence, such as our contig.
A TREE-OF the 20 sequences within the hot springs metagenome most similar to the
contig (excluding the sequences used to assemble the contig itself) was assembled. This tree
could be used to determine what sequences may be from the same virus, or a similar virus, or to
assemble a more general consensus sequence of the contig that may assist in finding other
matches to known viruses.
Fig. 3: most related sequences to contig
It is interesting to note that all the sequences – like all the sequences used to build the
contig itself – are from the Octopus hot spring. In fact, on the order of 99% of the sequences
similar to the contig are from Octopus, with very few from Bear Paw. This is somewhat contrary
to Schoenfeld’s conclusion, which indicates a high degree of crossover between the two hot
springs12. It may be that the virus which the contig derives from only occurs in Octopus, or that
the segment of the virus that the contig shows is highly variable between the two hot springs.
Another use of the phylogenetic tree for this case is to determine the interrelatedness of
the known viruses which are similar to the contig. SEQUENCE-SIMILAR-TO the contig in the
GenBank database by translated DNA vs. protein was used to find the reported proteins of
similar viruses.
Fig. 4: tree of actual viruses similar to contig (modified to improve readability)
Note the main two clades are filamentous and rod-shaped viruses, rather than division by
the known host organism. This suggests differentiation occurred by morphology before occurring
by host. Since the contig is highly related to both these forms, it is not obvious to which
morphological pattern – if either – the contig’s source belongs to.
12
Op cit. 4
Conclusions
Through the various analytical techniques described above, we have added new
information to the body of viral sequences, identifying potential genes and identities for an
unknown viral sequence. From the relatively simple techniques used, similarity to
Sulfolobaceae-infecting bacteriophages was found, suggesting that the sequence itself may be
from a Sulfolobaceae-infecting phage. Additionally, similarity – although fairly slight – was
found to putative glycosyltransferase proteins, indicating the possible presence of a
glycosyltransferase in the virus from which the sequence was taken.
More generally, the exercises above have demonstrated the ability of metagenomic
analysis to provide contributions to systematics and genomics, by providing information usable
in evolutionary and genetic studies. With effort, based on the bioinformatic processes used
above, significant advances may be made to the field of biology.
Appendix I
Sequence of initial read
>OctHSe.ATYB2334-b2 from 1 to 969
TAGGTCAAGAGGTCACTGTAGTTGACATACCGCCACAGAATGACGACTAC
ATAGTGGAGGGCTATGTAGACCTTTCGTCGCTTCGACAAAACGACGTGCT
TATAGTTACTGAGTATATGGCTCTAGACGGCGTGAACTATAAGCCATTTG
CTAGCTATTCGTTCAGCGGACCGTTGCTTGCGCCTATTTACAGATTTCAC
ACTAAGACTCTGTATAGGGAGATGGCTTACAGGGTTGCCGTGAAACTTGT
CGCAGGCACGACGCCTAAGTCTGTGCCATACGCGTTCATTGTAGAAGTTT
TAGCGTCAGTTTAGCCGAAGATGTTTACATGTTTTTTTGGTTGTCTCCTA
ACGGATAAAATTTATTAATTTGCGCAGTATTTACGCCGATGTACAAAGAT
ACGTCAAAGGGGACTTCTGTGTCTACTGTGTACAAATTCGTCGATGTAGA
TGGTCATATTGCCAAGTTGTTAGAGCAAGGCGAAAAGATTGACGACGTCG
TAAAACAGTTTAAGGACTTCCTGGTTGGCATAAGGCGTATCAGGGGAAAT
GTGTTTTTGCGCGAGGGGGTGAACTTCATCTGGAGAGCAATAACCGGAGA
GTCGGGGTTGACATATTTCGGCCCAAATTCATGTATCGGCGTCGGGGATG
GCACAAGCGACCCTAGTCCGTCTCAGACTGGTCTGACTGGGACTAACAAA
ACGTATAAGCAAGTTGATAACGGATACCCCCAAATAATGGACAACAGGAT
AGTGTTCCGTGCTACGTTTGGTCCCGACGATGCTAACCATGCCTGGNCAG
AATGGACCGTTGCGAATGGCTGTGGCGATNCATACATAAACATAAACGGC
AGGAAAGAAAACTGNGCGTAAATCTAGCGGTCTACATGGTGCTGGAGTGA
CGTTGTCATAATTAACGGATAAAAACATAGTTAGACCAGGTGATTTTGCA
CTGCCTGCCAGCGATCTCC
Appendix II
Sequence of contiguous read
>contig from 1 to 5546
GAAGATGCGCGGTGAATTCTTATCGATGGTGTATGAAAACGTCGAGAACA
AAAGCGTCAGGATACCAACGCCGTGGCAACTTGCGACGCTTGCCGAGTAC
ATGGCAATACCGCAGTCGCTTGTCGCTAAGTCAATAGCTGAGTACGGTCT
CGACCGTGACTTCGCCGAATTCGTCGCCAGGTACATAGCCGTCAAACCGC
TTAAGGCGGACTACAGGCAAGTCATCACGAGCGCGATTAGGGCGCTTAGA
TACGGGGCAATTACTAGGGACACGTTTGACAACATCTTGAAGAACGCGCT
GAACTACGGTTTCACAGAGCCCGAAATTGCGCTGATCCAGATGAGGGCTG
AGCTGGACCTTGCAGTCGATTCGGCAAGAGAGTACGTACCAACGCCGACC
ATGCTGGCAACCATAGCAGAAGTCTTGCCGGAGGTACGCGACTACATTAA
GGAGGTGTTTGAGGCTCGTCGGATACGTGGCGTCTGGGCTGAGCTGTGGG
CTAAGTACATCTACTTGAGACCCGTAATCGACGAAGTCAGAAGGTGGCTG
AACGCTGTAGTTGCGCTGGTGGAGCGCAACATATTGCCCATGGACGCGCT
CAGCGATGTGTTCAAGATACTGTCGACGTATGGTTACGAGAAACTTGAGC
AGGAGATAATTACGCGGACTGTGCTGTCAAATAGGGCGAGACGCGCTTGG
AATGAGTTGCTAGGCACTGCGCGGAACCTGACTANCATGGCGCGCTACAG
CGACGTAGCAGCAGACATGGCGTGGACCCGCGTCGCCAAGATAGTAGACG
CGTTGCCCGTCGACAACACAACTAAAGACCTAATNCAGACCATGTGGAAA
CAGTACATAATCCACTACCAGAACTACCCCGAAAATAGGGCGTATGTGAC
TGAGCTCGTGACGAGCTACGCCTACGGGGTCATCGACGAGACAGCGTTTG
AGAGGGAGTTTGCATACTTTGAAACACTCGGCGTCCCCGAGATGACTATA
GCGCTAATAAAGCGCAGGGCGCGACTGAGAAGAATGAGAATTATATCAAC
ACGACGCCCGCGCGGCTAAGCTTAATATCTCTTAATGCAATTTTCACATG
TCTTCGAATCGAATCAAATCACTGACGACATACTCAAAGAGGTAAACGAA
GTTTTAGACCAGCACTTAAAGGAGACTGAGAAACACATTGTCGCGCTCCA
GCAGTCGGCGCAACAGCTATCGCGGTTAAGCGACGAGCTCATGAAGATGG
CTCAGGAACTGTCTAAACCACTACAGCCACAGACACAGCAACAACAAAGT
AGGAGGTAATGAGTTGGCGCGGTCTGCGTTATAAGGGGACTGAGGTGTTG
CGTAGCACGGAGTGGAACGCCGTAATCGATGCACTGAACGATCTCTATGG
GATGTTCACGTCTGGACACAGCAAGCTGACTGTGGACGAACTGCACGCAC
GTGTTGCACGTTTTAGAGAGCGACCGAGCGTGGGAAACAGGCCAGTGCTC
CTCGACGACGACCCGGTGCACATTGCCAGCTTTTACGATACAGCAAAGCG
GGACCTGGCTGAGTCGATCAACAGCTCAAACGTCGCAACGTATCTAGCCG
ACATACGCGGACAAGTCGTCAAGATATCTATAGACGACTACGGTCGCGTC
GGCGTAAGAATCGTGGAGCCCGTAGACCGCCATGGTAGAGTGCTTATGGC
TATGCCTCAAGAGCTGATAGACGAGATGTCCCCCGTGTATACTGCGGGCT
CCCTCCCAGGGCCGTCTAACACAAGCGGATTTAGCTTAACGCTTATGAAA
GGTGGCCGGCCGTATGTCAACATCTACTACGTTTTGGGAGGCCCAGGCAC
GGTATACGTAGAGGTATCACTAGACGGACAGACGTGGCGCCTATTGGACA
CAATCACTCTGTCTTCTGCAGGGTCTGGCTTGAAGATATACCAAGGCGTC
GCCTACCCACACGTCAGAGTCCGCACGAACGCTACTGGGGTGGACGTCTA
CTTTGAAATAGTTGCGTCAAGGTGACAAACGTGTATATGCGCAATGGCCC
GTTGAAGATACTCTACGTGTACCAATTTACGCATACTACGTCGTTTACCA
TAGTGTCGAAGAAACATATTGAATACATCCCGCAGTTAGGTCTTGCACAT
GTAGACGAGCTGTCGATAAACGCGTTCCCCTGGGTTAAGCCGACGCGACG
GTACGTGGCGGTGCTCCACCCGTTTCTAATTACGTGGACTTAAGCTGTCG
CACTCTATGAGCAGGTCTTCGGAGGTGACCGGAAACATATGCTAGAAGAG
CTGAAGAGTTGGTATGAGAAGGTAGTGGGAATCTATGTATGTGACAGCGA
CGCANTATCGAAAGAAGCTGTTGAAGTGCTTAATGAAACAGATGTGCTCA
TAGTCCCGTCGGAGTTTGCATAGACGTGTACAGGAGGTGTGGAGTATCCA
AGCCTGTCTTTCAAGTGCAACACGGCGTAGACCCCGAATGGTACGCTATG
AAAAGCGTCTGGAACGGAGGAGCTGTAGCTGGCATGGTCCCTGCACTGAT
ACGCCTATACCAGTACAAGAAGTATACGGGGAAGAAACTACTGTTGTGGT
GGTTCTGGCATTCGTGGCTAAGGAAAGGAGGTCCATGGGTTAAGGCTGTT
TATGAGAGACTTATCAGAAAGAGAAACGACGTGAAACTCGTAGTAAAGAC
GGGCCAATTCAATATGCACGAGCTAGGCGAATTAATGGGGTTAGGCATAG
TGCAGATCTACGGCTGGCTTACGGAACATGAAAAGATGGCTCTGTACGAC
ATGGCAGACATTACTTTGATGTTCAGCGTCGGTGGGGCGTTTGAGCTGAA
CGGGCTTGAATCTTTGGCGCGGGGCGTGCCTGCCGTGGCGGTGGACTGGG
GCCCGTGGACTGAGTACATTCCGCCTTTCTTGAGGGTTAAGCCAGGCGAA
AAAGTGAGACCCCTGCCAGACAACAGCGTCCACACTGGTTATGGCTACGC
AGTTGATGTGGAGGACGCGGTGGCGAAAATTGAAGATATTCTAGATAACT
ATGAAGAGTATAAAGCTAAGGTAGATGAGTGGCGTTGGAAGACATTGTAC
CCCCAGTACCGGTGGGACGTAGTGGCGAAACAGCTAATTTCCGCCATCTC
TATTAATACTTAAATACTAATTATGAAGATGTATAAACATGCCAGACTTC
GCAATTGTGCCTGAACTCCCTGCCGGGTATCCTATTGACAACAGAAGGTC
GGTCGCGAGGATTGTAAGCGTGCAGACAGTGATCGTGAGCTCGATACCTG
AAGTAACGATACCTACAAGTGAGGTTGGTGCACATTCGTACGCCGTAGCC
ATGGACGCTACAGATGCCAGGCCTAGCGACGCTCCTGGCGTAGACGTAGC
GACTCCATCTGTCTCTGTTCAAGCAGTAGAAACCGCGTCGGCTGAAACGC
GTGTGCGTTAGCATGTCGGCGAGTCGCATATTGGTGGGTCGCAAGACTGT
AATTGCTGAGGTTACGTTGAGGGGTTTCTATGTGCTAGACAGCGGTTTGT
TGATCCGCAATAAGTTCACCCGCAACATGGCCTACATTATAGACGGACTT
CTGAGGAGGGCTGACGTAAGCGTCTACGACGAAGGAGGGGCTACCAGGAC
CGTAAGGACGAGCGACGGCGGGCTGACGAGGAGCGCAATTAATAGGCCCG
ATCAGTGGACACCCGTAGATCCAAACAATTTAAGTGAACAAGGCCCACCG
TTTTTCATACATCTGGGCACAGGCACTGCGCAACCTACCGTGAACGAACA
CAACTTGGCGAATCCGTCGGCACATCTGCCGACGTCGTATATCGACATCA
TTGAGGATACCNACACACAGGTATACTTGTGGCGTCTCGCTACGCTCTGT
CATCGCCTGTAAACGTTATGGAGATCGGCCTGAAGTACCTTTTGACACGG
CCTACAACTATGCGACTTTGTGTCCCGCGCCATGTTACCATCGGCAGTCG
GCAGGTCGCCGTACACCGTATATTTCGACGGTTATCAGCTCAACTTCCCA
ACGACGTTTACCCGCTGGTTCGTGCGTGCTCTGTACTGCGCGATGGTTGG
ACACCGCAAACGCCCAACAATATGTCTGTCAGCTAAGGCGACGAACGGCT
CTGACTACTCAATACAAAGCGGGACGCCATTTGCTGGGTCGCCCGATGTA
GCCATTGGTTCGGACAACGCGCCAGTTTCGCCGACTGATGTAAATTTGCG
CTCCCCAATCGCTTCGCTGTCTAATCAGACACATACGGTAGAAGTTGATA
CGGCGCTCAACGAAGCGCGTATCGTGCGCACTGGCATATATACGCCAACA
TCGACTGTGCAGCTAGGCGAAATAGGGCTGTTTGCCAACATCAACGGCTA
CGTCGGCGCAACAGTCGCGGGAAGGAGGACGTTGATAGTCAGGGTAGCCC
TGCCCCAGCCGATCACTCTCAACGCCGGGACTACGTACACCATCGGCATA
ACGCTGAAGTTCGCCTGAGATGCCAGTAATAGTTCTTGCTAAATATAACT
ACACTGGCGTTACAAATATCACAAGCATAGGTCAAGAGGTCACTGTAGTT
GACATACCGCCACAGAATGACGACTACATAGTGGAGGGCTATGTAGACCT
TTCGTCGCTTCGACAAAACGACGTGCTTATAGTTACTGAGTATATGGCTC
TAGACGGCGTGAACTATAAGCCATTTGCTAGCTATTCGTTCAGCGGACCG
TTGCTTGCGCCTATTTACAGATTTCACACTAAGACTCTGTATAGGGAGAT
GGCTTACAGGGTTGCCGTGAAACTTGTCGCAGGCACGACGCCTAAGTCTG
TGCCATACGCGTTCATTGTAGAAGTTTTAGCGTCAGTTTAGCCGAAGATG
TTTACATGTTTTTTTGGTTGTCTCCTAACGGATAAAATTTATTAATTTGC
GCAGTATTTACGCCGATGTACAAAGATACGTCAAAGGGGACTTCTGTGTC
TACTGTGTACAAATTCGTCGATGTAGATGGTCATATTGCCAAGTTGTTAG
AGCAAGGCGAAAAGATTGACGACGTCGTAAAACAGTTTAAGGACTTCCTG
GTTGGCATAAGGCGTATCAGGGGAAATGTGTTTTTGCGCGAGGGGGTGAA
CTTCATCTGGAGAGCAATAACCGGAGAGTCGGGGTTGACATATTTCGGCC
CAAATTCATGTATCGGCGTCGGGGATGGCACAAGCGACCCTAGTCCGTCT
CAGACTGGTCTGACTGGGACTAACAAAACGTATAAGCAAGTTGATAACGG
ATACCCCCAAATAATGGACAACAGGATAGTGTTCCGTGCTACGTTTGGTC
CCGACGATGCTAACCATGCCTGGNCAGAATGGACCGTTGCGAATGGCTGT
GGCGATNCATACATAAACATAAACGGCAGGAAAGAAAACTGNGCGTAAAT
CTAGCGGTCTACATGGTGCTGGAGTGACGTTGTCATAATTAACGGATAAA
AACATAGTTAGACCAGGTGATTTTGCACTGCCTGCCAGCGATCTCC