* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Viral Metagenome Analysis Nicholas Upton Introduction A
DNA vaccination wikipedia , lookup
Viral phylodynamics wikipedia , lookup
Genetic code wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Vectors in gene therapy wikipedia , lookup
Helitron (biology) wikipedia , lookup
Protein moonlighting wikipedia , lookup
Sequence alignment wikipedia , lookup
Point mutation wikipedia , lookup
Viral Metagenome Analysis Nicholas Upton Introduction A metagenome - a genomic sample taken directly from the environment - can be an extremely useful tool in biological studies, providing valuable data for ecology, genomics, and systematics. Metagenomic studies have shown promise in finding better estimates of biodiversity compared to cultivation techniques, with one study showing that less than 1% of bacteria and archaea may be sequenced by culture-based studies compared to metagenomic samples1. Other studies have given estimates of bacterial and viral biodiversity and discovered new species and strains; a study of the Sargasso Sea found near 150 new bacterial species2, and a metagenome of two marine environments found an estimate of up to 7000 viral species, up to 65% of which were uncharacterized3. This analysis focuses on a metagenome sequenced from two alkali hot springs in Yellowstone National Park, Octopus and Bear Paw, by Schoenfeld et al. The hot springs are populated by bacteria and archaea and the viruses which infect them4. The techniques used to isolate and sequence the viral DNA may be found in the article. The sequences found were placed into NCBI’s GenBank, and most remain uncharacterized. Since any characterization of a read would increase the body of biological and virological knowledge, it was decided to attempt to identify a series of viral reads. Two random reads were chosen (OctHS.ATYB2334-b2 and OctHS.ATYB2334-g2, from Octopus hot spring), although due to time constraints only one (OctHS.ATYB2334-b2, see appendix I for sequence) was thoroughly analyzed. Due to the sequencing technique, fragments of the primer and linker used in sequencing remained on the reads5; a technique was developed by Jeff Elhai to remove these unwanted parts and for the remainder of the analysis the edited reads were used. From a read to a contig Since the length of the selected read is 969 nucleotides (nt), and the average viral gene length is 800-1500nt6, it's not very likely that the viral read has at least one full gene within it. So the first major step in characterizing the read is to extend it into a longer contiguous read, or contig. The primary tool used for this – and most of the rest of the analysis – was BioBIKE, specifically the iteration for virology known as ViroBIKE7. BioBIKE includes a function, SEQUENCE-SIMILAR-TO, which uses NCBI's BLAST to compare an input sequence to a series of sequences in a collection. In order to build a contig, overlaps with other sequences must be found, and they must meet certain criteria; one or two matching nucleotides, after all, are not enough overlap to 1 2 3 4 5 6 7 Impact of Culture-Independent Studies on the Emerging Phylogenetic View of Bacterial Diversity. J Bacteriol. Hugenholtz et al. 1998. Environmental Genome Shotgun Sequencing of the Sargasso Sea. Science. Venter et al. 2004. Genomic analysis of uncultured marine viral communities. Proc Natl Acad Sci USA. Breitbart et al 2002. Assembly of Viral Metagenomes from Yellowstone Hot Springs. Appl Environ Microbiol. Schoenfeld et al. 2008. Ibid. 1/f Correlations in viral genomes – a Fast-Fourier Transformation (FFT) study. Indian J Biochem & Biophys. Rekha and Mitra 2006. http://biobike.csbc.vcu.edu:9003/biologin consider the sequences connected. 95% identity over at least 20nt is the standard mentioned in Schoenfeld's article8, and was used here as well. Additionally, an overlap should occur at one end of one read and the other end of the second; if they occur in the middle on both yet do not extend to the ends of either, the sequences are most likely not good matches. Using SEQUENCE-SIMILAR-TO with the read as input and searching over the edited reads of the hot spring metagenome, fitting results were visually aligned with the query sequence and, if sufficiently matching, the two were joined. As an example, for the initial read (OctHSe.ATYB23340b2), three matches had an overlap of 221nt with 99.55% identity and matched at the opposite end from the query sequence. The longest of these (OctHSe.ATYB2327b2) was matched and joined. This process was then repeated with the new matching sequence, and continued until the read was extended to 9 reads combined to a contig of 5546nt (see appendix II for sequence). 747nt 767nt 67nt 689nt 47nt 368nt 72nt Read 221nt Overlap 5 773 923 990 1 690 893 941 1 369 875 947 1 222 964 7 (ATYB6595-g2) OctHSe.ATYB4255-g2(inverse) OctHSe.ATYB3926-g2 OctHSe.ATYB2327-g2 OctHSe.ATYB2334-b2 200 (APNO828-b2) OctHSe.ATYB6595-g2 OctHSe.ATYB6782-g2 OctHSe.APNO1640-b2 OctHSe.ATYB2327-b2(inverse) OctHSe.APNO838-b2 964 1 1 68 270 959 1 48 476 844 1 73 763 984 947 (APNO828-b2) 745 (ATYB6595-g2) Fig. 1: map of the extended contig Finding significance Once a long contig is found, the next step is to attempt to determine if there are any significant gene products within the contig. From this, a comparison can be made to known proteins to determine the relatedness of the sequence to other organisms. In order to establish potential proteins, one should first attempt to find open reading frames that could contain coding sequences. A good tool for this is GeneMark9, which will take a sequence and predict open reading frames from it. Running the contig through GeneMark’s prediction program, several open reading frames are found, all of which are interestingly on the + strand. The contig was then run through BLAST’s translated-nucleotide vs. protein comparison to identify similarities to any known protein products. Translated-nucleotide vs. protein was used because of the high rate of mutation of DNA compared to the protein product; as many codons may code for a single amino acid, the specific DNA sequence can vary widely and still produce largely similar proteins. Following this, BioBIKE’s SEQUENCE-SIMILAR-TO function was again used, this time comparing to the database of *known-viruses*, by translated-nucleotide vs. protein comparison. 8 9 Op cit. 4 http://exon.biology.gatech.edu/ Hypothetical protein, S. islandicus rudivirus, 2420-3128 (29.7%ID) *known-viruses* Hypothetical glycosyltransferase, Stygiolobus virus, 68-335 (26%ID) GenBank Hypo. .protein, S. islandicus fil. virus, 2420-3128 (30.7%ID) Hypo. glycosyltransferase, S. islandicus rudivirus, 68355 (23%ID) Hypothetical protein, Stygiolobus virus, 20483139 (24-30%ID) Hypothetical glycosyltransferase, Acidianus virus, 68-1105 (17-26%ID) Hypothetical protein, S. islandicus rudivirus, 20583151 (24-29%ID) Hypothetical protein, Acidianus virus, 20583151 (25-34%ID) 1-1069 1085-2025 2289-2423 2426-3163 3189-3461 4520-4891 4966-5448 1-5546 Fig. 2: map of open reading frames, compared to known protein products There are several things of note on this graph. First, nearly all matches overlap the predicted open reading frames. There are several potential reasons for this. Open reading frame prediction is never perfect; GeneMark may have left out some areas which could code for proteins. Additionally, there may be (and most likely are) differences between the organisms coding for the known protein and the organism from which the sequence is taken. Furthermore, the sequence may be (and most likely is) imperfect, leading to occasional insertions, deletions, and wrong nucleotides which may hinder the open reading frame prediction. Next, one may notice the organisms which have protein products similar to the contig. They are all viruses, infecting Sulfolobus islandicus, Acidianus spp., and Stygiolobus spp. Sulfolobus, Stygiolobus, and Acidianus are all genera of Sulfolobaceae, a family of thermophilic archaea known to exist in hot springs10. Since the viruses were isolated from hot springs which are known to contain Archaean bacteria, it’s no surprise that the closest matches should be hot spring bacteriophages. Also of note is the low sequence identity, in the range of 20-35%. If indeed the protein is of a similar function, its amino-acid identity may seem like it should be higher. However, many amino acids have similar chemical properties to others, so while changes between them would alter the sequence, it might not alter the function or structure of the protein. The protein may, too, be similarly derived but of somewhat different properties. The matching proteins with a theoretical function are putative glycosyltransferases, which is a large family of proteins of various biological functions; a glycosyltransferase attaches a sugar to a substrate by glycosydic linkage, but the sugar, the substrate, and the purpose of the reaction vary widely11. So the contig may code for a glycosyltransferase with similar function to 10 11 http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Info&id=118883 www.cazy.org/fam/acc_GT.html the known viruses’ glycosyltransferase, or for a different protein of different function with a few somewhat conserved domain, or may not even code for a protein at all. It is important to note that the matches found are all partial matches; they do not fully match any known gene product, even ignoring their low identity. So it may also be that the contig’s gene product has a few similar domains, such as a transmembrane section or receptor site, while the rest of the protein may have entirely different function. At any rate, with some matches to known organisms, the next step is to compare sequences to try and determine interrelatedness, to see where the contig may fit in the “tree of life.” Phylogenetic analysis One of the most useful products of phylogenetic analysis is the phylogenetic tree, the well-known forking diagram showing potential lines of descent based on various evidence, in this case the differences between genetic sequences. BioBIKE’s TREE-OF function assembles such a tree from an alignment of sequences. From these trees we can determine the organisms or sequences most related to a given sequence, such as our contig. A TREE-OF the 20 sequences within the hot springs metagenome most similar to the contig (excluding the sequences used to assemble the contig itself) was assembled. This tree could be used to determine what sequences may be from the same virus, or a similar virus, or to assemble a more general consensus sequence of the contig that may assist in finding other matches to known viruses. Fig. 3: most related sequences to contig It is interesting to note that all the sequences – like all the sequences used to build the contig itself – are from the Octopus hot spring. In fact, on the order of 99% of the sequences similar to the contig are from Octopus, with very few from Bear Paw. This is somewhat contrary to Schoenfeld’s conclusion, which indicates a high degree of crossover between the two hot springs12. It may be that the virus which the contig derives from only occurs in Octopus, or that the segment of the virus that the contig shows is highly variable between the two hot springs. Another use of the phylogenetic tree for this case is to determine the interrelatedness of the known viruses which are similar to the contig. SEQUENCE-SIMILAR-TO the contig in the GenBank database by translated DNA vs. protein was used to find the reported proteins of similar viruses. Fig. 4: tree of actual viruses similar to contig (modified to improve readability) Note the main two clades are filamentous and rod-shaped viruses, rather than division by the known host organism. This suggests differentiation occurred by morphology before occurring by host. Since the contig is highly related to both these forms, it is not obvious to which morphological pattern – if either – the contig’s source belongs to. 12 Op cit. 4 Conclusions Through the various analytical techniques described above, we have added new information to the body of viral sequences, identifying potential genes and identities for an unknown viral sequence. From the relatively simple techniques used, similarity to Sulfolobaceae-infecting bacteriophages was found, suggesting that the sequence itself may be from a Sulfolobaceae-infecting phage. Additionally, similarity – although fairly slight – was found to putative glycosyltransferase proteins, indicating the possible presence of a glycosyltransferase in the virus from which the sequence was taken. More generally, the exercises above have demonstrated the ability of metagenomic analysis to provide contributions to systematics and genomics, by providing information usable in evolutionary and genetic studies. With effort, based on the bioinformatic processes used above, significant advances may be made to the field of biology. Appendix I Sequence of initial read >OctHSe.ATYB2334-b2 from 1 to 969 TAGGTCAAGAGGTCACTGTAGTTGACATACCGCCACAGAATGACGACTAC ATAGTGGAGGGCTATGTAGACCTTTCGTCGCTTCGACAAAACGACGTGCT TATAGTTACTGAGTATATGGCTCTAGACGGCGTGAACTATAAGCCATTTG CTAGCTATTCGTTCAGCGGACCGTTGCTTGCGCCTATTTACAGATTTCAC ACTAAGACTCTGTATAGGGAGATGGCTTACAGGGTTGCCGTGAAACTTGT CGCAGGCACGACGCCTAAGTCTGTGCCATACGCGTTCATTGTAGAAGTTT TAGCGTCAGTTTAGCCGAAGATGTTTACATGTTTTTTTGGTTGTCTCCTA ACGGATAAAATTTATTAATTTGCGCAGTATTTACGCCGATGTACAAAGAT ACGTCAAAGGGGACTTCTGTGTCTACTGTGTACAAATTCGTCGATGTAGA TGGTCATATTGCCAAGTTGTTAGAGCAAGGCGAAAAGATTGACGACGTCG TAAAACAGTTTAAGGACTTCCTGGTTGGCATAAGGCGTATCAGGGGAAAT GTGTTTTTGCGCGAGGGGGTGAACTTCATCTGGAGAGCAATAACCGGAGA GTCGGGGTTGACATATTTCGGCCCAAATTCATGTATCGGCGTCGGGGATG GCACAAGCGACCCTAGTCCGTCTCAGACTGGTCTGACTGGGACTAACAAA ACGTATAAGCAAGTTGATAACGGATACCCCCAAATAATGGACAACAGGAT AGTGTTCCGTGCTACGTTTGGTCCCGACGATGCTAACCATGCCTGGNCAG AATGGACCGTTGCGAATGGCTGTGGCGATNCATACATAAACATAAACGGC AGGAAAGAAAACTGNGCGTAAATCTAGCGGTCTACATGGTGCTGGAGTGA CGTTGTCATAATTAACGGATAAAAACATAGTTAGACCAGGTGATTTTGCA CTGCCTGCCAGCGATCTCC Appendix II Sequence of contiguous read >contig from 1 to 5546 GAAGATGCGCGGTGAATTCTTATCGATGGTGTATGAAAACGTCGAGAACA AAAGCGTCAGGATACCAACGCCGTGGCAACTTGCGACGCTTGCCGAGTAC ATGGCAATACCGCAGTCGCTTGTCGCTAAGTCAATAGCTGAGTACGGTCT CGACCGTGACTTCGCCGAATTCGTCGCCAGGTACATAGCCGTCAAACCGC TTAAGGCGGACTACAGGCAAGTCATCACGAGCGCGATTAGGGCGCTTAGA TACGGGGCAATTACTAGGGACACGTTTGACAACATCTTGAAGAACGCGCT GAACTACGGTTTCACAGAGCCCGAAATTGCGCTGATCCAGATGAGGGCTG AGCTGGACCTTGCAGTCGATTCGGCAAGAGAGTACGTACCAACGCCGACC ATGCTGGCAACCATAGCAGAAGTCTTGCCGGAGGTACGCGACTACATTAA GGAGGTGTTTGAGGCTCGTCGGATACGTGGCGTCTGGGCTGAGCTGTGGG CTAAGTACATCTACTTGAGACCCGTAATCGACGAAGTCAGAAGGTGGCTG AACGCTGTAGTTGCGCTGGTGGAGCGCAACATATTGCCCATGGACGCGCT CAGCGATGTGTTCAAGATACTGTCGACGTATGGTTACGAGAAACTTGAGC AGGAGATAATTACGCGGACTGTGCTGTCAAATAGGGCGAGACGCGCTTGG AATGAGTTGCTAGGCACTGCGCGGAACCTGACTANCATGGCGCGCTACAG CGACGTAGCAGCAGACATGGCGTGGACCCGCGTCGCCAAGATAGTAGACG CGTTGCCCGTCGACAACACAACTAAAGACCTAATNCAGACCATGTGGAAA CAGTACATAATCCACTACCAGAACTACCCCGAAAATAGGGCGTATGTGAC TGAGCTCGTGACGAGCTACGCCTACGGGGTCATCGACGAGACAGCGTTTG AGAGGGAGTTTGCATACTTTGAAACACTCGGCGTCCCCGAGATGACTATA GCGCTAATAAAGCGCAGGGCGCGACTGAGAAGAATGAGAATTATATCAAC ACGACGCCCGCGCGGCTAAGCTTAATATCTCTTAATGCAATTTTCACATG TCTTCGAATCGAATCAAATCACTGACGACATACTCAAAGAGGTAAACGAA GTTTTAGACCAGCACTTAAAGGAGACTGAGAAACACATTGTCGCGCTCCA GCAGTCGGCGCAACAGCTATCGCGGTTAAGCGACGAGCTCATGAAGATGG CTCAGGAACTGTCTAAACCACTACAGCCACAGACACAGCAACAACAAAGT AGGAGGTAATGAGTTGGCGCGGTCTGCGTTATAAGGGGACTGAGGTGTTG CGTAGCACGGAGTGGAACGCCGTAATCGATGCACTGAACGATCTCTATGG GATGTTCACGTCTGGACACAGCAAGCTGACTGTGGACGAACTGCACGCAC GTGTTGCACGTTTTAGAGAGCGACCGAGCGTGGGAAACAGGCCAGTGCTC CTCGACGACGACCCGGTGCACATTGCCAGCTTTTACGATACAGCAAAGCG GGACCTGGCTGAGTCGATCAACAGCTCAAACGTCGCAACGTATCTAGCCG ACATACGCGGACAAGTCGTCAAGATATCTATAGACGACTACGGTCGCGTC GGCGTAAGAATCGTGGAGCCCGTAGACCGCCATGGTAGAGTGCTTATGGC TATGCCTCAAGAGCTGATAGACGAGATGTCCCCCGTGTATACTGCGGGCT CCCTCCCAGGGCCGTCTAACACAAGCGGATTTAGCTTAACGCTTATGAAA GGTGGCCGGCCGTATGTCAACATCTACTACGTTTTGGGAGGCCCAGGCAC GGTATACGTAGAGGTATCACTAGACGGACAGACGTGGCGCCTATTGGACA CAATCACTCTGTCTTCTGCAGGGTCTGGCTTGAAGATATACCAAGGCGTC GCCTACCCACACGTCAGAGTCCGCACGAACGCTACTGGGGTGGACGTCTA CTTTGAAATAGTTGCGTCAAGGTGACAAACGTGTATATGCGCAATGGCCC GTTGAAGATACTCTACGTGTACCAATTTACGCATACTACGTCGTTTACCA TAGTGTCGAAGAAACATATTGAATACATCCCGCAGTTAGGTCTTGCACAT GTAGACGAGCTGTCGATAAACGCGTTCCCCTGGGTTAAGCCGACGCGACG GTACGTGGCGGTGCTCCACCCGTTTCTAATTACGTGGACTTAAGCTGTCG CACTCTATGAGCAGGTCTTCGGAGGTGACCGGAAACATATGCTAGAAGAG CTGAAGAGTTGGTATGAGAAGGTAGTGGGAATCTATGTATGTGACAGCGA CGCANTATCGAAAGAAGCTGTTGAAGTGCTTAATGAAACAGATGTGCTCA TAGTCCCGTCGGAGTTTGCATAGACGTGTACAGGAGGTGTGGAGTATCCA AGCCTGTCTTTCAAGTGCAACACGGCGTAGACCCCGAATGGTACGCTATG AAAAGCGTCTGGAACGGAGGAGCTGTAGCTGGCATGGTCCCTGCACTGAT ACGCCTATACCAGTACAAGAAGTATACGGGGAAGAAACTACTGTTGTGGT GGTTCTGGCATTCGTGGCTAAGGAAAGGAGGTCCATGGGTTAAGGCTGTT TATGAGAGACTTATCAGAAAGAGAAACGACGTGAAACTCGTAGTAAAGAC GGGCCAATTCAATATGCACGAGCTAGGCGAATTAATGGGGTTAGGCATAG TGCAGATCTACGGCTGGCTTACGGAACATGAAAAGATGGCTCTGTACGAC ATGGCAGACATTACTTTGATGTTCAGCGTCGGTGGGGCGTTTGAGCTGAA CGGGCTTGAATCTTTGGCGCGGGGCGTGCCTGCCGTGGCGGTGGACTGGG GCCCGTGGACTGAGTACATTCCGCCTTTCTTGAGGGTTAAGCCAGGCGAA AAAGTGAGACCCCTGCCAGACAACAGCGTCCACACTGGTTATGGCTACGC AGTTGATGTGGAGGACGCGGTGGCGAAAATTGAAGATATTCTAGATAACT ATGAAGAGTATAAAGCTAAGGTAGATGAGTGGCGTTGGAAGACATTGTAC CCCCAGTACCGGTGGGACGTAGTGGCGAAACAGCTAATTTCCGCCATCTC TATTAATACTTAAATACTAATTATGAAGATGTATAAACATGCCAGACTTC GCAATTGTGCCTGAACTCCCTGCCGGGTATCCTATTGACAACAGAAGGTC GGTCGCGAGGATTGTAAGCGTGCAGACAGTGATCGTGAGCTCGATACCTG AAGTAACGATACCTACAAGTGAGGTTGGTGCACATTCGTACGCCGTAGCC ATGGACGCTACAGATGCCAGGCCTAGCGACGCTCCTGGCGTAGACGTAGC GACTCCATCTGTCTCTGTTCAAGCAGTAGAAACCGCGTCGGCTGAAACGC GTGTGCGTTAGCATGTCGGCGAGTCGCATATTGGTGGGTCGCAAGACTGT AATTGCTGAGGTTACGTTGAGGGGTTTCTATGTGCTAGACAGCGGTTTGT TGATCCGCAATAAGTTCACCCGCAACATGGCCTACATTATAGACGGACTT CTGAGGAGGGCTGACGTAAGCGTCTACGACGAAGGAGGGGCTACCAGGAC CGTAAGGACGAGCGACGGCGGGCTGACGAGGAGCGCAATTAATAGGCCCG ATCAGTGGACACCCGTAGATCCAAACAATTTAAGTGAACAAGGCCCACCG TTTTTCATACATCTGGGCACAGGCACTGCGCAACCTACCGTGAACGAACA CAACTTGGCGAATCCGTCGGCACATCTGCCGACGTCGTATATCGACATCA TTGAGGATACCNACACACAGGTATACTTGTGGCGTCTCGCTACGCTCTGT CATCGCCTGTAAACGTTATGGAGATCGGCCTGAAGTACCTTTTGACACGG CCTACAACTATGCGACTTTGTGTCCCGCGCCATGTTACCATCGGCAGTCG GCAGGTCGCCGTACACCGTATATTTCGACGGTTATCAGCTCAACTTCCCA ACGACGTTTACCCGCTGGTTCGTGCGTGCTCTGTACTGCGCGATGGTTGG ACACCGCAAACGCCCAACAATATGTCTGTCAGCTAAGGCGACGAACGGCT CTGACTACTCAATACAAAGCGGGACGCCATTTGCTGGGTCGCCCGATGTA GCCATTGGTTCGGACAACGCGCCAGTTTCGCCGACTGATGTAAATTTGCG CTCCCCAATCGCTTCGCTGTCTAATCAGACACATACGGTAGAAGTTGATA CGGCGCTCAACGAAGCGCGTATCGTGCGCACTGGCATATATACGCCAACA TCGACTGTGCAGCTAGGCGAAATAGGGCTGTTTGCCAACATCAACGGCTA CGTCGGCGCAACAGTCGCGGGAAGGAGGACGTTGATAGTCAGGGTAGCCC TGCCCCAGCCGATCACTCTCAACGCCGGGACTACGTACACCATCGGCATA ACGCTGAAGTTCGCCTGAGATGCCAGTAATAGTTCTTGCTAAATATAACT ACACTGGCGTTACAAATATCACAAGCATAGGTCAAGAGGTCACTGTAGTT GACATACCGCCACAGAATGACGACTACATAGTGGAGGGCTATGTAGACCT TTCGTCGCTTCGACAAAACGACGTGCTTATAGTTACTGAGTATATGGCTC TAGACGGCGTGAACTATAAGCCATTTGCTAGCTATTCGTTCAGCGGACCG TTGCTTGCGCCTATTTACAGATTTCACACTAAGACTCTGTATAGGGAGAT GGCTTACAGGGTTGCCGTGAAACTTGTCGCAGGCACGACGCCTAAGTCTG TGCCATACGCGTTCATTGTAGAAGTTTTAGCGTCAGTTTAGCCGAAGATG TTTACATGTTTTTTTGGTTGTCTCCTAACGGATAAAATTTATTAATTTGC GCAGTATTTACGCCGATGTACAAAGATACGTCAAAGGGGACTTCTGTGTC TACTGTGTACAAATTCGTCGATGTAGATGGTCATATTGCCAAGTTGTTAG AGCAAGGCGAAAAGATTGACGACGTCGTAAAACAGTTTAAGGACTTCCTG GTTGGCATAAGGCGTATCAGGGGAAATGTGTTTTTGCGCGAGGGGGTGAA CTTCATCTGGAGAGCAATAACCGGAGAGTCGGGGTTGACATATTTCGGCC CAAATTCATGTATCGGCGTCGGGGATGGCACAAGCGACCCTAGTCCGTCT CAGACTGGTCTGACTGGGACTAACAAAACGTATAAGCAAGTTGATAACGG ATACCCCCAAATAATGGACAACAGGATAGTGTTCCGTGCTACGTTTGGTC CCGACGATGCTAACCATGCCTGGNCAGAATGGACCGTTGCGAATGGCTGT GGCGATNCATACATAAACATAAACGGCAGGAAAGAAAACTGNGCGTAAAT CTAGCGGTCTACATGGTGCTGGAGTGACGTTGTCATAATTAACGGATAAA AACATAGTTAGACCAGGTGATTTTGCACTGCCTGCCAGCGATCTCC