Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
CS 251 Introduction to Bioinformatics: Laboratory 2: Dabbling in Bioinformatics: Today, we will take our first real crack at using bioinformatic tools. We will follow the flow of Chapter 3, Bioinformatics for Dummies (BFD), in which the authors pilot a single gene (dUTPase, deoxyuridine 5’ triphosphate nucleotidylhydrolase) as a vehicle for touring several genome databases and for learning some basic terminology and search tools. To make this exercise more interesting for you, we will substitute a gene of our choosing, the mutS/hMSH2 DNA repair gene, for this exercise. And, we will ask you to perform some additional steps (e.g., Blastp) and answer a variety of questions, as you navigate this “road rally” through databases, genomes, and search tools. First, the essential gene terminology: mutS is the name given to the prokaryotic (bacterial) version of this universal defender of the genome. (“mut’ is an abbreviation to reflect the increased rate at which DNA mutations accumulate in cells that lack this critical gene). MSH2 is the name given to the eukaryotic (algae and fungi, plants, and animals) version of this gene. (“MSH” is an abbreviation that means “MutS Homolog”). The term “homolog” means that the MSH2 gene looks and acts like the mutS gene, i.e., its structure (DNA and protein sequence) is similar to mutS, and it plays a similar role in preventing mutations from occurring. hMSH2: the prefix ‘h’ in front of a gene name indicates that it is the human version of the gene. For some background, please obtain the PubMed abstracts of these two recent research articles about the mutS/hMSH2 genes. Ainsworth P, Koscinski D, Fraser B, Stuart J. Family cancer histories predictive of a high risk of hereditary non-polyposis colorectal cancer associate significantly with a genomic rearrangement in hMSH2 or hMLH1. Clin Genet. 2004 Sep;66(3):183-188. PMID: 15324316 [PubMed - as supplied by publisher] Watson ME Jr, Burns JL, Smith AL. Hypermutable Haemophilus influenzae with mutations in mutS are found in cystic fibrosis sputum. Microbiology. 2004 Sep;150(Pt 9):2947-58. PMID: 15347753 [PubMed - in process] Please answer the following questions here: From the abstract by Ainsworth P, Koscinski D, Fraser B, Stuart J.: HNPCC is a hereditary form of colon cancer caused by defects in DNA repair genes, most notably the hMSH2 gene. About 1 in 200 of us will develop this cancer because we carry a defective copy of the hMSH2 gene. Are there any bioinformatic tools, described in this paper, for predicting risk for this defect in human populations? What is the name of this tool, and its location? At what institution was this tool developed and housed? From the abstract by Watson ME Jr, Burns JL, Smith AL: normally, bacteria lacking the mutS gene are at a distinct disadvantage owing to the rapid accumulation of deleterious mutations in their DNA. Why might this defect in DNA repair provide an advantage for human bacterial pathogens Procedure: follow pp. 78-84 in BFD Objective: Locate and study the E. coli mutS gene Go to the GenBank entry tool at http://www.ncbi.nlm.nih.gov/entrez/ a. From the “Search” pull down menu, choose “Gene”. b. Type the term ‘mutS E.coli’ in the “For” window and click “Go”. c. Entries for a number of human versions of this gene are listed. However, nowhere on this list will you find the E. coli mutS gene (strangely?!). Instead, scroll down the page until you find the 14th entry. This will provide you with annotation for the mutS protein not from E. coli, but from an other bacterium, Yersinia pestis. This lethal bacterium is the causative agent of Bubonic Plague (the “Black Death” made infamous by wiping out 1/3 of the population of Europe in the 14th century). Open this ‘mutS’ hyperlink. 14: mutS Links methyl-directed mismatch repair protein [Yersinia pestis KIM] GeneID: 1145782 d. You will see a variety of information about the Y. pestis mutS gene, such as its chromosomal location, neighboring genes, links to PubMed references, etc. Display Show: Send to 1: mutS methyl-directed mismatch repair protein [Yersinia pestis KIM] Links GeneID: 1145782 Locus tag: y0835 updated 11-Sep-2004Transcripts and products: (shown on reverse complement genome) RefSeq below Genomic context: Gene type: protein coding Gene name: mutS RefSeq status: Provisional Organism: Yersinia pestis KIM (strain: KIM) Lineage: Bacteria; Proteobacteria; Gammaproteobacteria; Enterobacteriales; Enterobacteriaceae; Yersinia Bibliography: Gene References into Function (GeneRIF): Submit General protein informationName: methyl-directed mismatch repair protein Comment: helpPubMed links NP_668169: residues 5 to 851 of 851 are 84.09 pct identical to residues 5 to 853 of 853 from E. coli K12 : B2733; residues 5 to 851 of 851 are 84.09 pct identical to residues 5 to 853 of 853 from GenPept : >gb|AAG57842.1|AE005501_11 (AE005501) methyl-directed mismatch repair [Escherichia coli O157:H7 EDL933] NCBI Reference Sequences (RefSeq) Product NP_668169 methyl-directed mismatch repair protein [Yersinia pestis KIM] Conserved Domains (5) summary COG0249: MutS; Mismatch repair ATPase (MutS family) [DNA replication, recombination, and repair] Location: 8 849 Blast Score: 2541 smart00533: MUTSd; DNA-binding domain of DNA mismatch repair MUTS family Location: 286 591 Blast Score: 596 smart00534: MUTSac; ATPase domain of DNA mismatch repair MUTS family Location: 607 794 Blast Score: 725 pfam01624: MutS_I; MutS domain I Location: 11 123 Blast Score: 441 pfam05188: MutS_II; MutS domain II Location: 131 256 Blast Score: 329 Related Sequences Nucleotide Protein Genomic AE013686 AAM84420 Q1: How many papers about the Y. pestis mutS gene have been published? One: Display Show: Send to 1: Deng W, Burland V, Plunkett G 3rd, Boutin A, Mayhew GF, Liss P, Perna NT, Rose DJ, Mau B, Zhou S, Schwartz DC, Fetherston JD, Lindler LE, Brubaker RR, Plano GV, Straley SC, McDonough KA, Nilles ML, Matson JS, Blattner FR, Perry RD. Related Articles, Links Genome sequence of Yersinia pestis KIM. J Bacteriol. 2002 Aug;184(16):4601-11. PMID: 12142430 [PubMed - indexed for MEDLINE] Q2: Does it appear that the Y. pestis genome has been completely sequenced?: YES Q3: How large is the Y. pestis genome, and how many proteins can it encode?: SIZE = 4.6 Mb or 4.6 million base pairs e. Back to the search for the E. coli mutS gene: go back to the Y. pestis mutS frontpage mutS methyl-directed mismatch repair protein (“ ”), and scroll to the bottom. Click on the “Protein” link, AAM84420, to obtain the amino acid (aa) sequence of the Y. pestis mutS protein. Q4: f. How many aa long is this protein?: 851 amino acids Open a new window, and go to the NCBI homepage http://www.ncbi.nlm.nih.gov/ g. From the dark blue line above the search window, choose “BLAST”, and then choose “Protein-protein BLAST (blastp)”. You will now perform your first BLAST search, using the Basic Local Alignment Sequence Tool. This tool allows you to rapidly search the entirety of GenBank to locate genes and proteins that are related to your “Query” sequence. In this case your query sequence will be the Y. pestis mutS protein. Let’s see if we can use it to find the E. coli mutS protein. h. Copy/paste the entire Y. pestis mutS protein sequence into the Search window on the BLAST page. Don’t worry about the numbers and the extra spaces – BLAST knows how to ignore them. 1 61 121 181 241 301 361 421 481 541 601 661 721 781 841 mknndkldsh sagepipmag deallqerqd penfepmsli lqyvkdtqrt pmgsrmlkrw rprdlarmrh dggviasgyn qshlvpihyv elqlsanala ltlspqrrml ddlasgrstf amtlfathyf vprdvikrar lewiyrlknm tpmmqqylrl vpyhsienyl nllaaiwqda ehrhglrrrp slphirgltm lhmpirdirv afqqlpeihr aeldewrala rrqtlknaer eldvlanlae iitgpnmggk mvemtetani elttlpekme qklkelesls v kaqhpeillf aklvqlgesa kgfgyatldi lwefeldtak erqqdgiimd ltdrqqaigg llqpidvphv dgatdyldrl yiipelkeye raetlnyscp stymrqtali lhnateqslv gvvnvhldal nnaaastidg yrmgdfyelf aiceqigdpa ssgrfrvaep qqlnlqfgtr aatrrnlelt lqdiaaelqt qnllsqvgqf eirereklgl dkvltskgka tlsdkpgiki vllahlgsyv lmdeigrgts ehgetiafmh sqmtllneei ysdakrasql tskgpverkv adletmaael dligfgveqa qnlsggsent plrqvgdler delqdllera dtlkvgfngv laiekglyee mggrhpvveq padqatigpi tydglslawa svqegaasks ppavealeal ldisltkrga vrivtpgtis qrtnpaelly hlalraagcl laaildcsvt ilarlalrta ivetppvlvr hgyyiqvsrg ifdlllphlp vlkepfisnp driftrvgaa caenlasrik yglavaalag dpdslsprqa i. Click the blue “BLAST!” button to begin the search. A new screen will appear shortly thereafter. On this new screen, click “FORMAT”. This will bring up a new window, and within 1-4 minutes the completed BLAST search report should appear. j. Interpreting the BLAST results: (1) The first window will contain a graphical display of “hits” showing the relative similarity between your query sequence and genes to which it is related. Suffice it to say that if the hits are in red color, they represent proteins that are extremely similar to Y. pestis mutS. (2) The second element of the report contains a list of the top 100 hits, in descending order of similarity to your query protein. Each entry is listed on a single line, with a GenBank accession number for each homolog hyperlinked so that you can get to it. (3) The third element of the report contains alignments of your query protein (top line) to similar “subject” proteins (bottom line). Entries between the query line and subject line indicate aa residues that are identical between the two proteins, and also aa residues that are conservatively substituted between the two proteins (indicated by a ‘+’ sign). The meaning of “conservative substitution will be explained. k. Go to the third entry line in the report, and copy/paste the GenBank accession number here: Q5: GB Accession # gi|1592569|gb|AAB97931.1| l. Go to the third alignment in the report, and copy/paste it here (preserve the alignment by using a COURIER font at size 10). Make sure to that the entry includes the top line with GenBank accession numbers and other descriptors. Q6: Paste in the alignment here gi|1592569|gb|AAB97931.1| Length = 853 DNA mismatch repair protein [Escherichia coli] Score = 1320 bits (3416), Expect = 0.0 Identities = 684/853 (80%), Positives = 750/853 (87%), Gaps = 2/853 (0%) Query: 1 Sbjct: 1 Query: 61 Sbjct: 61 MKNNDKLDSHTPMMQQYLRLKAQHPEILLFYRMGDFYELFYSDAKRASQLLDISLTKRGA 60 M + D+HTPMMQQYL+LKAQHPEILLFYRMGDFYELFYDAKRASQLLDISLTKR A MSAIENFDAHTPMMQQYLKLKAQHPEILLFYRMGDFYELFYDDAKRASQLLDISLTKRSA 60 SAGEPIPMAGVPYHSIENYLAKLVQLGESAAICEQIGDPATSKGPVERKVVRIVTPGTIS 120 SAGEPIPMAG+PYH++ENYLAKLV GES AICEQIGDPATSKGPVERKVVRIVTPGTIS SAGEPIPMAGIPYHAVENYLAKLVNQGESVAICEQIGDPATSKGPVERKVVRIVTPGTIS 120 Query: 121 DEALLQERQDNLLAAIWQDAKGFGYATLDISSGRFRVAEPADLETMAAELQRTNPAELLY 180 DEALLQERQDNLLAAIWQD+KGF YATLDISSGRFR++EPADETMAAELQRTNPAELLY Sbjct: 121 DEALLQERQDNLLAAIWQDSKGFAYATLDISSGRFRLSEPADRETMAAELQRTNPAELLY 180 Query: 181 PENFEPMSLIEHRHGLRRRPLWEFELDTAKQQLNLQFGTRDLIGFGVEQAHLALRAAGCL 240 E+F MSLIE R GLRRRPLWEFE+DTA+QQLNLQFGTRDL+GFGVEA L AAGCL Sbjct: 181 AEDFAEMSLIEGRRGLRRRPLWEFEIDTARQQLNLQFGTRDLVGFGVENAPRGLCAAGCL 240 Query: 241 LQYVKDTQRTSLPHIRGLTMERQQDGIIMDAATRRNLELTQNLSGGSENTLAAILDCSVT 300 LQY KDTQRT+LPHIR +TMER+QDIIMDAATRRNLE+TQNL+GG+ENTLA++LDC+VT Sbjct: 241 LQYAKDTQRTTLPHIRSITMEREQDSIIMDAATRRNLEITQNLAGGAENTLASVLDCTVT 300 Query: 301 PMGSRMLKRWLHMPIRDIRVLTDRQQAIGGLQDIAAELQTPLRQVGDXXXXXXXXXXXXX 360 PMGSRMLKRWLHMP+R RVL +RQQ IG LQD AELQ LRQVGD Sbjct: 301 PMGSRMLKRWLHMPVRHTRVLLERQQTIGALQDFTAELQPVLRQVGDLERILARLALRTA 360 Query: 361 XXXXXXXMRHAFQQLPEIHRLLQPIDVPHVQNLLSQVGQFDELQDLLERAIVETPPVLVR 420 MRHAFQQLPE+ L+ +D VQ L ++G+FEL+DLLERAI++TPPVLVR Sbjct: 361 RPRDLARMRHAFQQLPELRAQLETVDSAPVQALREKMGEFAELRDLLERAIIDTPPVLVR 420 Query: 421 DGGVIASGYNAELDEWRALADGATDYLDRLEIREREKLGLDTLKVGFNGVHGYYIQVSRG 480 DGGVIASGYN ELDEWRALADGATDYL+RLE+RERE+ GLDTLKVGFNVHGYYIQ+SRG Sbjct: 421 DGGVIASGYNEELDEWRALADGATDYLERLEVRERERTGLDTLKVGFNAVHGYYIQISRG 480 Query: 481 QSHLVPIHYVRRQTLKNAERYIIPELKEYEDKVLTSKGKALAIEKGLYEEIFDXXXXXXX 540 QSHL PI+Y+RRQTLKNAERYIIPELKEYEDKVLTSKGKALA+EK LYEE+FD Sbjct: 481 QSHLAPINYMRRQTLKNAERYIIPELKEYEDKVLTSKGKALALEKQLYEELFDLLLPHLE 540 Query: 541 XXXXSANALAELDVLANLAERAETLNYSCPTLSDKPGIKIMGGRHPVVEQVLKEPFISNP 600 SA+ALAELDVL NLAERA TLNY+CPT DKPGI+I GRHPVVEQVL EPFI+NP Sbjct: 541 ALQQSASALAELDVLVNLAERAYTLNYTCPTFIDKPGIRITEGRHPVVEQVLNEPFIANP 600 Query: 601 LTLSPQRRMLIITGPNMGGKSTYMRQTALIVLLAHLGSYVPADQATIGPIDRIFTRVGAA 660 L LSPQRRMLIITGPNMGGKSTYMRQTALI L+A++GSYVPA +IGPIDRIFTRVGAA Sbjct: 601 LNLSPQRRMLIITGPNMGGKSTYMRQTALIALMAYIGSYVPAQKVEIGPIDRIFTRVGAA 660 Query: 661 DDLASGRSTFMVEMTETANILHNATEQSLVLMDEIGRGTSTYDGLSLAWACAENLASRIK 720 DDLASGRSTFMVEMTETANILHNATESLVLMDEIGRGTSTYDGLSLAWACAENLA++IK Sbjct: 661 DDLASGRSTFMVEMTETANILHNATEYSLVLMDEIGRGTSTYDGLSLAWACAENLANKIK 720 Query: 721 AMTLFATHYFELTTLPEKMEGVVNVHLDALEHGETIAFMHSVQEGAASKSYGLAVAALAG 780 A+TLFATHYFELT LPEKMEGV NVHLDALEHG+TIAFMHSVQ+GAASKSYGLAVAALAG Sbjct: 721 ALTLFATHYFELTQLPEKMEGVANVHLDALEHGDTIAFMHSVQDGAASKSYGLAVAALAG 780 Query: 781 VPRDVIKRARQKLKELESLSNNAAASTIDGSQMTLLN--EEIPPAVEALEALDPDSLSPR 838 VP++VIKRARQKL+ELES+S NAAA+ +DG+QM+LL+ EE PAVEALE LDPDSL+PR Sbjct: 781 VPKEVIKRARQKLRELESISPNAAATQVDGTQMSLLSVPEETSPAVEALENLDPDSLTPR 840 Query: 839 QALEWIYRLKNMV 851 QALEWIYRLK++V Sbjct: 841 QALEWIYRLKSLV 853 Q7: What species does the subject sequence come from? The bacterium Escherichia coli Q8: Are the two proteins the same length? If not, what is the length of each? Y. pestis mutS = 851 aa E. coli mutS = 853 aa Q9: Do the two proteins appear to be related? Does the alignment report contain a quantitative indicator of relatedness? If so, what is the measure of their relatedness? The two proteins are very closely related, having derived from a common ancestral gene, and being conserved in each species through time. The quantitative measures of relatedness are summarized in the top line of the GenBank alignment: Identities = 684/853 (80%), Positives = 750/853 (87%), Gaps = 2/853 (0%) The two proteins are identical at 684 out of 853 positions. By including conservative amino acid substitutions (+), the proteins are similar at 750 out of 853 positions, or 87% similar. This optimal alignment can be created by introducing only two gaps in the alignment. Can you find these gaps m. Click on the annotation link to obtain the sequence of the mutS homolog from E. coli. This will give you a page of annotation about the protein only. We would like to know about the DNA sequence as well. This additional information can be obtained by clicking on the hyperlink associated with the DBSOURCE. Q10: Paste the full page of results here: (reduce to 10 point font) 1: U69873. Escherichia coli ...[gi:2822121] Links LOCUS DEFINITION ECU69873 2736 bp DNA linear BCT 29-JAN-1998 Escherichia coli DNA mismatch repair protein (mutS) gene, complete cds. ACCESSION U69873 VERSION U69873.1 GI:2822121 KEYWORDS . SOURCE Escherichia coli ORGANISM Escherichia coli Bacteria; Proteobacteria; Gammaproteobacteria; Enterobacteriales; Enterobacteriaceae; Escherichia. REFERENCE 1 (bases 1 to 2736) AUTHORS LeClerc,J.E., Li,B., Payne,W.L. and Cebula,T.A. TITLE High mutation frequencies among Escherichia coli and Salmonella pathogens JOURNAL Science 274 (5290), 1208-1211 (1996) MEDLINE 97053611 PUBMED 8895473 REFERENCE 2 (bases 1 to 2736) AUTHORS Li,B. TITLE Direct Submission JOURNAL Submitted (06-SEP-1996) Molecular Biology Branch (HFS-237), FDA, 200 C. Street SW, Washington, DC 20204, USA REFERENCE 3 (bases 1 to 2736) AUTHORS Li,B. TITLE Direct Submission JOURNAL Submitted (29-JAN-1998) Molecular Biology Branch (HFS-237), FDA, 200 C. Street SW, Washington, DC 20204, USA REMARK Sequence update by submitter COMMENT On Jan 29, 1998 this sequence version replaced gi:1592568. FEATURES Location/Qualifiers source 1..2736 /organism="Escherichia coli" /mol_type="genomic DNA" /strain="O157:H7" /db_xref="taxon:562" gene 88..2649 /gene="mutS" CDS 88..2649 /gene="mutS" /function="methyl directed mismatch repair" /codon_start=1 /evidence=experimental /transl_table=11 /product="DNA mismatch repair protein" /protein_id="AAB97931.1" /db_xref="GI:1592569" /translation="MSAIENFDAHTPMMQQYLKLKAQHPEILLFYRMGDFYELFYDDA KRASQLLDISLTKRSASAGEPIPMAGIPYHAVENYLAKLVNQGESVAICEQIGDPATS KGPVERKVVRIVTPGTISDEALLQERQDNLLAAIWQDSKGFAYATLDISSGRFRLSEP ADRETMAAELQRTNPAELLYAEDFAEMSLIEGRRGLRRRPLWEFEIDTARQQLNLQFG TRDLVGFGVENAPRGLCAAGCLLQYAKDTQRTTLPHIRSITMEREQDSIIMDAATRRN LEITQNLAGGAENTLASVLDCTVTPMGSRMLKRWLHMPVRHTRVLLERQQTIGALQDF TAELQPVLRQVGDLERILARLALRTARPRDLARMRHAFQQLPELRAQLETVDSAPVQA LREKMGEFAELRDLLERAIIDTPPVLVRDGGVIASGYNEELDEWRALADGATDYLERL EVRERERTGLDTLKVGFNAVHGYYIQISRGQSHLAPINYMRRQTLKNAERYIIPELKE YEDKVLTSKGKALALEKQLYEELFDLLLPHLEALQQSASALAELDVLVNLAERAYTLN YTCPTFIDKPGIRITEGRHPVVEQVLNEPFIANPLNLSPQRRMLIITGPNMGGKSTYM RQTALIALMAYIGSYVPAQKVEIGPIDRIFTRVGAADDLASGRSTFMVEMTETANILH NATEYSLVLMDEIGRGTSTYDGLSLAWACAENLANKIKALTLFATHYFELTQLPEKME GVANVHLDALEHGDTIAFMHSVQDGAASKSYGLAVAALAGVPKEVIKRARQKLRELES ISPNAAATQVDGTQMSLLSVPEETSPAVEALENLDPDSLTPRQALEWIYRLKSLV" ORIGIN 1 61 121 181 241 301 361 421 481 541 601 661 721 781 841 901 961 1021 1081 1141 1201 1261 1321 1381 1441 1501 1561 1621 1681 1741 1801 1861 1921 1981 2041 2101 2161 2221 2281 2341 2401 2461 2521 2581 2641 2701 // ctccggtatc taatatcagg cccatgatgc cggatgggtg gatatttcac ccctaccatg atctgcgaac cgtatcgtta ctgctggcgg tccggtcgtt cgcactaatc ggccgtcgcg cagttgaatc cgcggacttt ctgccgcata gcgacgcgtc gcttctgtgc catatgccag caggatttca ctggcgcgtc ttccagcaac gcgctacgtg atcgacacac gagctggatg gtccgcgagc ggctactaca cgccagacgc aaagtcctca ttcgacctgc ctcgacgtgc ttcattgata ctgaacgagc atcaccggtc ctgatggcct cgcatcttta gtggagatga atggatgaga gcggaaaatc ctgacccagt cacggcgaca ggcctggcgg aaactgcgtg caaatgtctt cttgatccgg ctggtgtaat aataaaataa atgtgcgcct gaaccggaca agcagtatct atttttatga tgaccaaacg cggtggaaaa aaattggcga cgccaggcac ctatctggca ttcgcctgag ctgcggaact gcctgcgccg tgcaatttgg gtgctgccgg ttcgttccat gtaatctgga tcgactgcac tgcgccatac ccgccgagtt tggctttacg tgccggagct agaagatggg cgccggtgct agtggcgcgc gtgaacgtac ttcaaatcag tgaaaaacgc cttcaaaagg tgttgccgca tggtgaacct aaccgggcat catttatcgc cgaacatggg acatcggcag cccgcgtagg ccgaaaccgc ttgggcgcgg tggcgaataa taccggagaa ccattgcctt ttgcagctct agctggaaag tgctgtcagt attcactcac aataattccc aaataccctg tatgtgatta taaccccatg caagctgaaa actgttttat cagtgcttcg ctacctcgcc tccggcgacc catcagcgat ggacagcaaa cgaaccggct gctgtatgca tcgcccgctg gacccgcgat ttgtctgttg caccatggaa aatcacccag cgtcacgccg ccgcgtgttg gcagccggta aaccgctcgc gcgtgcgcag cgagtttgcc ggtacgcgac gctggctgac cggcctggac ccgtgggcaa cgagcgctac caaagcactg tctggaagcg ggcggaacgg tcgcattacc caacccgctg cggtaaaagt ctacgtaccg cgcggcagat taatattctg aacgtccact gattaaggcg aatggaaggc tatgcatagc ggccggcgtg catttcgccg accagaagaa cccgcgtcag gatagtcttt tataatagga caacgaaaat agtgcaatag gcccagcatc gacgacgcaa gcgggagagc aaactggtga agcaaaggtc gaagccctgt ggtttcgcct gaccgcgaaa gaagattttg tgggagtttg ctggtcggtt cagtatgcga cgtgagcagg aacctggcgg atgggcagcc cttgagcgcc ctacgtcagg ccacgcgatc ttagaaactg gagctgcgcg ggtggtgtta ggcgcgaccg acgctgaaag agccatctgg atcattccag gctctggaaa ttgcaacaga gcctataccc gaaggccgcc aatctgtcac acctatatgc gcgcaaaaag gatctggcgt cataacgcca tacgatggtc ttgacgctgt gtcgccaacg gtgcaggatg ccaaaagagg aacgccgccg acttcgcctg gcgctggaat tgctatcggg aagctt aaaaaccatc aaaatttcga ccgagatcct aacgcgcgtc cgatcccgat atcagggcga cggttgagcg tgcaggagcg acgcgacgct cgatggcggc ctgaaatgtc aaatcgacac ttggcgtcga aagataccca acagcatcat gtggtgcgga gtatgctgaa agcaaactat tcggcgacct tggcccgtat tcgatagtgc atctgctgga tcgcatcagg attatctgga ttggctttaa cacctatcaa agctaaaaga aacagcttta gcgcgagcgc tgaactacac atccggtggt cgcagcgccg gccagaccgc tcgagattgg ccgggcgttc ccgagtacag tgtcgctggc ttgccaccca tgcatctcga gcgcggcgag ttattaagcg ctacgcaagt cggtcgaagc ggatttatcg aatattaacg acaccccatt cgcccatacg gctgttttac gcaactgctg ggcggggatt gtccgttgcc caaagttgtg tcaggacaac ggatatcagt agaactgcaa gttaattgaa cgcgcgccag gaacgcgccg acgtacgact tatggatgcc aaatacgctg acgctggctg tggcgcattg ggaacgtatt gcgtcacgct accggtacag gcgagcaatc ctataacgaa gcgtctggaa tgcggtgcac ctatatgcgt gtacgaagat tgaagagctg gctggcggaa ctgcccgacc tgaacaggtg gatgttgatt actgattgcg cccgattgac aacctttatg tctggtgctg gtgggcgtgc ctatttcgag tgcactggag caaaagctac cgcacggcaa ggatggtacg tctggaaaat cttgaagagt ataactgacg Q11: Reading the header of a prokaryotic GenBank entry. Following the outline on p.80, record below the LOCUS, DEFINITION, ACCESSION, VERSION, KEYWORDS, SOURCE, ORGANISM, REFRENCE, and COMMENTS FOR THE E. coli mutS gene: LOCUS AAB97931 853 aa linear BCT 29-JAN-1998 DEFINITION DNA mismatch repair protein [Escherichia coli]. ACCESSION AAB97931 VERSION AAB97931.1 GI:1592569 DBSOURCE locus ECU69873 accession U69873.1 KEYWORDS . SOURCE Escherichia coli ORGANISM Escherichia coli Bacteria; Proteobacteria; Gammaproteobacteria; Enterobacteriales; Enterobacteriaceae; Escherichia. REFERENCE AUTHORS TITLE JOURNAL MEDLINE PUBMED REFERENCE AUTHORS TITLE JOURNAL REFERENCE AUTHORS TITLE JOURNAL REMARK 1 (residues 1 to 853) LeClerc,J.E., Li,B., Payne,W.L. and Cebula,T.A. High mutation frequencies among Escherichia coli and Salmonella pathogens Science 274 (5290), 1208-1211 (1996) 97053611 8895473 2 (residues 1 to 853) Li,B. Direct Submission Submitted (06-SEP-1996) Molecular Biology Branch (HFS-237), FDA, 200 C. Street SW, Washington, DC 20204, USA 3 (residues 1 to 853) Li,B. Direct Submission Submitted (29-JAN-1998) Molecular Biology Branch (HFS-237), FDA, 200 C. Street SW, Washington, DC 20204, USA Sequence update by submitte Q12: p. 82, BFD – what does the term CDS mean? CDS = Coding Sequence, which is synonymous with “coding region” or open reading frame (ORF) Q13: At what nucleotide number does the CDS begin in this GenBank entry? The ORF begins at nt #88, and ends at nt 2649 Q14: 1 61 121 181 241 Paste in the first 300 nt of the nucleotide sequence here, and highlight the start codon: ctccggtatc taatatcagg cccatgatgc cggatgggtg gatatttcac atgtgcgcct gaaccggaca agcagtatct atttttatga tgaccaaacg tatgtgatta taaccccatg caagctgaaa actgttttat cagtgcttcg caacgaaaat agtgcaatag gcccagcatc gacgacgcaa gcgggagagc aaaaaccatc aaaatttcga ccgagatcct aacgcgcgtc cgatcccgat acaccccatt cgcccatacg gctgttttac gcaactgctg ggcggggatt n. We now know the nucleotide and protein sequences of the E. coli mutS gene, but the annotation provided does not include the upstream regulatory sequences, i.e., the promoter for recognition by RNA polymerase, and the Ribosome Binding Site (RBS) by which the ribosome joins with the mRNA to begin translating the protein. Let’s go and find these upstream regulatory elements. To do so, we will need access to a much larger chunk of the E. coli genome, as follows: o. To begin, follow the three steps at the bottom of p.82 and top of p. 83, BFD, to convert the nucleotide sequence to a more universally acceptable format, called “FASTA”. p. Paste the entire nucleotide sequence into a Nucleotide-nucleotide BLAST search (go back to your BLAST search page, paste the sequence into the Search window, and choose “Nucleotide” from from the blue menu bar, then begin the search by clicking the blue “BLAST” button. “FORMAT” the BLAST search to retrieve the results in a new window, as you did with the previous Protein-protein BLAST search. q. The results will show many independent GenBank entries containing the mutS DNA region. This will illustrate just how redundant GenBank can be (cf. p. 84, BFD). The database often contains many different entries of the same information, usually because of independent submissions by different authors. Most of the top entries in this case correspond to GenBank files that contain the entire E. coli genome or large chunks of the genome. r. In this case, open AE016765.1, a “manageable” chunk consisting of 305,000 base pairs. This is section 11 out of 18 of the complete genome. Give it a moment to fully load….this file contains a lot of information! As you scroll down, you will see sequential translations of every Open Reading Frame (ORF), i.e., every potential gene, that is encoded in this large segment of DNA. At the bottom of the file, all 305,000 nucleotides are listed. To make this a bit easier, we did some groundwork for you, and discovered that the mutS gene lies between nt 115,840 and 118,752, and the ATG codon is at nt 116,191. Q15: 115861 115921 115981 116041 116101 116161 116221 116281 116341 116401 116461 116521 116581 116641 116701 116761 116821 116881 Copy/paste this entire sequence below (Courier 10 pt), and highlight (bold) the Start codon. ttttttaatc ttaaggtgca Tttcttcaac cagcattttt taactccggt atttaacatc acgcccatga taccggatgg ctggatattt attccctacc gctatttgtg gtccgtatcg aacctgctgg agttccggtc caacgtacga gaaggtcgcc cagcaattga ccacgcggac gtcagttttt ttTATATTac cgacgaattt cgcggaaccg atcatgtgcg agggagccgg tgcagcagta gcgactttta cactgaccaa atgcggtgga aacagattgg ttacgccggg cggctatctg gttttcgcct atccagcgga gcggcctgcg acctgcaatt tttgtgctgc cacgagagat aacttAattt ggcgatcttg acatcaagaa ccttatgtga acttaacccc tctcaagctg tgaactgttt acgcggtgct aaactacctc cgatccggcg caccatcagc gcaggacagt gagcgaaccg attactgtat tcgtcgcccg tggcacccga cggttgtctg c acgcttgccg taaaggggac tttcatcgtc ctcacctttc ttacaacgaa ATGagtacaa aaagcccagc tatgacgacg tcagcgggag gccaaactgg accagcaaag gatgaagcgc aaaggtttcg gcggaccgcg gcggaagatt ctgtgggagt gatctggtcg ttgcagtatg gcTTGATAaa gacatgctgc tatgtctctg atcacgcaaa gcgtcttccc aataaaaacc tagaaaattt atcccgagat caaaacgggc agccaatccc tgaaccaggg ggccggtcga tgttgcagga gctacgcaac aaacgatggc tcgccgagat ttgaaatcga gttttggtgt cgaaagatac gatcgatcat ctccatactc aaggAGGAGG agctgcaaag ctgaaatgat atcacactac cgacgcccat cctgctgttt gtcgcaactg gatggcgggg cgaatcggtt gcgcaaagtt acgtcaggac gctggatatc ggcagaactg gtcgctgatt caccgctcgc ggagaacgca ccaacgcacg 116941 117001 117061 117121 117181 117241 117301 117361 117421 117481 117541 117601 117661 117721 117781 117841 117901 117961 118021 118081 118141 118201 118261 118321 118381 118441 118501 118561 118621 118681 118741 accctgccgc gccgcgacgc ctggcttccg ctgcatatgc ttgcaggatt attctggcgc gctttccagc caagcgctgc atcatcgaca gaagagctgg gaggtccgcg cacggctact cgtcgccaga gataaagttc ctgttcgacc gaactcgacg acctttattg gtactgaatg atcattaccg gcgctgatgg gaccgtatct atggtggaga ttgatggatg tgtgcggaaa gagctgaccc gagcacggcg tacggcctgg caaaaactgc acacaaatgt aacctcgacc agtctggtgt atattcgttc gtcgtaacct tactcgactg cagtgcgcga tcaccgccga gtctggcgtt aactgccaga gtgagaagat caccaccggt atgagtggcg agcgtgaacg acattcaaat cgctgaaaaa tcacctcaaa tgctgttgcc tgctggtgaa ataaacctgg agccgtttat gtccgaacat cgtatatcgg ttacccgcgt tgaccgaaac agatcgggcg atctggcaaa agttaccgga acaccattgc cggttgcagc gtgagctgga ctttgctgtc cggattcact aa tatcactatg ggaaattact cactgtaacg tacccgcgtg gttacagccg gcgtaccgct gttgcgtgcg gggcgagttt gctggtacgc cgcgctggct taccggcctg cagccgtggg cgccgagcgt aggcaaagca gcatctggaa cctggcggaa cattcgcatt cgctaacccg gggcggtaaa cagctacgta aggtgctgcg cgccaatatt tggaacgtcc taagatcaaa gaaaatggaa ctttatgcat tctggcgggt aagcatttcg cgtaccggaa gactccgcgt gaacgtcagc cagaacctgg ccgatgggta ttgcttgagc gtactgcgtc cgcccgcgcg cagttagaaa gccgagctgc gacggtggtg gacggcgcga gacacgctga caaagccatc tacattattc ctggctctgg gcgttgcaac cgggcctata accgaaggtc ctaaacctgt agtacctata ccggcgcaaa gatgatctgg ttacataacg acttacgatg gcgttgacgc ggcgtcgcca agcgtgcagg gtgccaaaag ccgaacgccg gaaacttcgc caggcgctgg aggacagcat ccggcggtgc gtcgtatgct gccagcaaac aggtgggcga atctggcccg atgtcgatag gcgatctgct ttatcgcatc ccgattatct aagttggctt tggcaccaat cggagctgaa aaaaacaact agagcgcgag ccctgaacta gccatccggt cgccgcaacg tgcgccagac aagtcgagat cttccgggcg ccaccgaata gtctgtcgct tgtttgctac acgtgcatct atggcgcagc aggttattaa ctgctacgca ctgcggtcga agtggattta cattatggat ggaaaatacg gaaacgctgg tattggcgca cctggaacgt tatgcgtcac tgcaccggta ggagcgagca gggctataac ggagcgtctg taatgcggtg caactacatg agagtacgaa ttatgaagag cgcgctggcg cacctgcccg agttgaacaa tcgcatgttg cgcgttgatt tggcccgatt ttcaaccttt cagtctggtg ggcgtgggca ccactatttc cgatgcactg gagcaaaagc gcgcgcacgg agtggatggt ggcactggaa tcgcttgaag Q16: Locate the –35 promoter sequence. Highlight it (Bold) in the sequence above, and list its sequence here (keep in mind that this sequence may not not perfectly match the consensus sequence, but it will probably differ by no more than one base from the consensus. TTGATA, one base different from the consensus TTGACA Q17: Locate the –10 sequence, keeping in mind that it also may not follow the exact consensus. Highlight it (Bold) in the sequence above, and list it here: TATATT, one base different from the consensus TATAAT Q18: Propose a likely startsite for transcription, and highlight its location above. At what nucleotide (type of base and nt #) does transcription probably start? A, 8 nt downstream of the –10 sequence, at nt 11946. Transcription could also start at A 11947. Q18: Locate the Ribosome Binding Site (RBS), which has the consensus sequence AGGAGGU in the mRNA. Highlight it (Bold). In what region of the mRNA transcript is the RBS found? How far is the RBS from the start codon? AGGAGGT, the Ribosome Binding Site, is located between the transcription startsite and the start codon, in the 5’ untranslated region (5’ UT) of the mRNA. Q19: As oriented above, does this DNA sequence represent the coding strand or the template strand? Is the DNA sequence as shown oriented from the 5’ to the 3’ end, or from the 3’ end to the 5’ end, of the E. coli mutS gene? This sequence is oriented from 5’ to 3’ end and represents the coding, or sense strand of the DNA