Download Lab2_AnswerKey

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
CS 251 Introduction to Bioinformatics: Laboratory 2:
Dabbling in Bioinformatics:
Today, we will take our first real crack at using bioinformatic tools.
We will follow the flow of Chapter 3, Bioinformatics for Dummies (BFD), in which the authors pilot a
single gene (dUTPase, deoxyuridine 5’ triphosphate nucleotidylhydrolase) as a vehicle for touring
several genome databases and for learning some basic terminology and search tools.
To make this exercise more interesting for you, we will substitute a gene of our choosing, the
mutS/hMSH2 DNA repair gene, for this exercise. And, we will ask you to perform some additional
steps (e.g., Blastp) and answer a variety of questions, as you navigate this “road rally” through
databases, genomes, and search tools.
First, the essential gene terminology:
mutS is the name given to the prokaryotic (bacterial) version of this universal defender of the genome.
(“mut’ is an abbreviation to reflect the increased rate at which DNA mutations accumulate in cells that
lack this critical gene).
MSH2 is the name given to the eukaryotic (algae and fungi, plants, and animals) version of this gene.
(“MSH” is an abbreviation that means “MutS Homolog”). The term “homolog” means that the MSH2
gene looks and acts like the mutS gene, i.e., its structure (DNA and protein sequence) is similar to
mutS, and it plays a similar role in preventing mutations from occurring.
hMSH2: the prefix ‘h’ in front of a gene name indicates that it is the human version of the gene.
For some background, please obtain the PubMed abstracts of these two recent research articles
about the mutS/hMSH2 genes.
Ainsworth P, Koscinski D, Fraser B, Stuart J.
Family cancer histories predictive of a high risk of hereditary non-polyposis colorectal cancer
associate significantly with a genomic rearrangement in hMSH2 or hMLH1.
Clin Genet. 2004 Sep;66(3):183-188.
PMID: 15324316 [PubMed - as supplied by publisher]
Watson ME Jr, Burns JL, Smith AL.
Hypermutable Haemophilus influenzae with mutations in mutS are found in cystic fibrosis sputum.
Microbiology. 2004 Sep;150(Pt 9):2947-58.
PMID: 15347753 [PubMed - in process]
Please answer the following questions here:
From the abstract by Ainsworth P, Koscinski D, Fraser B, Stuart J.: HNPCC is a hereditary form
of colon cancer caused by defects in DNA repair genes, most notably the hMSH2 gene. About 1 in
200 of us will develop this cancer because we carry a defective copy of the hMSH2 gene. Are there
any bioinformatic tools, described in this paper, for predicting risk for this defect in human
populations? What is the name of this tool, and its location? At what institution was this tool
developed and housed?
From the abstract by Watson ME Jr, Burns JL, Smith AL: normally, bacteria lacking the mutS
gene are at a distinct disadvantage owing to the rapid accumulation of deleterious mutations in their
DNA. Why might this defect in DNA repair provide an advantage for human bacterial pathogens
Procedure:
follow pp. 78-84 in BFD
Objective: Locate and study the E. coli mutS gene
Go to the GenBank entry tool at http://www.ncbi.nlm.nih.gov/entrez/
a. From the “Search” pull down menu, choose “Gene”.
b. Type the term ‘mutS E.coli’ in the “For” window and click “Go”.
c. Entries for a number of human versions of this gene are listed. However, nowhere on this list
will you find the E. coli mutS gene (strangely?!). Instead, scroll down the page until you find
the 14th entry. This will provide you with annotation for the mutS protein not from E.
coli, but from an other bacterium, Yersinia pestis. This lethal bacterium is the causative agent
of Bubonic Plague (the “Black Death” made infamous by wiping out 1/3 of the population of
Europe in the 14th century). Open this ‘mutS’ hyperlink.
14: mutS Links
methyl-directed mismatch repair protein [Yersinia pestis KIM]
GeneID: 1145782
d. You will see a variety of information about the Y. pestis mutS gene, such as its chromosomal
location, neighboring genes, links to PubMed references, etc.
Display Show: Send to
1: mutS methyl-directed mismatch repair protein [Yersinia pestis KIM]
Links
GeneID: 1145782 Locus tag: y0835
updated 11-Sep-2004Transcripts and products: (shown
on reverse complement genome) RefSeq below
Genomic context:
Gene type: protein coding
Gene name: mutS
RefSeq status: Provisional
Organism: Yersinia pestis KIM (strain: KIM)
Lineage: Bacteria; Proteobacteria; Gammaproteobacteria; Enterobacteriales; Enterobacteriaceae;
Yersinia
Bibliography:
Gene References into Function (GeneRIF):
Submit
General protein informationName: methyl-directed mismatch repair protein
Comment:
helpPubMed links
NP_668169: residues 5 to 851 of 851 are 84.09 pct identical to residues 5 to 853 of 853 from E. coli
K12 : B2733; residues 5 to 851 of 851 are 84.09 pct identical to residues 5 to 853 of 853 from GenPept
: >gb|AAG57842.1|AE005501_11 (AE005501) methyl-directed mismatch repair [Escherichia coli
O157:H7 EDL933]
NCBI Reference Sequences (RefSeq)
Product NP_668169 methyl-directed mismatch repair protein [Yersinia pestis KIM]
Conserved Domains (5) summary
COG0249: MutS; Mismatch repair ATPase (MutS family) [DNA replication, recombination, and
repair]
Location:
8
849
Blast Score:
2541
smart00533: MUTSd; DNA-binding domain of DNA mismatch repair MUTS family
Location:
286
591
Blast Score:
596
smart00534: MUTSac; ATPase domain of DNA mismatch repair MUTS family
Location:
607
794
Blast Score:
725
pfam01624: MutS_I; MutS domain I
Location:
11
123
Blast Score:
441
pfam05188: MutS_II; MutS domain II
Location:
131
256
Blast Score:
329
Related Sequences
Nucleotide
Protein
Genomic
AE013686
AAM84420
Q1: How many papers about the Y. pestis mutS gene have been published?
One:
Display Show: Send to
1:
Deng W, Burland V, Plunkett G 3rd, Boutin A, Mayhew GF, Liss P, Perna NT, Rose
DJ, Mau B, Zhou S, Schwartz DC, Fetherston JD, Lindler LE, Brubaker RR, Plano GV,
Straley SC, McDonough KA, Nilles ML, Matson JS, Blattner FR, Perry RD.
Related
Articles, Links
Genome sequence of Yersinia pestis KIM.
J Bacteriol. 2002 Aug;184(16):4601-11.
PMID: 12142430 [PubMed - indexed for MEDLINE]
Q2: Does it appear that the Y. pestis genome has been completely sequenced?: YES
Q3: How large is the Y. pestis genome, and how many proteins can it encode?:
SIZE = 4.6 Mb or 4.6 million base pairs
e. Back to the search for the E. coli mutS gene: go back to the Y. pestis mutS frontpage
mutS methyl-directed mismatch repair protein
(“
”), and scroll to
the bottom. Click on the “Protein” link, AAM84420, to obtain the amino acid (aa) sequence of
the Y. pestis mutS protein.
Q4:
f.
How many aa long is this protein?: 851 amino acids
Open a new window, and go to the NCBI homepage http://www.ncbi.nlm.nih.gov/
g. From the dark blue line above the search window, choose “BLAST”, and then choose
“Protein-protein BLAST (blastp)”. You will now perform your first BLAST search, using the
Basic Local Alignment Sequence Tool. This tool allows you to rapidly search the entirety of
GenBank to locate genes and proteins that are related to your “Query” sequence. In this
case your query sequence will be the Y. pestis mutS protein. Let’s see if we can use it to find
the E. coli mutS protein.
h. Copy/paste the entire Y. pestis mutS protein sequence into the Search window on the BLAST
page. Don’t worry about the numbers and the extra spaces – BLAST knows how to ignore
them.
1
61
121
181
241
301
361
421
481
541
601
661
721
781
841
mknndkldsh
sagepipmag
deallqerqd
penfepmsli
lqyvkdtqrt
pmgsrmlkrw
rprdlarmrh
dggviasgyn
qshlvpihyv
elqlsanala
ltlspqrrml
ddlasgrstf
amtlfathyf
vprdvikrar
lewiyrlknm
tpmmqqylrl
vpyhsienyl
nllaaiwqda
ehrhglrrrp
slphirgltm
lhmpirdirv
afqqlpeihr
aeldewrala
rrqtlknaer
eldvlanlae
iitgpnmggk
mvemtetani
elttlpekme
qklkelesls
v
kaqhpeillf
aklvqlgesa
kgfgyatldi
lwefeldtak
erqqdgiimd
ltdrqqaigg
llqpidvphv
dgatdyldrl
yiipelkeye
raetlnyscp
stymrqtali
lhnateqslv
gvvnvhldal
nnaaastidg
yrmgdfyelf
aiceqigdpa
ssgrfrvaep
qqlnlqfgtr
aatrrnlelt
lqdiaaelqt
qnllsqvgqf
eirereklgl
dkvltskgka
tlsdkpgiki
vllahlgsyv
lmdeigrgts
ehgetiafmh
sqmtllneei
ysdakrasql
tskgpverkv
adletmaael
dligfgveqa
qnlsggsent
plrqvgdler
delqdllera
dtlkvgfngv
laiekglyee
mggrhpvveq
padqatigpi
tydglslawa
svqegaasks
ppavealeal
ldisltkrga
vrivtpgtis
qrtnpaelly
hlalraagcl
laaildcsvt
ilarlalrta
ivetppvlvr
hgyyiqvsrg
ifdlllphlp
vlkepfisnp
driftrvgaa
caenlasrik
yglavaalag
dpdslsprqa
i.
Click the blue “BLAST!” button to begin the search. A new screen will appear shortly
thereafter. On this new screen, click “FORMAT”. This will bring up a new window, and within
1-4 minutes the completed BLAST search report should appear.
j.
Interpreting the BLAST results:
(1) The first window will contain a graphical display of “hits” showing the relative similarity
between your query sequence and genes to which it is related. Suffice it to say that if the hits
are in red color, they represent proteins that are extremely similar to Y. pestis mutS.
(2) The second element of the report contains a list of the top 100 hits, in descending order of
similarity to your query protein. Each entry is listed on a single line, with a GenBank
accession number for each homolog hyperlinked so that you can get to it.
(3) The third element of the report contains alignments of your query protein (top line) to
similar “subject” proteins (bottom line). Entries between the query line and subject line
indicate aa residues that are identical between the two proteins, and also aa residues that are
conservatively substituted between the two proteins (indicated by a ‘+’ sign). The meaning
of “conservative substitution will be explained.
k. Go to the third entry line in the report, and copy/paste the GenBank accession number here:
Q5: GB Accession # gi|1592569|gb|AAB97931.1|
l.
Go to the third alignment in the report, and copy/paste it here (preserve the alignment by using
a COURIER font at size 10). Make sure to that the entry includes the top line with GenBank
accession numbers and other descriptors.
Q6: Paste in the alignment here
gi|1592569|gb|AAB97931.1|
Length = 853
DNA mismatch repair protein [Escherichia coli]
Score = 1320 bits (3416), Expect = 0.0
Identities = 684/853 (80%), Positives = 750/853 (87%), Gaps = 2/853 (0%)
Query: 1
Sbjct: 1
Query: 61
Sbjct: 61
MKNNDKLDSHTPMMQQYLRLKAQHPEILLFYRMGDFYELFYSDAKRASQLLDISLTKRGA 60
M
+ D+HTPMMQQYL+LKAQHPEILLFYRMGDFYELFYDAKRASQLLDISLTKR A
MSAIENFDAHTPMMQQYLKLKAQHPEILLFYRMGDFYELFYDDAKRASQLLDISLTKRSA 60
SAGEPIPMAGVPYHSIENYLAKLVQLGESAAICEQIGDPATSKGPVERKVVRIVTPGTIS 120
SAGEPIPMAG+PYH++ENYLAKLV GES AICEQIGDPATSKGPVERKVVRIVTPGTIS
SAGEPIPMAGIPYHAVENYLAKLVNQGESVAICEQIGDPATSKGPVERKVVRIVTPGTIS 120
Query: 121 DEALLQERQDNLLAAIWQDAKGFGYATLDISSGRFRVAEPADLETMAAELQRTNPAELLY 180
DEALLQERQDNLLAAIWQD+KGF YATLDISSGRFR++EPADETMAAELQRTNPAELLY
Sbjct: 121 DEALLQERQDNLLAAIWQDSKGFAYATLDISSGRFRLSEPADRETMAAELQRTNPAELLY 180
Query: 181 PENFEPMSLIEHRHGLRRRPLWEFELDTAKQQLNLQFGTRDLIGFGVEQAHLALRAAGCL 240
E+F MSLIE R GLRRRPLWEFE+DTA+QQLNLQFGTRDL+GFGVEA L AAGCL
Sbjct: 181 AEDFAEMSLIEGRRGLRRRPLWEFEIDTARQQLNLQFGTRDLVGFGVENAPRGLCAAGCL 240
Query: 241 LQYVKDTQRTSLPHIRGLTMERQQDGIIMDAATRRNLELTQNLSGGSENTLAAILDCSVT 300
LQY KDTQRT+LPHIR +TMER+QDIIMDAATRRNLE+TQNL+GG+ENTLA++LDC+VT
Sbjct: 241 LQYAKDTQRTTLPHIRSITMEREQDSIIMDAATRRNLEITQNLAGGAENTLASVLDCTVT 300
Query: 301 PMGSRMLKRWLHMPIRDIRVLTDRQQAIGGLQDIAAELQTPLRQVGDXXXXXXXXXXXXX 360
PMGSRMLKRWLHMP+R RVL +RQQ IG LQD AELQ LRQVGD
Sbjct: 301 PMGSRMLKRWLHMPVRHTRVLLERQQTIGALQDFTAELQPVLRQVGDLERILARLALRTA 360
Query: 361 XXXXXXXMRHAFQQLPEIHRLLQPIDVPHVQNLLSQVGQFDELQDLLERAIVETPPVLVR 420
MRHAFQQLPE+
L+ +D
VQ L ++G+FEL+DLLERAI++TPPVLVR
Sbjct: 361 RPRDLARMRHAFQQLPELRAQLETVDSAPVQALREKMGEFAELRDLLERAIIDTPPVLVR 420
Query: 421 DGGVIASGYNAELDEWRALADGATDYLDRLEIREREKLGLDTLKVGFNGVHGYYIQVSRG 480
DGGVIASGYN ELDEWRALADGATDYL+RLE+RERE+ GLDTLKVGFNVHGYYIQ+SRG
Sbjct: 421 DGGVIASGYNEELDEWRALADGATDYLERLEVRERERTGLDTLKVGFNAVHGYYIQISRG 480
Query: 481 QSHLVPIHYVRRQTLKNAERYIIPELKEYEDKVLTSKGKALAIEKGLYEEIFDXXXXXXX 540
QSHL PI+Y+RRQTLKNAERYIIPELKEYEDKVLTSKGKALA+EK LYEE+FD
Sbjct: 481 QSHLAPINYMRRQTLKNAERYIIPELKEYEDKVLTSKGKALALEKQLYEELFDLLLPHLE 540
Query: 541 XXXXSANALAELDVLANLAERAETLNYSCPTLSDKPGIKIMGGRHPVVEQVLKEPFISNP 600
SA+ALAELDVL NLAERA TLNY+CPT DKPGI+I GRHPVVEQVL EPFI+NP
Sbjct: 541 ALQQSASALAELDVLVNLAERAYTLNYTCPTFIDKPGIRITEGRHPVVEQVLNEPFIANP 600
Query: 601 LTLSPQRRMLIITGPNMGGKSTYMRQTALIVLLAHLGSYVPADQATIGPIDRIFTRVGAA 660
L LSPQRRMLIITGPNMGGKSTYMRQTALI L+A++GSYVPA +IGPIDRIFTRVGAA
Sbjct: 601 LNLSPQRRMLIITGPNMGGKSTYMRQTALIALMAYIGSYVPAQKVEIGPIDRIFTRVGAA 660
Query: 661 DDLASGRSTFMVEMTETANILHNATEQSLVLMDEIGRGTSTYDGLSLAWACAENLASRIK 720
DDLASGRSTFMVEMTETANILHNATESLVLMDEIGRGTSTYDGLSLAWACAENLA++IK
Sbjct: 661 DDLASGRSTFMVEMTETANILHNATEYSLVLMDEIGRGTSTYDGLSLAWACAENLANKIK 720
Query: 721 AMTLFATHYFELTTLPEKMEGVVNVHLDALEHGETIAFMHSVQEGAASKSYGLAVAALAG 780
A+TLFATHYFELT LPEKMEGV NVHLDALEHG+TIAFMHSVQ+GAASKSYGLAVAALAG
Sbjct: 721 ALTLFATHYFELTQLPEKMEGVANVHLDALEHGDTIAFMHSVQDGAASKSYGLAVAALAG 780
Query: 781 VPRDVIKRARQKLKELESLSNNAAASTIDGSQMTLLN--EEIPPAVEALEALDPDSLSPR 838
VP++VIKRARQKL+ELES+S NAAA+ +DG+QM+LL+ EE PAVEALE LDPDSL+PR
Sbjct: 781 VPKEVIKRARQKLRELESISPNAAATQVDGTQMSLLSVPEETSPAVEALENLDPDSLTPR 840
Query: 839 QALEWIYRLKNMV 851
QALEWIYRLK++V
Sbjct: 841 QALEWIYRLKSLV 853
Q7: What species does the subject sequence come from? The bacterium Escherichia coli
Q8: Are the two proteins the same length? If not, what is the length of each?
Y. pestis mutS = 851 aa
E. coli mutS = 853 aa
Q9: Do the two proteins appear to be related? Does the alignment report contain a
quantitative indicator of relatedness? If so, what is the measure of their relatedness?
The two proteins are very closely related, having derived from a common
ancestral gene, and being conserved in each species through time. The
quantitative measures of relatedness are summarized in the top line of the
GenBank alignment:
Identities = 684/853 (80%), Positives = 750/853 (87%), Gaps = 2/853 (0%)
The two proteins are identical at 684 out of 853 positions. By including
conservative amino acid substitutions (+), the proteins are similar at 750 out of
853 positions, or 87% similar. This optimal alignment can be created by
introducing only two gaps in the alignment. Can you find these gaps
m. Click on the annotation link to obtain the sequence of the mutS homolog from E. coli. This will
give you a page of annotation about the protein only. We would like to know about the DNA
sequence as well. This additional information can be obtained by clicking on the hyperlink
associated with the DBSOURCE.
Q10: Paste the full page of results here: (reduce to 10 point font)
1: U69873. Escherichia coli ...[gi:2822121] Links
LOCUS
DEFINITION
ECU69873
2736 bp
DNA
linear
BCT 29-JAN-1998
Escherichia coli DNA mismatch repair protein (mutS) gene, complete
cds.
ACCESSION
U69873
VERSION
U69873.1 GI:2822121
KEYWORDS
.
SOURCE
Escherichia coli
ORGANISM Escherichia coli
Bacteria; Proteobacteria; Gammaproteobacteria; Enterobacteriales;
Enterobacteriaceae; Escherichia.
REFERENCE
1 (bases 1 to 2736)
AUTHORS
LeClerc,J.E., Li,B., Payne,W.L. and Cebula,T.A.
TITLE
High mutation frequencies among Escherichia coli and Salmonella
pathogens
JOURNAL
Science 274 (5290), 1208-1211 (1996)
MEDLINE
97053611
PUBMED
8895473
REFERENCE
2 (bases 1 to 2736)
AUTHORS
Li,B.
TITLE
Direct Submission
JOURNAL
Submitted (06-SEP-1996) Molecular Biology Branch (HFS-237), FDA,
200 C. Street SW, Washington, DC 20204, USA
REFERENCE
3 (bases 1 to 2736)
AUTHORS
Li,B.
TITLE
Direct Submission
JOURNAL
Submitted (29-JAN-1998) Molecular Biology Branch (HFS-237), FDA,
200 C. Street SW, Washington, DC 20204, USA
REMARK
Sequence update by submitter
COMMENT
On Jan 29, 1998 this sequence version replaced gi:1592568.
FEATURES
Location/Qualifiers
source
1..2736
/organism="Escherichia coli"
/mol_type="genomic DNA"
/strain="O157:H7"
/db_xref="taxon:562"
gene
88..2649
/gene="mutS"
CDS
88..2649
/gene="mutS"
/function="methyl directed mismatch repair"
/codon_start=1
/evidence=experimental
/transl_table=11
/product="DNA mismatch repair protein"
/protein_id="AAB97931.1"
/db_xref="GI:1592569"
/translation="MSAIENFDAHTPMMQQYLKLKAQHPEILLFYRMGDFYELFYDDA
KRASQLLDISLTKRSASAGEPIPMAGIPYHAVENYLAKLVNQGESVAICEQIGDPATS
KGPVERKVVRIVTPGTISDEALLQERQDNLLAAIWQDSKGFAYATLDISSGRFRLSEP
ADRETMAAELQRTNPAELLYAEDFAEMSLIEGRRGLRRRPLWEFEIDTARQQLNLQFG
TRDLVGFGVENAPRGLCAAGCLLQYAKDTQRTTLPHIRSITMEREQDSIIMDAATRRN
LEITQNLAGGAENTLASVLDCTVTPMGSRMLKRWLHMPVRHTRVLLERQQTIGALQDF
TAELQPVLRQVGDLERILARLALRTARPRDLARMRHAFQQLPELRAQLETVDSAPVQA
LREKMGEFAELRDLLERAIIDTPPVLVRDGGVIASGYNEELDEWRALADGATDYLERL
EVRERERTGLDTLKVGFNAVHGYYIQISRGQSHLAPINYMRRQTLKNAERYIIPELKE
YEDKVLTSKGKALALEKQLYEELFDLLLPHLEALQQSASALAELDVLVNLAERAYTLN
YTCPTFIDKPGIRITEGRHPVVEQVLNEPFIANPLNLSPQRRMLIITGPNMGGKSTYM
RQTALIALMAYIGSYVPAQKVEIGPIDRIFTRVGAADDLASGRSTFMVEMTETANILH
NATEYSLVLMDEIGRGTSTYDGLSLAWACAENLANKIKALTLFATHYFELTQLPEKME
GVANVHLDALEHGDTIAFMHSVQDGAASKSYGLAVAALAGVPKEVIKRARQKLRELES
ISPNAAATQVDGTQMSLLSVPEETSPAVEALENLDPDSLTPRQALEWIYRLKSLV"
ORIGIN
1
61
121
181
241
301
361
421
481
541
601
661
721
781
841
901
961
1021
1081
1141
1201
1261
1321
1381
1441
1501
1561
1621
1681
1741
1801
1861
1921
1981
2041
2101
2161
2221
2281
2341
2401
2461
2521
2581
2641
2701
//
ctccggtatc
taatatcagg
cccatgatgc
cggatgggtg
gatatttcac
ccctaccatg
atctgcgaac
cgtatcgtta
ctgctggcgg
tccggtcgtt
cgcactaatc
ggccgtcgcg
cagttgaatc
cgcggacttt
ctgccgcata
gcgacgcgtc
gcttctgtgc
catatgccag
caggatttca
ctggcgcgtc
ttccagcaac
gcgctacgtg
atcgacacac
gagctggatg
gtccgcgagc
ggctactaca
cgccagacgc
aaagtcctca
ttcgacctgc
ctcgacgtgc
ttcattgata
ctgaacgagc
atcaccggtc
ctgatggcct
cgcatcttta
gtggagatga
atggatgaga
gcggaaaatc
ctgacccagt
cacggcgaca
ggcctggcgg
aaactgcgtg
caaatgtctt
cttgatccgg
ctggtgtaat
aataaaataa
atgtgcgcct
gaaccggaca
agcagtatct
atttttatga
tgaccaaacg
cggtggaaaa
aaattggcga
cgccaggcac
ctatctggca
ttcgcctgag
ctgcggaact
gcctgcgccg
tgcaatttgg
gtgctgccgg
ttcgttccat
gtaatctgga
tcgactgcac
tgcgccatac
ccgccgagtt
tggctttacg
tgccggagct
agaagatggg
cgccggtgct
agtggcgcgc
gtgaacgtac
ttcaaatcag
tgaaaaacgc
cttcaaaagg
tgttgccgca
tggtgaacct
aaccgggcat
catttatcgc
cgaacatggg
acatcggcag
cccgcgtagg
ccgaaaccgc
ttgggcgcgg
tggcgaataa
taccggagaa
ccattgcctt
ttgcagctct
agctggaaag
tgctgtcagt
attcactcac
aataattccc
aaataccctg
tatgtgatta
taaccccatg
caagctgaaa
actgttttat
cagtgcttcg
ctacctcgcc
tccggcgacc
catcagcgat
ggacagcaaa
cgaaccggct
gctgtatgca
tcgcccgctg
gacccgcgat
ttgtctgttg
caccatggaa
aatcacccag
cgtcacgccg
ccgcgtgttg
gcagccggta
aaccgctcgc
gcgtgcgcag
cgagtttgcc
ggtacgcgac
gctggctgac
cggcctggac
ccgtgggcaa
cgagcgctac
caaagcactg
tctggaagcg
ggcggaacgg
tcgcattacc
caacccgctg
cggtaaaagt
ctacgtaccg
cgcggcagat
taatattctg
aacgtccact
gattaaggcg
aatggaaggc
tatgcatagc
ggccggcgtg
catttcgccg
accagaagaa
cccgcgtcag
gatagtcttt
tataatagga
caacgaaaat
agtgcaatag
gcccagcatc
gacgacgcaa
gcgggagagc
aaactggtga
agcaaaggtc
gaagccctgt
ggtttcgcct
gaccgcgaaa
gaagattttg
tgggagtttg
ctggtcggtt
cagtatgcga
cgtgagcagg
aacctggcgg
atgggcagcc
cttgagcgcc
ctacgtcagg
ccacgcgatc
ttagaaactg
gagctgcgcg
ggtggtgtta
ggcgcgaccg
acgctgaaag
agccatctgg
atcattccag
gctctggaaa
ttgcaacaga
gcctataccc
gaaggccgcc
aatctgtcac
acctatatgc
gcgcaaaaag
gatctggcgt
cataacgcca
tacgatggtc
ttgacgctgt
gtcgccaacg
gtgcaggatg
ccaaaagagg
aacgccgccg
acttcgcctg
gcgctggaat
tgctatcggg
aagctt
aaaaaccatc
aaaatttcga
ccgagatcct
aacgcgcgtc
cgatcccgat
atcagggcga
cggttgagcg
tgcaggagcg
acgcgacgct
cgatggcggc
ctgaaatgtc
aaatcgacac
ttggcgtcga
aagataccca
acagcatcat
gtggtgcgga
gtatgctgaa
agcaaactat
tcggcgacct
tggcccgtat
tcgatagtgc
atctgctgga
tcgcatcagg
attatctgga
ttggctttaa
cacctatcaa
agctaaaaga
aacagcttta
gcgcgagcgc
tgaactacac
atccggtggt
cgcagcgccg
gccagaccgc
tcgagattgg
ccgggcgttc
ccgagtacag
tgtcgctggc
ttgccaccca
tgcatctcga
gcgcggcgag
ttattaagcg
ctacgcaagt
cggtcgaagc
ggatttatcg
aatattaacg
acaccccatt
cgcccatacg
gctgttttac
gcaactgctg
ggcggggatt
gtccgttgcc
caaagttgtg
tcaggacaac
ggatatcagt
agaactgcaa
gttaattgaa
cgcgcgccag
gaacgcgccg
acgtacgact
tatggatgcc
aaatacgctg
acgctggctg
tggcgcattg
ggaacgtatt
gcgtcacgct
accggtacag
gcgagcaatc
ctataacgaa
gcgtctggaa
tgcggtgcac
ctatatgcgt
gtacgaagat
tgaagagctg
gctggcggaa
ctgcccgacc
tgaacaggtg
gatgttgatt
actgattgcg
cccgattgac
aacctttatg
tctggtgctg
gtgggcgtgc
ctatttcgag
tgcactggag
caaaagctac
cgcacggcaa
ggatggtacg
tctggaaaat
cttgaagagt
ataactgacg
Q11: Reading the header of a prokaryotic GenBank entry. Following the outline on p.80,
record below the LOCUS, DEFINITION, ACCESSION, VERSION, KEYWORDS,
SOURCE, ORGANISM, REFRENCE, and COMMENTS FOR THE E. coli mutS gene:
LOCUS
AAB97931
853 aa
linear
BCT 29-JAN-1998
DEFINITION DNA mismatch repair protein [Escherichia coli].
ACCESSION
AAB97931
VERSION
AAB97931.1 GI:1592569
DBSOURCE
locus ECU69873 accession U69873.1
KEYWORDS
.
SOURCE
Escherichia coli
ORGANISM Escherichia coli
Bacteria; Proteobacteria; Gammaproteobacteria; Enterobacteriales;
Enterobacteriaceae; Escherichia.
REFERENCE
AUTHORS
TITLE
JOURNAL
MEDLINE
PUBMED
REFERENCE
AUTHORS
TITLE
JOURNAL
REFERENCE
AUTHORS
TITLE
JOURNAL
REMARK
1 (residues 1 to 853)
LeClerc,J.E., Li,B., Payne,W.L. and Cebula,T.A.
High mutation frequencies among Escherichia coli and Salmonella
pathogens
Science 274 (5290), 1208-1211 (1996)
97053611
8895473
2 (residues 1 to 853)
Li,B.
Direct Submission
Submitted (06-SEP-1996) Molecular Biology Branch (HFS-237), FDA,
200 C. Street SW, Washington, DC 20204, USA
3 (residues 1 to 853)
Li,B.
Direct Submission
Submitted (29-JAN-1998) Molecular Biology Branch (HFS-237), FDA,
200 C. Street SW, Washington, DC 20204, USA
Sequence update by submitte
Q12: p. 82, BFD – what does the term CDS mean?
CDS = Coding Sequence, which is synonymous with “coding region” or open reading frame
(ORF)
Q13:
At what nucleotide number does the CDS begin in this GenBank entry?
The ORF begins at nt #88, and ends at nt 2649
Q14:
1
61
121
181
241
Paste in the first 300 nt of the nucleotide sequence here, and highlight the start codon:
ctccggtatc
taatatcagg
cccatgatgc
cggatgggtg
gatatttcac
atgtgcgcct
gaaccggaca
agcagtatct
atttttatga
tgaccaaacg
tatgtgatta
taaccccatg
caagctgaaa
actgttttat
cagtgcttcg
caacgaaaat
agtgcaatag
gcccagcatc
gacgacgcaa
gcgggagagc
aaaaaccatc
aaaatttcga
ccgagatcct
aacgcgcgtc
cgatcccgat
acaccccatt
cgcccatacg
gctgttttac
gcaactgctg
ggcggggatt
n. We now know the nucleotide and protein sequences of the E. coli mutS gene, but the
annotation provided does not include the upstream regulatory sequences, i.e., the promoter for
recognition by RNA polymerase, and the Ribosome Binding Site (RBS) by which the ribosome
joins with the mRNA to begin translating the protein.
Let’s go and find these upstream regulatory elements. To do so, we will need access to a
much larger chunk of the E. coli genome, as follows:
o. To begin, follow the three steps at the bottom of p.82 and top of p. 83, BFD, to convert the
nucleotide sequence to a more universally acceptable format, called “FASTA”.
p. Paste the entire nucleotide sequence into a Nucleotide-nucleotide BLAST search (go back to
your BLAST search page, paste the sequence into the Search window, and choose
“Nucleotide” from from the blue menu bar, then begin the search by clicking the blue
“BLAST” button. “FORMAT” the BLAST search to retrieve the results in a new window, as
you did with the previous Protein-protein BLAST search.
q. The results will show many independent GenBank entries containing the mutS DNA region.
This will illustrate just how redundant GenBank can be (cf. p. 84, BFD). The database often
contains many different entries of the same information, usually because of independent
submissions by different authors. Most of the top entries in this case correspond to GenBank
files that contain the entire E. coli genome or large chunks of the genome.
r.
In this case, open AE016765.1, a “manageable” chunk consisting of 305,000 base pairs. This
is section 11 out of 18 of the complete genome. Give it a moment to fully load….this file
contains a lot of information!
As you scroll down, you will see sequential translations of every Open Reading Frame (ORF),
i.e., every potential gene, that is encoded in this large segment of DNA. At the bottom of the
file, all 305,000 nucleotides are listed.
To make this a bit easier, we did some groundwork for you, and discovered that the mutS
gene lies between nt 115,840 and 118,752, and the ATG codon is at nt 116,191.
Q15:
115861
115921
115981
116041
116101
116161
116221
116281
116341
116401
116461
116521
116581
116641
116701
116761
116821
116881
Copy/paste this entire sequence below (Courier 10 pt), and highlight (bold) the Start
codon.
ttttttaatc
ttaaggtgca
Tttcttcaac
cagcattttt
taactccggt
atttaacatc
acgcccatga
taccggatgg
ctggatattt
attccctacc
gctatttgtg
gtccgtatcg
aacctgctgg
agttccggtc
caacgtacga
gaaggtcgcc
cagcaattga
ccacgcggac
gtcagttttt
ttTATATTac
cgacgaattt
cgcggaaccg
atcatgtgcg
agggagccgg
tgcagcagta
gcgactttta
cactgaccaa
atgcggtgga
aacagattgg
ttacgccggg
cggctatctg
gttttcgcct
atccagcgga
gcggcctgcg
acctgcaatt
tttgtgctgc
cacgagagat
aacttAattt
ggcgatcttg
acatcaagaa
ccttatgtga
acttaacccc
tctcaagctg
tgaactgttt
acgcggtgct
aaactacctc
cgatccggcg
caccatcagc
gcaggacagt
gagcgaaccg
attactgtat
tcgtcgcccg
tggcacccga
cggttgtctg
c
acgcttgccg
taaaggggac
tttcatcgtc
ctcacctttc
ttacaacgaa
ATGagtacaa
aaagcccagc
tatgacgacg
tcagcgggag
gccaaactgg
accagcaaag
gatgaagcgc
aaaggtttcg
gcggaccgcg
gcggaagatt
ctgtgggagt
gatctggtcg
ttgcagtatg
gcTTGATAaa
gacatgctgc
tatgtctctg
atcacgcaaa
gcgtcttccc
aataaaaacc
tagaaaattt
atcccgagat
caaaacgggc
agccaatccc
tgaaccaggg
ggccggtcga
tgttgcagga
gctacgcaac
aaacgatggc
tcgccgagat
ttgaaatcga
gttttggtgt
cgaaagatac
gatcgatcat
ctccatactc
aaggAGGAGG
agctgcaaag
ctgaaatgat
atcacactac
cgacgcccat
cctgctgttt
gtcgcaactg
gatggcgggg
cgaatcggtt
gcgcaaagtt
acgtcaggac
gctggatatc
ggcagaactg
gtcgctgatt
caccgctcgc
ggagaacgca
ccaacgcacg
116941
117001
117061
117121
117181
117241
117301
117361
117421
117481
117541
117601
117661
117721
117781
117841
117901
117961
118021
118081
118141
118201
118261
118321
118381
118441
118501
118561
118621
118681
118741
accctgccgc
gccgcgacgc
ctggcttccg
ctgcatatgc
ttgcaggatt
attctggcgc
gctttccagc
caagcgctgc
atcatcgaca
gaagagctgg
gaggtccgcg
cacggctact
cgtcgccaga
gataaagttc
ctgttcgacc
gaactcgacg
acctttattg
gtactgaatg
atcattaccg
gcgctgatgg
gaccgtatct
atggtggaga
ttgatggatg
tgtgcggaaa
gagctgaccc
gagcacggcg
tacggcctgg
caaaaactgc
acacaaatgt
aacctcgacc
agtctggtgt
atattcgttc
gtcgtaacct
tactcgactg
cagtgcgcga
tcaccgccga
gtctggcgtt
aactgccaga
gtgagaagat
caccaccggt
atgagtggcg
agcgtgaacg
acattcaaat
cgctgaaaaa
tcacctcaaa
tgctgttgcc
tgctggtgaa
ataaacctgg
agccgtttat
gtccgaacat
cgtatatcgg
ttacccgcgt
tgaccgaaac
agatcgggcg
atctggcaaa
agttaccgga
acaccattgc
cggttgcagc
gtgagctgga
ctttgctgtc
cggattcact
aa
tatcactatg
ggaaattact
cactgtaacg
tacccgcgtg
gttacagccg
gcgtaccgct
gttgcgtgcg
gggcgagttt
gctggtacgc
cgcgctggct
taccggcctg
cagccgtggg
cgccgagcgt
aggcaaagca
gcatctggaa
cctggcggaa
cattcgcatt
cgctaacccg
gggcggtaaa
cagctacgta
aggtgctgcg
cgccaatatt
tggaacgtcc
taagatcaaa
gaaaatggaa
ctttatgcat
tctggcgggt
aagcatttcg
cgtaccggaa
gactccgcgt
gaacgtcagc
cagaacctgg
ccgatgggta
ttgcttgagc
gtactgcgtc
cgcccgcgcg
cagttagaaa
gccgagctgc
gacggtggtg
gacggcgcga
gacacgctga
caaagccatc
tacattattc
ctggctctgg
gcgttgcaac
cgggcctata
accgaaggtc
ctaaacctgt
agtacctata
ccggcgcaaa
gatgatctgg
ttacataacg
acttacgatg
gcgttgacgc
ggcgtcgcca
agcgtgcagg
gtgccaaaag
ccgaacgccg
gaaacttcgc
caggcgctgg
aggacagcat
ccggcggtgc
gtcgtatgct
gccagcaaac
aggtgggcga
atctggcccg
atgtcgatag
gcgatctgct
ttatcgcatc
ccgattatct
aagttggctt
tggcaccaat
cggagctgaa
aaaaacaact
agagcgcgag
ccctgaacta
gccatccggt
cgccgcaacg
tgcgccagac
aagtcgagat
cttccgggcg
ccaccgaata
gtctgtcgct
tgtttgctac
acgtgcatct
atggcgcagc
aggttattaa
ctgctacgca
ctgcggtcga
agtggattta
cattatggat
ggaaaatacg
gaaacgctgg
tattggcgca
cctggaacgt
tatgcgtcac
tgcaccggta
ggagcgagca
gggctataac
ggagcgtctg
taatgcggtg
caactacatg
agagtacgaa
ttatgaagag
cgcgctggcg
cacctgcccg
agttgaacaa
tcgcatgttg
cgcgttgatt
tggcccgatt
ttcaaccttt
cagtctggtg
ggcgtgggca
ccactatttc
cgatgcactg
gagcaaaagc
gcgcgcacgg
agtggatggt
ggcactggaa
tcgcttgaag
Q16: Locate the –35 promoter sequence. Highlight it (Bold) in the sequence above, and list its
sequence here (keep in mind that this sequence may not not perfectly match the consensus
sequence, but it will probably differ by no more than one base from the consensus.
TTGATA, one base different from the consensus TTGACA
Q17: Locate the –10 sequence, keeping in mind that it also may not follow the exact consensus.
Highlight it (Bold) in the sequence above, and list it here:
TATATT, one base different from the consensus TATAAT
Q18: Propose a likely startsite for transcription, and highlight its location above. At what
nucleotide (type of base and nt #) does transcription probably start?
A, 8 nt downstream of the –10 sequence, at nt 11946. Transcription could also
start at A 11947.
Q18: Locate the Ribosome Binding Site (RBS), which has the consensus sequence AGGAGGU
in the mRNA. Highlight it (Bold). In what region of the mRNA transcript is the RBS found?
How far is the RBS from the start codon?
AGGAGGT, the Ribosome Binding Site, is located between the transcription
startsite and the start codon, in the 5’ untranslated region (5’ UT) of the mRNA.
Q19: As oriented above, does this DNA sequence represent the coding strand or the template
strand? Is the DNA sequence as shown oriented from the 5’ to the 3’ end, or from the 3’
end to the 5’ end, of the E. coli mutS gene?
This sequence is oriented from 5’ to 3’ end and represents the coding,
or sense strand of the DNA
Related documents