Download CHAPTER 2 Genome Sequence Acquisition and

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Epigenetics of diabetes Type 2 wikipedia , lookup

Extrachromosomal DNA wikipedia , lookup

Gene therapy wikipedia , lookup

Epigenetics of neurodegenerative diseases wikipedia , lookup

X-inactivation wikipedia , lookup

Whole genome sequencing wikipedia , lookup

Oncogenomics wikipedia , lookup

Polycomb Group Proteins and Cancer wikipedia , lookup

Copy-number variation wikipedia , lookup

Gene nomenclature wikipedia , lookup

Genetic engineering wikipedia , lookup

Gene expression programming wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Transposable element wikipedia , lookup

Ridge (biology) wikipedia , lookup

No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup

Gene desert wikipedia , lookup

Public health genomics wikipedia , lookup

Genomic imprinting wikipedia , lookup

Biology and consumer behaviour wikipedia , lookup

Point mutation wikipedia , lookup

Non-coding DNA wikipedia , lookup

Metagenomics wikipedia , lookup

Genomic library wikipedia , lookup

Epigenetics of human development wikipedia , lookup

RNA-Seq wikipedia , lookup

Human Genome Project wikipedia , lookup

Pathogenomics wikipedia , lookup

History of genetic engineering wikipedia , lookup

Gene expression profiling wikipedia , lookup

Human genome wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Gene wikipedia , lookup

Minimal genome wikipedia , lookup

Genome (book) wikipedia , lookup

Genomics wikipedia , lookup

Microevolution wikipedia , lookup

Designer baby wikipedia , lookup

Genome editing wikipedia , lookup

Helitron (biology) wikipedia , lookup

Genome evolution wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Transcript
CHAPTER 2
Genome Sequence Acquisition and Analysis
2.1 How Are Genomes Sequenced?
1. Read the sequence from the real X-ray film in Figure 2.2. Record the sequence for
both strands of DNA, with the top strand containing the sequence on the X-ray film.
Be sure to keep track of 5′ and 3′ ends for both strands.
5′ GCA CTT GTT TCT CGGGG CTC AGC TGT ATC AGCC ACGT GCC TAC AAC AAT CTG CCCCT 3′
3′ CGT GAA CAA AGA GCCCC GAG TCG ACA TAG TCGG TGCA CGG ATG TTG TTA GAC GGGGA 5′
Perform a BLASTn (nucleotide sequence) search with the top strand of DNA.
BLASTn searches allow you to query the constantly updated database of all DNA
sequences to find the best matches from the database for your query sequence (see
Math Minute 1.1). Read the top “hit” from the BLAST results. What gene did you
sequence? Now try a BLASTn search with the bottom strand (remember to enter it 5′
to 3′). Do you retrieve the same gene?
LOCUS
DEFINITION
ACCESSION
VERSION
CREEZYA 1904 bp mRNA PLN 13-OCT-1993
Chlamydomonas reinhardtii ezy-1 mRNA, complete cds.
L20945
L20945.1 GI:299182
Bottom strand in 5′ to 3′ orientation.
5′ AGGGG CAG ATT GTT GTA GGC ACGT GGCT GAT ACA GCT GAG CCCCG AGA AAC AAG TGC 3′
Yes, you get the same result.
2. To get a complete understanding of the sequencing process, join two students who
tour the Genome Sequencing Center at Washington University in St. Louis.
As of January 2006, the Washington University Genome Sequencing Center employs 230 people.
3. Go to the Chromat 1 web page and examine the entire sequence. Don’t bother trying
to read the letters yet. Can you tell which end is the 5′ end?
From chromat 1, it is impossible to determine which end is 5′. However, we know that the 5′ end
migrates faster on the gel so it must appear first on the chromat. Furthermore, by convention, we
always write DNA with the 5′ end on the top left and 3′ end on the bottom right.
4. Beginning at base 80, read 50 bases of the sequence and write down both strands of
the DNA, with the top strand being the one on the chromat.
5′ atgct ctggc cacgg cactt gcgga tccca (30) TGATC TGTGC ACCTG CGATA (50)3′
3′ tacga caccg gtgcc gtgaa cgcct agggt
ACTAG ACACG TGGAC GCTAT 5′
5. Perform a BLASTn search of the DNA in Chromat 1, but use only the first 30 bases of
the 50. What was your best match? Record the E-value (measures quality of BLAST
hits) presented in the right column. Now BLASTn all 50 bases and compare the new
results with the search that used only 30 bases. Explain what happened to the E-value
and why. You can read Math Minute 1.1 to understand why the E-value changed for
the two BLAST results.
C H A P T E R 2 Genome Sequence Acquisition and Analysis
11
October, 2005 with 30 bases; top hit with E-value of 0.56:
LOCUS
XM_531892 1739 bp mRNA linear MAM 30-AUG-2005
DEFINITION PREDICTED: Canis familiaris similar to Lamin B1 (LOC474663), mRNA.
October, 2005 with 50 bases; top hit with E-value of 2e-09:
LOCUS
AF154499 2432 bp mRNA linear PLN 12-JUL-1999
DEFINITION Thalassiosira weissflogii sexually induced protein 1 (Sig1) mRNA, complete cds.
The E-value decreased substantially because longer sequences are more likely to match the database
by chance but since this one matched perfectly, it was not due to chance. The statistics from the
BLASTn report are shown:
Number of letters in database: 1,580,765,460
Number of sequences in database: 3,542,931
Lambda
K
H
1.37
0.711
1.31
Gapped
Lambda
K
H
1.37
0.711
1.31
Matrix: blastn matrix:1 -3
Gap Penalties: Existence: 5, Extension: 2
Number of Sequences: 3542931
Number of Hits to DB: 905748
Number of extensions: 43510
Number of successful extensions: 10650
Number of sequences better than 10: 2
Number of HSP’s better than 10 without gapping: 2
Number of HSP’s gapped: 10650
Number of HSP’s successfully gapped: 2
Number of extra gapped extensions for HSPs above 10: 10646
Length of query: 50
Length of database: 15599103720
Length adjustment: 20
Effective length of query: 30
Effective length of database: 15528245100
Effective search space: 465847353000
Effective search space used: 465847353000
A: 0
X1: 11 (21.8 bits)
X2: 15 (29.7 bits)
X3: 25 (49.6 bits)
S1: 12 (24.3 bits)
S2: 18 (36.2 bits)
6. Go to Ensembl (European version of NCBI) and click on “Information” under the
“Docs and downloads” menu on the left side. Click on “Download data files.” Are the
genome sequences submitted as one single file? What level of organization has been
used to post the DNA sequences?
Data are posted under headings of DNA, cDNA, peptides, and then three different data formats (EMBL,
Genbank, and MySQL). If you click on one of the FTP links, you will see they are bundled into a series
of compressed files to expedite file transfer.
7. Do mammals or amphibians have larger genomes, as revealed on the less expensive
web site? Why does the answer seem counterintuitive?
Amphibians have up to 100 fold more DNA (109–1011 bp) compared to humans (3 × 109). This seems
counterintuitive, because we think of ourselves as more complex, but many amphibians have polyploid
genomes, which makes sequencing those genomes very difficult and expensive.
8. Go to the Assembly Archive (chromat database) to view some chromats from an
anthrax sequencing effort. Click on Bacillus anthracis str. (strain) Kruger B. You will
12
I N S T RU C TO R ’ S M A N UA L
see a list of many assemblies; click on Contig ID 607. When the new window opens, you
will see three frames. The top frame shows the coverage of this 15.3 kb segment of
Anthrax genome. The middle frame shows the individual sequences used to assemble the
15.3 kb contig. What does the graph in the top frame summarize from the middle frame?
It shows the fold coverage for this particular portion of the genome. The higher the blue trace, the
greater the fold coverage.
The bottom frame shows you the tiling path of the individual clones that span the
entire 15.3 kb contig. The small red dashes indicate marker sequences used to help create the overlapping tiling path of clones. Mouse over the large blue segment that is centered on the 2 kb tick mark with trace ID ti494464459. When you see this trace ID
number, click on it. A new window will open to show you the sequence, but click on the
box next to “in color” and hit the “Show” button. This will produce a color graph that
indicates the quality assessment score produced by PHRED. Where does the quality
tend to be best? Scroll to the first two regions with quality scores between 0 and 20.
Quality is best in the middle region of the sequence. Lowest scores appear very early in the sequence.
Now change the menu from “FASTA” (plain text format) to “Trace” and hit the
“Show” button. You should see the chromat for this sequencing read. Next to the “in
color” button should be a new option for the applet size. Change “Normal” to “Big”
and hit the “Show” button. Right above the chromat is a “confidence” option; turn
that on. On the far left is a scroll bar; move it down and up to see more and less of the
chromat, respectively. The confidence is indicated as bar graphs for each base, with
higher-confidence bases having longer bars; bar colors match the base colors, not quality assessment values. Find the regions of low quality scores and determine why the
scores were so low.
Low peak height and bad spacing are the most common reasons for poor PHRED scores. Note that the
chromat reports more bases than were shown in the FASTA version. Bases with PHRED scores too low
are not converted to FASTA quality data.
9. Go to the Human Genome Browser and locate section chr19:8,584,715–8,601,616 by
typing it into the search window. Click on the large black box in the gap row and read
how gaps are depicted. Click the “Back” button once on your browser, and scroll
down below the image. Click on “hide all”, except modify these individual options:
base position full; chromosome band dense; gap pack; Ensembl genes dense.
Be sure to click the “Refresh” button at the top of the display options to implement
your modifications; these settings will speed up subsequent navigation.
Black boxes indicate this portion has never been cloned for sequencing. A bridged gap indicates the
gap will be filled in the near future. Gaps such as this one are resistant to cloning or sequencing, often
due to highly repetitive DNA.
Click on the “base” button to the right of the 10X zoom-in button. This will show you
the consensus sequence where known, and an x where there is no sequence information. Below the DNA sequence, all three reading frames are translated with red boxes
marking stop codons and green boxes marking start codons. Zoom out 10X three
times. Is this gap near a gene?
Yes, a gene composed of many exons is to the left of this gap.
Do you think this gap affected the nearest neighboring gene annotation?
It is possible the gene extends into the gap and we don’t know it.
C H A P T E R 2 Genome Sequence Acquisition and Analysis
13
Continue zooming out until you see a second gap. Now hit the button until you
find a third gap. Continue to move through the third gap to define the extent
of this gap. Which gap is bigger, the first one you looked at or this third one? What
chromosomal structure(s) are in the area of the bigger gap?
The third gap is much, much bigger. It spans the centromere.
10. Go to the Finishing web page and determine the order of DNA fragments needed to
build the largest possible contig. How many gaps remain after you have created the
largest possible contigs?
This sequence was taken from the yeast gene MAL13/YGR288W. These 13 fragments can be assembled into 3 contigs with two gaps. However, the last line is composed of overlapping sequence that can
join the first contig. Ultimately, there are two contigs with one gap.
11. Imagine you’re a finisher working on the DNA you assembled in Discovery Question 10.
How might you have isolated the gapped DNA if you knew the entire region of DNA
was 20 kb long?
Extended PCR, or screening a genomic library with probes from the two edges.
12. Go to the genomic DNA #1 (gDNA1) web site, where you will see three pieces of
DNA sequence. Copy and paste one of the sequences and then click on the
“TestCode” button. One at a time, submit the three segments of gDNA to TestCode to
find which one harbors the ORF.
Sequence #1 gives an intermediate score of uncertain coding possibility Sequence #2 gives a low
score. The third sequence comes from human alpha actin and scores very high.
13. Copy and paste the real ORF from Discovery Question 12 into the scramble web site
to have a Perl script generate a scrambled version of the same DNA. Take the scrambled version and resubmit it to TestCode. Does the randomized version of the coding
DNA look like coding DNA? Would you expect it to?
No, even though it is exactly the same bases but in different orders, they do not appear to be coding
DNA. This shows that certain DNA patterns can be recognized, such as codons. We would not expect
random DNA to look like coding DNA when analyzed by TestCode.
14. Calculate the average percent nucleotide identity for the three COX gene regions
from your BLAST2 alignment.
72.6% identity using XM_051899 and NM_000962.
15. Go back to BLAST2 Sequences and enter the two protein accession numbers for
COX2, “NP_000954”, and COX1, “NP_000953”. Be sure to change the search from
BLASTn to BLASTp. Verify that the top blank contains NP_000954, so COX2 will be
the query in the resulting page.
a. What is the overall amino acid identity? Is this higher or lower than the overall
nucleotide identity?
The overall amino acid density is 64%. There is greater alignment over the entire protein than was seen
in the full-length mRNA sequences.
b. Notice that a separate percentage is calculated for similarities (called “Positives”),
which takes into account the similar structures of some amino acids. What is the
percent similarity?
80% similarity. Not matches, but conservative substitutions. See MM 2.3 for details.
14
I N S T RU C TO R ’ S M A N UA L
c. Which parts of the proteins appear to be poorly conserved? Look at the
sequence alignment that uses the single amino acid code and find where one
protein has several Xs in a row, to mark areas of low complexity (see Math
Minute 2.3).
Most of the variation appears in the first half of the proteins, including the additional amino acids found
in Cox-1.
d. Use your browser’s “Find” function to locate the amino acid sequence GAPFS.
Serine (S) is the amino acid modified by aspirin. Is GAPFS in a region of high
sequence identity or similarity? (See pages 336–337 for details.)
16. Go to HGNC (Human Genome Nomenclature Committee) and perform a quick gene
search for cyclooxygenase to see how many cyclooxygenase genes there are in the
human genome. HGNC is a good quick way to perform a gene search with links to
many other databases. However, compare the HGNC results with an OMIM search
using the gene name COX3. Do you find any surprises?
HGNC shows 2 genes. OMIM shows 3 genes.
17. Mouse over the domain boxes to determine the number of different CDs from your
search. Don’t just count the number of boxes, but determine the types of domains
revealed when you mouse over each box. Notice the E-values are provided when you
mouse over each box.
There are four major domains in the dystrophin protein: calponin homology, zinc finger, spectrin, and a
WW domain. Hyperlinks from the results page allow students to define each domain. The only surprise
is the zinc finger, which we normally associate with transcription factors.
18. Click on the “Show” Domain Relatives button and see what hits you get. At what
protein have you been looking?
We have been looking at dystrophin.
19. Go back and click on “gnl/CDD/7333”, to the left of “smart00291,ZnF_ZZ, . . .” Read
the text at the top of your screen. Does this domain have an important function?
Explain your answer.
It is unlikely that a cytoplasmic structural protein such as dystrophin also regulates transcription.
However, there is evidence that muscular dystrophy has a signaling component, and it might
be possible that the dystrophin gene produces truncated proteins that might be functional DNA
binding proteins. Therefore, we should not conclude that dystrophin does or does not have a
transcriptional function. The computer program points out conserved domains so we can be aware
of possible functions that we might not have noticed without this information. See Chapter 10 for
details.
20. Copy and paste this uncharacterized protein (as of spring 2005) amino acid sequence
into web sites of your choosing to characterize the protein’s possible functions.
Determine as much as you can about its structure and function.
Ezy2 in Chlamydomonas reinhardtii (unicellular green alga). From 2002 Pubmed link: Ezy2 is candidate
participant in the uniparental inheritance of chloroplast DNA. Kyte-Doolittle predicts it is an integral
membrane protein.
C H A P T E R 2 Genome Sequence Acquisition and Analysis
15
21. How can a single protein yield more than one answer to the why, what, and where
questions to describe its roles in cells?
Why: Some proteins might be involved in more than one large “objective.” For example, tubulin might
be described as providing cytoskeletal structure or movement of flagella. Leptin might be even more
complex since it seems to be involved in many processes.
What: It has been demonstrated that some ion channels are also kinases. The biochemical activities
are very distinct and different so these proteins would need two answers to answer the question “What?”
Where: Hormone receptors that act as transcription factors are located in the cytoplasm and nucleus as
they respond to the absence or presence of hormones, respectively.
22. Go to WormBase (C. elegans gene database), search for the gene “pmr-1”, and learn
its biological process, molecular function, and cellular components (about half way
down, next to Gene Ontology listing). Does pmr-1 have more than one biological
process, molecular function, and cellular component?
There is conflicting information on this protein. It is called a Ca2+ pump. It also pumps Mn2+ but later this
site says Mn2+ inhibits the pump. It seems to be located in the Golgi, but the biological process is uncertain. The main point of this question is to illustrate the gaps in our knowledge and to reiterate that there
is much work to be done even after a genome is sequenced.
23. Search NCBI’s Human Map Viewer using the term “obesity”. You will get hits for
every locus that has obesity associated with it. How many loci do you see? Are they
clustered or distributed throughout the genome?
There are 14 hits in all. They appear to be scattered randomly. The only apparent clustering is due to
the co-listing of mouse genes.
24. Click on the blue number “10” below the cartoon of chromosome 10. What gene did
you identify?
OB10 is neither leptin nor its receptor. It is not known what function this particular locus plays in the
susceptibility to obesity.
*603188
Related Entries, PubMed, LinkOut
OBESITY, SUSCEPTIBILITY TO, ON CHROMOSOME 10; OB10 Gene map locus 10p
16
I N S T RU C TO R ’ S M A N UA L
TEXT
Epidemiologic studies suggest that 30 to 70% of the variation in body weight may be attributable to
genetic factors. Hager et al. (1998) undertook a genomewide scan in affected sib pairs to identify chromosomal regions linked to obesity in a collection of French families. Model-free multipoint linkage
analyses revealed evidence for linkage to a region on 10p (MLS 4.85). Two further loci on chromosomes 5q and 2p showed suggestive evidence for linkage of serum leptin levels in a genomewide context. The peak on chromosome 2 coincided with the region containing the proopiomelanocortin gene
(POMC; 176830), a locus previously linked to leptin levels and fat mass in a Mexican-American population (Comuzzie et al., 1997 ) and shown to be mutated in obese humans (Krude et al., 1998). The findings suggested that there is a major gene on 10p implicated in the development of obesity and other
loci influencing leptin levels.
25. Click on the “Maps & Options” button to modify the view. From the new window, you
can choose from the list in the left window; your choices are displayed in the right window. Modify the display until only “Gene”, “Morbid/Disease”, and “Ideogram” are
displayed. Click on “Morbid/Disease” and then on the “Make Master” button, followed by “Apply”. The ideogram on the far left shows how much of the chromosome
you are viewing. You can zoom in or out as needed. This database allows you to search
for your favorite disease or condition and track down all this information. You could
place an order for this DNA or amplify it yourself using PCR.
This database allows you to search for your favorite disease or condition and track down all this
information. You could place an order for this DNA or amplify it yourself using PCR.
26. Go to Electronic PCR and enter this accession number, “M18533”, in the big open
box to determine if there are any STSs in this sequence. What gene have you located,
and how many STS markers are there? Click on one of the blue links and see how
much information is there. Do you have all the information you need to amplify this
STS? What else did you learn, other than sequences of the primers?
The dystrophin gene has 16 STSs. You know how big the PCR product should be, but you do not know
the temperatures or magnesium concentrations.
27. BLASTn this mystery sequence and select “est_mouse” from the “Choose database”
menu. What can you learn based on the hits you obtained? For example, what gene
have you identified? Scroll down and see how many tissues are described in this
search. Imagine you were studying obesity in mice; how might this help your efforts?
(See Chapter 5 for details.)
A sequence with “similarity to mouse leptin” was identified in this search
<http://www.ncbi.nlm.nih.gov:80/entrez/query.fcgi?cmd=Retrieve&db=Nucleotide&list_uids=11520484&
dopt=GenBank>.
Among other information, we see this:
/organism=“Mus musculus”
/sex=“male”
/tissue_type=“salivary gland”
/dev_stage=“5 months”
You’d know at least one tissue where this gene is expressed.
28. Did the EST database provide you with more than just sequence identification information? With the completion of many genomes, is there any utility for the EST
databases? Support your answer with specific examples.
C H A P T E R 2 Genome Sequence Acquisition and Analysis
17
Yes. The additional information would be useful if you wanted to know when and where a particular
gene was expressed. This could save you a lot of time if you were interested in a particular proteome,
for example.
There is still a significant value in the EST database. Genomic sequence does not inform us where
and when any given gene is expressed. Only data such as ESTs tell us how genes are regulated in a
living organism. In addition, ESTs allow us to discover sequence variations (e.g., SNPs) in the population as well as RNA splicing variants.
29. Go to the UniGene Statistics page to read the latest information on human ESTs.
Based on this information alone, calculate the average number of ESTs for each
human gene (assuming 23,000 genes).
As of 12/2006, there are 5,160,228 human ESTs. From this extensive collection, it would be impossible
to calculate the number of genes, but we can calculate the average number of ESTs per gene: 224
ESTs per gene. This may hint at the large number of alternative spliced mRNAs as well as mRNA with
alternative 5′ ends due to alternative start sites.
30. What does GeneCards say about Nox1? Is it an NADPH oxidase? Go to the Human
MapViewer for the human genome, enter “Nox1”, and hit “Find”. You will see a red
dash next to the X chromosome. Click on the blue “X” under the ideogram and notice
the other genes in the area. Next to the gene “NOX1”, click on the link “sv”. You
should see a graphical version of this gene.
GeneCards, and its many mirrored sites, is a great way to learn all about any given gene or protein. On
this page, we learn that Nox1 is both an NADPH oxidase and an ion channel. It also tells us there are
three alternatively spliced forms called Nox1-long, Nox1-short, and Nox1-long variant. You can see the
full description at this URL: <http://thr.cit.nih.gov/cgi-bin/cards/carddisp?NOX1>.
a. How many different mRNAs (listed as CDS for coding sequences) are produced
by this one gene? The color code is just below the expanded view of this gene.
Notice that only the first exon is shown in the sequence (as denoted by the red
bracket in the cartoon above).
REFSEQ proteins (3 alternative transcripts):
NP_008983.1 NP_039248.1 NP_039249.1
b. Use the navigation button icon to zoom out by clicking once on the “” sign. How
many genes are in this region of the X chromosome? Do any other genes produce
more than one mRNA?
Entrez Gene cytogenetic band: Xq22
Ensembl cytogenetic band: Xq22.1
Entrez shows 3 splice variants while Ensembl shows only 2.
31. Take a few minutes to draw a flowchart illustrating the steps you would take to annotate a newly sequenced genome (define the genes; describe each protein’s biological
process, molecular function, and cellular component; summarize the major metabolic
pathways the organism needs to survive and evolve).
1. Search for coding DNA. You could do this using methods such as Testcode, or geneome alignment
programs comparing two evolutionarily similar species.
2. See if you can BLAST coding DNA and get hits with known genes. Searching for ORFs and ESTs
may simplify this process. Alternative splicing products may be identified by EST matches.
3. Look up Gene Onotology for complete descriptions of “functions”.
4. You might link out from GO to find metabolites at places such as KEGG in Japan.
32. Analyze the dystrophin gene with the Genome Browser. Enter the name “dystrophin”
and click the “Submit” button. You will get a list of hits. Click on the first “dystrophin”
18
I N S T RU C TO R ’ S M A N UA L
option to see a graphic view of the human dystrophin gene. Use the 3X zoom-out button until you can see the entire gene. You may have to modify the view using the
options below the display to answer the following questions.
a. Are there any STS markers in this gene?
Yes, many of them.
b. Look at the Gap and Coverage lines. Has the public HGP sequenced all of the
chromosome in this region?
Many of the draft sequence gaps have been closed. There are still some in this arm of the X chromosome, but not many. The two major ones are at the telomere and centromere.
c. Change the “Coverage” option to “full” and then count how many BACs span the
DMD gene. Gray BACs are draft-quality sequencing and black BACs are finished.
Can you determine the minimum number of BACs required to span DMD based
on coverage?
32 BACs span DMD.
Only 23 are required to span this region. The other 9 are redundant.
d. How many DMD mRNAs use more than one exon? What can you infer from the
number of alternatively spliced mRNA?
There are over 35 mRNAs reported. At least 23 different mRNAs use more than one exon.
33. In the “position” box, enter “7p15.2” and then hit the “jump” button. At the bottom of
the page, hide all features, except set to full “Known Genes” and then set to dense
every species “Net” (e.g., Fugu Net) under “Comparative Genomics”. Then hit
“refresh”. You are looking at 12 Hox genes, which are critical to body development.
Center the Hox genes and zoom in to see the near universal conservation of DNA,
especially the Hox exons.
a. Which species highlight the conserved exons the best, closely related species or
more distant?
Fugu and chicken highlight the exons best—the most distantly related species. The mammals have
more intron conservation and thus the exons do not stand out as much.
C H A P T E R 2 Genome Sequence Acquisition and Analysis
19
b. Set repeat masker to “dense” and refresh this diagram. Do Hox genes contain a lot
of repeats (black indicates repeat sequences) compared to portions outside the
Hox genes? Do you think this is significant? Explain your answer.
At a low magnification, the repeat masker may look like a solid black bar. Encourage students to
zoom in (use 1.5X button to slowly zoom in) and to navigate left or right in order to keep the Hox
genes centered on their displays. When they zoom in, they will realize that the repeat masker shows
repetitive DNA in the human genome except where the coding regions of the Hox genes are located.
This is a good illustration of selection pressure maintaining the integrity of critical regions of the
genome.
c. Set the SNPs (single nucleotide polymorphisms, which can be thought of as point
mutations) option to dense. Do Hox genes have more or fewer SNPs than the surrounding area? (See Color Key.) Do you think the Hox frequency of SNPs is significant? Explain your answer. For a comparison, click on the “Move ”
button.
There are relatively few SNPs in the human Hox genes as you might expect for genes that are critical
for embryonic development. There appears to be less tolerance of genomic variation in this region
compared to neighboring regions. This question foreshadows issues addressed in Chapter 4.
In these three questions, we are working with only one strain of mice, not two.
34. If the Igf2 gene were deleted in the sperm, predict the phenotype in the offspring.
What would the phenotype be if the egg’s Igf2 gene were deleted?
The male’s offspring might be smaller than normal. The female’s offspring would be normal sized.
35. If the maternally expressed gene Igf2r were deleted from eggs, predict the phenotype
in the offspring. What would the phenotype be if the sperm lacked an Igf2r gene?
The female’s offspring would be smaller than normal while the male’s would be normal sized.
36. What would you predict for the offspring if the sperm’s Igf2 gene and the egg’s Igf2r
gene were deleted?
We would expect them to be abnormally small, if they are viable at all.
37. Do you think methylation is an ancient mechanism or one limited to vertebrates?
How could you answer this question?
Students may answer either way with this one. Encourage them to offer methods to answer this
question. The next Discovery Question will help them figure out the answer.
38. Go to GenBank and search “methyltransferase”. Can you find any DNA methyltransferases in organisms other than vertebrates?
Methyltransferases are found in archea, bacteria, plants. . . . It is a very ancient mechanism for regulating DNA.
39. Jackson-Grusby et al. found that most of the genes with altered expression increased
their expression levels when the methyltransferase was deleted, as you might expect
since methylation normally silences genes. Explain how some genes could be
repressed when hypomethylated.
Loss of the methyltransferase might result in the increased production of transcriptional repressors and
thus some genes might be repressed even though they were not methylated.
20
I N S T RU C TO R ’ S M A N UA L
2.2 What Have We Learned from Unicellular Genomes?
40. Perform an NCBI Gene search for PPA1880 (use the pull-down menu on the NCBI
main page to select “protein”). Click on the the first link, then click on the “CDS” link
to see the DNA. Copy the DNA sequence into the GC calculator to determine the
%GC. P. acnes has an average of 60% GC and human has an average of 41%. Which
genome does PPA1880 more closely match?
At 57% GC, this sequence looks more like P. acnes than human.
41. Find AAH14236.1 from NCBI’s protein pulldown menu. Copy the DNA sequence
and determine the %GC. Does this sequence look more human-like or P. acnes-like?
What interesting annotation did you uncover? How could this cDNA get into a
human cDNA library?
With a GC content of 56%, this “human” cDNA looks like a P. acnes coding sequence. The NCBI hit
(September, 2005) says, “This record was removed because the sequence was determined to be an
artifact. Please contact [email protected] for further details.” It was isolated from a human cDNA
library that probably was a contamination of the technician’s skin.
42. Now determine the GC content for one rRNA gene and one tRNA gene. How do
they compare to the genome average of 60% GC?
16S and 23S rRNA were 56%, close to the overall GC content. Percent GC for tRNAs varied from low
60s to low 70s.
43. Go to the P. acnes genome view, enter 1 into the “Start from” box, and hit “Go”. Notice
that the first gene is DnaA (accession number YP_054724). Click the blue arrow pointing
to the left. Do you notice anything peculiar about this region upstream of DnaA?
Compare this region to any other by clicking somewhere on the genome map to see any
other region.
The region prior to DnaA was devoid of any annotated genes. This is in stark contrast to the rest of the
genome. However, this is an artifact of the display. If you click on this region of the map, you will not find
any portion of the genome that lacks genes.
44. Find the Conserved Domain of LPXTG. What does this domain help proteins do?
From cd00004: “Gram-positive bacteria, cleaves surface proteins at the LPXTG motif between Thr and
Gly and catalyzes the formation of an amide bond between the carboxyl group of Thr and the amino
group of cell-wall crossbridges. In two different classes of sortases the N-terminus either functions as
both a signal peptide for secretion and a stop-transfer signal for membrane anchoring, or it contains a
signal peptide only and the C-terminus serves as a membrane anchor.”
45. Perform an NCBI Medical Subject Heading (MeSH) search for “CAMP Factor”.
You should see a hit called “CAMP protein, Streptococcus [Substance Name]”.
On the far right side, click on the “link” link and choose “NLM [National
Library of Medicine] MeSH Browser”. What can CAMP factors do to our
blood cells?
It can induce hemolysis but is used “for rapid identification of group B streptococci strains.”
46. Search Google Scholar for autoinducer-2. Do you see any evidence that autoinducer-2
plays a role in communication? Is this protein expressed in many species?
C H A P T E R 2 Genome Sequence Acquisition and Analysis
21
From one abstract: “While the discovery of a diffusible Escherichia coli signaling pheromone, termed
autoinducer 2 (AI-2), has been made along with several quorum sensing genes . . . ” Several species
contain an AI-2 gene.
47. Search for “biofilm” in OMIM and click on the one hit. Perform a find function with
your browser for “biofilm” and see what this protein has to do with preventing biofilm
formation.
“Lactoferrin, a ubiquitous and abundant constituent of human external secretions, blocks biofilm development by the opportunistic pathogen Pseudomonas aeruginosa. This occurs at lactoferrin concentrations below those that kill or prevent growth. By chelating iron, lactoferrin stimulates twitching, a
specialized form of surface motility, causing the bacteria to wander across the surface instead of
forming cell clusters and biofilms.”
48. Conduct PubMed and Google searches for “Blue Light Acne” and see what you can
learn about a novel method to combat acne. Do you think genome sequences can help
us understand this method better? Explain your answer.
It appears that P. acnes is sensitive to 420 nm light, which can lead to its death and thus reduce acne.
This is particularly good for pregnant women who cannot take certain antibiotics. Genome sequencing
provided the clues that suggested and explained this clinical therapy.
Bonus Material: A study of the tetanus genome is available on the book’s
web site.
49. Go to the Microbial List web page and click on “Bacteroides thetaiotaomicron VPI5482”. Then click on the link “4778” to produce a list of all the proteins in the proteome, ordered the way the genes appear on the chromosome. Find “mannosidase” as
many times as you can (you can stop when you get tired). This basic search illustrates
the high level of polysaccharide utilization enzymes in the proteome—and you only
searched for a single sugar.
Mannosidase appears 18 times on this page.
50. Go back to the list of 4,778 proteins and do a find function for “one-component” (the
sensor and signal-transduction proteins are fused). Each time you find one, look to see
if a sugar-metabolizing protein is nearby. Perhaps the placement of a sensor gene and
a metabolizing gene is not random. Propose a reason why evolution might have selected for these two types of genes to be neighbors.
One-component appears 22 times in this proteome annotation. Examples of sugar metabolizing
genes adjacent to one-component genes include: BT2629 putative alpha-1,2-mannosidase; BT2898
endo-1,4-beta-xylanase D precursor; BT2924 acetyl-CoA synthetase; BT4179 polysaccharide
deacetylase. If the sensor signals for sugar regulation, then there might have been selection pressure to keep these two genes nearby to minimize recombination and thus disassociated genetic drift
and reduced fitness.
51. Look at mannose metabolism, then find Bacterioides in the pull-down menu, and click
on the “Go” button. (The fastest way to do this is to open the menu and start typing
Bacterioides fairly quickly. The species will be highlighted as you spell out the genus.)
Next, click on the oval labeled “Galactose metabolism”. Can you verify that our symbiont is well suited to help us digest sugars?
The green boxes indicate which enzymes are encoded in the displayed pathway. Both mannose (20
enzymes) and galactose (22 enzymes) metabolism are well represented in the B. thetaiotaomicron genome.
22
I N S T RU C TO R ’ S M A N UA L
52. Search Scientific American for “pylori” to read a surprising proposal published
in 2005 that even the ulcer-causing bacterium H. pylori might produce beneficial
consequences for living in our stomachs. Are we harming ourselves when we take
antibiotics unnecessarily?
“As H. pylori has retreated, the rates of peptic ulcers and stomach cancer have dropped. But at the
same time, diseases of the esophagus—including acid reflux disease and a particularly deadly type of
esophageal cancer—have increased dramatically, and a wide body of evidence indicates that the rise
of these illnesses is also related to the disappearance of H. pylori.” From February, 2005, article written
by Martin J. Blaser.
53. Go to the TIGR CMR web site, then choose “Align Whole Genomes”. Choose from
the two pull-down menus H. influenzae and M. genitalium with a minimum alignment of
20 nt (this display looks best if viewed in Netscape or Firefox browsers). Do these two
genomes look like they evolved from a common ancestor in the recent past? You can
increase the sensitivity by changing the minimum alignment to 15 nt; see if this helps.
The figure above displays a plot of maximally unique matching subsequences (MUMs) between
genomes as identified by the MUMmer program. The minimum alignment length shown here is 20 bp.
Alignments with the same orientation are shown in red and alignments with opposite orientations are
shown in green. Genes are shown along the axis for each genome and are colored by role category.
C H A P T E R 2 Genome Sequence Acquisition and Analysis
23
H. influenza is on the x-axis and M. genitalium is on the y-axis. There are no obvious patterns that indicate they have a recent common ancestor.
With 15 bp window, the degree of sequence conservation is more apparent in the figure above, but
gene order does not appear to be conserved.
54. Go to the M. genitalium genome page and choose the “Searches” option from the top
menu; then choose “Name”, search for the genes MG064 and MG101 (see Figure 2.15),
and follow the “GenBank” link to retrieve protein sequences. Does BLAST provide
any insights now?
AAC71282/MG064: Conserved domain shows it has two permease domains that suggest it may help
transport molecules across the cell membrane; COG0577: BLASTp results indicate this protein is an
“ABC-type antimicrobial peptide transport system, permease component [Mycoplasma genitalium
G-37]” indicating additional information has been added to the data base in the last 10 years.
AAC71319/MG101: Conserved domain indicates helix_turn_helix gluconate operon transcriptional
repressor; BLASTp reports “transcription regulator [Mesoplasma florum L1]”.
55. Go to DEG and search for entry number 1169 (DEG10060038 or clpB) from
M. genitalium. Does the name of this gene lead you to believe it might be an essential
gene? Copy the ORF sequence from the DEG link and perform a BLASTx search
(submit DNA sequence and search for protein matches) against all DEG entries using
DEG’s BLASTx program. Then perform a BLASTx search at NCBI. Which search is
more informative and why?
ATP-dependent Clp protease, ATPase subunit (clpB): being an “ATPase subunit” looks like this protein
is the energy yielding portion of a larger protein structure that acts as a protease.
24
I N S T RU C TO R ’ S M A N UA L
The NCBI BLASTx gave many hits with E-values of 0.0, all of which were clear orthologs. The DEG
BLASTx was not working on a reliable basis.
56. Go to the M. genitalium Genome Browser at Genome.Net. Click on “KEGG”
(metabolism database) at the bottom to see an interactive genome map. At the
bottom is a genome alignment tool that will show you genes in your query species
aligned with the species of your choice (select E. coli, then click on “Exec”[ute]). You
will see colored bands for orthologs of the two species and their location in the query
genome. Mouse over the genome, and the green bar will show you on which portion it
will zoom; click on the section with the most color (red). On the new page, click on the
“ORF Color” button to understand the colors, and “View Genome map” to see the
position of the conserved genes. What category of genes are in this area? Are the
genes clustered or widely distributed?
I have selected the region with the most red (see above). Many genes associated with translation are
clustered together in this region.
57. Perform an NCBI Entrez search for Nanoarchaeum equitans. Click on the “MeSH”
link at the bottom to learn about this species. Go back to the full Entrez results, click
on the “Genome” link, and follow this until you see the circular map of the smallest
cellular genome in the world (as of September 2005). Click on “GenePlot” to align
N. equitans and M. genitalium by changing the default species for the pulldown menu
on the right, then click on “compare selected pair”. How many protein-coding genes
are conserved between these two smallest genomes (number of “bets” [best hits] listed
below the 2D alignment maps)? You can navigate around by clicking and changing
the zoom button. Identify the gene they have in common that is located nearest to
their 0–0 origin of the graph.
MeSH results: Nanoarchaeota—A phylum of hyperthermophilic ARCHAEA found in diverse
environments. Year introduced: 2005.
C H A P T E R 2 Genome Sequence Acquisition and Analysis
25
73 genes are conserved.
The gene pair closest to the origin is MG035 and NEQ102. According to the MG035 link, this gene
encodes Histidyl-tRNA synthetase.
58. In Discovery Question 57, you determined that both species have histidyl-tRNA
synthetase, but if you studied the whole genome of N. equitans, you would discover it
lacks a tRNAHis for the codon GUG. Go to the tRNAHis web page and BLASTn each
of the three sequences. Note the first hits when you submit each half and then the difference when you submit the full sequence. Do any of these BLAST results indicate
they might be a tRNA gene? Why doesn’t the full sequence give a better score than
just half of the tRNA gene?
>5′ tRNAHis Anti-codon underlined.
TTGCCCCCGTAGCTTAGTGGCAGAGCGCCGGGCTGTGG
The only significant hit is this portion of the N. equitans genome.
>3′ tRNAHis
ACCCGGAGGTCCCGGGTTCGAATCCCGGCGGGGGCC
There are many significant hits, but none of them are very informative.
>Combined tRNA
TTGCCCCCGTAGCTTAGTGGCAGAGCGCCGGGCTGTGGACCCGGAGGTCCCGGGTTCGAATCCCGGCGGGGGCC
Still, it is hard to find a biologically significant match because the annotation is lacking. However, if you
perform a find function for “tRNAArg”, you will see the best ortholog on this page is a tRNA gene.
However, the database still is not yielding very good results due to poor annotation of genome
sequences for RNA encoding genes.
59. Perform a Genome NCBI search for “Mimivirus” and click on the link. Change the
view of the genome to show only tRNAs and hit “Refresh”. How many are in this
genome?
26
I N S T RU C TO R ’ S M A N UA L
Mimivirus has 6 tRNA genes as shown above.
60. In the left frame there is a link called “Protein view”. Click on it to get a list of
Mimivirus protein-coding genes as they appear on the genome. In the “Start from”
field, enter these nucleotide numbers to find three genes and then click on “Go”:
(1) 234000, R194; (2) 267000, L221; and (3) 633000, R480 (L and R refer to genes
pointing to the left or right, respectively). Click on each of the three boxes to find out
the family of proteins. Which one looks most like a eukaryotic protein, based on the
COG sequence similarity results you get with each click of the boxes?
R194 COG3569: Topoisomerase IB
L221 COG0550: Topoisomerase IA
R480 DNA topoisomerase II [Dictyostelium discoideum]
R480 is most similar to eukaryote genes of these three topoisomerase genes.
61. If you were going to construct a minimum genome, would you choose a virus or a
bacterium? Explain why.
You might start with mimivirus and add to it until you had a self-replicating organism. This might be the
easiest way to go. Alternatively, you could try to reduce the size of Nanoarchaeum equitans as a way to
begin with a self-replicating organism and get rid of “extraneous genes”. This question has no right
answer, but was designed to engage students in a thoughtful way with a topic that is currently being
investigated.
62. Go to the Oak Ridge National Laboratory (ORNL) Microbial Genome web site,
choose the Prochlorococcus marinus sp. MED4 genome from the “Finished
Eubacteria” pull-down menu, and note the %GC. Compare this percentage with the
genomes of P. marinus MIT9313 and Synechococcus sp. WH8102. Do the two P. marinus ecotypes look like two genomes of the same species to you?
P. marinus MED4 31.64% GC
P. marinus MIT9313 52.17% GC
Synechococcus sp. WH8102 60.2% GC
It is impossible to tell from %GC alone, but it is striking to see the two Prochlorococcus
ecotypes with more dissimilar GC content than MIT9313 to Synechococcus.
63. Go back to the P. marinus sp. MED4 genome and click on the “View genome in WebArtemis” link. It will take a while to load this Java applet, but it is worth the wait.
Warning: Do not close the web page that simply says “loading entry—done” or you will
lose the applet. You will see the full, annotated genome in three frames. The top and middle frames are duplicates, but the vertical slider bars allow you to adjust magnification,
with the default showing the top frame in medium scale and the middle frame in highresolution scale. Thin vertical lines in the top frames represent stop codons. The bottom
frame is the complete list of every gene and annotated feature (e.g., transmembrane
domain, signal peptide, etc.). A “c” in the gene list indicates the “Crick” strand of DNA.
C H A P T E R 2 Genome Sequence Acquisition and Analysis
27
Your browser should display a few new menus that add extensively to the Artemis
viewing and analysis. Under the “View” menu, choose “Show CDS Genes and
Products” and under the “GoTo” menu, choose “Navigator . . .”. Check the “Goto
Feature with This Qualifier Value:”, and search for “cobA” (see Figure 2.19), telling
the navigator to pay attention to case. Double-click on the highlighted gene in the list
of “Genes and Products”, and you will see the ORF displayed in the main graphic
window, complete with DNA and protein sequences. What does cobA do and is it on
the Crick or Watson strand?
CobA is in the first reading frame of the bottom strand (Crick strand). CobA is a putative uroporphyrin-III
C-methyltransferase.
Now click on the main graphic window to make sure it is the active one (a Java
requirement). Under the “Graph” menu, choose “GC content (%)” to see how the
GC content shifts with different genes. In the list of features in the bottom frame,
scroll down until you see a tRNA gene (light green box). Double-click on the box
and look at the %GC. Try a few more genes and describe the pattern you observe.
Artemis is very powerful, so feel free to explore and make new discoveries on your
own.
For the two tRNA genes shown above, the %GC goes above the overall average which is consistent
with RNA coding genes in species with overall low GC content.
64. Go back to the ORNL Microbial Genome page, select Synechococcus and note its
%GC again. Now launch the Artemis viewer (wait . . . ) and view the following regions
with attention to the GC graph and how many genes are in these AT-rich sections:
(1) 427233; (2) 622199; (3) 912098; (4) 2379778. Did you notice one of the world’s
longest prokaryote ORFs in one of these sections? Compare this long ORF to the
average gene on the Synechococcus statistics page. You have just examined four areas
with different codon bias and GC content. Propose an explanation for these four
apparent anomalies.
Overall 60.2% GC. Average gene length is 871 bases.
It helps to zoom the top frame out some to see the landscape of these AT-rich regions.
18 genes in the region beginning with base 427233. Note only one gene on the bottom strand.
28
I N S T RU C TO R ’ S M A N UA L
7 genes in the region beginning with base 622199 (see below).
9 genes in the region beginning with base 912098, including a very long gene on the top strand (over
32kb long).
13 genes in the region beginning with base 2379778.
One possibility is these regions are sites of horizontal transfer of DNA from the genome of an AT-rich
species.
65. Perform a PubMed search for the term “selenocysteine” and find out what this is.
Does it matter functionally whether a protein incorporates a cysteine or a
selenocysteine?
An August 2005 abstract by J. Kohrle summarizes one negative consequence of failure to incorporate
the modified cysteine amino acid. “Limited or inadequate supply of both trace elements, iodine and
selenium, leads to complex rearrangements of thyroid hormone metabolism enabling adaptation to
unfavorable conditions.”
66. Search Google for “isoprenoid Wiki” and select the link for Wikipedia to read what
isoprenoids are. Explain why loss of the apicoplast would be lethal given it’s the
source of isoprenoids.
“Isoprene is formed naturally in plants and animals and is generally the most common hydrocarbon
found in the human body. . . . Also derived from isoprene are phytol, retinol (vitamin A), tocopherol
(vitamin E), dolichols, and squalene. Heme A has an isoprenoid tail.” It is a substrate for the production
of many vital metabolites.
67. Go to KEGG Pathway web site and click on the “ATP synthesis” link to see a model
of ATP synthase. Change the pull-down menu from “Reference pathway” to
“Plasmodium falciparum” located near the top of this long list of species, then click on
“Go”. Does Plasmodium have all the parts necessary to synthesize ATP from an
H+ -ion gradient? Explain your answer.
No, Plasmodium lacks several of the subunits for ATP synthesis as indicated by the white boxes in the
figure below for eukaryote F0 and F1 portions of the ATP-synthase.
C H A P T E R 2 Genome Sequence Acquisition and Analysis
29
68. Return to the KEGG Pathway, choose “glycolysis”, and see which enzymes Plasmodium
has. Follow the pathway from b-D-Glucose to pyruvate and see if any steps are missing. Compare Plasmodium to Saccharomyces cerevisiae (baker’s yeast) to see which
one has the more robust metabolic capacity. Finally, look at Aminoacyl-tRNA biosynthesis on the list of maps and see if Plasmodium is missing any enzymes needed to
synthesize tRNAs coupled with their amino acids (aminoacyl-tRNA). Would you
predict that a parasite might depend on the host for any of these enzymes? Explain
your prediction and then test it by searching the database.
Plasmodium has all the enzymes required for converting glucose to pyruvate. Yeast has more side
reaction enzymes, but glycolysis is essentially identical. Plasmodium has all the aminoacyl-tRNA synthase enzymes. This would be expected since the intracellular parasite inside a genetically silent RBC
host could not obtain charged tRNAs from its host.
69. Go to PlasmoDB and view bases 400,000–450,000 on chromosome 1. Below the busy
graphics, choose to hide all options except “%AT”, “BLASTx”, “Genefinder”, and “Pf
Annotation”, all of which should be set to “show one line”. Hit the “Update” button. Do
the BLASTx (DNA query against protein database) hits align with the annotated
genes? Did the predictive software Genefinder identify every exon correctly?
There are some BLASTx hits that did not align with the annotations, but Genefinder did align closely
with the annotated genes. However, the annotation and Genefinder differ on some of the exons and
whether some exons were in the same gene or two separate genes. Mousing over the exons produces
text that explains some of the annotations.
Now go back to PlasmoDB and choose to view the mitochondrial genome at the
bottom of the page. Change the display so that all RNAs and genes are displayed as
“show—expanded”. How many genes and how many RNAs are encoded on this
organellar genome? Explain why the two numbers are different.
30
I N S T RU C TO R ’ S M A N UA L
The mitochondrion encodes 19 RNAs and 3 proteins. The RNAs are ribosomal rRNAs while the three
proteins are involved in energy synthesis.
70. Perform a search at the Saccharomyces Genome Database (SGD) web site and perform a Quick Search for maltose-metabolizing genes Mal1, Mal2, Mal3, Mal4, Mal6,
Mal10, Mal12, and Mal13. Determine which of these genes are true paralogs or phenotypes with uncertain genomic information. Which gene or genes have the most
detailed information?
The answer to this question is more complex than it first appears.
Mal1: not in systematic sequence of S288C (the reference genome strain)
Mal2: not in systematic sequence of S288C
Mal3: not in systematic sequence of S288C
Mal4: not in systematic sequence of S288C
Mal6: not in systematic sequence of S288C
Mal10: not a gene name recognized by SGD
Mal12: YGR292W, this is a valid, annotated gene. Physically interacts with Mal32 with which it shares
100% amino acid identity. This appears to be a duplicated paralog.
Mal13: YGR288W. This appears to be a paralog with YBR297W (Mal33) but they have only 65% amino
acid identity.
It is interesting to note Mal12 and Mal13 are near each other on chromosome VII, but they do not
appear to be paralogs.
An SGD curator explained the “not in systematic sequence of S288C” response as follows: “SGD is
based on the genomic sequence of strain S288C, which has been the “official” source for the Yeast
Genome Sequencing project. The MAL genes are highly variable among yeast strains and they happen
to be absent from the genome of that particular strain. But these genes do exists in many other laboratory strains of S. cerevisiae and people do study them. That’s why we at SGD have Locus Pages for
them and we do collect whatever literature data we come across.”
71. At the bottom of the Mal13 page, click on the “MIPS” (German genomics database) link
to see a different source; click “Protein Info” to see details about Mal13p (p for protein).
Are the two databases identical in content, or do they present different information?
MIPS results for Mal13p shows more protein information on the results page while SGD shows less
information but many more links to this type of information. Also, SGD shows some expression
information but MIPS does not.
72. Go to SGD’s Advanced Search to get an up-to-date count on the number of ORFs,
ncRNAs (non-protein-coding), pseudogenes, rRNAs, and tRNAs. The search takes a
couple of minutes.
As of October 2005, there are 6,946: ORFs, ncRNAs, pseudogenes, rRNAs, and tRNAs.
73. Compare the yeast gene Sir2 in all the Model Organisms and determine if this gene
is widely conserved. Compare this result with Mfa1. Why might you get such different
results with two genes?
Sir2 is conserved in insects, worms, plants and humans; it is a NAD+-dependent histone deacetylase.
Mfa1 is found only in yeast but it is a mating factor gene and thus would be expected to be restricted to
yeast.
74. Go to the yeast metabolism map; click on the citric acid cycle, then the “More detail”
button until you cannot zoom in any more. Move around the circle until you see the
C H A P T E R 2 Genome Sequence Acquisition and Analysis
31
two isocitrate dehydrogenase genes (Idh1 and Idh2). Mouse over the enzymes to see
the chemical reaction, then click on the EC number 1.1.1.41. On the resulting page, go
down and click on Idh1 and Idh2 to find their chromosomal locations. Are these
redundant genes located next to each other in the genome?
Idh1 is near base 557,000 on chromosome XIV, while Idh2 is located near base 579,000 on base XV.
They are not near each other.
2.3 What Have We Learned from Metazoan Genomes?
75. Enter the BruinFly database created through the original research of many undergraduates at UCLA. First, search for the term “misshapen”. Click on the link in the
first column and view the eyes. Read the description and then zoom in by clicking on
the images. Compare the eye phenotype of misshapen to the patched eye phenotype.
Click on the name “patched” to see information about this gene from FlyBase. Go
back and click on the “P-element insertion site in the genome” to see where the transposable element landed to cause the patched phenotype. Is the insertion in a coding or
noncoding portion of this gene? Explain how this insertion could lead to a mutant
phenotype. If you look at other genes, notice some of the quirky names used by fly
biologists (e.g., Ken and Barbie, Sunday driver, and deadpan).
Both patched and misshapen are involved in signal transduction pathways. However, the patched phenotype is either pupal lethal or cell lethal in the eye. The P-element inserted into the first intron of
patched and thus probably disrupts normal splicing to cause the phenotype. Because the mutation is
outside exons, this may explain why some flies die but others do not.
76. Search Entrez for the largest fly gene, called Kakapo, then click on the “gene” database option. Notice the polytene band location on the results page, and then click on
the link. What name was given to this gene based on its mutant phenotype? How
many different mRNAs are produced from this one gene? While on the gene page,
click on the “link” link in the top right corner and choose the map viewer option. You
should see the gene highlighted with its location shown on the drawing of the polytene
chromosome on the left side. Notice how long this gene is. Click on the “hm” (homology)
link to the right of the gene name. Is this gene found in many different species?
Chromosome: 2R; Location: 50C6-50C12.
Also called short stop: Short Stop (abbreviated Shot) provides an essential link between F-actin and
microtubules during axon extension. Shot associates with the cytoplasmic faces of the basal hemiadherens junction and with the EB1/APC1 complex, and mediates tendon stress resistance by the organization of a compact microtubule network at the muscle-tendon junction.
6 different mRNAs are produced from this one gene.
The gene is over 17kb long and is conserved in “bilateria” such as mammals, birds, insects, and
C. elegans.
77. Go to Ensembl and click on the fruit fly button to access the fly database. Enter
“P450” in the top text box preceded by the word “with” and click on the top “Look
up” button. Only a few hits will be displayed out of a large number possible; click
CG10093, which may be the first one listed. The gene Cyp313a3 should be displayed
with other genes nearby. Do you see any that may be clustered paralogs? Scroll down
to determine how many exons are used in the mRNA as shown in a diagram.
P450 is alternatively spliced to produce 4 different versions. Nearby paralogs include: Cyp313a2
(probable cytochrome P450 313a2) and Cyp313a5 (probable cytochrome P450 313a5).
32
I N S T RU C TO R ’ S M A N UA L
78. Go to BDGP and click on the “Expression Patterns” link to see where a particular
mRNA is produced. Search for bicoid (abbreviated bcd) and follow the links until you
see a bar graph and images of the blue-stained mRNA. When and where is bcd
transcribed? Compare bcd expression pattern to the pattern for Mkp3 to see how
differently some genes are transcribed.
Bicoid is transcribed mostly during the first 3 hours (see below) and the mRNA is localized to the anterior 5% of the developing embryo. Furthermore, it appears there are 4 different mRNA splice variants.
Mkp3 was transcribed at all time points during development, but at lower amounts. However, its localization
is much different from bcd (above). Notice the change in embryo localization for mRNA in only 2 hours.
79. Search OMIM for “transferrin” to see what this human protein does. On the page, do
a find function for the word “Alzheimer” to see one possible medical role for this
gene. Now perform a search on FlyBase for “Tsf1” to see if transferrin could be studied in the fly as a possible model for its influence on human Alzheimer’s disease.
From OMIM: A C-to-T substitution at codon 570 replaced proline (in C1) with serine (in C2). The
results showed that each of the 2 variants (second variant is hemochromatosis gene [HFE] missense
mutation C282Y) was associated with an increased risk of AD only in the presence of the other. Neither
allele alone had any effect. Furthermore, carriers of these 2 alleles plus APOE4 (see 107741) were at
still higher risk of AD: of the 14 carriers of the 3 variants identified in this study, 12 had AD and 2 had
mild cognitive impairment.
From FlyBase: Flies do have a transferrin gene, but it is most similar to the human lactotransferrin
gene with 27% amino acid identity and 41% amino acid conservation.
C H A P T E R 2 Genome Sequence Acquisition and Analysis
33
80. Go to the IRGSP web page and click on the status tab to see where the project is currently. “PLN” means that the finished sequence has been submitted to the PLaNt
database. Click on the finished bar graph for chromosome 5 to see a clone-by-clone
list of all the DNA sequenced for this chromosome. Click on the P0668H12 clone link
called “INE” (for INtegrated rice genome Explorer) to see an interactive version of
the rice genome (this launches a Java applet, so be patient). Move down to about
5.5 cM on the chromosome to locate the marker “S14158”; mouse over this text to see
how many different maps it is on, or not on. Click on the text to launch a new window
containing information about this marker. How were these data used by the
sequencers during the assembly phase of the project?
S14158 is not on the EST map but is on the physical and PAC/BAC maps. STS markers such as these
can be used to find redundant overlaps and contigs which would allow the investigators to verify they
had assembled two overlapping fragments if they both contained the same STS which is known to be
unique in the genome.
81. On the clone-by-clone list page, click on “RiceGAAS:Rice Genome Automated
Annotation System” for the same P0668H12 clone you explored by INE. This view
gives you a very different insight into the same DNA. For example, does this portion
of chromosome 5 contain any tRNA genes? Does the repeat masker identify highly
repetitive DNA in genes or outside of genes? Many genes have been predicted by the
various computer programs, but how many of these failed to yield BLASTx, cDNA, or
EST hits and therefore could not be verified by biological evidence? You can change
the view by altering the default preferences in the bottom frame. If you find a segment
you really like, you can select the “MAP Download” link and print out a copy to hang
on your wall.
No, P0668H12 does not contain any tRNA genes. Occasionally, the repeat masker finds repetitive DNA
within the Rice cDNA segements, but most are outside the exons. It appears that BLASTx missed more
than half of the annotated genes. cDNA or EST identified nearly all of the annotated genes missed by
BLASTx.
82. Go to the Rice Annotation Database (RAD) and click on the link to gene length
distribution; then choose, in this order, chromosomes 1, 4, 10, 3. What characteristic
changes the most? Does this characteristic correlate with chromosome length?
Return to the RAD main page and determine the source of the trend you
detected for these four chromosomes by studying the ideogram of the
chromosomes.
There does not appear to be a correlation of gene size and chromosome length. However, the percentage of finished sequence is correlated with gene size, which probably indicates genes have been
mistakenly truncated due to incomplete coverage (as of 10/2005). Blue is finished and annotated DNA
while black is non-finished DNA.
34
I N S T RU C TO R ’ S M A N UA L
83. Go to the Chromosome 8 link and you will get basically a blank page. Click on one of
the black areas in the ideogram at the top, then scroll to the right and left until you see
some content. What area have you landed in, based on what you do not see and the
physical location within the chromosome? Explain why you did not see anything when
you first explored the chromosome.
We have landed on a gap and because gaps have no physical link to DNA, changing the scale from
50 kbp to 2 Mbp does not change your view. Because it is near the center, this may be the centromere
which is often difficult to clone and sequence.
84. Go to BGI’s Rice Information System (RIS), click on the “ComView” tab at the top,
and then click on the “Refresh” button on the right side if the settings indicate the base
organism is 9331 (indica), chromosome 1, the first 1 Mb. Explore the first 5 Mb of chromosomal synteny in jumps of 1,000,000 using the windows at the bottom of the page.
Which section has the lowest level of synteny? Between 9311 (indica) bases 2,000,000
and 3,000,000, find cDNA OsJRFA 107843 and click on the japanica link. How many
SNPs (single nucleotide polymorphisms) did you find? Click on the Mapviewer link to
the right of one SNP’s information (you may have to click on the refresh button to see
the full display). Can you determine if this SNP alters the encoded protein sequence?
Explain what you see and what the limitations are with this visualization of data.
The first Mb has good synteny; the second shows compression in the japanica genome with one cDNA
out of synteny; the third shows more genes out of synteny while the forth is collinear; the fifth Mb is
mostly syntenic but there are a few cDNA exceptions. OsJRFA 107843 has 3 SNPs but the mapview
does not display with enough clarity to tell if the base view is within coding regions or not. You can click
on the SNP bar graphs to see the largest variations, but indels are also included in this display. This
database is rich with information but the display impedes your ability to determine the effect of variation
on the encoded proteins.
85. Many academics are concerned about free access to genome information. Go to
Syngenta’s web page and read the second paragraph. Follow the links to see how
quickly you can access the data. Compare this effort with the BGI databases in the
preceding Discovery Questions and draw your own conclusions about the availability
of data.
“Torrey Mesa Research Institute (TMRI) has closed its doors, but TMRIs affiliate, SBI, is making the
rice genome sequence available to external scientists. Please send requests for a CD copy of the
rice genome sequence to [email protected].” Students will not be able to analyze these
data.
86. Look at Figure 2.39 and consider chromosome 13. Was it duplicated, or was it unlucky
enough never to have been duplicated? If 13 was duplicated, describe what happened
C H A P T E R 2 Genome Sequence Acquisition and Analysis
35
to the duplicated version. Which chromosome pairs have been duplicated and
retained nearly intact?
Chromosome 13 appears to be duplicated but now split into chromosomes 19 and 5. We cannot tell for
sure if 13 predates 19 and 5; the smaller 2 may have duplicated and fused. None of the duplicated
chromosomes are perfectly preserved, but 5 pairs have retained remarkable amounts of synteny: 2 & 3;
4 & 12; 9 & 11; 7 & 16; and 10 & 14.
87. Find the location of the human genes oligophrenin and arrestin, using MapViewer. Now
go to the Tetraodon Genome Browser and search for oligophrenin and arrestin. Can
you detect the interleafing genes shown in Figure 2.40 and the genome duplication in
Figure 2.39 using these genes? Do they have paralogs near each other in the puffer fish?
oligophrenin on X: There are apparent paralogs on Tetraodon chromosomes 16, 18, and 7.
arrestin on 2: There are 5 apparent paralogs. One on chromosome 1 but the other 4 (plus some splice
variants) are on unassigned chromosomes, which indicates they have not been assembled into a
contig of a known chromosome. Neither of these genes match Figure 2.40 but oligophrenin on chromosomes 16 and 7 match the data in Figure 2.39 and arrestin on chromosome 1 and potentially several
others is consistent with the poorly conserved synteny of chromosome 1.
88. Go to the Human Genome Browser and search for “sarcospan”. Is sarcospan highly
conserved in the diverse species shown in the browser? You should see the
Tetraodon Net line; click on the Tetraodon exons until you get a tabular report
showing the gene’s summary statistics. What is the size difference in the human and
fish genes?
Human sarcospan is on chromosome 12. Yes, sarcospan is highly conserved in the diverse species
(see figure). The tabular report shows these differences:
Human size: 670,434
Tetraodon size: 47,601
Difference 622,833 bp
89. Go back to Tetraodon Genome Browser and examine the ideograms. Is the genome
finished yet? Click on a gap and show the resolution at 50 kb. Change the viewing
36
I N S T RU C TO R ’ S M A N UA L
options so that all are hidden, except turn on “DNA/GC content”, “Gap: All
Sequence Gap”, “Genescan”, “Hox Genes”, Takifugu ecotigs”, and “Tetraodon
cDNAs”. What affect do gaps have on the number of genes predicted by Genescan?
Compare the predicted genes to the number of cDNAs to see if any validating
sequences are available. If so, how well did Genescan predict the genes’ correct
sizes? Now search for HoxA, zoom out to 200 kb, and change the settings to highlight mouse, human, Takifugu, and Tetraodon gene conservation. Which HoxA gene
is in Tetraodon but not the mammals?
No, the genome is not finished as of 10/2005 since there are many white gaps in a background of blue
finished sequence. If you look at this region (chr10:6654906 . . . 6704905), you will see that a small gap
has resulted in the loss of one exon compared to cDNA. Genescan finds some of the verifiable genes,
but it also predicts some that cannot be verified and even spans some gaps. Predicting gene size is not
very reliable in Genescan. HoxA7 is found in the two puffer fish but not human or mouse.
Bonus Material: A study of the chicken genome is available on the book’s web site.
90. Let’s do some quick estimations about our DNA using these numbers: haploid
genome of 3,289,000,000 bp, 23,000 genes; and the numbers from Table 2.7.
a. What percentage of your genome is spent on genes? Exons? Introns?
1.89% in genes (includes non-transcribed DNA), 0.00937% in exons, 0.0235% in introns.
b. What percentage of your genes is spent on exons? Introns?
5.0% on exons, 12.5% on introns and the rest is regulatory in nature.
91. Are any chromosomes missing from Figure 2.43b?
The Y chromosome is missing.
92. Given the information in Figure 2.45, name one aspect that makes humans different
from other species.
The number of proteins dedicated to transcription control is a critical difference. This is not unexpected
since we contain more cell types that require unique combinations of proteins to perform their different
cellular roles.
93. Go to the BLAST2 web page and perform a protein alignment with human sarcoglycan delta (NP_758447) and sarcoglycan gamma (NP_000222) by pasting these protein
accession numbers into the smaller accession boxes and choosing “blastp” from the
program menu. What are the percent identity and percent similarity between these
two proteins? Now align the C. elegans sarcoglycan ortholog sgn-1 (CAA92663) with
both of the human sarcoglycan genes and determine which one is more similar (compare identity, similarity, and gaps).
sarcoglycan delta v. sarcoglycan gamma: Identities 112/211 (53%), Positives 145/211 (68%)
sgn-1 v. sarcoglycan delta: Identities 87/236 (36%), Positives 141/236 (58%), Gaps 2/236 (0%)
sgn-1 v. sarcoglycan gamma: Identities 95/269 (35%), Positives 160/269 (59%), Gaps 7/269 (2%)
94. Go to UCSC’s Genome Browser and search for “syntrophin”. How many syntrophin
genes do humans have? Click on the link for “(NM_009228) syntrophin, acidic 1”.
Change the view to hide all except “Known genes” in full view, and all the different
species “Net” views in dense, then refresh. Is this gene’s structure (combination of
exons and introns) highly conserved? Click on the dog alignment twice until you have
a nongraphic report for this locus of the dog genome, then click on the “Open Dog
browser” link. Set the species “Blat” or “Net” views to full, and “Conservation” to full,
C H A P T E R 2 Genome Sequence Acquisition and Analysis
37
then refresh. Is the conservation confined to the exons only? Explain the significance
of the conservation in the non-coding regions.
Humans have alpha 1, beta 1, beta 2 (two splice variants), gamma 1, gamma 2, and another gamma 2
apparent paralog by the same annotated name (very odd). Syntrophin alpha 1 is highly conserved in all
species listed. Many parts of the introns are also conserved (see figure). Splice sites and mRNA binding
proteins may require sequence conservation even if this does not directly affect the protein sequence.
95. Go to the Comparative Maps web page, and click on “Chromosome 9” under the “Rat
and Mouse compared to Human” column. Using the “Region Shown” box in the left
frame, enter “122M” in the top box and “123M” in the bottom box. Locate the human
gene called PTGS1, which is the cyclooxygenase 1 gene and the target of painkillers
such as aspirin (see Chapter 9 for details). Use the “Maps & Options” button, activate
the “Show Connections”, and hit “Apply”. Do all 3 species have PTGS1? Are these
regions syntenic, or are the human genes not linearly related to those in the two
rodents? What pattern in the alignment of genes is evident just below PTGS1? Click
on the Rat ortholog and chose the “ev” (evidence) link. Do you think there is sufficient data to support the annotation that this is a true ortholog?
All 3 species contain a Cox1 gene and overall, these regions appear syntenic. There is some gene
rearrangement just below the PTGS1 locus. If you follow the links, eventually you will find some rat
mRNAs that have identical structure to the human gene and this is very good evidence for the Cox1
annotation in rat.
96. Go to UCSC’s ENCODE web site and choose “Alpha Globin” from the menu in the
left frame. Alpha Globin is abbreviated HBA, but you can see several genes that begin
with HB. How many can you identify? (You may want to modify the view and zoom in
to help you focus on the HB genes.) Considering the conservation in other species,
how many of these HB genes are conserved in vertebrates? Do all of these genes produce functional protein? Click on “HBA1” to find the answers in text. Click on the
“AceView” link in the “Tools and Databases” table to see more information, including
graphic depictions of alternative splicing for these genes.
You can see Hbz, HBQ1, HBM (predicted), HBA2, and HBA1. All of these are conserved, including
HBM. The reports says:
“The human alpha globin gene cluster located on chromosome 16 spans about 30 kb and includes the
following five loci: 5′- zeta - pseudozeta - pseudoalpha-1 - alpha-2 - alpha-1 -3′. The alpha-2 (HBA2)
and alpha-1 (HBA1) coding sequences are identical. These genes differ slightly over the 5′ untranslated
regions and the introns, but they differ significantly over the 3′ untranslated regions. Two alpha chains
plus two beta chains constitute HbA, which in normal adult life comprises about 97% of the total hemoglobin; alpha chains combine with delta chains to constitute HbA-2, which with HbF (fetal hemoglobin)
makes up the remaining 3% of adult hemoglobin. Alpha thalassemias result from deletions of each of
38
I N S T RU C TO R ’ S M A N UA L
the alpha genes as well as deletions of both HBA2 and HBA1 respectively; some nondeletion alpha
thalassemias have also been reported.”
There are 5 alternatively spliced forms shown in the AceView.
Now go to Ensembl’s Multispecies ENCODE web site and click on the “MultiSpecies
alignment” for the cystic fibrosis gene “CFTR” to display 1 Mb of human, mouse, and
chicken genome alignments. On what chromosomes are CFTR orthologs located for
each species? Scroll down to examine the alignment and notice the difference for
chicken. What must have happened to produce the converging lines seen in the chicken chromosome? Follow the blue line from human through chicken CFTR (the blue
line is centered on the whole gene, not the 5′ end). Is CFTR conserved in chicken, or is
the gene truncated? You can recenter and zoom in to see the genes in better detail.
Human Chromosome 7; mouse 6; chicken 1; Chimp 6; Rat 4; Dog 14. The chicken gene
is inverted compared to the mammalian. The converging lines indicate loss of DNA so the gene is compacted compared to human. If you zoom in to a resolution showing bases, you can see the exon structure of the CFTR orthologs, but it is still difficult to be certain with this resolution for such a large gene,
though some exons do appear absent in chicken. Introns are clearly reduced.
Human
Chicken
97. Perform a Gene search for the human ITGB3 (integrin beta 3) gene, which is illustrated in
Figure 2.47c. Is the correct version of the gene in the database, or is the number of exons
incorrect? Click on the KEGG pathway for “regulation of actin cytoskeleton”. Where is
integrin located in cells (its cellular component, in GO terminology)? Click on the
highlighted box labeled “ITG” and use your browser’s find function to locate the amino
acids “RNRD”. Does this database have the old or new acid sequence?
C H A P T E R 2 Genome Sequence Acquisition and Analysis
39
The corrected version is in the NCBI database with 15 exons. Integrin alpha 11 (ITG) is an integral
membrane protein in the plasma membrane. KEGG also has the corrected sequence RNRDAPEGG . . .
Math Minute 2.1
What Can You Learn from a Dot Plot?
1. Suppose that two sequences are identical except that a segment is inverted in one
sequence, relative to the other. Explain how such an inversion would be displayed in a
dot plot.
The inversion would create a black streak in the dot plot, running diagonally from top right to bottom left.
An example is shown below, obtained by comparing the protein query in Discovery Question 10 in
Chapter 1 to the same sequence with amino acids 8–20 reversed. This dot plot was made with dotplot.xls.
Other examples can be seen in Figures 2.4 and 2.20. Note that inversions are not identifiable in the
sliding window view because the sequences are no longer similar along a diagonal segment from top
left to bottom right, the order in which the sliding window measures similarity.
2. Suppose that two sequences are identical except that a segment is deleted from the
middle of one sequence and not from the other. Explain how such a deletion would be
displayed in a dot plot.
In this situation, there would be a diagonal segment of black, then a horizontal jump (if deletion in
sequence labeling the columns) or a vertical jump (if deletion in sequence labeling the rows), after
which the diagonal segment would pick up again. An example is shown below, obtained by comparing
the sequence from Discovery Question 10 in Chapter 1 to the same sequence with amino acids 8–20
deleted in the sequence labeling the rows. Unlike inversions (see Math Minute Discovery Question 1)
deletions are identifiable in both the identity and sliding window similarity dot plots.
40
I N S T RU C TO R ’ S M A N UA L
3. What would be the value of using a dot plot to compare a sequence to a second
sequence, as well as the reverse complement of that second sequence?
By comparing to both the original sequence and the reverse complement of the sequence, homology
could be detected on both strands. For example, if one sequence codes for a gene, and the other is the
complementary strand for a homologous gene, this homology could still be detected.
4. Dot plots can detect many interesting sequence features by using the exact same
sequence on both the horizontal and vertical axes. Sketch the dot plot of a 100 kb
sequence in which a 20 kb segment is duplicated.
The dot plot would be symmetric, with a solid diagonal line (because the sequence is identical to itself).
The repeat shows up as lines, above and below the main diagonal, and 20% the length of the main
diagonal. The basic pattern should be like the following:
5. Sketch the dot plot of a 1 kb sequence in which a motif of approximately 50 consecutive bases appears six times in the sequence.
The properties of this sketch are similar to the sketch in Math Minute Discovery Question 4, but with shorter
sequences repeated more often. Multiple repeats result in a checkerboard type pattern such as the following:
6. What would be the value in using a dot plot to compare a sequence to its own reverse
complement?
Basic RNA structure can be detected, since hairpin loops show up as identical on a dot plot comparing
a sequence to its own reverse complement. The bottom two plots in dotplot.xls allow students to
explore this possibility. For example, the following plot of identity was made by comparing the sequence
ACGTGGCCATATATCGCCACGT to itself. Notice that the last 7 nt are complementary to the first 7 nt,
which can be seen by the diagonal strip in the sliding window plot.
C H A P T E R 2 Genome Sequence Acquisition and Analysis
41
Math Minute 2.2
How Do You Find Motifs?
1. Go to JASPAR and select “Browse profiles by class”. Scroll down to the TATA box
and click on the “View” button in this row. Verify that the values in Table MM2.1 are
displayed in this window. Explain how the sequence logo represents the information
in Table MM2.1.
The sequence logo represents the frequency of each base at a position by the relative height of that
letter in the stack. The extent to which the position is conserved, i.e., contains the same base, is represented by the overall height of the stack of letters at that position.
2. Return to the JASPAR “Browse profiles by class” page. Find transcription factors with
ID numbers MA0040, MA0041, and MA0047. By looking at the sequence logos,
explain which of these three transcription factors in rat is most likely to bind to DNA
containing the motif TGTTTA.
MA0040 is most likely to bind to DNA containing TFTTTA because it has the tallest representations of these
bases at positions 4–9. Therefore, this is a strongly preferred binding site for this transcription factor.
3. Use the spreadsheet pwm.xls to compute the total TATA PWM score of the following
three sequences, and determine which one is most likely to be a true TATA box:
ATATATATAGGCTGG, CTATATATATGCTGG, CTATAAATAGGCCGG.
Using the TATA box PWM matrix:
ATATATATAGGCTGG produces a total PWM score of 10.95
CTATATATATGCTGG produces a total PWM score of 10.55
CTATAAATAGGCCGG produces a total PWM score of 14.99
Because it has the highest score, the third sequence, CTATAAATAGGCCGG, is the most likely to be a
true TATA box.
4. Use the spreadsheet pwm.xls to compute the total TATA PWM score of
CCGGCCTATTTATAG. Explain why the score is so high, even though this sequence does
not look like a true TATA box. (Hint: How is this sequence related to one of the
sequences in Math Minute Discovery Question 3?)
The reverse complement of this sequence has a high score (14.99). It is the same score as the third
sequence in Question 3, because it is precisely the reverse complement of that sequence.
Math Minute 2.3 What Are “Positives” and What Do They Have
to Do with E-values?
1. What are the smallest and largest values in the BLOSUM62 matrix? Explain why the
diagonal entries of the table are not all the same, even though they all correspond to
matches.
−4 is the smallest score, and 11 is the largest score. The diagonal entries are not all the same because
some amino acids resist mutation more than others.
2. Repeat the BLAST2 protein alignment you performed in Discovery Question 15, but
with COX1 (NP_000953) as the query this time. By looking at the resulting alignment
and the BLOSUM62 matrix, explain the difference in raw scores in this comparison
and your original comparison. Focus your search on the low-complexity regions. To
42
I N S T RU C TO R ’ S M A N UA L
verify that the low-complexity regions are the only difference between using COX1 or
COX2 as the query, repeat the two comparisons with filtering turned off. The filter
check box is just above the box for sequence 1 accession number.
The bit score in the original search, with COX2 as the query, is 745 bits. But when COX1 is the query,
the bit score is 738 bits. The difference is due to screening out, or filtering, low complexity regions.
These regions are marked with X in the alignment. For example, when COX2 is the query, amino acids
182–198 are filtered out as low complexity. By clicking on the word Filter on the BLAST2 page, you can
read about the source of the filtering algorithm (SEG program of Wootton & Federhen) (Computers and
Chemistry, 1993).
Repeating the alignments with filtering turned off results in a bit score of 787, regardless of which
sequence is the query.
3. Repeat the BLAST2 search you did to compare COX1 and COX2 at the nucleotide
level, but this time change “Reward for a match” to 2 and “Penalty for a mismatch” to
3. (Blanks for these numbers are just below where you select the blastn program.)
What similarities and differences do you see in your results, as compared to your original search? Now repeat the search with the mismatch score remaining at 3, but the
match score returned to the default of 1. Are there any unexpected outcomes?
Explain the changes in hits for the three different nucleotide alignments, and how
your results relate to the three different BLOSUM matrices described earlier.
The original search found 3 distinct regions of similarity, as shown in the snapshot on the left. The modified alignment, with the match reward set at 2 and the mismatch set to 3, found a single, longer
region of similarity, as shown in the snapshot on the right. More bases are included in the second, modified, alignment, because the penalty for a mismatch is much smaller, relative to the reward for a match.
Thus, many more mismatches are accepted in the alignment, allowing the three regions of stronger
similarity to be stitched together into one long region.
When the penalty for a mismatch is increased to 3, without a corresponding increase in the reward
for a match (leaving it at 1), there are no significantly similar regions in the alignment. The default
scores used in the first alignment are comparable to the general purpose BLOSUM62 matrix. The
scores used in the second alignment are comparable to the BLOSUM80 matrix, best for comparing
more divergent sequences. The scores used in the third alignment are comparable to the BLOSUM45
matrix, best for comparing more highly conserved sequences, since it does not return hits unless the
sequences are strongly similar.
4. You can change the nucleotide scoring parameters in a regular BLASTn search, too.
Go back to the sequence you read from the chromatogram in Discovery Question 4,
and enter the 50 bases as your query sequence in a BLASTn search. Before submitting your search, scroll down to the “Other Advanced” window and type “–r 2” to
set the match score to 2. (You can click on the link next to this window to see what
other parameters you can change.) Now submit your query. How are your hits
C H A P T E R 2 Genome Sequence Acquisition and Analysis
43
different from the hits you got in Discovery Question 5 (with default value of 1), and
why? Find the numerical evidence of the change you made at the bottom of the
BLAST report.
Most of the hits are to the same sequences, and have the same alignments. However, the raw scores,
bit scores, and E-values are different because of the modified reward for a match. The parameters
lambda and K change to adjust for the change in the match score.
Math Minute 2.4
a Dot Plot?
Can You Estimate the Number of Inversions in
1. Go to the GRIMM site and enter the mouse gene order from the top line of
Table MM2.6 into the “Source genome” box. You can leave the “Destination genome”
box empty, because the program assumes the default target order of positive numbers
1 through 11. Select “multichromosomal or undirected” and “signed” options before
hitting the “run” button. Does the program sort by reversals in the same number of
steps as in Table MM2.6? Explain the differences in the reversal steps between the
GRIMM site and Table MM2.6. How do these differences affect how you interpret
the results of sorting by reversals?
Yes, the program sorts by reversals in the same number of steps (7) as in the table. However, the steps
are different. In the GRIMM program, the first two reversals are of single genes (6 and then 9), whereas
the first two steps in the table are reversals of 7 and 6 genes, respectively. The single gene reversals
occur in steps 3 and 4 in the table. It is important to realize that the number of reversals may be minimized by more than one reversal process. In other words, the predicted inversion history is not unique.
2. At the GRIMM site, select “Human Mouse (123 genes)” from the “choose sample
data” drop-down menu. Scroll down to see the results. What operations other than
reversals have been performed? Why are these additional operations needed?
Fusions and translocations are also included in the history. These are needed because multiple chromosomes are being compared between mouse and human, and each mouse chromosome contains genes
from multiple human chromosomes, and vice versa. To achieve the human ordering from the mouse ordering, genes must be moved across chromosomes as well as inverted on an individual chromosome.
Math Minute 2.5
How Do You Fit a Line to Data?
1. One possible explanation for the choice of line (1) over lines (2) and (3) is that the
regression line was constrained to go through the origin. Why whould it make sense
for the line to pass through the origin?
If a chromosome truly had 0 CpG islands per Mb, it is a very simple chromosome. Because you would
expect at least one CpG island to appear just by chance on a sequence as long as a chromosome, the
case where there are 0 CpG islands could be thought of as a “zero” or “empty” chromosome. An
“empty” chromosome, one without enough sequence complexity to have even one CpG island, might
also be expected to have zero genes on it.
2. Redraw lines (1), (2), and (3) with the additional restriction that each line must go
through the origin. Under this restriction, do you agree with the investigators that
line (1) is the best fit?
The line of best fit of type (1) under the restriction that the line must go through the origin would look
something like the solid line in the figure below. The line of type (2) through the origin would look something like the dotted line, and the line of type (3) through the origin would look something like the
dashed line. Lines (1) and (2) are both reasonable explanations. Line (3) misses so much of the cluster
44
I N S T RU C TO R ’ S M A N UA L
of points that it does not seem reasonable. When the line is not required to go through the origin, line (3)
is much more reasonable, going through the cloud of points and near the four supposed outliers
(something like the red line in the figure below).
This example shows that assumptions are key to interpreting lines of best fit. If you believe the line
should go through the origin, you would probably say (1) is the best, but if you do not believe the line
should go through the origin, you might say (3) is the best because it is a model for all the available
data.
C H A P T E R 2 Genome Sequence Acquisition and Analysis
45