Download CHAPTER 2 Genome Sequence Acquisition and

CHAPTER 2 Genome Sequence Acquisition and Analysis 2.1 How Are Genomes Sequenced? 1. Read the sequence from the real X-ray film in Figure 2.2. Record the sequence for both strands of DNA, with the top strand containing the sequence on the X-ray film. Be sure to keep track of 5′ and 3′ ends for both strands. 5′ GCA CTT GTT TCT CGGGG CTC AGC TGT ATC AGCC ACGT GCC TAC AAC AAT CTG CCCCT 3′ 3′ CGT GAA CAA AGA GCCCC GAG TCG ACA TAG TCGG TGCA CGG ATG TTG TTA GAC GGGGA 5′ Perform a BLASTn (nucleotide sequence) search with the top strand of DNA. BLASTn searches allow you to query the constantly updated database of all DNA sequences to find the best matches from the database for your query sequence (see Math Minute 1.1). Read the top “hit” from the BLAST results. What gene did you sequence? Now try a BLASTn search with the bottom strand (remember to enter it 5′ to 3′). Do you retrieve the same gene? LOCUS DEFINITION ACCESSION VERSION CREEZYA 1904 bp mRNA PLN 13-OCT-1993 Chlamydomonas reinhardtii ezy-1 mRNA, complete cds. L20945 L20945.1 GI:299182 Bottom strand in 5′ to 3′ orientation. 5′ AGGGG CAG ATT GTT GTA GGC ACGT GGCT GAT ACA GCT GAG CCCCG AGA AAC AAG TGC 3′ Yes, you get the same result. 2. To get a complete understanding of the sequencing process, join two students who tour the Genome Sequencing Center at Washington University in St. Louis. As of January 2006, the Washington University Genome Sequencing Center employs 230 people. 3. Go to the Chromat 1 web page and examine the entire sequence. Don’t bother trying to read the letters yet. Can you tell which end is the 5′ end? From chromat 1, it is impossible to determine which end is 5′. However, we know that the 5′ end migrates faster on the gel so it must appear first on the chromat. Furthermore, by convention, we always write DNA with the 5′ end on the top left and 3′ end on the bottom right. 4. Beginning at base 80, read 50 bases of the sequence and write down both strands of the DNA, with the top strand being the one on the chromat. 5′ atgct ctggc cacgg cactt gcgga tccca (30) TGATC TGTGC ACCTG CGATA (50)3′ 3′ tacga caccg gtgcc gtgaa cgcct agggt ACTAG ACACG TGGAC GCTAT 5′ 5. Perform a BLASTn search of the DNA in Chromat 1, but use only the first 30 bases of the 50. What was your best match? Record the E-value (measures quality of BLAST hits) presented in the right column. Now BLASTn all 50 bases and compare the new results with the search that used only 30 bases. Explain what happened to the E-value and why. You can read Math Minute 1.1 to understand why the E-value changed for the two BLAST results. C H A P T E R 2 Genome Sequence Acquisition and Analysis 11 October, 2005 with 30 bases; top hit with E-value of 0.56: LOCUS XM_531892 1739 bp mRNA linear MAM 30-AUG-2005 DEFINITION PREDICTED: Canis familiaris similar to Lamin B1 (LOC474663), mRNA. October, 2005 with 50 bases; top hit with E-value of 2e-09: LOCUS AF154499 2432 bp mRNA linear PLN 12-JUL-1999 DEFINITION Thalassiosira weissflogii sexually induced protein 1 (Sig1) mRNA, complete cds. The E-value decreased substantially because longer sequences are more likely to match the database by chance but since this one matched perfectly, it was not due to chance. The statistics from the BLASTn report are shown: Number of letters in database: 1,580,765,460 Number of sequences in database: 3,542,931 Lambda K H 1.37 0.711 1.31 Gapped Lambda K H 1.37 0.711 1.31 Matrix: blastn matrix:1 -3 Gap Penalties: Existence: 5, Extension: 2 Number of Sequences: 3542931 Number of Hits to DB: 905748 Number of extensions: 43510 Number of successful extensions: 10650 Number of sequences better than 10: 2 Number of HSP’s better than 10 without gapping: 2 Number of HSP’s gapped: 10650 Number of HSP’s successfully gapped: 2 Number of extra gapped extensions for HSPs above 10: 10646 Length of query: 50 Length of database: 15599103720 Length adjustment: 20 Effective length of query: 30 Effective length of database: 15528245100 Effective search space: 465847353000 Effective search space used: 465847353000 A: 0 X1: 11 (21.8 bits) X2: 15 (29.7 bits) X3: 25 (49.6 bits) S1: 12 (24.3 bits) S2: 18 (36.2 bits) 6. Go to Ensembl (European version of NCBI) and click on “Information” under the “Docs and downloads” menu on the left side. Click on “Download data files.” Are the genome sequences submitted as one single file? What level of organization has been used to post the DNA sequences? Data are posted under headings of DNA, cDNA, peptides, and then three different data formats (EMBL, Genbank, and MySQL). If you click on one of the FTP links, you will see they are bundled into a series of compressed files to expedite file transfer. 7. Do mammals or amphibians have larger genomes, as revealed on the less expensive web site? Why does the answer seem counterintuitive? Amphibians have up to 100 fold more DNA (109–1011 bp) compared to humans (3 × 109). This seems counterintuitive, because we think of ourselves as more complex, but many amphibians have polyploid genomes, which makes sequencing those genomes very difficult and expensive. 8. Go to the Assembly Archive (chromat database) to view some chromats from an anthrax sequencing effort. Click on Bacillus anthracis str. (strain) Kruger B. You will 12 I N S T RU C TO R ’ S M A N UA L see a list of many assemblies; click on Contig ID 607. When the new window opens, you will see three frames. The top frame shows the coverage of this 15.3 kb segment of Anthrax genome. The middle frame shows the individual sequences used to assemble the 15.3 kb contig. What does the graph in the top frame summarize from the middle frame? It shows the fold coverage for this particular portion of the genome. The higher the blue trace, the greater the fold coverage. The bottom frame shows you the tiling path of the individual clones that span the entire 15.3 kb contig. The small red dashes indicate marker sequences used to help create the overlapping tiling path of clones. Mouse over the large blue segment that is centered on the 2 kb tick mark with trace ID ti494464459. When you see this trace ID number, click on it. A new window will open to show you the sequence, but click on the box next to “in color” and hit the “Show” button. This will produce a color graph that indicates the quality assessment score produced by PHRED. Where does the quality tend to be best? Scroll to the first two regions with quality scores between 0 and 20. Quality is best in the middle region of the sequence. Lowest scores appear very early in the sequence. Now change the menu from “FASTA” (plain text format) to “Trace” and hit the “Show” button. You should see the chromat for this sequencing read. Next to the “in color” button should be a new option for the applet size. Change “Normal” to “Big” and hit the “Show” button. Right above the chromat is a “confidence” option; turn that on. On the far left is a scroll bar; move it down and up to see more and less of the chromat, respectively. The confidence is indicated as bar graphs for each base, with higher-confidence bases having longer bars; bar colors match the base colors, not quality assessment values. Find the regions of low quality scores and determine why the scores were so low. Low peak height and bad spacing are the most common reasons for poor PHRED scores. Note that the chromat reports more bases than were shown in the FASTA version. Bases with PHRED scores too low are not converted to FASTA quality data. 9. Go to the Human Genome Browser and locate section chr19:8,584,715–8,601,616 by typing it into the search window. Click on the large black box in the gap row and read how gaps are depicted. Click the “Back” button once on your browser, and scroll down below the image. Click on “hide all”, except modify these individual options: base position full; chromosome band dense; gap pack; Ensembl genes dense. Be sure to click the “Refresh” button at the top of the display options to implement your modifications; these settings will speed up subsequent navigation. Black boxes indicate this portion has never been cloned for sequencing. A bridged gap indicates the gap will be filled in the near future. Gaps such as this one are resistant to cloning or sequencing, often due to highly repetitive DNA. Click on the “base” button to the right of the 10X zoom-in button. This will show you the consensus sequence where known, and an x where there is no sequence information. Below the DNA sequence, all three reading frames are translated with red boxes marking stop codons and green boxes marking start codons. Zoom out 10X three times. Is this gap near a gene? Yes, a gene composed of many exons is to the left of this gap. Do you think this gap affected the nearest neighboring gene annotation? It is possible the gene extends into the gap and we don’t know it. C H A P T E R 2 Genome Sequence Acquisition and Analysis 13 Continue zooming out until you see a second gap. Now hit the button until you find a third gap. Continue to move through the third gap to define the extent of this gap. Which gap is bigger, the first one you looked at or this third one? What chromosomal structure(s) are in the area of the bigger gap? The third gap is much, much bigger. It spans the centromere. 10. Go to the Finishing web page and determine the order of DNA fragments needed to build the largest possible contig. How many gaps remain after you have created the largest possible contigs? This sequence was taken from the yeast gene MAL13/YGR288W. These 13 fragments can be assembled into 3 contigs with two gaps. However, the last line is composed of overlapping sequence that can join the first contig. Ultimately, there are two contigs with one gap. 11. Imagine you’re a finisher working on the DNA you assembled in Discovery Question 10. How might you have isolated the gapped DNA if you knew the entire region of DNA was 20 kb long? Extended PCR, or screening a genomic library with probes from the two edges. 12. Go to the genomic DNA #1 (gDNA1) web site, where you will see three pieces of DNA sequence. Copy and paste one of the sequences and then click on the “TestCode” button. One at a time, submit the three segments of gDNA to TestCode to find which one harbors the ORF. Sequence #1 gives an intermediate score of uncertain coding possibility Sequence #2 gives a low score. The third sequence comes from human alpha actin and scores very high. 13. Copy and paste the real ORF from Discovery Question 12 into the scramble web site to have a Perl script generate a scrambled version of the same DNA. Take the scrambled version and resubmit it to TestCode. Does the randomized version of the coding DNA look like coding DNA? Would you expect it to? No, even though it is exactly the same bases but in different orders, they do not appear to be coding DNA. This shows that certain DNA patterns can be recognized, such as codons. We would not expect random DNA to look like coding DNA when analyzed by TestCode. 14. Calculate the average percent nucleotide identity for the three COX gene regions from your BLAST2 alignment. 72.6% identity using XM_051899 and NM_000962. 15. Go back to BLAST2 Sequences and enter the two protein accession numbers for COX2, “NP_000954”, and COX1, “NP_000953”. Be sure to change the search from BLASTn to BLASTp. Verify that the top blank contains NP_000954, so COX2 will be the query in the resulting page. a. What is the overall amino acid identity? Is this higher or lower than the overall nucleotide identity? The overall amino acid density is 64%. There is greater alignment over the entire protein than was seen in the full-length mRNA sequences. b. Notice that a separate percentage is calculated for similarities (called “Positives”), which takes into account the similar structures of some amino acids. What is the percent similarity? 80% similarity. Not matches, but conservative substitutions. See MM 2.3 for details. 14 I N S T RU C TO R ’ S M A N UA L c. Which parts of the proteins appear to be poorly conserved? Look at the sequence alignment that uses the single amino acid code and find where one protein has several Xs in a row, to mark areas of low complexity (see Math Minute 2.3). Most of the variation appears in the first half of the proteins, including the additional amino acids found in Cox-1. d. Use your browser’s “Find” function to locate the amino acid sequence GAPFS. Serine (S) is the amino acid modified by aspirin. Is GAPFS in a region of high sequence identity or similarity? (See pages 336–337 for details.) 16. Go to HGNC (Human Genome Nomenclature Committee) and perform a quick gene search for cyclooxygenase to see how many cyclooxygenase genes there are in the human genome. HGNC is a good quick way to perform a gene search with links to many other databases. However, compare the HGNC results with an OMIM search using the gene name COX3. Do you find any surprises? HGNC shows 2 genes. OMIM shows 3 genes. 17. Mouse over the domain boxes to determine the number of different CDs from your search. Don’t just count the number of boxes, but determine the types of domains revealed when you mouse over each box. Notice the E-values are provided when you mouse over each box. There are four major domains in the dystrophin protein: calponin homology, zinc finger, spectrin, and a WW domain. Hyperlinks from the results page allow students to define each domain. The only surprise is the zinc finger, which we normally associate with transcription factors. 18. Click on the “Show” Domain Relatives button and see what hits you get. At what protein have you been looking? We have been looking at dystrophin. 19. Go back and click on “gnl/CDD/7333”, to the left of “smart00291,ZnF_ZZ, . . .” Read the text at the top of your screen. Does this domain have an important function? Explain your answer. It is unlikely that a cytoplasmic structural protein such as dystrophin also regulates transcription. However, there is evidence that muscular dystrophy has a signaling component, and it might be possible that the dystrophin gene produces truncated proteins that might be functional DNA binding proteins. Therefore, we should not conclude that dystrophin does or does not have a transcriptional function. The computer program points out conserved domains so we can be aware of possible functions that we might not have noticed without this information. See Chapter 10 for details. 20. Copy and paste this uncharacterized protein (as of spring 2005) amino acid sequence into web sites of your choosing to characterize the protein’s possible functions. Determine as much as you can about its structure and function. Ezy2 in Chlamydomonas reinhardtii (unicellular green alga). From 2002 Pubmed link: Ezy2 is candidate participant in the uniparental inheritance of chloroplast DNA. Kyte-Doolittle predicts it is an integral membrane protein. C H A P T E R 2 Genome Sequence Acquisition and Analysis 15 21. How can a single protein yield more than one answer to the why, what, and where questions to describe its roles in cells? Why: Some proteins might be involved in more than one large “objective.” For example, tubulin might be described as providing cytoskeletal structure or movement of flagella. Leptin might be even more complex since it seems to be involved in many processes. What: It has been demonstrated that some ion channels are also kinases. The biochemical activities are very distinct and different so these proteins would need two answers to answer the question “What?” Where: Hormone receptors that act as transcription factors are located in the cytoplasm and nucleus as they respond to the absence or presence of hormones, respectively. 22. Go to WormBase (C. elegans gene database), search for the gene “pmr-1”, and learn its biological process, molecular function, and cellular components (about half way down, next to Gene Ontology listing). Does pmr-1 have more than one biological process, molecular function, and cellular component? There is conflicting information on this protein. It is called a Ca2+ pump. It also pumps Mn2+ but later this site says Mn2+ inhibits the pump. It seems to be located in the Golgi, but the biological process is uncertain. The main point of this question is to illustrate the gaps in our knowledge and to reiterate that there is much work to be done even after a genome is sequenced. 23. Search NCBI’s Human Map Viewer using the term “obesity”. You will get hits for every locus that has obesity associated with it. How many loci do you see? Are they clustered or distributed throughout the genome? There are 14 hits in all. They appear to be scattered randomly. The only apparent clustering is due to the co-listing of mouse genes. 24. Click on the blue number “10” below the cartoon of chromosome 10. What gene did you identify? OB10 is neither leptin nor its receptor. It is not known what function this particular locus plays in the susceptibility to obesity. *603188 Related Entries, PubMed, LinkOut OBESITY, SUSCEPTIBILITY TO, ON CHROMOSOME 10; OB10 Gene map locus 10p 16 I N S T RU C TO R ’ S M A N UA L TEXT Epidemiologic studies suggest that 30 to 70% of the variation in body weight may be attributable to genetic factors. Hager et al. (1998) undertook a genomewide scan in affected sib pairs to identify chromosomal regions linked to obesity in a collection of French families. Model-free multipoint linkage analyses revealed evidence for linkage to a region on 10p (MLS 4.85). Two further loci on chromosomes 5q and 2p showed suggestive evidence for linkage of serum leptin levels in a genomewide context. The peak on chromosome 2 coincided with the region containing the proopiomelanocortin gene (POMC; 176830), a locus previously linked to leptin levels and fat mass in a Mexican-American population (Comuzzie et al., 1997 ) and shown to be mutated in obese humans (Krude et al., 1998). The findings suggested that there is a major gene on 10p implicated in the development of obesity and other loci influencing leptin levels. 25. Click on the “Maps & Options” button to modify the view. From the new window, you can choose from the list in the left window; your choices are displayed in the right window. Modify the display until only “Gene”, “Morbid/Disease”, and “Ideogram” are displayed. Click on “Morbid/Disease” and then on the “Make Master” button, followed by “Apply”. The ideogram on the far left shows how much of the chromosome you are viewing. You can zoom in or out as needed. This database allows you to search for your favorite disease or condition and track down all this information. You could place an order for this DNA or amplify it yourself using PCR. This database allows you to search for your favorite disease or condition and track down all this information. You could place an order for this DNA or amplify it yourself using PCR. 26. Go to Electronic PCR and enter this accession number, “M18533”, in the big open box to determine if there are any STSs in this sequence. What gene have you located, and how many STS markers are there? Click on one of the blue links and see how much information is there. Do you have all the information you need to amplify this STS? What else did you learn, other than sequences of the primers? The dystrophin gene has 16 STSs. You know how big the PCR product should be, but you do not know the temperatures or magnesium concentrations. 27. BLASTn this mystery sequence and select “est_mouse” from the “Choose database” menu. What can you learn based on the hits you obtained? For example, what gene have you identified? Scroll down and see how many tissues are described in this search. Imagine you were studying obesity in mice; how might this help your efforts? (See Chapter 5 for details.) A sequence with “similarity to mouse leptin” was identified in this search <http://www.ncbi.nlm.nih.gov:80/entrez/query.fcgi?cmd=Retrieve&db=Nucleotide&list_uids=11520484& dopt=GenBank>. Among other information, we see this: /organism=“Mus musculus” /sex=“male” /tissue_type=“salivary gland” /dev_stage=“5 months” You’d know at least one tissue where this gene is expressed. 28. Did the EST database provide you with more than just sequence identification information? With the completion of many genomes, is there any utility for the EST databases? Support your answer with specific examples. C H A P T E R 2 Genome Sequence Acquisition and Analysis 17 Yes. The additional information would be useful if you wanted to know when and where a particular gene was expressed. This could save you a lot of time if you were interested in a particular proteome, for example. There is still a significant value in the EST database. Genomic sequence does not inform us where and when any given gene is expressed. Only data such as ESTs tell us how genes are regulated in a living organism. In addition, ESTs allow us to discover sequence variations (e.g., SNPs) in the population as well as RNA splicing variants. 29. Go to the UniGene Statistics page to read the latest information on human ESTs. Based on this information alone, calculate the average number of ESTs for each human gene (assuming 23,000 genes). As of 12/2006, there are 5,160,228 human ESTs. From this extensive collection, it would be impossible to calculate the number of genes, but we can calculate the average number of ESTs per gene: 224 ESTs per gene. This may hint at the large number of alternative spliced mRNAs as well as mRNA with alternative 5′ ends due to alternative start sites. 30. What does GeneCards say about Nox1? Is it an NADPH oxidase? Go to the Human MapViewer for the human genome, enter “Nox1”, and hit “Find”. You will see a red dash next to the X chromosome. Click on the blue “X” under the ideogram and notice the other genes in the area. Next to the gene “NOX1”, click on the link “sv”. You should see a graphical version of this gene. GeneCards, and its many mirrored sites, is a great way to learn all about any given gene or protein. On this page, we learn that Nox1 is both an NADPH oxidase and an ion channel. It also tells us there are three alternatively spliced forms called Nox1-long, Nox1-short, and Nox1-long variant. You can see the full description at this URL: <http://thr.cit.nih.gov/cgi-bin/cards/carddisp?NOX1>. a. How many different mRNAs (listed as CDS for coding sequences) are produced by this one gene? The color code is just below the expanded view of this gene. Notice that only the first exon is shown in the sequence (as denoted by the red bracket in the cartoon above). REFSEQ proteins (3 alternative transcripts): NP_008983.1 NP_039248.1 NP_039249.1 b. Use the navigation button icon to zoom out by clicking once on the “” sign. How many genes are in this region of the X chromosome? Do any other genes produce more than one mRNA? Entrez Gene cytogenetic band: Xq22 Ensembl cytogenetic band: Xq22.1 Entrez shows 3 splice variants while Ensembl shows only 2. 31. Take a few minutes to draw a flowchart illustrating the steps you would take to annotate a newly sequenced genome (define the genes; describe each protein’s biological process, molecular function, and cellular component; summarize the major metabolic pathways the organism needs to survive and evolve). 1. Search for coding DNA. You could do this using methods such as Testcode, or geneome alignment programs comparing two evolutionarily similar species. 2. See if you can BLAST coding DNA and get hits with known genes. Searching for ORFs and ESTs may simplify this process. Alternative splicing products may be identified by EST matches. 3. Look up Gene Onotology for complete descriptions of “functions”. 4. You might link out from GO to find metabolites at places such as KEGG in Japan. 32. Analyze the dystrophin gene with the Genome Browser. Enter the name “dystrophin” and click the “Submit” button. You will get a list of hits. Click on the first “dystrophin” 18 I N S T RU C TO R ’ S M A N UA L option to see a graphic view of the human dystrophin gene. Use the 3X zoom-out button until you can see the entire gene. You may have to modify the view using the options below the display to answer the following questions. a. Are there any STS markers in this gene? Yes, many of them. b. Look at the Gap and Coverage lines. Has the public HGP sequenced all of the chromosome in this region? Many of the draft sequence gaps have been closed. There are still some in this arm of the X chromosome, but not many. The two major ones are at the telomere and centromere. c. Change the “Coverage” option to “full” and then count how many BACs span the DMD gene. Gray BACs are draft-quality sequencing and black BACs are finished. Can you determine the minimum number of BACs required to span DMD based on coverage? 32 BACs span DMD. Only 23 are required to span this region. The other 9 are redundant. d. How many DMD mRNAs use more than one exon? What can you infer from the number of alternatively spliced mRNA? There are over 35 mRNAs reported. At least 23 different mRNAs use more than one exon. 33. In the “position” box, enter “7p15.2” and then hit the “jump” button. At the bottom of the page, hide all features, except set to full “Known Genes” and then set to dense every species “Net” (e.g., Fugu Net) under “Comparative Genomics”. Then hit “refresh”. You are looking at 12 Hox genes, which are critical to body development. Center the Hox genes and zoom in to see the near universal conservation of DNA, especially the Hox exons. a. Which species highlight the conserved exons the best, closely related species or more distant? Fugu and chicken highlight the exons best—the most distantly related species. The mammals have more intron conservation and thus the exons do not stand out as much. C H A P T E R 2 Genome Sequence Acquisition and Analysis 19 b. Set repeat masker to “dense” and refresh this diagram. Do Hox genes contain a lot of repeats (black indicates repeat sequences) compared to portions outside the Hox genes? Do you think this is significant? Explain your answer. At a low magnification, the repeat masker may look like a solid black bar. Encourage students to zoom in (use 1.5X button to slowly zoom in) and to navigate left or right in order to keep the Hox genes centered on their displays. When they zoom in, they will realize that the repeat masker shows repetitive DNA in the human genome except where the coding regions of the Hox genes are located. This is a good illustration of selection pressure maintaining the integrity of critical regions of the genome. c. Set the SNPs (single nucleotide polymorphisms, which can be thought of as point mutations) option to dense. Do Hox genes have more or fewer SNPs than the surrounding area? (See Color Key.) Do you think the Hox frequency of SNPs is significant? Explain your answer. For a comparison, click on the “Move ” button. There are relatively few SNPs in the human Hox genes as you might expect for genes that are critical for embryonic development. There appears to be less tolerance of genomic variation in this region compared to neighboring regions. This question foreshadows issues addressed in Chapter 4. In these three questions, we are working with only one strain of mice, not two. 34. If the Igf2 gene were deleted in the sperm, predict the phenotype in the offspring. What would the phenotype be if the egg’s Igf2 gene were deleted? The male’s offspring might be smaller than normal. The female’s offspring would be normal sized. 35. If the maternally expressed gene Igf2r were deleted from eggs, predict the phenotype in the offspring. What would the phenotype be if the sperm lacked an Igf2r gene? The female’s offspring would be smaller than normal while the male’s would be normal sized. 36. What would you predict for the offspring if the sperm’s Igf2 gene and the egg’s Igf2r gene were deleted? We would expect them to be abnormally small, if they are viable at all. 37. Do you think methylation is an ancient mechanism or one limited to vertebrates? How could you answer this question? Students may answer either way with this one. Encourage them to offer methods to answer this question. The next Discovery Question will help them figure out the answer. 38. Go to GenBank and search “methyltransferase”. Can you find any DNA methyltransferases in organisms other than vertebrates? Methyltransferases are found in archea, bacteria, plants. . . . It is a very ancient mechanism for regulating DNA. 39. Jackson-Grusby et al. found that most of the genes with altered expression increased their expression levels when the methyltransferase was deleted, as you might expect since methylation normally silences genes. Explain how some genes could be repressed when hypomethylated. Loss of the methyltransferase might result in the increased production of transcriptional repressors and thus some genes might be repressed even though they were not methylated. 20 I N S T RU C TO R ’ S M A N UA L 2.2 What Have We Learned from Unicellular Genomes? 40. Perform an NCBI Gene search for PPA1880 (use the pull-down menu on the NCBI main page to select “protein”). Click on the the first link, then click on the “CDS” link to see the DNA. Copy the DNA sequence into the GC calculator to determine the %GC. P. acnes has an average of 60% GC and human has an average of 41%. Which genome does PPA1880 more closely match? At 57% GC, this sequence looks more like P. acnes than human. 41. Find AAH14236.1 from NCBI’s protein pulldown menu. Copy the DNA sequence and determine the %GC. Does this sequence look more human-like or P. acnes-like? What interesting annotation did you uncover? How could this cDNA get into a human cDNA library? With a GC content of 56%, this “human” cDNA looks like a P. acnes coding sequence. The NCBI hit (September, 2005) says, “This record was removed because the sequence was determined to be an artifact. Please contact [email protected] for further details.” It was isolated from a human cDNA library that probably was a contamination of the technician’s skin. 42. Now determine the GC content for one rRNA gene and one tRNA gene. How do they compare to the genome average of 60% GC? 16S and 23S rRNA were 56%, close to the overall GC content. Percent GC for tRNAs varied from low 60s to low 70s. 43. Go to the P. acnes genome view, enter 1 into the “Start from” box, and hit “Go”. Notice that the first gene is DnaA (accession number YP_054724). Click the blue arrow pointing to the left. Do you notice anything peculiar about this region upstream of DnaA? Compare this region to any other by clicking somewhere on the genome map to see any other region. The region prior to DnaA was devoid of any annotated genes. This is in stark contrast to the rest of the genome. However, this is an artifact of the display. If you click on this region of the map, you will not find any portion of the genome that lacks genes. 44. Find the Conserved Domain of LPXTG. What does this domain help proteins do? From cd00004: “Gram-positive bacteria, cleaves surface proteins at the LPXTG motif between Thr and Gly and catalyzes the formation of an amide bond between the carboxyl group of Thr and the amino group of cell-wall crossbridges. In two different classes of sortases the N-terminus either functions as both a signal peptide for secretion and a stop-transfer signal for membrane anchoring, or it contains a signal peptide only and the C-terminus serves as a membrane anchor.” 45. Perform an NCBI Medical Subject Heading (MeSH) search for “CAMP Factor”. You should see a hit called “CAMP protein, Streptococcus [Substance Name]”. On the far right side, click on the “link” link and choose “NLM [National Library of Medicine] MeSH Browser”. What can CAMP factors do to our blood cells? It can induce hemolysis but is used “for rapid identification of group B streptococci strains.” 46. Search Google Scholar for autoinducer-2. Do you see any evidence that autoinducer-2 plays a role in communication? Is this protein expressed in many species? C H A P T E R 2 Genome Sequence Acquisition and Analysis 21 From one abstract: “While the discovery of a diffusible Escherichia coli signaling pheromone, termed autoinducer 2 (AI-2), has been made along with several quorum sensing genes . . . ” Several species contain an AI-2 gene. 47. Search for “biofilm” in OMIM and click on the one hit. Perform a find function with your browser for “biofilm” and see what this protein has to do with preventing biofilm formation. “Lactoferrin, a ubiquitous and abundant constituent of human external secretions, blocks biofilm development by the opportunistic pathogen Pseudomonas aeruginosa. This occurs at lactoferrin concentrations below those that kill or prevent growth. By chelating iron, lactoferrin stimulates twitching, a specialized form of surface motility, causing the bacteria to wander across the surface instead of forming cell clusters and biofilms.” 48. Conduct PubMed and Google searches for “Blue Light Acne” and see what you can learn about a novel method to combat acne. Do you think genome sequences can help us understand this method better? Explain your answer. It appears that P. acnes is sensitive to 420 nm light, which can lead to its death and thus reduce acne. This is particularly good for pregnant women who cannot take certain antibiotics. Genome sequencing provided the clues that suggested and explained this clinical therapy. Bonus Material: A study of the tetanus genome is available on the book’s web site. 49. Go to the Microbial List web page and click on “Bacteroides thetaiotaomicron VPI5482”. Then click on the link “4778” to produce a list of all the proteins in the proteome, ordered the way the genes appear on the chromosome. Find “mannosidase” as many times as you can (you can stop when you get tired). This basic search illustrates the high level of polysaccharide utilization enzymes in the proteome—and you only searched for a single sugar. Mannosidase appears 18 times on this page. 50. Go back to the list of 4,778 proteins and do a find function for “one-component” (the sensor and signal-transduction proteins are fused). Each time you find one, look to see if a sugar-metabolizing protein is nearby. Perhaps the placement of a sensor gene and a metabolizing gene is not random. Propose a reason why evolution might have selected for these two types of genes to be neighbors. One-component appears 22 times in this proteome annotation. Examples of sugar metabolizing genes adjacent to one-component genes include: BT2629 putative alpha-1,2-mannosidase; BT2898 endo-1,4-beta-xylanase D precursor; BT2924 acetyl-CoA synthetase; BT4179 polysaccharide deacetylase. If the sensor signals for sugar regulation, then there might have been selection pressure to keep these two genes nearby to minimize recombination and thus disassociated genetic drift and reduced fitness. 51. Look at mannose metabolism, then find Bacterioides in the pull-down menu, and click on the “Go” button. (The fastest way to do this is to open the menu and start typing Bacterioides fairly quickly. The species will be highlighted as you spell out the genus.) Next, click on the oval labeled “Galactose metabolism”. Can you verify that our symbiont is well suited to help us digest sugars? The green boxes indicate which enzymes are encoded in the displayed pathway. Both mannose (20 enzymes) and galactose (22 enzymes) metabolism are well represented in the B. thetaiotaomicron genome. 22 I N S T RU C TO R ’ S M A N UA L 52. Search Scientific American for “pylori” to read a surprising proposal published in 2005 that even the ulcer-causing bacterium H. pylori might produce beneficial consequences for living in our stomachs. Are we harming ourselves when we take antibiotics unnecessarily? “As H. pylori has retreated, the rates of peptic ulcers and stomach cancer have dropped. But at the same time, diseases of the esophagus—including acid reflux disease and a particularly deadly type of esophageal cancer—have increased dramatically, and a wide body of evidence indicates that the rise of these illnesses is also related to the disappearance of H. pylori.” From February, 2005, article written by Martin J. Blaser. 53. Go to the TIGR CMR web site, then choose “Align Whole Genomes”. Choose from the two pull-down menus H. influenzae and M. genitalium with a minimum alignment of 20 nt (this display looks best if viewed in Netscape or Firefox browsers). Do these two genomes look like they evolved from a common ancestor in the recent past? You can increase the sensitivity by changing the minimum alignment to 15 nt; see if this helps. The figure above displays a plot of maximally unique matching subsequences (MUMs) between genomes as identified by the MUMmer program. The minimum alignment length shown here is 20 bp. Alignments with the same orientation are shown in red and alignments with opposite orientations are shown in green. Genes are shown along the axis for each genome and are colored by role category. C H A P T E R 2 Genome Sequence Acquisition and Analysis 23 H. influenza is on the x-axis and M. genitalium is on the y-axis. There are no obvious patterns that indicate they have a recent common ancestor. With 15 bp window, the degree of sequence conservation is more apparent in the figure above, but gene order does not appear to be conserved. 54. Go to the M. genitalium genome page and choose the “Searches” option from the top menu; then choose “Name”, search for the genes MG064 and MG101 (see Figure 2.15), and follow the “GenBank” link to retrieve protein sequences. Does BLAST provide any insights now? AAC71282/MG064: Conserved domain shows it has two permease domains that suggest it may help transport molecules across the cell membrane; COG0577: BLASTp results indicate this protein is an “ABC-type antimicrobial peptide transport system, permease component [Mycoplasma genitalium G-37]” indicating additional information has been added to the data base in the last 10 years. AAC71319/MG101: Conserved domain indicates helix_turn_helix gluconate operon transcriptional repressor; BLASTp reports “transcription regulator [Mesoplasma florum L1]”. 55. Go to DEG and search for entry number 1169 (DEG10060038 or clpB) from M. genitalium. Does the name of this gene lead you to believe it might be an essential gene? Copy the ORF sequence from the DEG link and perform a BLASTx search (submit DNA sequence and search for protein matches) against all DEG entries using DEG’s BLASTx program. Then perform a BLASTx search at NCBI. Which search is more informative and why? ATP-dependent Clp protease, ATPase subunit (clpB): being an “ATPase subunit” looks like this protein is the energy yielding portion of a larger protein structure that acts as a protease. 24 I N S T RU C TO R ’ S M A N UA L The NCBI BLASTx gave many hits with E-values of 0.0, all of which were clear orthologs. The DEG BLASTx was not working on a reliable basis. 56. Go to the M. genitalium Genome Browser at Genome.Net. Click on “KEGG” (metabolism database) at the bottom to see an interactive genome map. At the bottom is a genome alignment tool that will show you genes in your query species aligned with the species of your choice (select E. coli, then click on “Exec”[ute]). You will see colored bands for orthologs of the two species and their location in the query genome. Mouse over the genome, and the green bar will show you on which portion it will zoom; click on the section with the most color (red). On the new page, click on the “ORF Color” button to understand the colors, and “View Genome map” to see the position of the conserved genes. What category of genes are in this area? Are the genes clustered or widely distributed? I have selected the region with the most red (see above). Many genes associated with translation are clustered together in this region. 57. Perform an NCBI Entrez search for Nanoarchaeum equitans. Click on the “MeSH” link at the bottom to learn about this species. Go back to the full Entrez results, click on the “Genome” link, and follow this until you see the circular map of the smallest cellular genome in the world (as of September 2005). Click on “GenePlot” to align N. equitans and M. genitalium by changing the default species for the pulldown menu on the right, then click on “compare selected pair”. How many protein-coding genes are conserved between these two smallest genomes (number of “bets” [best hits] listed below the 2D alignment maps)? You can navigate around by clicking and changing the zoom button. Identify the gene they have in common that is located nearest to their 0–0 origin of the graph. MeSH results: Nanoarchaeota—A phylum of hyperthermophilic ARCHAEA found in diverse environments. Year introduced: 2005. C H A P T E R 2 Genome Sequence Acquisition and Analysis 25 73 genes are conserved. The gene pair closest to the origin is MG035 and NEQ102. According to the MG035 link, this gene encodes Histidyl-tRNA synthetase. 58. In Discovery Question 57, you determined that both species have histidyl-tRNA synthetase, but if you studied the whole genome of N. equitans, you would discover it lacks a tRNAHis for the codon GUG. Go to the tRNAHis web page and BLASTn each of the three sequences. Note the first hits when you submit each half and then the difference when you submit the full sequence. Do any of these BLAST results indicate they might be a tRNA gene? Why doesn’t the full sequence give a better score than just half of the tRNA gene? >5′ tRNAHis Anti-codon underlined. TTGCCCCCGTAGCTTAGTGGCAGAGCGCCGGGCTGTGG The only significant hit is this portion of the N. equitans genome. >3′ tRNAHis ACCCGGAGGTCCCGGGTTCGAATCCCGGCGGGGGCC There are many significant hits, but none of them are very informative. >Combined tRNA TTGCCCCCGTAGCTTAGTGGCAGAGCGCCGGGCTGTGGACCCGGAGGTCCCGGGTTCGAATCCCGGCGGGGGCC Still, it is hard to find a biologically significant match because the annotation is lacking. However, if you perform a find function for “tRNAArg”, you will see the best ortholog on this page is a tRNA gene. However, the database still is not yielding very good results due to poor annotation of genome sequences for RNA encoding genes. 59. Perform a Genome NCBI search for “Mimivirus” and click on the link. Change the view of the genome to show only tRNAs and hit “Refresh”. How many are in this genome? 26 I N S T RU C TO R ’ S M A N UA L Mimivirus has 6 tRNA genes as shown above. 60. In the left frame there is a link called “Protein view”. Click on it to get a list of Mimivirus protein-coding genes as they appear on the genome. In the “Start from” field, enter these nucleotide numbers to find three genes and then click on “Go”: (1) 234000, R194; (2) 267000, L221; and (3) 633000, R480 (L and R refer to genes pointing to the left or right, respectively). Click on each of the three boxes to find out the family of proteins. Which one looks most like a eukaryotic protein, based on the COG sequence similarity results you get with each click of the boxes? R194 COG3569: Topoisomerase IB L221 COG0550: Topoisomerase IA R480 DNA topoisomerase II [Dictyostelium discoideum] R480 is most similar to eukaryote genes of these three topoisomerase genes. 61. If you were going to construct a minimum genome, would you choose a virus or a bacterium? Explain why. You might start with mimivirus and add to it until you had a self-replicating organism. This might be the easiest way to go. Alternatively, you could try to reduce the size of Nanoarchaeum equitans as a way to begin with a self-replicating organism and get rid of “extraneous genes”. This question has no right answer, but was designed to engage students in a thoughtful way with a topic that is currently being investigated. 62. Go to the Oak Ridge National Laboratory (ORNL) Microbial Genome web site, choose the Prochlorococcus marinus sp. MED4 genome from the “Finished Eubacteria” pull-down menu, and note the %GC. Compare this percentage with the genomes of P. marinus MIT9313 and Synechococcus sp. WH8102. Do the two P. marinus ecotypes look like two genomes of the same species to you? P. marinus MED4 31.64% GC P. marinus MIT9313 52.17% GC Synechococcus sp. WH8102 60.2% GC It is impossible to tell from %GC alone, but it is striking to see the two Prochlorococcus ecotypes with more dissimilar GC content than MIT9313 to Synechococcus. 63. Go back to the P. marinus sp. MED4 genome and click on the “View genome in WebArtemis” link. It will take a while to load this Java applet, but it is worth the wait. Warning: Do not close the web page that simply says “loading entry—done” or you will lose the applet. You will see the full, annotated genome in three frames. The top and middle frames are duplicates, but the vertical slider bars allow you to adjust magnification, with the default showing the top frame in medium scale and the middle frame in highresolution scale. Thin vertical lines in the top frames represent stop codons. The bottom frame is the complete list of every gene and annotated feature (e.g., transmembrane domain, signal peptide, etc.). A “c” in the gene list indicates the “Crick” strand of DNA. C H A P T E R 2 Genome Sequence Acquisition and Analysis 27 Your browser should display a few new menus that add extensively to the Artemis viewing and analysis. Under the “View” menu, choose “Show CDS Genes and Products” and under the “GoTo” menu, choose “Navigator . . .”. Check the “Goto Feature with This Qualifier Value:”, and search for “cobA” (see Figure 2.19), telling the navigator to pay attention to case. Double-click on the highlighted gene in the list of “Genes and Products”, and you will see the ORF displayed in the main graphic window, complete with DNA and protein sequences. What does cobA do and is it on the Crick or Watson strand? CobA is in the first reading frame of the bottom strand (Crick strand). CobA is a putative uroporphyrin-III C-methyltransferase. Now click on the main graphic window to make sure it is the active one (a Java requirement). Under the “Graph” menu, choose “GC content (%)” to see how the GC content shifts with different genes. In the list of features in the bottom frame, scroll down until you see a tRNA gene (light green box). Double-click on the box and look at the %GC. Try a few more genes and describe the pattern you observe. Artemis is very powerful, so feel free to explore and make new discoveries on your own. For the two tRNA genes shown above, the %GC goes above the overall average which is consistent with RNA coding genes in species with overall low GC content. 64. Go back to the ORNL Microbial Genome page, select Synechococcus and note its %GC again. Now launch the Artemis viewer (wait . . . ) and view the following regions with attention to the GC graph and how many genes are in these AT-rich sections: (1) 427233; (2) 622199; (3) 912098; (4) 2379778. Did you notice one of the world’s longest prokaryote ORFs in one of these sections? Compare this long ORF to the average gene on the Synechococcus statistics page. You have just examined four areas with different codon bias and GC content. Propose an explanation for these four apparent anomalies. Overall 60.2% GC. Average gene length is 871 bases. It helps to zoom the top frame out some to see the landscape of these AT-rich regions. 18 genes in the region beginning with base 427233. Note only one gene on the bottom strand. 28 I N S T RU C TO R ’ S M A N UA L 7 genes in the region beginning with base 622199 (see below). 9 genes in the region beginning with base 912098, including a very long gene on the top strand (over 32kb long). 13 genes in the region beginning with base 2379778. One possibility is these regions are sites of horizontal transfer of DNA from the genome of an AT-rich species. 65. Perform a PubMed search for the term “selenocysteine” and find out what this is. Does it matter functionally whether a protein incorporates a cysteine or a selenocysteine? An August 2005 abstract by J. Kohrle summarizes one negative consequence of failure to incorporate the modified cysteine amino acid. “Limited or inadequate supply of both trace elements, iodine and selenium, leads to complex rearrangements of thyroid hormone metabolism enabling adaptation to unfavorable conditions.” 66. Search Google for “isoprenoid Wiki” and select the link for Wikipedia to read what isoprenoids are. Explain why loss of the apicoplast would be lethal given it’s the source of isoprenoids. “Isoprene is formed naturally in plants and animals and is generally the most common hydrocarbon found in the human body. . . . Also derived from isoprene are phytol, retinol (vitamin A), tocopherol (vitamin E), dolichols, and squalene. Heme A has an isoprenoid tail.” It is a substrate for the production of many vital metabolites. 67. Go to KEGG Pathway web site and click on the “ATP synthesis” link to see a model of ATP synthase. Change the pull-down menu from “Reference pathway” to “Plasmodium falciparum” located near the top of this long list of species, then click on “Go”. Does Plasmodium have all the parts necessary to synthesize ATP from an H+ -ion gradient? Explain your answer. No, Plasmodium lacks several of the subunits for ATP synthesis as indicated by the white boxes in the figure below for eukaryote F0 and F1 portions of the ATP-synthase. C H A P T E R 2 Genome Sequence Acquisition and Analysis 29 68. Return to the KEGG Pathway, choose “glycolysis”, and see which enzymes Plasmodium has. Follow the pathway from b-D-Glucose to pyruvate and see if any steps are missing. Compare Plasmodium to Saccharomyces cerevisiae (baker’s yeast) to see which one has the more robust metabolic capacity. Finally, look at Aminoacyl-tRNA biosynthesis on the list of maps and see if Plasmodium is missing any enzymes needed to synthesize tRNAs coupled with their amino acids (aminoacyl-tRNA). Would you predict that a parasite might depend on the host for any of these enzymes? Explain your prediction and then test it by searching the database. Plasmodium has all the enzymes required for converting glucose to pyruvate. Yeast has more side reaction enzymes, but glycolysis is essentially identical. Plasmodium has all the aminoacyl-tRNA synthase enzymes. This would be expected since the intracellular parasite inside a genetically silent RBC host could not obtain charged tRNAs from its host. 69. Go to PlasmoDB and view bases 400,000–450,000 on chromosome 1. Below the busy graphics, choose to hide all options except “%AT”, “BLASTx”, “Genefinder”, and “Pf Annotation”, all of which should be set to “show one line”. Hit the “Update” button. Do the BLASTx (DNA query against protein database) hits align with the annotated genes? Did the predictive software Genefinder identify every exon correctly? There are some BLASTx hits that did not align with the annotations, but Genefinder did align closely with the annotated genes. However, the annotation and Genefinder differ on some of the exons and whether some exons were in the same gene or two separate genes. Mousing over the exons produces text that explains some of the annotations. Now go back to PlasmoDB and choose to view the mitochondrial genome at the bottom of the page. Change the display so that all RNAs and genes are displayed as “show—expanded”. How many genes and how many RNAs are encoded on this organellar genome? Explain why the two numbers are different. 30 I N S T RU C TO R ’ S M A N UA L The mitochondrion encodes 19 RNAs and 3 proteins. The RNAs are ribosomal rRNAs while the three proteins are involved in energy synthesis. 70. Perform a search at the Saccharomyces Genome Database (SGD) web site and perform a Quick Search for maltose-metabolizing genes Mal1, Mal2, Mal3, Mal4, Mal6, Mal10, Mal12, and Mal13. Determine which of these genes are true paralogs or phenotypes with uncertain genomic information. Which gene or genes have the most detailed information? The answer to this question is more complex than it first appears. Mal1: not in systematic sequence of S288C (the reference genome strain) Mal2: not in systematic sequence of S288C Mal3: not in systematic sequence of S288C Mal4: not in systematic sequence of S288C Mal6: not in systematic sequence of S288C Mal10: not a gene name recognized by SGD Mal12: YGR292W, this is a valid, annotated gene. Physically interacts with Mal32 with which it shares 100% amino acid identity. This appears to be a duplicated paralog. Mal13: YGR288W. This appears to be a paralog with YBR297W (Mal33) but they have only 65% amino acid identity. It is interesting to note Mal12 and Mal13 are near each other on chromosome VII, but they do not appear to be paralogs. An SGD curator explained the “not in systematic sequence of S288C” response as follows: “SGD is based on the genomic sequence of strain S288C, which has been the “official” source for the Yeast Genome Sequencing project. The MAL genes are highly variable among yeast strains and they happen to be absent from the genome of that particular strain. But these genes do exists in many other laboratory strains of S. cerevisiae and people do study them. That’s why we at SGD have Locus Pages for them and we do collect whatever literature data we come across.” 71. At the bottom of the Mal13 page, click on the “MIPS” (German genomics database) link to see a different source; click “Protein Info” to see details about Mal13p (p for protein). Are the two databases identical in content, or do they present different information? MIPS results for Mal13p shows more protein information on the results page while SGD shows less information but many more links to this type of information. Also, SGD shows some expression information but MIPS does not. 72. Go to SGD’s Advanced Search to get an up-to-date count on the number of ORFs, ncRNAs (non-protein-coding), pseudogenes, rRNAs, and tRNAs. The search takes a couple of minutes. As of October 2005, there are 6,946: ORFs, ncRNAs, pseudogenes, rRNAs, and tRNAs. 73. Compare the yeast gene Sir2 in all the Model Organisms and determine if this gene is widely conserved. Compare this result with Mfa1. Why might you get such different results with two genes? Sir2 is conserved in insects, worms, plants and humans; it is a NAD+-dependent histone deacetylase. Mfa1 is found only in yeast but it is a mating factor gene and thus would be expected to be restricted to yeast. 74. Go to the yeast metabolism map; click on the citric acid cycle, then the “More detail” button until you cannot zoom in any more. Move around the circle until you see the C H A P T E R 2 Genome Sequence Acquisition and Analysis 31 two isocitrate dehydrogenase genes (Idh1 and Idh2). Mouse over the enzymes to see the chemical reaction, then click on the EC number 1.1.1.41. On the resulting page, go down and click on Idh1 and Idh2 to find their chromosomal locations. Are these redundant genes located next to each other in the genome? Idh1 is near base 557,000 on chromosome XIV, while Idh2 is located near base 579,000 on base XV. They are not near each other. 2.3 What Have We Learned from Metazoan Genomes? 75. Enter the BruinFly database created through the original research of many undergraduates at UCLA. First, search for the term “misshapen”. Click on the link in the first column and view the eyes. Read the description and then zoom in by clicking on the images. Compare the eye phenotype of misshapen to the patched eye phenotype. Click on the name “patched” to see information about this gene from FlyBase. Go back and click on the “P-element insertion site in the genome” to see where the transposable element landed to cause the patched phenotype. Is the insertion in a coding or noncoding portion of this gene? Explain how this insertion could lead to a mutant phenotype. If you look at other genes, notice some of the quirky names used by fly biologists (e.g., Ken and Barbie, Sunday driver, and deadpan). Both patched and misshapen are involved in signal transduction pathways. However, the patched phenotype is either pupal lethal or cell lethal in the eye. The P-element inserted into the first intron of patched and thus probably disrupts normal splicing to cause the phenotype. Because the mutation is outside exons, this may explain why some flies die but others do not. 76. Search Entrez for the largest fly gene, called Kakapo, then click on the “gene” database option. Notice the polytene band location on the results page, and then click on the link. What name was given to this gene based on its mutant phenotype? How many different mRNAs are produced from this one gene? While on the gene page, click on the “link” link in the top right corner and choose the map viewer option. You should see the gene highlighted with its location shown on the drawing of the polytene chromosome on the left side. Notice how long this gene is. Click on the “hm” (homology) link to the right of the gene name. Is this gene found in many different species? Chromosome: 2R; Location: 50C6-50C12. Also called short stop: Short Stop (abbreviated Shot) provides an essential link between F-actin and microtubules during axon extension. Shot associates with the cytoplasmic faces of the basal hemiadherens junction and with the EB1/APC1 complex, and mediates tendon stress resistance by the organization of a compact microtubule network at the muscle-tendon junction. 6 different mRNAs are produced from this one gene. The gene is over 17kb long and is conserved in “bilateria” such as mammals, birds, insects, and C. elegans. 77. Go to Ensembl and click on the fruit fly button to access the fly database. Enter “P450” in the top text box preceded by the word “with” and click on the top “Look up” button. Only a few hits will be displayed out of a large number possible; click CG10093, which may be the first one listed. The gene Cyp313a3 should be displayed with other genes nearby. Do you see any that may be clustered paralogs? Scroll down to determine how many exons are used in the mRNA as shown in a diagram. P450 is alternatively spliced to produce 4 different versions. Nearby paralogs include: Cyp313a2 (probable cytochrome P450 313a2) and Cyp313a5 (probable cytochrome P450 313a5). 32 I N S T RU C TO R ’ S M A N UA L 78. Go to BDGP and click on the “Expression Patterns” link to see where a particular mRNA is produced. Search for bicoid (abbreviated bcd) and follow the links until you see a bar graph and images of the blue-stained mRNA. When and where is bcd transcribed? Compare bcd expression pattern to the pattern for Mkp3 to see how differently some genes are transcribed. Bicoid is transcribed mostly during the first 3 hours (see below) and the mRNA is localized to the anterior 5% of the developing embryo. Furthermore, it appears there are 4 different mRNA splice variants. Mkp3 was transcribed at all time points during development, but at lower amounts. However, its localization is much different from bcd (above). Notice the change in embryo localization for mRNA in only 2 hours. 79. Search OMIM for “transferrin” to see what this human protein does. On the page, do a find function for the word “Alzheimer” to see one possible medical role for this gene. Now perform a search on FlyBase for “Tsf1” to see if transferrin could be studied in the fly as a possible model for its influence on human Alzheimer’s disease. From OMIM: A C-to-T substitution at codon 570 replaced proline (in C1) with serine (in C2). The results showed that each of the 2 variants (second variant is hemochromatosis gene [HFE] missense mutation C282Y) was associated with an increased risk of AD only in the presence of the other. Neither allele alone had any effect. Furthermore, carriers of these 2 alleles plus APOE4 (see 107741) were at still higher risk of AD: of the 14 carriers of the 3 variants identified in this study, 12 had AD and 2 had mild cognitive impairment. From FlyBase: Flies do have a transferrin gene, but it is most similar to the human lactotransferrin gene with 27% amino acid identity and 41% amino acid conservation. C H A P T E R 2 Genome Sequence Acquisition and Analysis 33 80. Go to the IRGSP web page and click on the status tab to see where the project is currently. “PLN” means that the finished sequence has been submitted to the PLaNt database. Click on the finished bar graph for chromosome 5 to see a clone-by-clone list of all the DNA sequenced for this chromosome. Click on the P0668H12 clone link called “INE” (for INtegrated rice genome Explorer) to see an interactive version of the rice genome (this launches a Java applet, so be patient). Move down to about 5.5 cM on the chromosome to locate the marker “S14158”; mouse over this text to see how many different maps it is on, or not on. Click on the text to launch a new window containing information about this marker. How were these data used by the sequencers during the assembly phase of the project? S14158 is not on the EST map but is on the physical and PAC/BAC maps. STS markers such as these can be used to find redundant overlaps and contigs which would allow the investigators to verify they had assembled two overlapping fragments if they both contained the same STS which is known to be unique in the genome. 81. On the clone-by-clone list page, click on “RiceGAAS:Rice Genome Automated Annotation System” for the same P0668H12 clone you explored by INE. This view gives you a very different insight into the same DNA. For example, does this portion of chromosome 5 contain any tRNA genes? Does the repeat masker identify highly repetitive DNA in genes or outside of genes? Many genes have been predicted by the various computer programs, but how many of these failed to yield BLASTx, cDNA, or EST hits and therefore could not be verified by biological evidence? You can change the view by altering the default preferences in the bottom frame. If you find a segment you really like, you can select the “MAP Download” link and print out a copy to hang on your wall. No, P0668H12 does not contain any tRNA genes. Occasionally, the repeat masker finds repetitive DNA within the Rice cDNA segements, but most are outside the exons. It appears that BLASTx missed more than half of the annotated genes. cDNA or EST identified nearly all of the annotated genes missed by BLASTx. 82. Go to the Rice Annotation Database (RAD) and click on the link to gene length distribution; then choose, in this order, chromosomes 1, 4, 10, 3. What characteristic changes the most? Does this characteristic correlate with chromosome length? Return to the RAD main page and determine the source of the trend you detected for these four chromosomes by studying the ideogram of the chromosomes. There does not appear to be a correlation of gene size and chromosome length. However, the percentage of finished sequence is correlated with gene size, which probably indicates genes have been mistakenly truncated due to incomplete coverage (as of 10/2005). Blue is finished and annotated DNA while black is non-finished DNA. 34 I N S T RU C TO R ’ S M A N UA L 83. Go to the Chromosome 8 link and you will get basically a blank page. Click on one of the black areas in the ideogram at the top, then scroll to the right and left until you see some content. What area have you landed in, based on what you do not see and the physical location within the chromosome? Explain why you did not see anything when you first explored the chromosome. We have landed on a gap and because gaps have no physical link to DNA, changing the scale from 50 kbp to 2 Mbp does not change your view. Because it is near the center, this may be the centromere which is often difficult to clone and sequence. 84. Go to BGI’s Rice Information System (RIS), click on the “ComView” tab at the top, and then click on the “Refresh” button on the right side if the settings indicate the base organism is 9331 (indica), chromosome 1, the first 1 Mb. Explore the first 5 Mb of chromosomal synteny in jumps of 1,000,000 using the windows at the bottom of the page. Which section has the lowest level of synteny? Between 9311 (indica) bases 2,000,000 and 3,000,000, find cDNA OsJRFA 107843 and click on the japanica link. How many SNPs (single nucleotide polymorphisms) did you find? Click on the Mapviewer link to the right of one SNP’s information (you may have to click on the refresh button to see the full display). Can you determine if this SNP alters the encoded protein sequence? Explain what you see and what the limitations are with this visualization of data. The first Mb has good synteny; the second shows compression in the japanica genome with one cDNA out of synteny; the third shows more genes out of synteny while the forth is collinear; the fifth Mb is mostly syntenic but there are a few cDNA exceptions. OsJRFA 107843 has 3 SNPs but the mapview does not display with enough clarity to tell if the base view is within coding regions or not. You can click on the SNP bar graphs to see the largest variations, but indels are also included in this display. This database is rich with information but the display impedes your ability to determine the effect of variation on the encoded proteins. 85. Many academics are concerned about free access to genome information. Go to Syngenta’s web page and read the second paragraph. Follow the links to see how quickly you can access the data. Compare this effort with the BGI databases in the preceding Discovery Questions and draw your own conclusions about the availability of data. “Torrey Mesa Research Institute (TMRI) has closed its doors, but TMRIs affiliate, SBI, is making the rice genome sequence available to external scientists. Please send requests for a CD copy of the rice genome sequence to [email protected].” Students will not be able to analyze these data. 86. Look at Figure 2.39 and consider chromosome 13. Was it duplicated, or was it unlucky enough never to have been duplicated? If 13 was duplicated, describe what happened C H A P T E R 2 Genome Sequence Acquisition and Analysis 35 to the duplicated version. Which chromosome pairs have been duplicated and retained nearly intact? Chromosome 13 appears to be duplicated but now split into chromosomes 19 and 5. We cannot tell for sure if 13 predates 19 and 5; the smaller 2 may have duplicated and fused. None of the duplicated chromosomes are perfectly preserved, but 5 pairs have retained remarkable amounts of synteny: 2 & 3; 4 & 12; 9 & 11; 7 & 16; and 10 & 14. 87. Find the location of the human genes oligophrenin and arrestin, using MapViewer. Now go to the Tetraodon Genome Browser and search for oligophrenin and arrestin. Can you detect the interleafing genes shown in Figure 2.40 and the genome duplication in Figure 2.39 using these genes? Do they have paralogs near each other in the puffer fish? oligophrenin on X: There are apparent paralogs on Tetraodon chromosomes 16, 18, and 7. arrestin on 2: There are 5 apparent paralogs. One on chromosome 1 but the other 4 (plus some splice variants) are on unassigned chromosomes, which indicates they have not been assembled into a contig of a known chromosome. Neither of these genes match Figure 2.40 but oligophrenin on chromosomes 16 and 7 match the data in Figure 2.39 and arrestin on chromosome 1 and potentially several others is consistent with the poorly conserved synteny of chromosome 1. 88. Go to the Human Genome Browser and search for “sarcospan”. Is sarcospan highly conserved in the diverse species shown in the browser? You should see the Tetraodon Net line; click on the Tetraodon exons until you get a tabular report showing the gene’s summary statistics. What is the size difference in the human and fish genes? Human sarcospan is on chromosome 12. Yes, sarcospan is highly conserved in the diverse species (see figure). The tabular report shows these differences: Human size: 670,434 Tetraodon size: 47,601 Difference 622,833 bp 89. Go back to Tetraodon Genome Browser and examine the ideograms. Is the genome finished yet? Click on a gap and show the resolution at 50 kb. Change the viewing 36 I N S T RU C TO R ’ S M A N UA L options so that all are hidden, except turn on “DNA/GC content”, “Gap: All Sequence Gap”, “Genescan”, “Hox Genes”, Takifugu ecotigs”, and “Tetraodon cDNAs”. What affect do gaps have on the number of genes predicted by Genescan? Compare the predicted genes to the number of cDNAs to see if any validating sequences are available. If so, how well did Genescan predict the genes’ correct sizes? Now search for HoxA, zoom out to 200 kb, and change the settings to highlight mouse, human, Takifugu, and Tetraodon gene conservation. Which HoxA gene is in Tetraodon but not the mammals? No, the genome is not finished as of 10/2005 since there are many white gaps in a background of blue finished sequence. If you look at this region (chr10:6654906 . . . 6704905), you will see that a small gap has resulted in the loss of one exon compared to cDNA. Genescan finds some of the verifiable genes, but it also predicts some that cannot be verified and even spans some gaps. Predicting gene size is not very reliable in Genescan. HoxA7 is found in the two puffer fish but not human or mouse. Bonus Material: A study of the chicken genome is available on the book’s web site. 90. Let’s do some quick estimations about our DNA using these numbers: haploid genome of 3,289,000,000 bp, 23,000 genes; and the numbers from Table 2.7. a. What percentage of your genome is spent on genes? Exons? Introns? 1.89% in genes (includes non-transcribed DNA), 0.00937% in exons, 0.0235% in introns. b. What percentage of your genes is spent on exons? Introns? 5.0% on exons, 12.5% on introns and the rest is regulatory in nature. 91. Are any chromosomes missing from Figure 2.43b? The Y chromosome is missing. 92. Given the information in Figure 2.45, name one aspect that makes humans different from other species. The number of proteins dedicated to transcription control is a critical difference. This is not unexpected since we contain more cell types that require unique combinations of proteins to perform their different cellular roles. 93. Go to the BLAST2 web page and perform a protein alignment with human sarcoglycan delta (NP_758447) and sarcoglycan gamma (NP_000222) by pasting these protein accession numbers into the smaller accession boxes and choosing “blastp” from the program menu. What are the percent identity and percent similarity between these two proteins? Now align the C. elegans sarcoglycan ortholog sgn-1 (CAA92663) with both of the human sarcoglycan genes and determine which one is more similar (compare identity, similarity, and gaps). sarcoglycan delta v. sarcoglycan gamma: Identities 112/211 (53%), Positives 145/211 (68%) sgn-1 v. sarcoglycan delta: Identities 87/236 (36%), Positives 141/236 (58%), Gaps 2/236 (0%) sgn-1 v. sarcoglycan gamma: Identities 95/269 (35%), Positives 160/269 (59%), Gaps 7/269 (2%) 94. Go to UCSC’s Genome Browser and search for “syntrophin”. How many syntrophin genes do humans have? Click on the link for “(NM_009228) syntrophin, acidic 1”. Change the view to hide all except “Known genes” in full view, and all the different species “Net” views in dense, then refresh. Is this gene’s structure (combination of exons and introns) highly conserved? Click on the dog alignment twice until you have a nongraphic report for this locus of the dog genome, then click on the “Open Dog browser” link. Set the species “Blat” or “Net” views to full, and “Conservation” to full, C H A P T E R 2 Genome Sequence Acquisition and Analysis 37 then refresh. Is the conservation confined to the exons only? Explain the significance of the conservation in the non-coding regions. Humans have alpha 1, beta 1, beta 2 (two splice variants), gamma 1, gamma 2, and another gamma 2 apparent paralog by the same annotated name (very odd). Syntrophin alpha 1 is highly conserved in all species listed. Many parts of the introns are also conserved (see figure). Splice sites and mRNA binding proteins may require sequence conservation even if this does not directly affect the protein sequence. 95. Go to the Comparative Maps web page, and click on “Chromosome 9” under the “Rat and Mouse compared to Human” column. Using the “Region Shown” box in the left frame, enter “122M” in the top box and “123M” in the bottom box. Locate the human gene called PTGS1, which is the cyclooxygenase 1 gene and the target of painkillers such as aspirin (see Chapter 9 for details). Use the “Maps & Options” button, activate the “Show Connections”, and hit “Apply”. Do all 3 species have PTGS1? Are these regions syntenic, or are the human genes not linearly related to those in the two rodents? What pattern in the alignment of genes is evident just below PTGS1? Click on the Rat ortholog and chose the “ev” (evidence) link. Do you think there is sufficient data to support the annotation that this is a true ortholog? All 3 species contain a Cox1 gene and overall, these regions appear syntenic. There is some gene rearrangement just below the PTGS1 locus. If you follow the links, eventually you will find some rat mRNAs that have identical structure to the human gene and this is very good evidence for the Cox1 annotation in rat. 96. Go to UCSC’s ENCODE web site and choose “Alpha Globin” from the menu in the left frame. Alpha Globin is abbreviated HBA, but you can see several genes that begin with HB. How many can you identify? (You may want to modify the view and zoom in to help you focus on the HB genes.) Considering the conservation in other species, how many of these HB genes are conserved in vertebrates? Do all of these genes produce functional protein? Click on “HBA1” to find the answers in text. Click on the “AceView” link in the “Tools and Databases” table to see more information, including graphic depictions of alternative splicing for these genes. You can see Hbz, HBQ1, HBM (predicted), HBA2, and HBA1. All of these are conserved, including HBM. The reports says: “The human alpha globin gene cluster located on chromosome 16 spans about 30 kb and includes the following five loci: 5′- zeta - pseudozeta - pseudoalpha-1 - alpha-2 - alpha-1 -3′. The alpha-2 (HBA2) and alpha-1 (HBA1) coding sequences are identical. These genes differ slightly over the 5′ untranslated regions and the introns, but they differ significantly over the 3′ untranslated regions. Two alpha chains plus two beta chains constitute HbA, which in normal adult life comprises about 97% of the total hemoglobin; alpha chains combine with delta chains to constitute HbA-2, which with HbF (fetal hemoglobin) makes up the remaining 3% of adult hemoglobin. Alpha thalassemias result from deletions of each of 38 I N S T RU C TO R ’ S M A N UA L the alpha genes as well as deletions of both HBA2 and HBA1 respectively; some nondeletion alpha thalassemias have also been reported.” There are 5 alternatively spliced forms shown in the AceView. Now go to Ensembl’s Multispecies ENCODE web site and click on the “MultiSpecies alignment” for the cystic fibrosis gene “CFTR” to display 1 Mb of human, mouse, and chicken genome alignments. On what chromosomes are CFTR orthologs located for each species? Scroll down to examine the alignment and notice the difference for chicken. What must have happened to produce the converging lines seen in the chicken chromosome? Follow the blue line from human through chicken CFTR (the blue line is centered on the whole gene, not the 5′ end). Is CFTR conserved in chicken, or is the gene truncated? You can recenter and zoom in to see the genes in better detail. Human Chromosome 7; mouse 6; chicken 1; Chimp 6; Rat 4; Dog 14. The chicken gene is inverted compared to the mammalian. The converging lines indicate loss of DNA so the gene is compacted compared to human. If you zoom in to a resolution showing bases, you can see the exon structure of the CFTR orthologs, but it is still difficult to be certain with this resolution for such a large gene, though some exons do appear absent in chicken. Introns are clearly reduced. Human Chicken 97. Perform a Gene search for the human ITGB3 (integrin beta 3) gene, which is illustrated in Figure 2.47c. Is the correct version of the gene in the database, or is the number of exons incorrect? Click on the KEGG pathway for “regulation of actin cytoskeleton”. Where is integrin located in cells (its cellular component, in GO terminology)? Click on the highlighted box labeled “ITG” and use your browser’s find function to locate the amino acids “RNRD”. Does this database have the old or new acid sequence? C H A P T E R 2 Genome Sequence Acquisition and Analysis 39 The corrected version is in the NCBI database with 15 exons. Integrin alpha 11 (ITG) is an integral membrane protein in the plasma membrane. KEGG also has the corrected sequence RNRDAPEGG . . . Math Minute 2.1 What Can You Learn from a Dot Plot? 1. Suppose that two sequences are identical except that a segment is inverted in one sequence, relative to the other. Explain how such an inversion would be displayed in a dot plot. The inversion would create a black streak in the dot plot, running diagonally from top right to bottom left. An example is shown below, obtained by comparing the protein query in Discovery Question 10 in Chapter 1 to the same sequence with amino acids 8–20 reversed. This dot plot was made with dotplot.xls. Other examples can be seen in Figures 2.4 and 2.20. Note that inversions are not identifiable in the sliding window view because the sequences are no longer similar along a diagonal segment from top left to bottom right, the order in which the sliding window measures similarity. 2. Suppose that two sequences are identical except that a segment is deleted from the middle of one sequence and not from the other. Explain how such a deletion would be displayed in a dot plot. In this situation, there would be a diagonal segment of black, then a horizontal jump (if deletion in sequence labeling the columns) or a vertical jump (if deletion in sequence labeling the rows), after which the diagonal segment would pick up again. An example is shown below, obtained by comparing the sequence from Discovery Question 10 in Chapter 1 to the same sequence with amino acids 8–20 deleted in the sequence labeling the rows. Unlike inversions (see Math Minute Discovery Question 1) deletions are identifiable in both the identity and sliding window similarity dot plots. 40 I N S T RU C TO R ’ S M A N UA L 3. What would be the value of using a dot plot to compare a sequence to a second sequence, as well as the reverse complement of that second sequence? By comparing to both the original sequence and the reverse complement of the sequence, homology could be detected on both strands. For example, if one sequence codes for a gene, and the other is the complementary strand for a homologous gene, this homology could still be detected. 4. Dot plots can detect many interesting sequence features by using the exact same sequence on both the horizontal and vertical axes. Sketch the dot plot of a 100 kb sequence in which a 20 kb segment is duplicated. The dot plot would be symmetric, with a solid diagonal line (because the sequence is identical to itself). The repeat shows up as lines, above and below the main diagonal, and 20% the length of the main diagonal. The basic pattern should be like the following: 5. Sketch the dot plot of a 1 kb sequence in which a motif of approximately 50 consecutive bases appears six times in the sequence. The properties of this sketch are similar to the sketch in Math Minute Discovery Question 4, but with shorter sequences repeated more often. Multiple repeats result in a checkerboard type pattern such as the following: 6. What would be the value in using a dot plot to compare a sequence to its own reverse complement? Basic RNA structure can be detected, since hairpin loops show up as identical on a dot plot comparing a sequence to its own reverse complement. The bottom two plots in dotplot.xls allow students to explore this possibility. For example, the following plot of identity was made by comparing the sequence ACGTGGCCATATATCGCCACGT to itself. Notice that the last 7 nt are complementary to the first 7 nt, which can be seen by the diagonal strip in the sliding window plot. C H A P T E R 2 Genome Sequence Acquisition and Analysis 41 Math Minute 2.2 How Do You Find Motifs? 1. Go to JASPAR and select “Browse profiles by class”. Scroll down to the TATA box and click on the “View” button in this row. Verify that the values in Table MM2.1 are displayed in this window. Explain how the sequence logo represents the information in Table MM2.1. The sequence logo represents the frequency of each base at a position by the relative height of that letter in the stack. The extent to which the position is conserved, i.e., contains the same base, is represented by the overall height of the stack of letters at that position. 2. Return to the JASPAR “Browse profiles by class” page. Find transcription factors with ID numbers MA0040, MA0041, and MA0047. By looking at the sequence logos, explain which of these three transcription factors in rat is most likely to bind to DNA containing the motif TGTTTA. MA0040 is most likely to bind to DNA containing TFTTTA because it has the tallest representations of these bases at positions 4–9. Therefore, this is a strongly preferred binding site for this transcription factor. 3. Use the spreadsheet pwm.xls to compute the total TATA PWM score of the following three sequences, and determine which one is most likely to be a true TATA box: ATATATATAGGCTGG, CTATATATATGCTGG, CTATAAATAGGCCGG. Using the TATA box PWM matrix: ATATATATAGGCTGG produces a total PWM score of 10.95 CTATATATATGCTGG produces a total PWM score of 10.55 CTATAAATAGGCCGG produces a total PWM score of 14.99 Because it has the highest score, the third sequence, CTATAAATAGGCCGG, is the most likely to be a true TATA box. 4. Use the spreadsheet pwm.xls to compute the total TATA PWM score of CCGGCCTATTTATAG. Explain why the score is so high, even though this sequence does not look like a true TATA box. (Hint: How is this sequence related to one of the sequences in Math Minute Discovery Question 3?) The reverse complement of this sequence has a high score (14.99). It is the same score as the third sequence in Question 3, because it is precisely the reverse complement of that sequence. Math Minute 2.3 What Are “Positives” and What Do They Have to Do with E-values? 1. What are the smallest and largest values in the BLOSUM62 matrix? Explain why the diagonal entries of the table are not all the same, even though they all correspond to matches. −4 is the smallest score, and 11 is the largest score. The diagonal entries are not all the same because some amino acids resist mutation more than others. 2. Repeat the BLAST2 protein alignment you performed in Discovery Question 15, but with COX1 (NP_000953) as the query this time. By looking at the resulting alignment and the BLOSUM62 matrix, explain the difference in raw scores in this comparison and your original comparison. Focus your search on the low-complexity regions. To 42 I N S T RU C TO R ’ S M A N UA L verify that the low-complexity regions are the only difference between using COX1 or COX2 as the query, repeat the two comparisons with filtering turned off. The filter check box is just above the box for sequence 1 accession number. The bit score in the original search, with COX2 as the query, is 745 bits. But when COX1 is the query, the bit score is 738 bits. The difference is due to screening out, or filtering, low complexity regions. These regions are marked with X in the alignment. For example, when COX2 is the query, amino acids 182–198 are filtered out as low complexity. By clicking on the word Filter on the BLAST2 page, you can read about the source of the filtering algorithm (SEG program of Wootton & Federhen) (Computers and Chemistry, 1993). Repeating the alignments with filtering turned off results in a bit score of 787, regardless of which sequence is the query. 3. Repeat the BLAST2 search you did to compare COX1 and COX2 at the nucleotide level, but this time change “Reward for a match” to 2 and “Penalty for a mismatch” to 3. (Blanks for these numbers are just below where you select the blastn program.) What similarities and differences do you see in your results, as compared to your original search? Now repeat the search with the mismatch score remaining at 3, but the match score returned to the default of 1. Are there any unexpected outcomes? Explain the changes in hits for the three different nucleotide alignments, and how your results relate to the three different BLOSUM matrices described earlier. The original search found 3 distinct regions of similarity, as shown in the snapshot on the left. The modified alignment, with the match reward set at 2 and the mismatch set to 3, found a single, longer region of similarity, as shown in the snapshot on the right. More bases are included in the second, modified, alignment, because the penalty for a mismatch is much smaller, relative to the reward for a match. Thus, many more mismatches are accepted in the alignment, allowing the three regions of stronger similarity to be stitched together into one long region. When the penalty for a mismatch is increased to 3, without a corresponding increase in the reward for a match (leaving it at 1), there are no significantly similar regions in the alignment. The default scores used in the first alignment are comparable to the general purpose BLOSUM62 matrix. The scores used in the second alignment are comparable to the BLOSUM80 matrix, best for comparing more divergent sequences. The scores used in the third alignment are comparable to the BLOSUM45 matrix, best for comparing more highly conserved sequences, since it does not return hits unless the sequences are strongly similar. 4. You can change the nucleotide scoring parameters in a regular BLASTn search, too. Go back to the sequence you read from the chromatogram in Discovery Question 4, and enter the 50 bases as your query sequence in a BLASTn search. Before submitting your search, scroll down to the “Other Advanced” window and type “–r 2” to set the match score to 2. (You can click on the link next to this window to see what other parameters you can change.) Now submit your query. How are your hits C H A P T E R 2 Genome Sequence Acquisition and Analysis 43 different from the hits you got in Discovery Question 5 (with default value of 1), and why? Find the numerical evidence of the change you made at the bottom of the BLAST report. Most of the hits are to the same sequences, and have the same alignments. However, the raw scores, bit scores, and E-values are different because of the modified reward for a match. The parameters lambda and K change to adjust for the change in the match score. Math Minute 2.4 a Dot Plot? Can You Estimate the Number of Inversions in 1. Go to the GRIMM site and enter the mouse gene order from the top line of Table MM2.6 into the “Source genome” box. You can leave the “Destination genome” box empty, because the program assumes the default target order of positive numbers 1 through 11. Select “multichromosomal or undirected” and “signed” options before hitting the “run” button. Does the program sort by reversals in the same number of steps as in Table MM2.6? Explain the differences in the reversal steps between the GRIMM site and Table MM2.6. How do these differences affect how you interpret the results of sorting by reversals? Yes, the program sorts by reversals in the same number of steps (7) as in the table. However, the steps are different. In the GRIMM program, the first two reversals are of single genes (6 and then 9), whereas the first two steps in the table are reversals of 7 and 6 genes, respectively. The single gene reversals occur in steps 3 and 4 in the table. It is important to realize that the number of reversals may be minimized by more than one reversal process. In other words, the predicted inversion history is not unique. 2. At the GRIMM site, select “Human Mouse (123 genes)” from the “choose sample data” drop-down menu. Scroll down to see the results. What operations other than reversals have been performed? Why are these additional operations needed? Fusions and translocations are also included in the history. These are needed because multiple chromosomes are being compared between mouse and human, and each mouse chromosome contains genes from multiple human chromosomes, and vice versa. To achieve the human ordering from the mouse ordering, genes must be moved across chromosomes as well as inverted on an individual chromosome. Math Minute 2.5 How Do You Fit a Line to Data? 1. One possible explanation for the choice of line (1) over lines (2) and (3) is that the regression line was constrained to go through the origin. Why whould it make sense for the line to pass through the origin? If a chromosome truly had 0 CpG islands per Mb, it is a very simple chromosome. Because you would expect at least one CpG island to appear just by chance on a sequence as long as a chromosome, the case where there are 0 CpG islands could be thought of as a “zero” or “empty” chromosome. An “empty” chromosome, one without enough sequence complexity to have even one CpG island, might also be expected to have zero genes on it. 2. Redraw lines (1), (2), and (3) with the additional restriction that each line must go through the origin. Under this restriction, do you agree with the investigators that line (1) is the best fit? The line of best fit of type (1) under the restriction that the line must go through the origin would look something like the solid line in the figure below. The line of type (2) through the origin would look something like the dotted line, and the line of type (3) through the origin would look something like the dashed line. Lines (1) and (2) are both reasonable explanations. Line (3) misses so much of the cluster 44 I N S T RU C TO R ’ S M A N UA L of points that it does not seem reasonable. When the line is not required to go through the origin, line (3) is much more reasonable, going through the cloud of points and near the four supposed outliers (something like the red line in the figure below). This example shows that assumptions are key to interpreting lines of best fit. If you believe the line should go through the origin, you would probably say (1) is the best, but if you do not believe the line should go through the origin, you might say (3) is the best because it is a model for all the available data. C H A P T E R 2 Genome Sequence Acquisition and Analysis 45

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download CHAPTER 2 Genome Sequence Acquisition and