Download Name that Gene Project The National Center for Biotechnology

Name that Gene Project The National Center for Biotechnology Information (NCBI) Web site is the primary repository of biological information in the US. The NCBI site includes the Basic Local Alignment Search Tool (BLAST), which can be used to search the GenBank database of DNA sequences and find regions of local similarity. This lab, based on exercises developed at NCBI and the University of New Hampshire, will familiarize you with the use of BLAST. From the main NCBI web site click on the BLAST link (along that top bar). There are many BLAST variants. The three most commonly used variants are: BLAST variant BLASTN BLASTP BLASTX TBLASTN TBLASTX Query sequence type DNA protein DNA (translated to protein) Protein Translated DNA Database type DNA protein protein Translated DNA Translated DNA As you can see on the web page, there are many parameters you can set for BLAST searches. Interpreting the results is tricky without guidance. NOTE: There is an excellent BLAST tutorial on the NCBI site that is strongly recommended if you want to really learn more about what BLAST does. http://www.ncbi.nih.gov/Education/BLASTinfo/information3.html “Jurassic Park” Dino-DNA Analysis In 1990, Michael Crichton published the book Jurassic Park about the resurrection of dinosaurs using the blood from the stomachs of insects which had been encased in amber. At one point in the book, Dr. Henry Wu is asked to explain some of DNA techniques used in reconstructing the extinct dinosaur genomes. Dr. Wu describes the use of restriction enzymes and how the fragmented pieces of “dino DNA” can be spliced together with these enzymes. He also alludes to the fact that they don't have the entire genome but that they "filled in the gaps" with modern day frog DNA. At one point during his discussion he points to a computer screen and remarks "Here you see the actual structure of a small fragment of dinosaur DNA." >DinoDNA "Dinosaur DNA" from Crichton JURASSIC PARK p. 103 nt 1-1200 GCGTTGCTGGCGTTTTTCCATAGGCTCCGCCCCCCTGACGAGCATCACAAAAATCGACGC GGTGGCGAAACCCGACAGGACTATAAAGATACCAGGCGTTTCCCCCTGGAAGCTCCCTCG TGTTCCGACCCTGCCGCTTACCGGATACCTGTCCGCCTTTCTCCCTTCGGGAAGCGTGGC TGCTCACGCTGTACCTATCTCAGTTCGGTGTAGGTCGTTCGCTCCAAGCTGGGCTGTGTG CCGTTCAGCCCGACCGCTGCGCCTTATCCGGTAACTATCGTCTTGAGTCCAACCCGGTAA AGTAGGACAGGTGCCGGCAGCGCTCTGGGTCATTTTCGGCGAGGACCGCTTTCGCTGGAG ATCGGCCTGTCGCTTGCGGTATTCGGAATCTTGCACGCCCTCGCTCAAGCCTTCGTCACT CCAAACGTTTCGGCGAGAAGCAGGCCATTATCGCCGGCATGGCGGCCGACGCGCTGGGCT GGCGTTCGCGACGCGAGGCTGGATGGCCTTCCCCATTATGATTCTTCTCGCTTCCGGCGG CCCGCGTTGCAGGCCATGCTGTCCAGGCAGGTAGATGACGACCATCAGGGACAGCTTCAA CGGCTCTTACCAGCCTAACTTCGATCACTGGACCGCTGATCGTCACGGCGATTTATGCCG CACATGGACGCGTTGCTGGCGTTTTTCCATAGGCTCCGCCCCCCTGACGAGCATCACAAA CAAGTCAGAGGTGGCGAAACCCGACAGGACTATAAAGATACCAGGCGTTTCCCCCTGGAA GCGCTCTCCTGTTCCGACCCTGCCGCTTACCGGATACCTGTCCGCCTTTCTCCCTTCGGG CTTTCTCAATGCTCACGCTGTAGGTATCTCAGTTCGGTGTAGGTCGTTCGCTCCAAGCTG ACGAACCCCCCGTTCAGCCCGACCGCTGCGCCTTATCCGGTAACTATCGTCTTGAGTCCA ACACGACTTAACGGGTTGGCATGGATTGTAGGCGCCGCCCTATACCTTGTCTGCCTCCCC GCGGTGCATGGAGCCGGGCCACCTCGACCTGAATGGAAGCCGGCGGCACCTCGCTAACGG CCAAGAATTGGAGCCAATCAATTCTTGCGGAGAACTGTGAATGCGCAAACCAACCCTTGG CCATCGCGTCCGCCATCTCCAGCAGCCGCACGCGGCGCATCTCGGGCAGCGTTGGGTCCT In 1992 Dr. Mark Boguski at NCBI entered this sequence into a text editor and searched all of the known DNA sequences at the time. Dr. Boguski wrote up his findings and submitted a manuscript to the journal BioTechniques, as a tongue-in-cheek joke. His manuscript was accepted and published. (Boguski, M.S. A Molecular Biologist Visits Jurassic Park. (1992) BioTechniques 12(5):668-669). You will reproduce this experiment using BLAST. BLAST stands for Basic Local Alignment Search Tool. It allows you to search a database of nucleotide or protein sequences. You enter a query, which is a nucleotide or protein sequence. It is not a text term. It then compares your character string (nucleotide sequence or protein sequence) against all the sequences in the target database. The program uses rigorous statistics to identify statistically significant matches. EXERCISE 1: From the main BLAST page select Nucleotide BLAST. This brings up a web page where you can specify your query sequence along with various parameters. Copy and paste the above "dinosaur DNA" sequence into the window labeled Enter Query Sequence, and then click the BLAST button at the bottom of the page to start the search. After a short delay, the results of your search will be displayed on the page. The most obvious feature of the resulting page is the graphic near the top which depicts the "hits" or database matches for your query sequence. The number of hits depends on the degree of similarity found between your input sequence and the sequences in the database. The uppermost red line in the graphic represents your query sequence. The colored lines below represent the "hits" or sequences that closely match your query sequence. Lavender lines represent close or identical matches while green, blue and black lines are more imperfect matches. The text immediately below the graphic describes the DNA sequences represented by the lines in the graphic with the best matches presented first. The hyperlink at the start of each line of text will take you to an entry in the DNA sequence database that corresponds to the gene named in that line of text. A) For each of the top three matches, click on the Accession link to the right. This will take you to a page with more information about the sequence. Report in the table on the data sheet the entry that appears next to the heading SOURCE ORGANISM. B) In the world of molecular biology, what is a vector? Answer on the data sheet. C) If a sequence does not correspond to a natural organism but instead represents a man-made construct, the SOURCE ORGANISM entry will be identified as an artificial sequence. How many of the top ten matches are artificial sequences? Identify any actual organisms in the top ten and list them in the table on the data sheet. In practice, researchers rarely have complete and exact DNA samples. Some mistakes will undoubtedly occur in extracting sequences from samples, and gaps may occur as pieces of a sample are lost or incorrectly combined. This is why BLAST reports multiple matches and provides matching information via the colored lines and overall score. Advanced users of BLAST can specify additional search parameters to control how similar a match must be in order to be reported. “The Lost World” Dino-DNA Analysis Mark Boguski's published article was brought to Crichton's attention. In his second book, "The Lost World", Mr. Crichton used Dr. Boguski as a consultant. Dr. Boguski constructed an interesting sequence from existing species and also embedded a message in the protein translation of the DNA sequence which he submitted for use in the book. Here is the sequence Dr. Boguski gave Crichton for the book "The Lost World": >DinoDNA "Dinosaur DNA" from Crichton THE LOST WORLD p. 135 GAATTCCGGAAGCGAGCAAGAGATAAGTCCTGGCATCAGATACAGTTGGAGATAAGGACG GACGTGTGGCAGCTCCCGCAGAGGATTCACTGGAAGTGCATTACCTATCCCATGGGAGCC ATGGAGTTCGTGGCGCTGGGGGGGCCGGATGCGGGCTCCCCCACTCCGTTCCCTGATGAA GCCGGAGCCTTCCTGGGGCTGGGGGGGGGCGAGAGGACGGAGGCGGGGGGGCTGCTGGCC TCCTACCCCCCCTCAGGCCGCGTGTCCCTGGTGCCGTGGGCAGACACGGGTACTTTGGGG ACCCCCCAGTGGGTGCCGCCCGCCACCCAAATGGAGCCCCCCCACTACCTGGAGCTGCTG CAACCCCCCCGGGGCAGCCCCCCCCATCCCTCCTCCGGGCCCCTACTGCCACTCAGCAGC GGGCCCCCACCCTGCGAGGCCCGTGAGTGCGTCATGGCCAGGAAGAACTGCGGAGCGACG GCAACGCCGCTGTGGCGCCGGGACGGCACCGGGCATTACCTGTGCAACTGGGCCTCAGCC TGCGGGCTCTACCACCGCCTCAACGGCCAGAACCGCCCGCTCATCCGCCCCAAAAAGCGC CTGCTGGTGAGTAAGCGCGCAGGCACAGTGTGCAGCCACGAGCGTGAAAACTGCCAGACA TCCACCACCACTCTGTGGCGTCGCAGCCCCATGGGGGACCCCGTCTGCAACAACATTCAC GCCTGCGGCCTCTACTACAAACTGCACCAAGTGAACCGCCCCCTCACGATGCGCAAAGAC GGAATCCAAACCCGAAACCGCAAAGTTTCCTCCAAGGGTAAAAAGCGGCGCCCCCCGGGG GGGGGAAACCCCTCCGCCACCGCGGGAGGGGGCGCTCCTATGGGGGGAGGGGGGGACCCC TCTATGCCCCCCCCGCCGCCCCCCCCGGCCGCCGCCCCCCCTCAAAGCGACGCTCTGTAC GCTCTCGGCCCCGTGGTCCTTTCGGGCCATTTTCTGCCCTTTGGAAACTCCGGAGGGTTT TTTGGGGGGGGGGCGGGGGGTTACACGGCCCCCCCGGGGCTGAGCCCGCAGATTTAAATA ATAACTCTGACGTGGGCAAGTGGGCCTTGCTGAGAAGACAGTGTAACATAATAATTTGCA CCTCGGCAATTGCAGAGGGTCGATCTCCACTTTGGACACAACAGGGCTACTCGGTAGGAC CAGATAAGCACTTTGCTCCCTGGACTGAAAAAGAAAGGATTTATCTGTTTGCTTCTTGCT GACAAATCCCTGTGAAAGGTAAAAGTCGGACACAGCAATCGATTATTTCTCGCCTGTGTG AAATTACTGTGAATATTGTAAATATATATATATATATATATATATCTGTATAGAACAGCC TCGGAGGCGGCATGGACCCAGCGTAGATCATGCTGGATTTGTACTGCCGGAATTC EXERCISE 2: Once again, invoke Nucleotide BLAST and copy and paste all or part this new "Lost World" sequence into the Search window and submit it to BLAST. Click the link to the accession number of the highest-scoring match in the list of sequence. (A) Which organism is this DNA sequence from? Answer on the data sheet. (B) To see more about the organism click on the organism’s name to go to another information page. Do the same thing for the second-highest-scoring match and identify the organism. Record the required information in the table on the data sheet. (C) Look around at the different information available on these pages. Are either of these organisms related to dinosaurs? How do you know? Answer on the data sheet. EXERCISE 3: From the main BLAST page, click on the link for Translated query vs. protein database (blastx). Copy and paste this same "Lost World" sequence into the Search window and submit it to BLAST. Make sure to include the entire sequence for this exercise. On the results page, look at the second best alignment (erythroid-specific transcription factor) by clicking on the score value for the alignment (364) in the right hand column. The resulting page will show the query sequence written as a protein (using the 20 letters corresponding to amino acids). The matching sequence of amino acids from the database is shown below the query sequence, with dashes representing gaps. Dr. Boguski's message is hidden in the query sequence in the positions corresponding to dashes in the subject sequence. What is his message? Answer on the data sheet. EXERCISE 4: You will be assigned a nucleotide sequence found in real human DNA that is associated with a genetic disease when mutated. Your job is to compare the sequence you are given with the nucleotide sequence of most known genes, using the BLAST tool to search genetic databases. 1. Go to the homepage for the NCBI ( www.ncbi.nlm.nih.gov ) 2. Click on the word "BLAST" located in the list of links on the right of the page. 3. Scroll down until you find the heading "nucleotide blast" and click the link. 4. This time we will practice using a part of the gene sequence that codes for hemoglobin. When a mutation occurs in this gene a person can wind up with Sickle Cell Anemia. Copy and Paste the sequence below into the blast window. GGG ATG AAT AAG GCA TAT GCA TCA GGG GCT GTT GCC AAT GTG CAT TAG CTG TTT GCA GCC TCA CCT TCT TTC ATG GAG TTT AAG ATA 5. When you have finished entering your sequence, click on the BLAST button. 6. You may see a screen asking you to wait 10-20 seconds. Don’t click on anything just relax and wait patiently for the search to conclude. 7. After the search has ended, scroll down past the box with all of the pink bars in it and find the words "Sequences Producing Significant Alignment". 8. Listed in order are the closest matches with your DNA sequence. You should notice that the blast search will return results for all genomes currently mapped (Several prokaryotes, humans, rats, chimpanzees, cows, pigs, chickens and puffer fish to name a few) Please take a moment to be awed by the similarity found in the DNA code despite the outward physical diversity of organisms. 9. Click on the link of the first item in the list, “Homo sapiens hemoglobin, beta (HBB) gene, complete cds”. This will take you to the sequence alignment section which displays the comparison between your sequence and the particular sequence in the database. (Note: You can also get to the Alignments section by scrolling down past the list of sequences producing significant alignment.) 10. Once in the Alignments section, look for a “Gene” link and click on it. It does not matter that it is not the first alignment in the list. This will take you to a page with summary information about the gene with your sequence. 11. This is the Gene Summary. It tells you that this sequence is officially recognized. HBB is the official Symbol for this gene. If you read further it will tell you that it is found in humans and that a mutated version of it causes Sickle Cell Anemia. 12. On the right you will see a link to “Phenotypes”. Click the link and look for information about Hb SS Disease under summary from GeneReviews. This will open more information about sickle cell. 13. Next go back to the previous page and click on the link to the “Map Viewer” in the right column. 14. Map Viewer page. Across the top of the page, you will notice the numbers 1-22 XY. These numbers represent the chromosomes found in humans. 15. While we are here let’s take a moment to go over chromosomal nomenclature. 11p15.5– This means that the HBB gene is found on chromosome 11 on the short (p) arm in region 15.5. The fancy word for this is locus which is science talk for location. 16. Your next step will be to click on the OMIM link in the pink information bar. This will take you to a page with lots of information about your gene and what it does. Another way to access the OMIM database is to go back to the NCBI Homepage. Click on the dropdown menu and select “OMIM”. Then type the topic in the search bar and click “Search.” So that is your basic tour of the NCBI. There is a lot of other information about genes and DNA posted on this website some of which requires 3 or 4 PhDs to understand  so we are going to move on. Project Instructions: Using the databases you just became familiar with, you are going to explore the human genome. You will be assigned a gene from the attached list. You are to gather and report the information you discover about your assigned gene sequence. You must include at least information on the following:        Name of gene involved Symbol of gene The sequence of your gene How many base pairs long is the gene? Location of gene on its corresponding chromosome What proportion of the gene is your sequence? What happens if there is an error/mutation in the gene? (i.e. what disease results?)      What is the error/mutation in the gene that leads to this disease? How prevalent in the human population is this mutation? Can the disease allele of this gene be inherited? If so, what is the pattern of inheritance? Describe the symptoms of the related disease or disorder. What are the treatments for the disease? What do they involve? Format:  Paper must be typed with 12 pt., Times New Roman or Arial type font; black ink  Paper must be double-spaced, with no large gaps between sections or between paragraphs  Paper must have 1-inch margins  Spelling and grammar should be correct  Do not use contractions (it’s, can’t, etc.)  Any figures or tables must have a figure legend (Table 1. Title) and be referred to in the text  It should not just be a list of answers to the above questions. The above list should not be headings in your paper. This should be written as a research paper about your gene sequence.  All text must be in your own words!  Information gathered and included in the report must be correctly cited using APA format with endnotes and a Works Cited page at the end. If you need assistance with APA format there is a document on the class website under “frequently used documents” to help. Gene Sequences: You will be assigned 1 or 2 of the following sequences to BLAST Sequence #1 ATGGCGACCCTGGAAAAGCTGATGAAGGCCTTCGAGTCCCTCAAGTCCTTCCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAG CAGCAGCAGCAGC Sequence #2 ATGGCGGGGTCTGACGGCGGCGGCCCCGCGGCCCGGAGTCCTCCTGCTCCTGCTGTCCATCCTCCACCCCTCTCGGCCTGGAGGGGTCC TTGGGGCCATTCCTGGTGGAGTTCCTGGAGGAGTCTT Sequence #3 ATGCTCACATTCATGGCCTCTGACAGCGAGGAAGAAGTGTGTGATGAGCGGACGTCCCTAATGTCGGCCGAGAGCCCCACGCCGCGCTCC TGCCAGGAGGGCAGGCAGGGCCCAGAGGATGGAG Sequence #4 ATGTTTTATACAGGTGTAGCCTGTAAGAGATGAAGCCTGGTATTTATAGAAATTGACTTATTTTATTCTCATATTTACATGTGCATAATTTTCCA TATGCCAGAAAAGTTGAATAGTATCAGATTCCAAATCT Sequence #5 ATGCGTCGAGGGCGTCTGCTGGAGATCGCCCTGGGATTTACCGTGCTTTTAGCGTCCTACACGAGCCATGGGGCGGACGCCAATTTGGAG GCTGGGAACGTGAAGGAAACCAGAGCCAGTCGGGCC Sequence #6 ATGCCGCCCAAAACCCCCCGAAAAACGGCCGCCACCGCCGCCGCTGCCGCCGCGGAACCCCCGGCACCGCCGCCGCCGCCCCCTCCTG AGGAGGACCCAGAGCAGGACAGCGGCCCGGAGGAC Sequence #7 ATGTTGTGCAATATCCATCTACTGTAGTTAAGATATTCAGTAGTTTGTTTTTCATAAGCATGTAATTGATCATATTTCTGCCAGGGATGTGCCTT CAACTTTATAATTATAGTGTTGTAAAATATTTTTGTCTG Sequence #8 ATGCCATCTTCCTTGATGTTGGAGGTACCTGCTCTGGCAGATTTCAACCGGGCTTGGACAGAACTTACCGACTGGCTTTCTCTGCTTGATCA AGTTATAAAATCACAGAGGGTGATGGTGGGTGACCTT Sequence #9 CCTCGCCTCCGTTACAACGGCCTACGGTGCTGGAGGATCCTTCTGCGCACGCGCACAGCCTCCGGCCGGCTATTTCCGCGAGCGCGTTCC ATCCTCTACCGAGCGCGCGCGAAGACTACGGAGGTCGACTCGGGAGCGCGCACGCAGCTCCGCCCCGCGTCCGACCCGCGGATCCCGCG GCGTCCGGCCCGGGTGGTCTGGATCGCGGAGGGAATGCCCCGGAGGGCGGAGAACTGGGACGAGGCCGAGGTAGGCGCGGAGGAGGC AGGCGTCGAAGAGTACGGCCCTGAAGAAGACGGCGGGGAGGAGTCGGGCGCCGAGGAGTCCGGCCCGGAAGAGTCCGGCCCGGA Sequence #10 CTTAGCGGTAGCCCCTTGGTTTCCGTGGCAACGGAAAAGCGCGGGAATTACAGATAAATTAAAACTGCGACTGCGCGGCGTGAGCTCGCTG AGACTTCCTGGACGGGGGACAGGCTGTGGGGTTTCTCAGATAACTGGGCCCCTGCGCTCAGGAGGCCTTCACCCTCTGCTCTGGGTAAAG TTCATTGGAACAGAAAGAAATGGATTTATCTGCTCTTCGCGTTGAAGAAGTACAAAATGTCATTAATGCTATGCAGAAAATCTTAGAGTGTCCC ATCTGTCTGGAGTTGATCAAGGAACCTGTCTCCACAAAGTGTGACCACATATTTTGCAAATTTTGCATGCTGAAA Sequence #11 GAAGAGCAAGCGCCATGTTGAAGCCATCATTACCATTCACATCCCTCTTATTCCTGCAGCTGCCCCTGCTGGGAGTGGGGCTGAACACGAC AATTCTGACGCCCAATGGGAATGAAGACACCACAGCTGGTGGGAAATCTGGGACTGGAGGGGGCTGGTGAGAAGGGTGGCTGTGGGAAG GGGCCGTACAGAGATCTGGTGCCTGCCACTGGCCATTACAATCATGTGGGCAGAATTGAAAAGTGGAGTGGGAAGGGCAAGGGGGAGGGT TCCCTGCCTCACGCTACTTCTTCTTTCTTTCTTGTTTGTTTGTTTCTTTCTTTCTTTTGAGGCAGGGTCTCACTATGTTG Sequence #12 CCGGAGCCCGAGCCGAAGGGCGAGCCGCAAACGCTAAGTCGCTGGCCATTGGTGGACATGGCGCAGGCGCGTTTGCTCCGACGGGCCGA ATGTTTTGGGGCAGTGTTTTGAGCGCGGAGACCGCGTGATACTGGATGCGCATGGGCATACCGTGCTCTGCGGCTGCTTGGCGTTGCTTCT TCCTCCAGAAGTGGGCGCTGGGCAGTCACGCAGGGTTTGAACCGGAAGCGGGAGTAGGTAGCTGCGTGGCTAACGGAGAAAAGAAGCCG TGGCCGCGGGAGGAGGCGAGAGGAGTCGGGATCTGCGCTGCAGCCACCGCCGCGGTTGATACTACTTTGACCTTCCGAGTG Sequence #13 CAGCCGCCCCTCCTGCGGCCGCTGCGGGGGCCGCCGCCTGACTTCGGACACCGGCCCCGCACCCGCCAGGAGGGGAGGGAAGGGGAG GCGGGGAGAGCGACGGCGGGGGGCGGGCGGTGGACCCCGCCTCCCCCGGCACAGCCTGCTGAGGGGAAGAGGGGGTCTCCGCTCTTC CTCAGTGCACTCTCTGACTGAAGCCCGGCGCGTGGGGTGCAGCGGGAGTGCGAGGGGACTGGACAGGTGGGAAGATGGGAATGAGGACC GGGCGGCGGGAATGTTCTCACTTCTCCGGATTCCACCGGGATGCAGGACTCTAGCTGCCCAGCCGCACCTGCGAAGAGACTACACTT Sequence #14 GTTTGGGGCCAGAGTGGGCGAGGCGCGGAGGTCTGGCCTATAAAGTAGTCGCGGAGACGGGGTGCTGGTTTGCGTCGTAGTCTCCTGCA GCGTCTGGGGTTTCCGTTGCAGTCCTCGGAACCAGGACCTCGGCGTGGCCTAGCGAGTTATGGCGACGAAGGCCGTGTGCGTGCTGAAGG GCGACGGCCCAGTGCAGGGCATCATCAATTTCGAGCAGAAGGAAAGTAATGGACCAGTGAAGGTGTGGGGAGCATTAAAGGACTGACTGA AGGCCTGCATGGATTCCATGTTCATGAGTTTGGAGATAATACAGCAGGCTGTACCAGTGCAGGTCCTCACTTTAATCCTC Modified from Wefer, Stephen H. (Oct. 2003). "Name That Gene- An Authentic Classroom Activity Incorporating Bioinformatics". The American Biology Teacher, Vol. 65 No. 8. p. 610 with permission to use from the National Association of Biology Teachers (NABT). Jurrasic Park Data Sheet Exercise 1: Jurassic Park A) Top Three Matches for >DinoDNA "Dinosaur DNA" from Crichton JURASSIC PARK Description Accession Number Source Organism 1. 2. 3. B) In the world of molecular biology, what is a vector? C) Identification of actual organisms in the top ten list of matches. Description Accession Number Source Organism 1. 2. 3. Exercise 2: The Lost World A) Which organism is this DNA sequence from? B) Top Two Matches for >DinoDNA "Dinosaur DNA" from Crichton THE LOST WORLD Description Accession Number Source Organism 1. 2. C) Look around at the different information available on these pages. Are either of these organisms related to dinosaurs? How do you know? Exercise 3: What is Dr. Boguski's message?

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Name that Gene Project The National Center for Biotechnology