* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download 251 Lab 2 Chrisine
Gene desert wikipedia , lookup
DNA sequencing wikipedia , lookup
Copy-number variation wikipedia , lookup
DNA polymerase wikipedia , lookup
DNA damage theory of aging wikipedia , lookup
Gene therapy wikipedia , lookup
Gel electrophoresis of nucleic acids wikipedia , lookup
Transposable element wikipedia , lookup
Epigenetics of neurodegenerative diseases wikipedia , lookup
Cancer epigenetics wikipedia , lookup
Genetic engineering wikipedia , lookup
SNP genotyping wikipedia , lookup
United Kingdom National DNA Database wikipedia , lookup
Frameshift mutation wikipedia , lookup
DNA barcoding wikipedia , lookup
DNA vaccination wikipedia , lookup
Genomic library wikipedia , lookup
Zinc finger nuclease wikipedia , lookup
Human genome wikipedia , lookup
Genealogical DNA test wikipedia , lookup
Primary transcript wikipedia , lookup
Extrachromosomal DNA wikipedia , lookup
Bisulfite sequencing wikipedia , lookup
Nutriepigenomics wikipedia , lookup
Nucleic acid double helix wikipedia , lookup
DNA supercoil wikipedia , lookup
Molecular cloning wikipedia , lookup
Epigenomics wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Non-coding DNA wikipedia , lookup
Nucleic acid analogue wikipedia , lookup
Cell-free fetal DNA wikipedia , lookup
No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup
Metagenomics wikipedia , lookup
Vectors in gene therapy wikipedia , lookup
Cre-Lox recombination wikipedia , lookup
History of genetic engineering wikipedia , lookup
Deoxyribozyme wikipedia , lookup
Point mutation wikipedia , lookup
Designer baby wikipedia , lookup
Microevolution wikipedia , lookup
Microsatellite wikipedia , lookup
Genome editing wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Bio/CS-251 Laboratory 2 Examining A Single DNA Sequence Jan 31, 2007 Christine Frielle January 31, 2007 For this first lab using the bioinformatics tools that are found on the web we will follow the last part of Chapter 5 of Bioinformatics for Dummies, henceforth abbreviated as BFD. The first part of the chapter deals with “cleaning up” a sequence of DNA that a microbiologist may have collected in the lab and also with designing PCR Primers. We will discuss this latter topic at a future date as we approach our “wet lab” exercise. For now, we will “borrow” a known data sequence from the NCBI web page: http://www.ncbi.nlm.nih.gov/ The gene that we choose is the mutS/hMSH2 DNA repair gene. In addition to following the readings and guided steps on pages 151 – 159, we will ask you to answer some questions related to your findings. First we give some background on this gene mutS is the name given to a prokaryotic (bacterial) defender of the genome. (“mut” is an abbreviation to reflect the increased rate at which DNA mutations accumulate in cells that do not have this critical gene). This gene is universal in that it is found in virtually every organism, both prokaryotic and eukaryotic. MSH2 is the name given to the eukaryotic (algae and fungi, plants and animals) version of this gene (“MSH” is an abbreviation that means “MutS Homolog”). The term “homolog” means that MSH2 looks and acts like the mutS gene, i.e., its structure is similar to mutS and it plays a similar role in preventing mutations from occurring. hMSH2: the prefix h in front of the gene name indicates that it is the human version of the gene Before we begin the lab, read Analyzing DNA Composition on page 151 of BFD and answer these questions. Q1: We are analyzing a single sequence of DNA that represents the entire sample of DNA that was obtained in the hypothetical lab. This sequence obviously represents only one strand of the DNA that was extracted in the lab. The single strand of DNA is denoted as cDNA (complementary DNA). How is cDNA created? HINT: read the preface to Chapter 5 or GOOGLE cDNA. cDNA stands for complementary DNA. It is the DNA copy of messenger RNA and is single stranded. Bio/CS-251 Q2: Laboratory 2 Examining A Single DNA Sequence Jan 31, 2007 Why is the pairing between guanosine and cytosine nucleotides more stable than the pairing between adenosine and thymidine? The G-C pairings are more stable because they are connected by three hydrogen bonds. The A-T pairings are only connected by two hydrogen bonds. Q3: If we know the G+C count, can we find the frequency of all of the bases in the sample of DNA that was obtained in the lab? How is this done? Pairings can be either G-C or A-T. If the frequency of G-C pairings is known, the remainder of the sample must be composed of A-T pairings. Computer programs such as Emboss can also be used. OK, now on to the lab procedures Procedure: Collect your sequence from NCBI Go to the NCBI web site for GenBank given in the URL at the top of this page. a. From the “Search” pull down menu, choose “Gene” b. In the “For” window type “hMSH2” and click “Go” c. Several references to the human versions of this gene are listed. Choose the second entry, MSH2. Click on this entry d. You will be taken to a page that contains a variety of information about research that has been done on this gene. Peruse this page. Q4: What is the complete name of this entry? MSH2 mutS homolog 2, colon cancer, nonpolyposis type 1 (E. coli) from Homo sapiens Q5: How many papers have been written about this particular entry? HINT: You will need to go to PubMed for this information. Follow the Links! 168 papers have been written about this entry. Q6: As you scroll down the page you will come to a link to the GenBank page that contains the DNA sequence itself. How many base pairs long is the sequence for this entry? 80098 base pairs Bio/CS-251 Laboratory 2 Examining A Single DNA Sequence Jan 31, 2007 e. Scroll back up to the top of the GenBank page and from the Display pulldown menu chose FASTA. f. A new page will appear that contains the name of the entry and the listing of the nucleotides in sequential order, but in a different format from the one at the bottom of the previous page. Copy all of this information into a word document that you will save in your workspace as MSH2.doc. You are now ready to begin your analysis. Procedure: follow pages 152 in BFD – Counting Words in DNA Sequences Purpose: to find the count for each of the nucleotides found in this sequence and also to find the count for each of four significant triplets found in the sequence. g. After you obtain your result that will be formatted like Figure 5-4, copy and paste it on a new page of your MSH2.doc. Screen shot attached at end. Q7: What is the total G+C count for this sequence? Why are the percentages of G and C that are shown so different? Is this a violation of Chargaff’s rules? The G-C content is 41.60%. The percentage for G is 21.33% and the percentage for C is 20.27%. This is not a violation of Chargaff’s rules because the sequence is for a single stranded cDNA sequence. If the sequence was for a double stranded DNA segment, it would be expected that the percentages would be equal. Q8: Give the total count for each of the nucleotides in the strand of DNA represented by this sequence. A: 20890 C: 16235 G: 17088 T: 25885 Bio/CS-251 Laboratory 2 Examining A Single DNA Sequence Q9: Jan 31, 2007 As you will learn, the triplets ATG, TAA, TAG, and TGA can have a special significance in DNA sequences. What is the frequency of each of these triplets in the sequence that you just processed? ATG: 1344 TAA: 1551 TAG: 1173 TGA: 1474 Procedure: Follow the instructions on pages 153 – 154 of BFD Purpose: To search our sequence for the occurrence of any highly unusual repeat of a long word (> 3 nucleotides in length) The people who did the statistical analysis for the program BLAST (which we will begin using next week) said that it was below any reasonable level of statistical significance that any sequence of length 11 would be repeated solely by random assignment of the four letters: A, C, G, or T. Therefore, we may conclude that the repeat of an 11 letter word is a significant finding in our sequence. We will look for a repeated sequence, but not push it as far as 11. We will go with 5. h. Follow the instructions on pages 153 – 154 using a word length of 5. You will have to recopy the sequence for MSH2 that you saved in your word document. NOTE: In instruction 3 there is no link at “Codon usage, composition”. Just find that section on the web page and go to instruction 4 on page 153. Q10: How many 5 letter words are repeated 200 times or more in the sequence for MSH2? There are 32 5-letter words that are repeated 200 times or more in the sequence. Bio/CS-251 Laboratory 2 Examining A Single DNA Sequence Jan 31, 2007 Q12: List (Copy and Paste) these sequence(s). TTTTT AAAAA ATTTT TTTTA TTTTG TATTT TTTGT TTTAA TTTCT TTTAT AATTT TGTTT TTATT CTTTT GTTTT GCTGG AAAAT TTTTC TCTTT TTCTT CCTCC TTGTT TAATT TTAAA TAAAA CTGGG AAATT GCCTC TGGGA TATAT TTTGA TCCCA 1270 589 482 415 346 343 289 283 277 273 270 269 266 266 255 245 245 242 242 240 235 235 234 226 226 217 214 212 205 205 202 201 Procedure: Using a Dot-Plot to spot long words in a sequence. Purpose: To provide a streamlined visual method to perform the task of the previous procedure. i. Follow the instructions on pages 155 and 156 of BFD. The web page will not download with your graph so scroll up so that the entire graph appears on you screen. Then press ALT and Print Scrn at the same time. This will copy the window that displays the graph. Paste (Ctrl and V) this on a new page of Bio/CS-251 Laboratory 2 Examining A Single DNA Sequence Jan 31, 2007 your WORD document. Save this document in a folder called Lab 1 on your H drive. You should also save this completed Lab worksheet in that folder. Sheet views are attached with different window sizes. Q13: Does this dot plot show any repeated word of significant length? Think carefully before you answer this question. The dark areas on the dot plot show words that are repeated a significant number of times. The darker the area, the more times the word is repeated. This makes sense because, earlier, there were found to be 32 five letter word that were repeated more than 200 times. An example of a repeated sequence with tragic consequences Procedure: Using OMIM (Online Mendelian Inheritance in Man) to examine a genetic disease caused by repeat sequences Purpose: Learn how to navigate OMIM j. Go to http://www.ncbi.nlm.nih.gov/. Under “Search”, choose “Gene”, and type “HD” into the search box. Open Link #2, “HD”. Read the Summary, and then scroll to the bottom of the page. Under NCBI Reference Sequences (RefSeq), open the link to the mRNA sequence (NM_002111), then under “Display”, choose FASTA. k. Examine the first six lines of the mRNA, and in the space below, record a triplet sequence that is repeated in tandem more than 10 times: Q14: Record your triplet repeat here: The triplet CAG is repeated, in tandem, more than 10 times. Q15: How many times is the triplet repeated (how many copies of the triplet?) The triplet is repeated 21 times. l. Return to the NCBI Entrez Gene page for the HD gene. Under “Additional Links”, select MIM:143100, and open this link to the OMIM database for the HD gene. You will find that this is a long and detailed summary of everything that is known about the HD gene and its pathology. Answer each of the following questions briefly. Bio/CS-251 Laboratory 2 Examining A Single DNA Sequence Jan 31, 2007 Q16: What disease is caused by alterations in the HD gene? What organ system is affected by this disease? (You may wish to view the “Clinical Synopsis” from the Table of Contents along the left border of the page) Huntington Disease Affects the central nervous system Also with behavioral and psychiatric manifestations Q17: From the Table of Contents, select “Allelic Variants”, read this section, and answer the following question: What is the molecular genetic basis for the disease? Explain how repeat sequence variation is responsible for this disease. The nucleotide sequence CAG is located in the region coding of the gene for Huntington disease. The sequence is repeated between 9 and 37 times in normal individuals, but between 37 and 86 times in affected individuals. Because the sequence is repeated in a coding section of the gene, the protein will contain extra amino acids that would not normally be present. These extra amino acids affect the protein and its functions in the cells. I affirm that I have upheld the highest principles of honesty and integrity in my academic work and have not witnessed a violation of the Honor Code. Christine Frielle