Download DNA ANALYSIS: Public vs private access to the human genome

Spring 2007 Biology 212 General Genetics Bioinformatics Workshop THE GOALS OF THIS TUTORIAL ARE:    to demonstrate the powerful tools for analyzing gene and protein sequences, many of which are free to the public to illustrate programs for analyzing the zebrafish cDNA clones you are working with to see how genome analysis of the zebrafish can be extended to help understand human genes KEYWORDS: BCM Search Launcher: A set of programs available at the Baylor College of Medicine web site for DNA sequence analysis. bioinformatics: The use of computing to analyze and store gene and protein sequences. BLAST: Programs which compare nucleotide or protein sequences and look for similarities. For example, BLAST can be used to find a human gene like that of a known mouse or fruit fly gene. In drug discovery, regions of newly identified proteins found to be similar to existing proteins can help suggest new drug targets. database: Stored DNA or protein sequence files or protein structure files. GenBank: A publically accessible database consisting of DNA sequences. Currently administered by the National Center for Biotechnology Information. genomics: Study of the genomes (DNA sequences) of organisms. proteomics: Study of the structure and function of proteins, the products of genes. NCBI: National Center for Biotechnology Information. A US government sponsored site which allows public access to research articles (PubMED), to databases such as GenBank (Entrez) and to nucleotide and protein sequence analysis programs and databases (BLAST). NEBcutter: Utility that identifies restriction sites in a DNA sequence. Use this tutorial to learn more about analyzing DNA sequences, using tools available on the internet. Then apply what you learn to the particular cDNA sequence you have been given for your lab project. 1. SEQUENCE ANALYSIS TO IDENTIFY GENES AND LOCATE NEW DRUG TARGETS Many genes that are discovered are similar in some regions to previously studied genes or proteins. Discovering these similarities using computer analyses saves time and money and can give companies a competitive edge on identifying new products and possible drug targets. A. TUTORIAL: HOW DO I FIND THE SEQUENCE FOR A PARTICULAR GENE? For the first part of the tutorial, go to the following web site: http://www.ncbi.nlm.nih.gov/ 1 This is the site for the National Center for Biotechnology Information. This site contains access to software and databases for DNA sequence analysis, human genetic diseases and mapped human genes, and allows you to search for scientific articles in PubMed. Under the top panel where it says Search, select nucleotide from the pull-down menu, ENTER THE ACCESSION NUMBER FOR THE GENE YOU WERE ASSIGNED (highlighted in yellow or see your lab instructor) and click on GO. NOTE: THERE IS A LIST OF THE ACCESSION NUMBERS FOR THE ZEBRAFISH cDNAS ON THE LAST PAGE OF THE TUTORIAL; do not use the Z number. Click on the accession number (underlined and in blue) to open the database file. Assignment Part 1. a. Print out a copy of your assigned zebrafish cDNA sequence. To obtain a hard copy of your data, your computer must be connected to a printer. Click on the print button located in the top center of your web browser. If you do not have this shortcut button, you may click on file and then click print and ok. Part 1. b. Locate the following information about your sequence and circle or list them i) What does the cDNA encode? What is the function of the protein product? (Note: to fully answer this, you may need to do further searches) ii) Genus and species name of the organism the cDNA is from. iii) The vector used for cloning the cDNA. B. TUTORIAL: HOW DO I LOCATE SIMILAR GENES? To locate similar genes, programs such as blastn are used. Blastn tries to match up your sequence with all the available DNA sequences stored in the databases. For example, you could use blastn to identify a human gene or cDNA related to the zebrafish cDNA you are characterizing. These results then might help to solve the structure or suggest the function of a gene involved in a genetic disease. From the NCBI home page double-click on BLAST in the top panel. You will be taken to the page: http://www.ncbi.nlm.nih.gov/blast/ Scroll down in the Nucleotide box to Nucleotide-nucleotide BLAST (blastn) (the third item under nucleotide) and click on it. In the large open box, type in the accession number of your cDNA. This is the sequence you are searching with. You now can choose a database to search against. Each database stores a different subset of DNA sequences. Select “Others” (nr etc.). You will be more likely to be successful in your search if you choose one of the databases from the table below. Scroll down the list to select the desired database. Click on BLAST. database nr est est_human est_others htgs sequence subset All unique sequences--not always a good thing, can take a long time Expressed sequence tags (cDNAs) from all species Human expressed sequence tags Expressed sequence tags from species other than human or mouse (would have zebrafish sequences) High throughput human genomic sequences (human genome draft sequences) 2 When the page comes up, click on Format! to view your results. Please be patient, as it will take several minutes at least to complete the search. The results consist of multiple data files of the gene or cDNA files from the species databases you searched. The colored bars indicate how much of the region matched with existing data files. The sequences most similar to your sample will appear first in the results. Red colored bars are good and indicate a high probability of homology. Weak homologies may also be found; these are indicated by short blue and black bars and are unlikely to be significant. To obtain a hard copy of your data, your computer must be connected to a printer. Click on file on the upper left of your browser window. In the dialog box, select pages and enter the specific pages you want to print (see below), then click print. DO NOT PRINT OUT THE ENTIRE FILE, AS THE OUTPUTS CAN BE 20 PAGES LONG. Assignment Part 2 a. Print out the first 3-5 pages of any search that produced a significant homology (red bars). You will not receive full credit for printouts that are not meaningful. Make sure the region of homology consists of at least 50 nucleotides and that the reported probability value is very low (10-1 or smaller). b. Give the sequence database you searched. What kinds of sequences are found in that database? c. What did you find? Annotate your printout to identify sequences that pertain to one or two of the following questions.  Did you find additional zebrafish cDNAs related to your cDNA?  Did you find any zebrafish genes or genomic regions related to your cDNA?  Did you find any human cDNAs related to your zebrafish cDNA? What does that mean?  Did you find any human genes related to your zebrafish cDNA?  Did you find any cDNAs from other species related to your zebrafish cDNA? Assignment Part 3. Find some additional information about your gene. For full credit, follow up on at least two different leads or carry out two different analyses from part a or b, and provide annotated printouts. a. Use various links at the main NCBI site to find out more. For example,  Search this site for the location of your gene on a particular chromosome in humans, zebrafish or other organisms using MapViewer. MapViewer can be found in the right hand menu. Be sure to use the protein name, not the accession number in your search.  Locate other information on the structure, function, map location, and/or association of the gene with a human disease using the link for OMIM (Online Mendelian Inheritance of Man) in the top panel.  Find a reference to more information on the structure or function of the gene using PubMed link in the top panel. Use a keyword to search, such as the name of the protein product, not the accession number. 3 b. Use some additional software to analyze your sequence further, following the instructions below. Some other types of analyses include:  Identifying restriction sites  Locating open reading frames  Designing PCR primers TO CARRY OUT ANY OF THESE ANALYSES, YOU WILL OFTEN NEED TO CUT AND PASTE YOUR SEQUENCE INTO OTHER PROGRAMS. Document your analysis by providing printouts of the results and include additional notes on what analysis was done and what you could learn from it. To cut and paste your sequence: open your sequence file using the NCBI site. Use the mouse to highlight the beginning of your sequence and scroll down to the end of your sequence. Use control C to copy the file or use copy from the pull down menu. Use control V to paste the file in the large box that appears. 2. HOW DO I IDENTIFY RESTRICTION SITES IN MY DNA SEQUENCE? In order to modify or further characterize a gene, you would probably want to identify useful restriction sites to serve as landmarks. There are a number of free programs available on the internet that will enable you to do restriction analysis. Go to the NEBcutter site at the following URL: http://tools.neb.com/NEBcutter2/index.php To analyze your DNA sequence, either type in the GenBank accession number in the appropriate place or cut and paste your sequence into the large box provided. Click on SUBMIT. A linear map of the sequence and the restriction sites will be displayed. a. Print out a copy of the output from this program using the main options on the lower left. Print the GIF version unless your computer has an Adobe reader for viewing/printing the PDF file. b. Unique 6 bp restriction sites within a gene or just flanking a gene are often among the most useful. Use the lower menu list to identify 1 or more unique sites in the sequence (1 cutters). This page can also be printed. Circle on your printed map or list an enzyme that has a unique site on your DNA. 3. HOW DO I DETERMINE WHERE THE PROTEIN CODING REGION IS ON MY cDNA? Sequence utilities enable you to do simple functions on the DNA, such as translate the sequence, identify repeated regions, select PCR primers, or determine the complementary sequence. We will use a web interface at the Baylor College of Medicine with these utilities as well as access to a variety of other programs, such as BLAST. We will use this site to theoretically translate a DNA sequence in all three reading frames in both directions. This identifies “open reading frames” (ORFs), which are the possible regions that encode proteins. Since we are translating cDNA sequences, the genetic code for the protein should not have any introns interrupting the coding region. 4 To open the BCM Search launcher, go to http://searchlauncher.bcm.tmc.edu Where it says choose a type of search from the pull-down menu, scroll down, and then click on--sequence utilities. From the BCM Search Launcher, choose the 6 frame translation to locate possible open reading frames in the sequence. Click on the [O] box on the right. Copy and paste your sequence into the box and scroll down and click on perform conversion. The correct frame for the protein is usually the largest segment of continuous amino acids without a stop (*) codon, usually beginning with M (methionine). Print out this analysis and circle or highlight the largest open reading frame among the six translations. 4. HOW COULD I DESIGN PCR PRIMERS UNIQUE TO MY cDNA? Often a smaller region of a cDNA sequence, for example just the protein coding region is needed. There are many software tools for designing primers, once a DNA sequence is known. This program we will use is called PrimerQuest, and is available from a commercial supplier of primers, Integrated DNA Technologies. Go to http://www.idtdna.com/SciTools/SciTools.aspx Select the program, PrimerQuest, from the menu on the left (fourth item down). Type in a name for your sequence, select PCR detection from the application list, and either type in you’re your accession number where it says NCBI ID# and click on “GET SEQUENCE” or cut and paste your sequence into the box. Make sure the option “Design for PCR primers” is selected, keep the default setting for USE PARAMETER SET as “PCR primers”, and click on the button below labeled “CALCULATE”. a. Print the first page of the output, with the first set of designed primers. The first set of primers would be those the software decides is most suitable for reproducing a portion of your cDNA from sites within the zebrafish sequence. b. Circle on the print out the size of the PCR product that would be produced in PCR using this primer set on your zebrafish cDNA template. SUMMARY OF ZEBRAFISH cDNAs Clone ID Z1 Z2 Z3 Z4 Z5 Z6 Z7 Z8 Z9 Z10 Z11 Z12 Z13 GenBank Accession # Z27 Z28 Z29 Z30 Z31 Z32 Z33 Z34 Z35 Z36 Z37 Z38 Z39 AW466657 AW466660 AW423262 AW423173 AW423225 AW423235 AW423239 AW466513 AW422974 AW466671 AW423023 AW423266 AW422897 5 Z14 Z15 Z16 Z17 Z18 Z19 Z20 Z21 Z22 Z23 Z24 Z25 Z26 Z40 Z41 Z42 Z43 Z44 Z45 Z46 Z47 Z48 Z49 Z50 Z51 Z52 AW466555 AW466529 AW466686 AW466677 AW423006 AW466503 AW466541 AW466689 AW422876 AW422883 AW423264 AW422931 AW422881 PLEASE NOTE: Keep in mind that the sequence file does not necessarily represent the cDNA you worked with in the PCR lab and that the sequence is not of the full length cDNA insert. Assignment: Annotated computer results (10 pts). Hand in printouts with labels indicating what they represent to receive credit for this lab. Due by last day of classes. 6

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download DNA ANALYSIS: Public vs private access to the human genome