Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Exploring Protein Structure and Function Using Bioinformatics Tools Adapted from: Biology 3055 Laboratory, April E. Bednarski, Sarah C.R. Elgin, and Himadri B. Pakrasi; Studying the Genetic Basis of Disease Using Web-Based Bioinformatics Tools. Available at http://www.nslc.wustl.edu/elgin/genomics/Bio3055/bio3055.html. In its early days cell biology was primarily an observational science with microscopy playing a vital role in our early understanding of how cells work. With the discovery and understanding of the central role played by genes and proteins in cell structure and function, cell biologist became interested in the roles played by these macromolecules in cells and thus molecular cell biology was born. While there are still important discoveries made by 'looking' at cells using microscopy (particularly using fluorescent stains), current cell biology research often involves studying cell function at the molecular level. Elucidation of a protein's function can be approached from several angles. Often a protein of interest is identified base on an observed phenotypic mutation. Using a genetic approach the gene responsible for the mutated phenotype is identified and cloned. The gene can then be mutated and the effect of the mutation can be evaluated within an isolated cell, a model organism or in vitro. Another approach is to purify a protein based on some biochemical characteristic and study the function of the protein under various conditions in vitro. Additionally partial amino acid sequence can be obtained for the purified protein, and this sequence used to 'fish' out the gene. Until recently most cell biology researchers would focus on a single cellular process and try to identify and characterize proteins involved in said process. In some cases dozens or more labs have dedicated decades to the study of a single cellular process or even a single protein. The advent of powerful techniques for studying proteins and genes such as automated DNA sequencing, high throughput protein identification methods and protein structure determinations has exponentially increased the information available to cell biologists. Currently NCBI reports that sequencing has been completed for 381 species, and another 805 projects are ongoing (NCBI, July 26, 2006). Of course the bulk of these are bacterial genomes which are relatively small but almost 20 percent of the genomes fully or partially sequenced are eukaryotic species. As of July 26 2006 RCBS Protein Data Bank was reported to contain 37874 protein structures. For the cell biologist the potential uses of this massive and growing sequence and structural information is very exciting. No longer are we tied exclusively to painstaking genetic approaches of hunting down a single gene or isolating a single protein for study. Now in the age of omics we can consider studying a species entire genome (genomics), all the mRNA expressed in a single cell (transcriptomics), or even all the proteins present in a single cell at one time (proteomics). It is difficult to adequately summarize the ways in which this new data is currently being exploited for the understanding of cellular process. In the next few weeks we are going to access a fraction of the resources freely available to anyone with a computer and an internet connection. Hopefully by the end of these labs you will have an appreciation for resources available and how to use some of them. Exploring Protein Structure and Function using Bioinformatics tools Project Scenario Many diseases are caused by proteins with amino acid sequences different from what we might call the 'normal' protein. The mutant protein may no longer be able to function at all or it may have a reduced or altered function. Often these disease causing proteins differ only by a single amino acid from the non-disease causing form. Understanding how the mutation causes disease can lead to better understanding of the 'normal' protein's function and may lead to new ideas for treatment of the disease. When we talk about 'normal' and 'mutant' genes or proteins it is important to remember that polymorphisms may exist among 'normal' proteins which do not cause disease. Depending on the particular job of a particular amino acid in a given protein's tertiary structure, some amino acid substitutions will conserve the protein's function, but others will not. We know this is the case because homologues proteins in distantly related species are not 100% identical, yet they have the same function. There is disagreement in the scientific community as to what constitutes a normal protein and a mutant protein. Some consider only the most prevalent sequence of a protein to be the 'normal' protein and all other forms to be mutants even if they are not associated with disease. Others think there must be evidence of altered protein function for an allele to be called a mutant gene. We will get around such distinctions by calling the most prevalent sequence normal and refer to any alternate sequence as variant. Variants may have as little as a single amino acid substitution (the amino acid normally seen in the sequence is changed to another amino acid) or it may have gained/lost one or more amino acids. The central goal of this project is to predict if a variant protein sequence is likely to result in an observable alteration of the protein's normal function. Such a loss of function could include some reduction in activity, an alteration to give the protein a new activity, or a complete loss of detectable activity. To make this prediction you will need to look at a variety of available information concerning the normal proteins structure and function – paying particular attention to the amino acid which is altered in the variant sequence. A summary of the many aspects of the normal and variant sequence which you will need to consider is give below (you may not have access to all this information for your protein): First about the 'normal' protein: • What is the cellular environment? • What does it do? Is anything known about the amino acids involved in its function? (Such information could be based on mutational studies or structural studies). • What protein family does it belong to? (Gives you functional information.) • Are any post-translational modifications expected? Then about the particular amino acid which is variant: • Is the position conserved across orthologues, and or paralogues? (suggest importance of the amino acid to the protein's function) • Where does the amino acid occur in the secondary structure? Could the mutation affect the secondary structure? • Is the amino acid involved in protein function? For example is it part of an active site and might it make contacts with a substrate? • If the structure of the protein or a closely related protein has been resolved, what does it tell you about the function of the normal amino acid in the proteins structure/function? 2 Biol 288 2006 In the process of working through the above considerations, you will be exposed to many bioinformatics resources and tools. You will be given the amino acid sequence for a variant protein and the name of the corresponding normal protein and you will then follow these steps. 1. Look up the normal protein in the secondary databases Gene (at NCBI) and SwissProt (at EBI). Read about the disease associated with your protein in the OMIM database. Depending on your project you may read some scientific literature concerning the role of your protein in normal and diseased cells. 2. Use the BLAST tool at NCBI to find five homologous sequences in other species. 3. Create a multiple sequence alignment (MSA) of the normal and mutant human genes, and the five homologous sequences. 4. Look at a secondary structure prediction of the program PSIPRED to see what structural role the normal and mutant amino acids may play. 5. Use the three dimensional structure of either your normal protein or a closely related protein to help predict what role the normal and mutant amino acids may have. This will require the use of using a program called Deep View (installed on the computer hard drive) and the resolved structure from the Protein Structure Database (PDB). 6. If your protein is an enzyme you will check the KEGG database to see what is known about its role in the biochemistry of cells. Much of the above requires that you have some understanding of amino acid properties. I have placed a paper written by Betts and Russell (2003) on WebCT. These authors provide a very good overview of the thinking process that goes into diagnosing the possible effects of an amino acid substitution on protein function, of an amino acid mutation. You will also be completing a tutorial on protein structure. How the labs will work For the bioinformatics labs I have used the symbol to indicate steps you need to carry out on the computer. All questions you need to answer are numbered sequentially. Extra information will be found in side boxes, like this one. Each student will be assigned one of several alternative projects to work on. Each alternative project deals with a different mutant protein involved in a different disease. We will be spending three to four weeks in a computer lab working on various aspects of your projects. Throughout the course of the laboratories you will be completing several in lab assignments which must be handed in before you leave. Additionally, there will be a few pre-lab assignments that you will have to hand in before a particular lab begins (some will require extra reading). And finally, you will prepare a short summary paper to hand in after all labs are completed. The details for all assignments and the summary paper will be handed out before we begin the first lab. Each student is expected to independently complete their own assignments; however, you are encouraged to consult with your peers during labs. Additionally, it is very important that each student carry out all their own computer work! You will not learn how to use the various tools if you do not ‘drive’ the mouse yourself. For those of you who are a bit afraid of the computer – we are here to help, please do not be embarrassed to ask us or your neighbor for help. Bioinformatics is a “learn by doing" enterprise so don’t be afraid to dive in. All databases we will use are easily accessible via the web and all tools we will use are freely available. Some are used directly in a web browser and some are programs which are downloaded and installed locally, but they are free. The computers in the lab we are using have been custom set up for us to use. Some of the work we will be doing can be 3 Exploring Protein Structure and Function using Bioinformatics tools done from any computer on campus but for other jobs you can only use the specially set up computers. There are instructions on WebCT for how to set up your own computer to do the same things. You will have two handouts for the bioinformatics labs. The thickest (this one) contains the information that applies to all projects and the second is a guide sheet that applies to your particular project. The assignment questions in the manual and guide sheets are numbered sequentially. Either answer the questions directly in a separately posted Word file (see WebCT) or if you prefer write the answers on loose leaf. Saving and accessing your files By now I hope you are familiar with your network (I drive). It is usually best to save files on the desk top while working on them and then move them to the ‘I drive’ at the end of the lab. Create a desktop folder titled “Bioinformatics yourname”. Save all local and downloaded files here during the lab. At the end of the lab, move this folder to your “I drive” or burn it if that works better for you. When you come back for the next lab, move the folder back to the desk top and continue your work. Laboratory Checklist Beginning of lab: hand in any pre lab assignments login to your computer and transfer your folder from the I-drive to the desktop. Open the Firefox browser and then go to the web page http://uregina.ca/~lintottl/288_bioinformatics_homepg.html (link available from WebCT or my web site). Make this your home page (Tools>Options>General>HomePage – click "use current page"). This will make it easy for you to get back here. It helps to use tabbed browsing – this way you can have files from several databases up at once and quickly switch back and forth to find information. During the lab: Answer the lab assignment questions as you go. You have 2 options. Type your answers in as you go: save the Word doc to your desktop folder. At the end of the lab print out the pages you have completed and hand them in. Make sure you have put your name on the sheet (you can do this in the header if you know how, or you can write it on after. Write out your answers on loose leaf. save any new files to your desktop folder End of the lab: move your desktop folder to your I-drive – very important or you have lost your work hand in your lab assignment (check you lab schedule for what questions are due each lab) logout Databases and Bioinformatics Tools There are countless bioinformatics resources available today, and we will simply not have time to learn to use even a few of them in any depth. Instead, we will give you a very brief background and focus on using only a few databases and tools to achieve our specific goal. 4 Biol 288 2006 There are two basic components of bioinformatics resources, databases and mining tools. Databases can be divided into two types; primary databases contain actual sequence (or other raw) data obtained from experiments while secondary databases are curated (included interpretations of the primary data). Primary sequence data usually refers to nucleotide sequences which could be of genomic or cDNA origin. Protein sequence databases contain protein sequences primarily obtained by conceptual translation of DNA sequences (i.e. protein sequences are not usually obtained by direct sequencing of the polypeptide). There are hundreds of bioinformatics databases available online (see the Jan 2006 issue of Nucleic Acids Research for a glimpse of what is available) but fortunately there are two major centers where most of the databases are gathered together (or linked to) for easier access; in the United States at NCBI (National Center for Biotechnology Information), and in Europe at EBI (European Bioinformatics Institute). NCBI and EBI have similar but not identical resources. Both have essentially the same primary databases (indeed they exchange primary sequence information on a daily basis, so that any sequence information submitted to one organization will always be available to both). Data mining tools used at the two sites are different, and they each build their own secondary databases. We will primarily use NCBI resources, although we will look at some EBI resources as well. Guided overview of resources and tools we will use NCBI resources Go the NCBI web site home page. The search bar that you see is NCBI's Entrez search tool. Entrez allows you to search any or all available database by selecting the database of choice from the drop down list. Quick links to some often used sites are above the search window. Click the helix from any NCBI pg to get to home pg Quick link toolbar. Here is where you find the link to BLAST. Search toolbar. This is gateway to NCBI's search tool called Entrez. You can search all databases at the same time for any term. Or, you can narrow your search to a single database. Just select the database you want from the drop down menu and enter your search term. Click on the link "All Databases". Here you will see a list of all databases available at the NCBI site. Let's learn a little about the databases and tools we will be using. Click the link “Protein” 5 Exploring Protein Structure and Function using Bioinformatics tools Protein database at NCBI 1. What are the sources for sequences available in this database? Define any acronyms (find them with Google - type in the acronym followed by database to quickly find the source). Gene database at NCBI Go back to the database page and click on Gene. In the left column you should see an "about" link which will take you to the paper Entrez Gene: gene-centered information at NCBI (Maglott et al. 2005). Read the abstract and introduction to answer the following questions. Database Redundancy Sequences are submitted by the originating lab directly to databases such as GeneBank or SwissProt. It is not unusual for the same sequence to be submitted more than once; for example different labs may submit sequences for the same gene, one might contain slightly more sequence than the other and the two entries will each get their own accession number. Later, a genome sequencing project submits sequence for an entire chromosome including the gene which already has been submitted twice; resulting in three separate entries for said gene. For historical reasons it is important that all sequences be maintained, even though this creates redundancy in the databases. The RefSeq project was begun to deal with the problem of redundancy. 2. Is this a primary or a secondary database? Explain your answer. 3. What subset of known sequences is included in this database? 4. What is a refseq? Go to Books (in tool bar menu at the top of the screen)>The NCBI Handbook, search for RefSeq project. Search Gene for the PTGS1 sequence. Take a quick look at the kinds of information available from this site. Answer the following questions. 5. What is the official name of this gene? 6. What types of reactions does it catalyze? 7. In what metabolic pathway does this enzyme function? 8. How many isoenzymes of PTGS are there? 9. Where is this gene located in the human genome? OMIM Go back to the database listing and select OMIM. Answer the following questions (hint: you may have to look around a bit to find all the answers). 10. What information is available at this site? 11. Is this a primary or a secondary database? Why? 12. What book is the information in OMIM base on? 13. Would you expect to find information on an infectious disease such as Herpes in this database? Why or why not? BLAST (at NCBI) Go back to the NCBI home page and click on BLAST. 14. What does this acronym stand for and what is it used for? Notice there are several different BLAST tools, we will focus on the tools for searching protein sequences. Protein BLAST (blastp) searches the protein database for matches to an input amino acid sequence (in FASTA format). Select blastp and on the input form, click on search – this should take you to information on how to input sequence into blastp. 15. Describe what FASTA format is. It is very important to understand FASTA format is it is the most commonly required form by most databases. 6 Biol 288 2006 EBI resources Click on the EBI link on WebCT. You should see a tree map showing most of the resources at EBI some of which look similar to NCBI offerings. We will be using the EBI resources SwissProt and ClustalW. See the 288 bioinformatics web site for detailed guides on reading output from Swissprot and other databases. Swiss-Prot Swiss-Prot is listed under protein databases. Click on "UniProtKB/Swiss-Prot" to get a description of this database. 16. Is this a primary or secondary database? Why? 17. Would you expect to see the same protein sequence represented more than once? Why or why not? ClustalW ClustalW is a program used to align two or more sequences. Such alignments can be used for a variety of purposes, such as, investigating the evolution of a protein, and determining what regions of a protein are most critical for function. There are many alignment programs available; which one you use depends on what you are trying to determine. For our purposes ClustalW will work well and is the easiest to use. To access ClustalW go back to the EBI resource list. ClustalW falls under Toolbox tree > sequence analysis. Once you get to ClustalW click the link “new users please read the FAQ”. Answer the following questions: 18. What basic information can be gleaned from a multiple sequence alignment of related proteins? 19. What are three ways the information gleaned can be used? 20. What types of sequences can be aligned by ClustalW? There is an example ClustalW input page under Guides on the bioinformatics web page. Note there are many modifiable input fields; for the most part we will use the default settings. Please do not change any settings unless you are clearly told to do so. As for BLAST input, the sequence input will be in FASTA format, make sure you know what this means. Other resources PSIPRED protein structure prediction server This server is an input point for using several different programs that will predict protein structure based on the primary amino acid sequence. We will be using this site to predict the secondary structure of the normal human proteins. Because obtaining output form this site can take hours to days I have already obtained the predictions and they are available from WebCT. If you want/need more information on the algorithms behind the prediction, or on interpreting the output click the more… button on PSIPRED home page. RCBS Protein Data Bank This is the database where all coordinate files for tertiary protein structures are stored. DeepView DeepView is a program used to both see and manipulate three dimensional macromolecule structures. Of all the Bioinformatics tools and programs we are using, this is the only one that actually resides on your computer. We are using this particular program because it offers one important feature that other free programs do not, it will allow us to mutate an amino acid and investigate what the new protein will look like. DeepView is a very powerful program which is a bit harder to use then other structure 7 Exploring Protein Structure and Function using Bioinformatics tools viewers, but we will guide you through the steps you will need to use. If you would like to learn to use DeepView in more depth see Gale Rode’s excellent tutorial at http://www.usm.maine.edu/~rhodes/SPVTut/index.html. The GIMP This is a fully featured open source program for image manipulation. This program has all the power of Photoshop, but the price tag is much easier to take. If you are already familiar with Photoshop it can take a bit of time to learn the GIMP as the commands often have different names, but there is plenty of help on the Web. We will be using the GIMP it to take screen shots of DeepView structures, and I will provide step-by-step instructions for how to do this. Useful Links (available on web site) EBI resources http://www.ebi.ac.uk/services/ NCBI resources http://www.ncbi.nlm.nih.gov/Sitemap/index.html PSIPRED sever http://bioinf.cs.ucl.ac.uk/psipred/ RCBS-PDB http://www.rcsb.org/pdb/Welcome.do GIMP home http://www.gimp.org/ References NCBI, Entrez Genome Project. Genome sequencing projects statistics. Available at http://www.ncbi.nlm.nih.gov/genomes/static/gpstat.html, accessed July 26, 2006. RCBS Protein Data Bank. An information portal to biological macromolecular structures. Available at http://www.rcsb.org/pdb/Welcome.do, accessed July 26, 2006. Maglott, D., J. Ostell, K. D. Pruitt, and T. Tatusova. Entrez Gene: gene-centered information at NCBI. Nucleic Acids Res. 2005 January 1; 33(Database Issue): D54– D58. Available online at http://www.pubmedcentral.nih.gov/articlerender.fcgi?tool=pubmed&pubmedid=15608 257 . 8 Biol 288 2006 Finally – let’s start your projects My project name _____________________________________________________. Please keep your project sheets handy. Step 1. Getting sequences and learning about your protein/disease The mutant sequence Go to your project page on WebCT. Download your mutant protein sequence to your desktop folder (right click the link, choose save as and save to your desktop folder). The NCBI Gene entry for your normal protein Go to the NCBI homepage. Select “Gene” from the database pull down menu. In the search box type the name of your protein as given in your guide sheet. Answer the questions (hint: go to the summary): 21. What is the official Gene symbol? 22. What is the gene name? 23. What is the GeneID number? 24. Where in the human genome is this gene located? 25. What is the function of the protein? (For example: enzyme, structural, cell signaling etc. – a protein may have more than one function) 26. What specific role does it play in the cell? (For example: if an enzyme, what type and what reaction does it catalyze, if structural, what structure.) 27. What is the RefSeq accession number for the mRNA of this gene? 28. What is the RefSeq accession number for the protein sequence (it is the product of the gene)? Get the sequence of the normal protein Click the RefSeq protein (product) accession number, this will open the GenBank file specific for this protein. At the top of the file is a drop down menu called “Display”, click the arrow and choose FASTA. This is the sequence of the normal protein in FASTA format, ready to copy and paste. Copy the sequence and paste it into a new word file. Make sure you have included the line that begins with the > sign. Save this new word file in your desktop folder as a text file, under the title “normal myproteinname”. The Swiss-Prot entry for your normal protein Open a new tab in your browser and go to the EBI site (by using tabbed browsing you can keep all your different protein records easily accessible). Under Protein Databases select the UniProtKB/Swiss-Prot database, scroll down a bit until you see Access the UniProtKB/Swiss-Prot Database. Click the link to get to a search page. Make sure you use the second search menu, UniProt Power Search and follow your guide sheet instructions for specifics on what to search for. 9 Exploring Protein Structure and Function using Bioinformatics tools 29. What is the SwissProt accession number? (Hint: for proteins it always starts with a P). 30. What are the alternate names for this protein? 31. Where in the cell is this protein located? (If it is known it will be in the comments – if it is not listed your answer is “unknown”.) 32. Does this protein exist as a monomer, dimmer (homo or hetero) etc.? OMIM entry Find the entry for you disease in the OMIM database, either search the OMIM database directly at or maneuver to the OMIM entry for this disease by clicking the OMIM link that appears to the right of the Gene entry you just looked up. 33. What protein is deficient in the disease you are studying? Project Specific Reading In addition to the above database information you may have some reading material for your disease. 34. In your project specific manual there will be a list of questions specific to your disease. These questions can be answered using any of the databases or readings you are already familiar with. Answers to the project specific questions are due at the beginning of next week’s lab. You may need to finish these up outside of lab time. Step 2. Finding related sequences and setting up a multisequence FASTA file. To begin our analysis of the variant protein we need to determine where the mutation lies in the protein, and how conserved this region of the protein is among homologous proteins from other species. To do this we will need to first find homologues sequences using the NCBI BLAST tool to search a non-redundant protein database. Go to the NCBI home page. Click on the BLAST link. Under the protein subheading, choose protein-protein blast (blastp). This will take you to a submission form. Open the word file containing the normal protein sequence. Highlight and copy the sequence, including the header (>……). Paste the sequence into the large window next to the word ‘search’. We are not changing any default settings so you can simply click the “BLAST” button. Very shortly you will be presented with a window indicating your request has been submitted. Click the Format button. In a few seconds a new window containing the results will open (time depends on sequence you submit and how busy the server is). In the appendix of this guide is a picture of a BLAST output with the various parts highlighted. Take some time to familiarize yourself what you see. (I recommend you save this output in your desktop folder as an HTML page for future reference). For our purposes here the graphical view and the summary list of hits will not be very useful. Scroll down until you get to the first alignment (may be a fair distance if you have a lot of hits). Be default the alignments are presented from highest to lowest score, so as you scroll down you will find hits with increasingly less homology to your query sequence. Go through the alignments looking for sequences from different species which have < 100% identity. Immediately following the sequence name, in square brackets, is the 10 Biol 288 2006 species name. Often, if the sequence is human, no species name will follow. When you find one that looks promising, click on its accession number. This will take you to the GenBank entry for that sequence. Look at the organism name to make sure it is not human (clicking on the species name will take you to a taxonomy page with more details). If you decide this sequence is a good choice select FASTA from the display menu (at the top). Copy and paste the sequence to a new Word doc, save it as text only, “myproteinname related seq”. Make sure you get the header (>…..). Go back to the BLAST results (use the browser back arrow) and continue looking for homologous sequences from a variety of species and adding the FASTA formatted sequences to the Word doc. Try and obtain sequences from distantly related species (i.e. only choose one rodent sequence, and try to get at least two no mammalian sequences). The more distantly related the homologous sequences are, the more informative the MSA will be. Once you have five non human sequences, add your variant and normal protein sequence to the beginning of the list of sequences in your Word doc. To make for a nicer ClustalW output, we are going to modify the FASTA title for each sequence. ClustalW names each sequence in the output file according to the FASTA title (including all the characters up to the first space in the title). If you look at your default FASTA titles, as a whole they are quite informative, but somewhat ugly. (Will gi|109019190|ref|XP_001107538.1| mean anything to you once you print out the alignment?) For each sequence insert a common name, such as monkey, right after the > character, and leave a space between the new name and the existing title. Make sure this common name is unique for each sequence, or ClustalW will not carry out the alignment (it thinks you have given it the same sequence twice). For the first two human sequences, label as mutH and norH. 35. Make sure you have saved the sequence file as a ‘text only file’. Print the file and hand it in with today’s lab assignment. Step 3. Creating the MSA – using ClustalW From the EBI home page, select ClustalW (under Toolbox>SeqAnalysis). See the Guides on my web site for an example input form for ClustalW (keep in mind the input page sometimes changes but all the fields I will describe should still be there). Use this to help you set up the job. You can get your set of FASTA formatted sequences to ClustalW in one of two ways. Either copy the sequences from your word file to the ClustalW input window, or upload your file by browsing to it (area below the window you paste into). Change OUTPUT ORDER to “input” instead of “aligned”. This change will keep the human sequences at the top of the alignment, trust me, this will work better. Other then that leave the default settings alone and click Run to submit. You will get a new window indicating your job is running. If there are any problems with your input they will show up fairly soon. You can go ahead and work on other things while the job is running but don’t close the running window. When the MSA is completed, it will pop up in the run window. Make sure there is only a single amino acid difference between your normal and mutant protein. (If you see more than one difference, you have probably got the wrong normal sequence). Save the output. First, click on the button View Alignment File (scroll down to the alignment to see this button). In the new window, select and copy the entire alignment. Paste it into a new Word file. It will look a bit strange, do not panic. Select 11 Exploring Protein Structure and Function using Bioinformatics tools all and change the font to Courier, change the left and right page margins to 0.75 inches (File>Page Setup). The alignment file should now look like it did on the web site. Adjust the page break so that the alignment lines stay together: All seven sequences and the row under (stars and dots) need to stay together, don’t let them split across pages (see example below). Don't forget to save the file. monkey dog YGENCSTPEFLTRIKLLLKPTPNTVHYILTHFKGFWNVVNNIPFLRNAIMSYVLTSRSHL 360 YGENCSTPEFLTRIKLYLKPTPNTVHYILTHFKGVWNIVNNIPFLRNTIMKYVLTSRSHL 109 **************** *****************.**:*********:**.********* 36. Print the output to hand one in at the end of today’s lab. Also answer the following questions. 37. What is the mutation? Write it in the following format "Res123Res" where Res is the three-letter code for the amino acid in the un-mutated (wild type) protein and the second Res is the amino acid in the mutated protein. In place of "123" put the amino acid residue number of the mutation. 38. Is the mutation in a region of conservation – how do you know? Interlude In cells, proteins do not exist as a linear chain of amino acids, but instead they assume a conformation that is dictated by the primary sequence. Before we begin to look at what the protein's secondary and tertiary structures can tell us we need to review the properties of amino acids and what the properties mean to a protein's three dimensional structure. Please link to the online tutorial Introduction to Protein Structure at http://www.paccd.cc.ca.us/instadmn/physcidv/chem_dp/chemweb/protein/index.html and answer the questions on the associate handout (we will give this out in lab and it is also available on WebCT). This tutorial must be completed and the question sheets handed in (with answers) before you can proceed to the next step. Additionally, please read section 14.1 to 14.4.3 in Amino Acid Properties and Consequences of Substitutions (Betts and Russell, 2003) and answer the following questions. 39. What type(s) of amino acid(s) would you expect to predominate on the surface of the following kinds of proteins: An integral membrane protein? A protein that interacts with DNA? A soluble protein? 40. What to processes can give rise to a family of homologous proteins? 41. What are the names given to the proteins arising from each process? 42. If you have access to a large family of homologous proteins, what might a region that is highly conserved across the family suggest? 43. What do we mean by post-translational modification? 44. What are the two most common types of post-translational modifications? 45. Why is it important to consider the possibility that a particular amino acid is modified? 46. Why is a single classification system of amino acid properties not satisfactory when considering the possible effects of an amino acid substitution on a proteins function? 47. What are three ways in which amino acids can be classified, as represented in the Venn diagram that integrates the various properties of amino acids (Taylor, 1986)? 12 Biol 288 2006 48. What properties of amino-acids are not represented in the Venn diagram? 49. Based on the Venn diagram of amino acid properties, what properties differ between the mutant and normal amino acid in your protein? Step 4. Secondary structure prediction So far you have considered possible consequences to protein function based on the properties of the normal and variant amino acid in the primary sequence. To further build your hypothesis you need to also consider the role of the normal amino acid in the secondary, tertiary, and even quaternary structure of the protein. Understanding the role the normal amino acid may play in protein structure will help to narrow down the possible effects of the mutation. Unfortunately, the bulk of proteins have not yet had their structures resolved, but we can still consider possible structural roles by looking at the protein's secondary structure. Since secondary structure is dictated by the primary amino acid sequence it stands to reason that even when a protein’s structure has not yet been resolved, we should be able to predict secondary structure. However, as you have just learned each of the twenty amino acids can play several roles in a protein depending on its position in the overall structure and its microenvironment (the neighboring amino acids). Despite these complications several secondary structure prediction programs and available today and are able to produce accurate predictions for up to 80% of sequences. We will be using the program PSIPRED. Secondary structure predictions take some time so I have already obtained them and placed them on WebCT. On your project page, right click the secondary structure link, and save it to your desktop folder. Open the file on your computer. Above the primary protein sequence are letters designating the predicted secondary structure; H=helix, E=strand(β-sheet), C=coil(random coils). On your MSA print out (or you could do this in Word) write the symbols H for helix and B for β-sheet above the alignment, write nothing for coils. See example below: monkey dog H--------H B------------B H---HB--------B YGENCSTPEFLTRIKLLLKPTPNTVHYILTHFKGFWNVVNNIPFLRNAIMSYVLTSRSHL 360 YGENCSTPEFLTRIKLYLKPTPNTVHYILTHFKGVWNIVNNIPFLRNTIMKYVLTSRSHL 109 **************** *****************.**:*********:**.********* 50. What is the secondary structure predicted for the region containing the mutation? 51. What type of secondary structure(s) is the most common in the alignment? If it seems an even mix between α-helixes and β-sheets, state that. 52. Do you think that the mutation in your protein may alter the secondary structure, why or why not? 53. How could you test your prediction? Step 5. Analysis of the 3D structure Background reading from Biochemistry 5th Edition (Berg et al. 2002) Bond distances http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Search&db=books&doptcmdl=GenBookHL&term=bond+ distance+AND+stryer%5Bbook%5D+AND+215023%5Buid%5D&rid=stryer.section.156#157 Conservation of tertiary structure http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Search&db=books&doptcmdl=GenBookHL&term=protei n+tertiary+structure+AND+stryer%5Bbook%5D+AND+215485%5Buid%5D&rid=stryer.section.944#945 13 Exploring Protein Structure and Function using Bioinformatics tools In this step we will look at the 3D structure of either the normal protein, or a closely related structure. In order to look at a 3D protein structure we need two things; a file containing coordinates for the atoms in a structure, and a program that not only converts these coordinates into a 3D image, but allows us to manipulate the structure. It is important to remember when looking at 3D structures of proteins that what we are looking at is a model, based on experimental data. As with any scientific interpretation of data the 3D protein structure can have errors. Generally structures are studied in one of two ways, either using X-ray crystallography to resolve the atom coordinates in a protein crystal, or using NMR to resolve structures in solution. Protein structure resolution is a very painstaking endeavor and often a single structure resolution can take several person years (or six years of a PhD candidate’s life). Regardless of the methods used, once the data is collected, analyzed and a satisfactory level of resolution achieved, a file containing the atomic coordinates for most atoms in the protein (excluding hydrogens) is submitted to the protein databank (PDB) as a PDB file. Using the program DeepView, we will first investigate the structure of the normal protein; paying particular attention to the amino acid that is altered in the variant protein. You should be able to see where the amino acid is located in the structure (surface, interior, active site or binding site) and to analyze possible interactions between the atoms of the ‘normal’ amino acid and other amino acids or with ligands. Then you will mutate the amino acid, and look to see if any of the predicted interactions are affected. Resolution Like any measuring method, measuring the position of atoms in a protein has limits, dictated by the method and equipment used in the experiment. The accuracy with which the atoms of a particular structure are measured is referred to as its resolution and is usually given in angstroms. That is, if the stated resolution of a structure is 3 angstroms, and two atoms in the structure are stated as 6 angstroms apart, the actual distance could be as little as 3 and as much as 9 angstroms. DeepView is a very powerful program and learning to use it would take many hours which we don’t have. The following is a step-by-step guide to complete some very specific tasks. Get the PDB file name for your protein from your project guide. It should look something like 1AGP. Go to the RCBS PDB home page and enter the file name into the search window, click search and you will be directed to the structure summary for this file. Note the following information: 54. What experimental method was used to obtain the data? 55. What is the resolution of the structure? 56. What species was the protein obtained from? 57. List any ligands, cofactors or metal ions included in the structure (particularly important for enzymes): 58. Answer any project specific questions. Download the PDB file (click on Download, and then click PDB, choose download and download to your desktop folder). Launch DeepView and then open the PDB file (File>Open PDB…). When you have open the file, close the information window that pops up (we don’t need it). 14 Biol 288 2006 Your screen should have 3 floating windows, something like the figure below. Toolbar Control Panel if missing, open with Window>Control P.. Graphic Window Summary of ToolBar buttons Re-center – display re-centers Zoom & Center around the atom you select, and will rotate around the atom when you use the rotate tool Mutate – Used to mutate a side chain (click button, then atom you wish to mutate, select the new amino acid from the list. The new amino acid will appear in the lowest energy conformation.) ToolBar Rotate Transverse Zoom Distance Identity – residue type and # will appear on the display for the atom you select Tab will switch to next movement tool Take a few minutes to try out some of the tools (don’t worry; the program will not let you alter the original structure file). 15 Exploring Protein Structure and Function using Bioinformatics tools Control Panel Show group Label group (code and #) Group names, amino acids are 3 letter code, and residue # Show selected groups as ribbons Chain column – click to select entire chain Secondary structure h = helix, b = beta sheet Show amino acid side chain Checks indicate what is shown in display. 59. Compare the actual secondary structure of your protein (shown in Control Panel) with that predicted by PSIPRED. Note and differences. 60. PSIPRED states its predictions are ~80% correct. Do you agree this is a good estimate of the accuracy? Since many of the DeepView steps you will follow depend on the particular structure you are looking at the remaining instructions for completing step 5 are located in your project specific guide sheet. Step 6. If your protein is an enzyme look it up in the KEGG database Search for your protein directly in the KEGG database, or link to the KEGG database from one of the other databases you have already accessed. 16 Biol 288 2006 Final Report You need to submit a final report in which you summarize what you have discovered about your variant protein and the disease it may cause. Even though we did not do any bench work, we can still use a formal research report to report our findings. Since the end of the semester is nigh, we will leave out Materials and Methods, but if you were writing a real research report it would be vital to list all databases and tools you used and how you used them. As before the guidelines below are in addition to the general guidelines for writing reports. required section Introduction to include: As part of the background, describe the function of the normal protein and the disease caused when the protein is deficient. pg limit weight 1–3 30% This project is a study, not an experiment; therefore you will not have a hypothesis. However, you do have a specific goal. What is it and how did you approach achieving this goal (you don't need to describe details of how to use each tool – just give and overview of the steps). Results should include figures of the various data you collected: 30% The MSA with the predicted and actual secondary structures indicated above the sequence, make sure to highlight the mutation (it is important to clearly identify all sequences used). A DeepView screen shot of the whole protein, rendered in 3D. A DeepView screen shot of the normal amino acid, showing its microenvironment. A DeepView screen shot of the mutant amino acid, showing its microenvironment. Discussion Using all the information you gathered, propose a mechanism for how the mutation could cause a deficiency in the protein. How you justify your proposed mechanism is much more important than the actual mechanism. Can you propose any experiments that would allow you to test this mechanism? References etc. As per usual. 2-3 30% 10% 17