Download Exploring Protein Structure and Function Using Bioinformatics Tools

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Artificial gene synthesis wikipedia , lookup

Transcript
Exploring Protein Structure and Function
Using Bioinformatics Tools
Adapted from:
Biology 3055 Laboratory, April E. Bednarski, Sarah C.R. Elgin, and Himadri B. Pakrasi; Studying the Genetic
Basis of Disease Using Web-Based Bioinformatics Tools. Available at
http://www.nslc.wustl.edu/elgin/genomics/Bio3055/bio3055.html.
In its early days cell biology was primarily an observational science with microscopy
playing a vital role in our early understanding of how cells work. With the discovery and
understanding of the central role played by genes and proteins in cell structure and
function, cell biologist became interested in the roles played by these macromolecules in
cells and thus molecular cell biology was born. While there are still important discoveries
made by 'looking' at cells using microscopy (particularly using fluorescent stains), current
cell biology research often involves studying cell function at the molecular level.
Elucidation of a protein's function can be approached from several angles. Often a
protein of interest is identified base on an observed phenotypic mutation. Using a
genetic approach the gene responsible for the mutated phenotype is identified and
cloned. The gene can then be mutated and the effect of the mutation can be evaluated
within an isolated cell, a model organism or in vitro. Another approach is to purify a
protein based on some biochemical characteristic and study the function of the protein
under various conditions in vitro. Additionally partial amino acid sequence can be
obtained for the purified protein, and this sequence used to 'fish' out the gene.
Until recently most cell biology researchers would focus on a single cellular process and
try to identify and characterize proteins involved in said process. In some cases dozens
or more labs have dedicated decades to the study of a single cellular process or even a
single protein.
The advent of powerful techniques for studying proteins and genes such as automated
DNA sequencing, high throughput protein identification methods and protein structure
determinations has exponentially increased the information available to cell biologists.
Currently NCBI reports that sequencing has been completed for 381 species, and
another 805 projects are ongoing (NCBI, July 26, 2006). Of course the bulk of these are
bacterial genomes which are relatively small but almost 20 percent of the genomes fully
or partially sequenced are eukaryotic species. As of July 26 2006 RCBS Protein Data
Bank was reported to contain 37874 protein structures.
For the cell biologist the potential uses of this massive and growing sequence and
structural information is very exciting. No longer are we tied exclusively to painstaking
genetic approaches of hunting down a single gene or isolating a single protein for study.
Now in the age of omics we can consider studying a species entire genome (genomics),
all the mRNA expressed in a single cell (transcriptomics), or even all the proteins
present in a single cell at one time (proteomics).
It is difficult to adequately summarize the ways in which this new data is currently being
exploited for the understanding of cellular process. In the next few weeks we are going
to access a fraction of the resources freely available to anyone with a computer and an
internet connection. Hopefully by the end of these labs you will have an appreciation for
resources available and how to use some of them.
Exploring Protein Structure and Function using Bioinformatics tools
Project Scenario
Many diseases are caused by proteins with amino acid sequences different from what
we might call the 'normal' protein. The mutant protein may no longer be able to function
at all or it may have a reduced or altered function. Often these disease causing proteins
differ only by a single amino acid from the non-disease causing form. Understanding
how the mutation causes disease can lead to better understanding of the 'normal'
protein's function and may lead to new ideas for treatment of the disease.
When we talk about 'normal' and 'mutant' genes or proteins it is important to remember
that polymorphisms may exist among 'normal' proteins which do not cause disease.
Depending on the particular job of a particular amino acid in a given protein's tertiary
structure, some amino acid substitutions will conserve the protein's function, but others
will not. We know this is the case because homologues proteins in distantly related
species are not 100% identical, yet they have the same function.
There is disagreement in the scientific community as to what constitutes a normal
protein and a mutant protein. Some consider only the most prevalent sequence of a
protein to be the 'normal' protein and all other forms to be mutants even if they are not
associated with disease. Others think there must be evidence of altered protein function
for an allele to be called a mutant gene. We will get around such distinctions by calling
the most prevalent sequence normal and refer to any alternate sequence as variant.
Variants may have as little as a single amino acid substitution (the amino acid normally
seen in the sequence is changed to another amino acid) or it may have gained/lost one
or more amino acids.
The central goal of this project is to predict if a variant protein sequence is likely to result
in an observable alteration of the protein's normal function. Such a loss of function could
include some reduction in activity, an alteration to give the protein a new activity, or a
complete loss of detectable activity. To make this prediction you will need to look at a
variety of available information concerning the normal proteins structure and function –
paying particular attention to the amino acid which is altered in the variant sequence. A
summary of the many aspects of the normal and variant sequence which you will need
to consider is give below (you may not have access to all this information for your
protein):
First about the 'normal' protein:
• What is the cellular environment?
• What does it do? Is anything known about the amino acids involved in its function?
(Such information could be based on mutational studies or structural studies).
• What protein family does it belong to? (Gives you functional information.)
• Are any post-translational modifications expected?
Then about the particular amino acid which is variant:
• Is the position conserved across orthologues, and or paralogues? (suggest
importance of the amino acid to the protein's function)
• Where does the amino acid occur in the secondary structure? Could the mutation
affect the secondary structure?
• Is the amino acid involved in protein function? For example is it part of an active
site and might it make contacts with a substrate?
• If the structure of the protein or a closely related protein has been resolved, what
does it tell you about the function of the normal amino acid in the proteins
structure/function?
2
Biol 288 2006
In the process of working through the above considerations, you will be exposed to
many bioinformatics resources and tools. You will be given the amino acid sequence for
a variant protein and the name of the corresponding normal protein and you will then
follow these steps.
1. Look up the normal protein in the secondary databases Gene (at NCBI) and
SwissProt (at EBI). Read about the disease associated with your protein in the
OMIM database. Depending on your project you may read some scientific
literature concerning the role of your protein in normal and diseased cells.
2. Use the BLAST tool at NCBI to find five homologous sequences in other species.
3. Create a multiple sequence alignment (MSA) of the normal and mutant human
genes, and the five homologous sequences.
4. Look at a secondary structure prediction of the program PSIPRED to see what
structural role the normal and mutant amino acids may play.
5. Use the three dimensional structure of either your normal protein or a closely
related protein to help predict what role the normal and mutant amino acids may
have. This will require the use of using a program called Deep View (installed on
the computer hard drive) and the resolved structure from the Protein Structure
Database (PDB).
6. If your protein is an enzyme you will check the KEGG database to see what is
known about its role in the biochemistry of cells.
Much of the above requires that you have some understanding of amino acid properties.
I have placed a paper written by Betts and Russell (2003) on WebCT. These authors
provide a very good overview of the thinking process that goes into diagnosing the
possible effects of an amino acid substitution on protein function, of an amino acid
mutation. You will also be completing a tutorial on protein structure.
How the labs will work
For the bioinformatics
labs I have used the
symbol to indicate
steps you need to carry
out on the computer.
All questions you need
to
answer
are
numbered sequentially.
Extra information will
be found in side boxes,
like this one.
Each student will be assigned one of several alternative projects to
work on. Each alternative project deals with a different mutant protein
involved in a different disease. We will be spending three to four
weeks in a computer lab working on various aspects of your projects.
Throughout the course of the laboratories you will be completing
several in lab assignments which must be handed in before you
leave. Additionally, there will be a few pre-lab assignments that you
will have to hand in before a particular lab begins (some will require
extra reading). And finally, you will prepare a short summary paper to
hand in after all labs are completed. The details for all assignments
and the summary paper will be handed out before we begin the first
lab.
Each student is expected to independently complete their own
assignments; however, you are encouraged to consult with your peers during labs.
Additionally, it is very important that each student carry out all their own computer work!
You will not learn how to use the various tools if you do not ‘drive’ the mouse yourself.
For those of you who are a bit afraid of the computer – we are here to help, please do
not be embarrassed to ask us or your neighbor for help. Bioinformatics is a “learn by
doing" enterprise so don’t be afraid to dive in.
All databases we will use are easily accessible via the web and all tools we will use are
freely available. Some are used directly in a web browser and some are programs which
are downloaded and installed locally, but they are free. The computers in the lab we are
using have been custom set up for us to use. Some of the work we will be doing can be
3
Exploring Protein Structure and Function using Bioinformatics tools
done from any computer on campus but for other jobs you can only use the specially set
up computers. There are instructions on WebCT for how to set up your own computer to
do the same things.
You will have two handouts for the bioinformatics labs. The thickest (this one) contains
the information that applies to all projects and the second is a guide sheet that applies to
your particular project. The assignment questions in the manual and guide sheets are
numbered sequentially. Either answer the questions directly in a separately posted Word
file (see WebCT) or if you prefer write the answers on loose leaf.
Saving and accessing your files
By now I hope you are familiar with your network (I drive). It is usually best to save files
on the desk top while working on them and then move them to the ‘I drive’ at the end of
the lab. Create a desktop folder titled “Bioinformatics yourname”. Save all local and
downloaded files here during the lab. At the end of the lab, move this folder to your “I
drive” or burn it if that works better for you. When you come back for the next lab, move
the folder back to the desk top and continue your work.
Laboratory Checklist
Beginning of lab:
hand in any pre lab assignments
login to your computer and transfer your folder from the I-drive to the desktop.
Open the Firefox browser and then go to the web page
http://uregina.ca/~lintottl/288_bioinformatics_homepg.html (link available from WebCT
or my web site).
Make this your home page (Tools>Options>General>HomePage – click "use current
page"). This will make it easy for you to get back here.
It helps to use tabbed browsing – this way you can have files from several databases
up at once and quickly switch back and forth to find information.
During the lab:
Answer the lab assignment questions as you go. You have 2 options. Type your
answers in as you go: save the Word doc to your desktop folder. At the end of the lab
print out the pages you have completed and hand them in. Make sure you have put
your name on the sheet (you can do this in the header if you know how, or you can
write it on after. Write out your answers on loose leaf.
save any new files to your desktop folder
End of the lab:
move your desktop folder to your I-drive – very important or you have lost your
work
hand in your lab assignment (check you lab schedule for what questions are due
each lab)
logout
Databases and Bioinformatics Tools
There are countless bioinformatics resources available today, and we will simply not
have time to learn to use even a few of them in any depth. Instead, we will give you a
very brief background and focus on using only a few databases and tools to achieve our
specific goal.
4
Biol 288 2006
There are two basic components of bioinformatics resources, databases and mining
tools. Databases can be divided into two types; primary databases contain actual
sequence (or other raw) data obtained from experiments while secondary databases are
curated (included interpretations of the primary data). Primary sequence data usually
refers to nucleotide sequences which could be of genomic or cDNA origin. Protein
sequence databases contain protein sequences primarily obtained by conceptual
translation of DNA sequences (i.e. protein sequences are not usually obtained by direct
sequencing of the polypeptide). There are hundreds of bioinformatics databases
available online (see the Jan 2006 issue of Nucleic Acids Research for a glimpse of what
is available) but fortunately there are two major centers where most of the databases are
gathered together (or linked to) for easier access; in the United States at NCBI (National
Center for Biotechnology Information), and in Europe at EBI (European Bioinformatics
Institute).
NCBI and EBI have similar but not identical resources. Both have essentially the same
primary databases (indeed they exchange primary sequence information on a daily
basis, so that any sequence information submitted to one organization will always be
available to both). Data mining tools used at the two sites are different, and they each
build their own secondary databases. We will primarily use NCBI resources, although we
will look at some EBI resources as well.
Guided overview of resources and tools we will use
NCBI resources
Go the NCBI web site home page.
The search bar that you see is NCBI's Entrez search tool. Entrez allows you to search
any or all available database by selecting the database of choice from the drop down list.
Quick links to some often used sites are above the search window.
Click the helix from any NCBI pg to
get to home pg
Quick link toolbar. Here is where you find the link to BLAST.
Search toolbar. This is gateway to NCBI's search tool called
Entrez. You can search all databases at the same time for any
term. Or, you can narrow your search to a single database. Just
select the database you want from the drop down menu and
enter your search term.
Click on the link "All Databases". Here you will see a list of all databases available at
the NCBI site. Let's learn a little about the databases and tools we will be using.
Click the link “Protein”
5
Exploring Protein Structure and Function using Bioinformatics tools
Protein database at NCBI
1. What are the sources for sequences available in this database? Define any
acronyms (find them with Google - type in the acronym followed by database to
quickly find the source).
Gene database at NCBI
Go back to the database page and click on Gene. In the left column you should see
an "about" link which will take you to the paper Entrez Gene: gene-centered
information at NCBI (Maglott et al. 2005). Read the abstract and introduction to
answer the following questions.
Database Redundancy
Sequences are submitted by the
originating
lab
directly
to
databases such as GeneBank or
SwissProt. It is not unusual for the
same sequence to be submitted
more than once; for example
different
labs
may
submit
sequences for the same gene, one
might
contain
slightly
more
sequence than the other and the
two entries will each get their own
accession number. Later, a
genome
sequencing
project
submits sequence for an entire
chromosome including the gene
which already has been submitted
twice; resulting in three separate
entries for said gene. For historical
reasons it is important that all
sequences be maintained, even
though this creates redundancy in
the databases. The RefSeq project
was begun to deal with the
problem of redundancy.
2. Is this a primary or a secondary database? Explain your
answer.
3. What subset of known sequences is included in this
database?
4. What is a refseq? Go to Books (in tool bar menu at the
top of the screen)>The NCBI Handbook, search for
RefSeq project.
Search Gene for the PTGS1 sequence. Take a quick look
at the kinds of information available from this site. Answer
the following questions.
5. What is the official name of this gene?
6. What types of reactions does it catalyze?
7. In what metabolic pathway does this enzyme function?
8. How many isoenzymes of PTGS are there?
9. Where is this gene located in the human genome?
OMIM
Go back to the database listing and select OMIM. Answer
the following questions (hint: you may have to look around
a bit to find all the answers).
10. What information is available at this site?
11. Is this a primary or a secondary database? Why?
12. What book is the information in OMIM base on?
13. Would you expect to find information on an infectious disease such as Herpes in this
database? Why or why not?
BLAST (at NCBI)
Go back to the NCBI home page and click on BLAST.
14. What does this acronym stand for and what is it used for?
Notice there are several different BLAST tools, we will focus on the tools for searching
protein sequences. Protein BLAST (blastp) searches the protein database for matches
to an input amino acid sequence (in FASTA format).
Select blastp and on the input form, click on search – this should take you to
information on how to input sequence into blastp.
15. Describe what FASTA format is. It is very important to understand FASTA format is it
is the most commonly required form by most databases.
6
Biol 288 2006
EBI resources
Click on the EBI link on WebCT. You should see a tree map showing most of the
resources at EBI some of which look similar to NCBI offerings. We will be using the
EBI resources SwissProt and ClustalW.
See the 288
bioinformatics web
site for detailed
guides on reading
output from Swissprot and other
databases.
Swiss-Prot
Swiss-Prot is listed under protein databases. Click on
"UniProtKB/Swiss-Prot" to get a description of this database.
16. Is this a primary or secondary database? Why?
17. Would you expect to see the same protein sequence represented
more than once? Why or why not?
ClustalW
ClustalW is a program used to align two or more sequences. Such alignments can be
used for a variety of purposes, such as, investigating the evolution of a protein, and
determining what regions of a protein are most critical for function. There are many
alignment programs available; which one you use depends on what you are trying to
determine. For our purposes ClustalW will work well and is the easiest to use.
To access ClustalW go back to the EBI resource list. ClustalW falls under Toolbox tree >
sequence analysis. Once you get to ClustalW click the link “new users please read the
FAQ”. Answer the following questions:
18. What basic information can be gleaned from a multiple sequence alignment of
related proteins?
19. What are three ways the information gleaned can be used?
20. What types of sequences can be aligned by ClustalW?
There is an example ClustalW input page under Guides on the bioinformatics web page.
Note there are many modifiable input fields; for the most part we will use the default
settings. Please do not change any settings unless you are clearly told to do so.
As for BLAST input, the sequence input will be in FASTA format, make sure you know
what this means.
Other resources
PSIPRED protein structure prediction server
This server is an input point for using several different programs that will predict protein
structure based on the primary amino acid sequence. We will be using this site to predict
the secondary structure of the normal human proteins. Because obtaining output form
this site can take hours to days I have already obtained the predictions and they are
available from WebCT. If you want/need more information on the algorithms behind the
prediction, or on interpreting the output click the more… button on PSIPRED home
page.
RCBS Protein Data Bank
This is the database where all coordinate files for tertiary protein structures are stored.
DeepView
DeepView is a program used to both see and manipulate three dimensional
macromolecule structures. Of all the Bioinformatics tools and programs we are using,
this is the only one that actually resides on your computer. We are using this particular
program because it offers one important feature that other free programs do not, it will
allow us to mutate an amino acid and investigate what the new protein will look like.
DeepView is a very powerful program which is a bit harder to use then other structure
7
Exploring Protein Structure and Function using Bioinformatics tools
viewers, but we will guide you through the steps you will need to use. If you would like to
learn to use DeepView in more depth see Gale Rode’s excellent tutorial at
http://www.usm.maine.edu/~rhodes/SPVTut/index.html.
The GIMP
This is a fully featured open source program for image manipulation. This program has
all the power of Photoshop, but the price tag is much easier to take. If you are already
familiar with Photoshop it can take a bit of time to learn the GIMP as the commands
often have different names, but there is plenty of help on the Web. We will be using the
GIMP it to take screen shots of DeepView structures, and I will provide step-by-step
instructions for how to do this.
Useful Links (available on web site)
EBI resources http://www.ebi.ac.uk/services/
NCBI resources http://www.ncbi.nlm.nih.gov/Sitemap/index.html
PSIPRED sever http://bioinf.cs.ucl.ac.uk/psipred/
RCBS-PDB http://www.rcsb.org/pdb/Welcome.do
GIMP home http://www.gimp.org/
References
NCBI, Entrez Genome Project. Genome sequencing projects statistics. Available at
http://www.ncbi.nlm.nih.gov/genomes/static/gpstat.html, accessed July 26, 2006.
RCBS Protein Data Bank. An information portal to biological macromolecular structures.
Available at http://www.rcsb.org/pdb/Welcome.do, accessed July 26, 2006.
Maglott, D., J. Ostell, K. D. Pruitt, and T. Tatusova. Entrez Gene: gene-centered
information at NCBI. Nucleic Acids Res. 2005 January 1; 33(Database Issue): D54–
D58. Available online at
http://www.pubmedcentral.nih.gov/articlerender.fcgi?tool=pubmed&pubmedid=15608
257 .
8
Biol 288 2006
Finally – let’s start your projects
My project name _____________________________________________________.
Please keep your project sheets handy.
Step 1. Getting sequences and learning about your protein/disease
The mutant sequence
Go to your project page on WebCT. Download your mutant protein sequence to your
desktop folder (right click the link, choose save as and save to your desktop folder).
The NCBI Gene entry for your normal protein
Go to the NCBI homepage.
Select “Gene” from the database pull down menu.
In the search box type the name of your protein as given in your guide sheet.
Answer the questions (hint: go to the summary):
21. What is the official Gene symbol?
22. What is the gene name?
23. What is the GeneID number?
24. Where in the human genome is this gene located?
25. What is the function of the protein? (For example: enzyme, structural, cell signaling
etc. – a protein may have more than one function)
26. What specific role does it play in the cell? (For example: if an enzyme, what type and
what reaction does it catalyze, if structural, what structure.)
27. What is the RefSeq accession number for the mRNA of this gene?
28. What is the RefSeq accession number for the protein sequence (it is the product of
the gene)?
Get the sequence of the normal protein
Click the RefSeq protein (product) accession number, this will open the GenBank file
specific for this protein.
At the top of the file is a drop down menu called “Display”, click the arrow and choose
FASTA. This is the sequence of the normal protein in FASTA format, ready to copy
and paste.
Copy the sequence and paste it into a new word file. Make sure you have included
the line that begins with the > sign. Save this new word file in your desktop folder as a
text file, under the title “normal myproteinname”.
The Swiss-Prot entry for your normal protein
Open a new tab in your browser and go to the EBI site (by using tabbed browsing you
can keep all your different protein records easily accessible).
Under Protein Databases select the UniProtKB/Swiss-Prot database, scroll down a
bit until you see Access the UniProtKB/Swiss-Prot Database. Click the link to get to a
search page.
Make sure you use the second search menu, UniProt Power Search and follow your
guide sheet instructions for specifics on what to search for.
9
Exploring Protein Structure and Function using Bioinformatics tools
29. What is the SwissProt accession number? (Hint: for proteins it always starts with a
P).
30. What are the alternate names for this protein?
31. Where in the cell is this protein located? (If it is known it will be in the comments – if it
is not listed your answer is “unknown”.)
32. Does this protein exist as a monomer, dimmer (homo or hetero) etc.?
OMIM entry
Find the entry for you disease in the OMIM database, either search the OMIM
database directly at or maneuver to the OMIM entry for this disease by clicking the
OMIM link that appears to the right of the Gene entry you just looked up.
33. What protein is deficient in the disease you are studying?
Project Specific Reading
In addition to the above database information you may have some reading material for
your disease.
34. In your project specific manual there will be a list of questions specific to your
disease. These questions can be answered using any of the databases or readings
you are already familiar with. Answers to the project specific questions are due at the
beginning of next week’s lab. You may need to finish these up outside of lab time.
Step 2. Finding related sequences and setting up a multisequence FASTA file.
To begin our analysis of the variant protein we need to determine where the mutation
lies in the protein, and how conserved this region of the protein is among homologous
proteins from other species. To do this we will need to first find homologues sequences
using the NCBI BLAST tool to search a non-redundant protein database.
Go to the NCBI home page.
Click on the BLAST link.
Under the protein subheading, choose protein-protein blast (blastp). This will take
you to a submission form.
Open the word file containing the normal protein sequence.
Highlight and copy the sequence, including the header (>……).
Paste the sequence into the large window next to the word ‘search’.
We are not changing any default settings so you can simply click the “BLAST” button.
Very shortly you will be presented with a window indicating your request has been
submitted. Click the Format button. In a few seconds a new window containing the
results will open (time depends on sequence you submit and how busy the server is).
In the appendix of this guide is a picture of a BLAST output with the various parts
highlighted. Take some time to familiarize yourself what you see. (I recommend you
save this output in your desktop folder as an HTML page for future reference).
For our purposes here the graphical view and the summary list of hits will not be very
useful. Scroll down until you get to the first alignment (may be a fair distance if you
have a lot of hits). Be default the alignments are presented from highest to lowest
score, so as you scroll down you will find hits with increasingly less homology to your
query sequence.
Go through the alignments looking for sequences from different species which have
< 100% identity. Immediately following the sequence name, in square brackets, is the
10
Biol 288 2006
species name. Often, if the sequence is human, no species name will follow. When
you find one that looks promising, click on its accession number. This will take you to
the GenBank entry for that sequence. Look at the organism name to make sure it is
not human (clicking on the species name will take you to a taxonomy page with more
details).
If you decide this sequence is a good choice select FASTA from the display menu (at
the top). Copy and paste the sequence to a new Word doc, save it as text only,
“myproteinname related seq”. Make sure you get the header (>…..).
Go back to the BLAST results (use the browser back arrow) and continue looking for
homologous sequences from a variety of species and adding the FASTA formatted
sequences to the Word doc. Try and obtain sequences from distantly related species
(i.e. only choose one rodent sequence, and try to get at least two no mammalian
sequences). The more distantly related the homologous sequences are, the more
informative the MSA will be.
Once you have five non human sequences, add your variant and normal protein
sequence to the beginning of the list of sequences in your Word doc.
To make for a nicer ClustalW output, we are going to modify the FASTA title for each
sequence. ClustalW names each sequence in the output file according to the FASTA
title (including all the characters up to the first space in the title). If you look at your
default FASTA titles, as a whole they are quite informative, but somewhat ugly. (Will
gi|109019190|ref|XP_001107538.1| mean anything to you once you print out the
alignment?) For each sequence insert a common name, such as monkey, right after
the > character, and leave a space between the new name and the existing title.
Make sure this common name is unique for each sequence, or ClustalW will not carry
out the alignment (it thinks you have given it the same sequence twice). For the first
two human sequences, label as mutH and norH.
35. Make sure you have saved the sequence file as a ‘text only file’. Print the file and
hand it in with today’s lab assignment.
Step 3. Creating the MSA – using ClustalW
From the EBI home page, select ClustalW (under Toolbox>SeqAnalysis). See the
Guides on my web site for an example input form for ClustalW (keep in mind the input
page sometimes changes but all the fields I will describe should still be there). Use
this to help you set up the job.
You can get your set of FASTA formatted sequences to ClustalW in one of two ways.
Either copy the sequences from your word file to the ClustalW input window, or
upload your file by browsing to it (area below the window you paste into).
Change OUTPUT ORDER to “input” instead of “aligned”. This change will keep the
human sequences at the top of the alignment, trust me, this will work better.
Other then that leave the default settings alone and click Run to submit. You will get a
new window indicating your job is running. If there are any problems with your input
they will show up fairly soon. You can go ahead and work on other things while the
job is running but don’t close the running window.
When the MSA is completed, it will pop up in the run window. Make sure there is only
a single amino acid difference between your normal and mutant protein. (If you see
more than one difference, you have probably got the wrong normal sequence).
Save the output. First, click on the button View Alignment File (scroll down to the
alignment to see this button). In the new window, select and copy the entire
alignment. Paste it into a new Word file. It will look a bit strange, do not panic. Select
11
Exploring Protein Structure and Function using Bioinformatics tools
all and change the font to Courier, change the left and right page margins to 0.75
inches (File>Page Setup). The alignment file should now look like it did on the web
site. Adjust the page break so that the alignment lines stay together: All seven
sequences and the row under (stars and dots) need to stay together, don’t let them
split across pages (see example below). Don't forget to save the file.
monkey
dog
YGENCSTPEFLTRIKLLLKPTPNTVHYILTHFKGFWNVVNNIPFLRNAIMSYVLTSRSHL 360
YGENCSTPEFLTRIKLYLKPTPNTVHYILTHFKGVWNIVNNIPFLRNTIMKYVLTSRSHL 109
**************** *****************.**:*********:**.*********
36. Print the output to hand one in at the end of today’s lab. Also answer the following
questions.
37. What is the mutation? Write it in the following format "Res123Res" where Res is the
three-letter code for the amino acid in the un-mutated (wild type) protein and the
second Res is the amino acid in the mutated protein. In place of "123" put the amino
acid residue number of the mutation.
38. Is the mutation in a region of conservation – how do you know?
Interlude
In cells, proteins do not exist as a linear chain of amino acids, but instead they assume a
conformation that is dictated by the primary sequence. Before we begin to look at what
the protein's secondary and tertiary structures can tell us we need to review the
properties of amino acids and what the properties mean to a protein's three dimensional
structure. Please link to the online tutorial Introduction to Protein Structure at
http://www.paccd.cc.ca.us/instadmn/physcidv/chem_dp/chemweb/protein/index.html and answer
the questions on the associate handout (we will give this out in lab and it is also
available on WebCT). This tutorial must be completed and the question sheets handed
in (with answers) before you can proceed to the next step.
Additionally, please read section 14.1 to 14.4.3 in Amino Acid Properties and
Consequences of Substitutions (Betts and Russell, 2003) and answer the following
questions.
39. What type(s) of amino acid(s) would you expect to predominate on the surface of the
following kinds of proteins:
An integral membrane protein?
A protein that interacts with DNA?
A soluble protein?
40. What to processes can give rise to a family of homologous proteins?
41. What are the names given to the proteins arising from each process?
42. If you have access to a large family of homologous proteins, what might a region that
is highly conserved across the family suggest?
43. What do we mean by post-translational modification?
44. What are the two most common types of post-translational modifications?
45. Why is it important to consider the possibility that a particular amino acid is modified?
46. Why is a single classification system of amino acid properties not satisfactory when
considering the possible effects of an amino acid substitution on a proteins function?
47. What are three ways in which amino acids can be classified, as represented in the
Venn diagram that integrates the various properties of amino acids (Taylor, 1986)?
12
Biol 288 2006
48. What properties of amino-acids are not represented in the Venn diagram?
49. Based on the Venn diagram of amino acid properties, what properties differ between
the mutant and normal amino acid in your protein?
Step 4. Secondary structure prediction
So far you have considered possible consequences to protein function based on the
properties of the normal and variant amino acid in the primary sequence. To further build
your hypothesis you need to also consider the role of the normal amino acid in the
secondary, tertiary, and even quaternary structure of the protein. Understanding the role
the normal amino acid may play in protein structure will help to narrow down the possible
effects of the mutation.
Unfortunately, the bulk of proteins have not yet had their structures resolved, but we can
still consider possible structural roles by looking at the protein's secondary structure.
Since secondary structure is dictated by the primary amino acid sequence it stands to
reason that even when a protein’s structure has not yet been resolved, we should be
able to predict secondary structure. However, as you have just learned each of the
twenty amino acids can play several roles in a protein depending on its position in the
overall structure and its microenvironment (the neighboring amino acids). Despite these
complications several secondary structure prediction programs and available today and
are able to produce accurate predictions for up to 80% of sequences. We will be using
the program PSIPRED.
Secondary structure predictions take some time so I have already obtained them and
placed them on WebCT. On your project page, right click the secondary structure link,
and save it to your desktop folder.
Open the file on your computer. Above the primary protein sequence are letters
designating the predicted secondary structure; H=helix, E=strand(β-sheet),
C=coil(random coils).
On your MSA print out (or you could do this in Word) write the symbols H for helix and
B for β-sheet above the alignment, write nothing for coils. See example below:
monkey
dog
H--------H
B------------B
H---HB--------B
YGENCSTPEFLTRIKLLLKPTPNTVHYILTHFKGFWNVVNNIPFLRNAIMSYVLTSRSHL 360
YGENCSTPEFLTRIKLYLKPTPNTVHYILTHFKGVWNIVNNIPFLRNTIMKYVLTSRSHL 109
**************** *****************.**:*********:**.*********
50. What is the secondary structure predicted for the region containing the mutation?
51. What type of secondary structure(s) is the most common in the alignment? If it
seems an even mix between α-helixes and β-sheets, state that.
52. Do you think that the mutation in your protein may alter the secondary structure, why
or why not?
53. How could you test your prediction?
Step 5. Analysis of the 3D structure
Background reading from Biochemistry 5th Edition (Berg et al. 2002)
Bond distances
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Search&db=books&doptcmdl=GenBookHL&term=bond+
distance+AND+stryer%5Bbook%5D+AND+215023%5Buid%5D&rid=stryer.section.156#157
Conservation of tertiary structure
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Search&db=books&doptcmdl=GenBookHL&term=protei
n+tertiary+structure+AND+stryer%5Bbook%5D+AND+215485%5Buid%5D&rid=stryer.section.944#945
13
Exploring Protein Structure and Function using Bioinformatics tools
In this step we will look at the 3D structure of either the normal protein, or a closely
related structure. In order to look at a 3D protein structure we need two things; a file
containing coordinates for the atoms in a structure, and a program that not only converts
these coordinates into a 3D image, but allows us to manipulate the structure. It is
important to remember when looking at 3D structures of proteins that what we are
looking at is a model, based on experimental data. As with any scientific interpretation of
data the 3D protein structure can have errors.
Generally structures are studied in one of two ways, either using X-ray crystallography to
resolve the atom coordinates in a protein crystal, or using NMR to resolve structures in
solution. Protein structure resolution is a very painstaking endeavor and often a single
structure resolution can take several person years (or six years of a PhD candidate’s
life). Regardless of the methods used, once the data is collected, analyzed and a
satisfactory level of resolution achieved, a file containing the atomic coordinates for most
atoms in the protein (excluding hydrogens) is submitted to the protein databank (PDB)
as a PDB file.
Using the program DeepView, we will first investigate the structure of the normal protein;
paying particular attention to the amino acid that is altered in the variant protein. You
should be able to see where the amino acid is located in the structure (surface, interior,
active site or binding site) and to analyze possible interactions between the atoms of the
‘normal’ amino acid and other amino acids or with ligands. Then you will mutate the
amino acid, and look to see if any of the predicted interactions are affected.
Resolution
Like any measuring method,
measuring the position of
atoms in a protein has limits,
dictated by the method and
equipment
used
in
the
experiment. The accuracy with
which the atoms of a particular
structure are measured is
referred to as its resolution
and is usually given in
angstroms. That is, if the
stated resolution of a structure
is 3 angstroms, and two atoms
in the structure are stated as 6
angstroms apart, the actual
distance could be as little as 3
and as much as 9 angstroms.
DeepView is a very powerful program and learning to use it
would take many hours which we don’t have. The following is a
step-by-step guide to complete some very specific tasks.
Get the PDB file name for your protein from your project
guide. It should look something like 1AGP.
Go to the RCBS PDB home page and enter the file name
into the search window, click search and you will be directed
to the structure summary for this file. Note the following
information:
54. What experimental method was used to obtain the data?
55. What is the resolution of the structure?
56. What species was the protein obtained from?
57. List any ligands, cofactors or metal ions included in the
structure (particularly important for enzymes):
58. Answer any project specific questions.
Download the PDB file (click on Download, and then click PDB, choose download
and download to your desktop folder).
Launch DeepView and then open the PDB file (File>Open PDB…).
When you have open the file, close the information window that pops up (we don’t
need it).
14
Biol 288 2006
Your screen should have 3 floating windows, something like the figure below.
Toolbar
Control Panel
if missing, open with
Window>Control P..
Graphic Window
Summary of ToolBar buttons
Re-center – display re-centers
Zoom & Center
around the atom you select, and will
rotate around the atom when you use
the rotate tool
Mutate – Used to mutate a side chain (click
button, then atom you wish to mutate, select the
new amino acid from the list. The new amino
acid will appear in the lowest energy
conformation.)
ToolBar
Rotate
Transverse
Zoom
Distance
Identity – residue type
and # will appear on the
display for the atom you
select
Tab will switch to
next movement tool
Take a few minutes to try out some of the tools (don’t worry; the program will not let
you alter the original structure file).
15
Exploring Protein Structure and Function using Bioinformatics tools
Control Panel
Show group
Label group (code and #)
Group names, amino
acids are 3 letter code,
and residue #
Show selected groups as
ribbons
Chain column –
click to select
entire chain
Secondary structure
h = helix, b = beta sheet
Show amino acid
side chain
Checks indicate
what is shown in
display.
59. Compare the actual secondary structure of your protein (shown in Control Panel)
with that predicted by PSIPRED. Note and differences.
60. PSIPRED states its predictions are ~80% correct. Do you agree this is a good
estimate of the accuracy?
Since many of the DeepView steps you will follow depend on the particular structure you
are looking at the remaining instructions for completing step 5 are located in your project
specific guide sheet.
Step 6. If your protein is an enzyme look it up in the KEGG database
Search for your protein directly in the KEGG database, or link to the KEGG database
from one of the other databases you have already accessed.
16
Biol 288 2006
Final Report
You need to submit a final report in which you summarize what you have discovered
about your variant protein and the disease it may cause. Even though we did not do any
bench work, we can still use a formal research report to report our findings. Since the
end of the semester is nigh, we will leave out Materials and Methods, but if you were
writing a real research report it would be vital to list all databases and tools you used and
how you used them.
As before the guidelines below are in addition to the general guidelines for writing
reports.
required
section
Introduction
to include:
As part of the background, describe the function of the
normal protein and the disease caused when the
protein is deficient.
pg
limit
weight
1–3
30%
This project is a study, not an experiment; therefore
you will not have a hypothesis. However, you do have
a specific goal. What is it and how did you approach
achieving this goal (you don't need to describe details
of how to use each tool – just give and overview of the
steps).
Results
should include figures of the various data you
collected:
30%
The MSA with the predicted and actual secondary
structures indicated above the sequence, make
sure to highlight the mutation (it is important to
clearly identify all sequences used).
A DeepView screen shot of the whole protein,
rendered in 3D.
A DeepView screen shot of the normal amino acid,
showing its microenvironment.
A DeepView screen shot of the mutant amino acid,
showing its microenvironment.
Discussion
Using all the information you gathered, propose a
mechanism for how the mutation could cause a
deficiency in the protein. How you justify your
proposed mechanism is much more important than the
actual mechanism. Can you propose any experiments
that would allow you to test this mechanism?
References
etc.
As per usual.
2-3
30%
10%
17