Download Steven`s project - The University of Texas at Dallas

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Multi-state modeling of biomolecules wikipedia , lookup

Ribosomally synthesized and post-translationally modified peptides wikipedia , lookup

Paracrine signalling wikipedia , lookup

Gene expression wikipedia , lookup

Point mutation wikipedia , lookup

G protein–coupled receptor wikipedia , lookup

Magnesium transporter wikipedia , lookup

Expression vector wikipedia , lookup

Protein wikipedia , lookup

Ancestral sequence reconstruction wikipedia , lookup

Metalloprotein wikipedia , lookup

Homology modeling wikipedia , lookup

Interactome wikipedia , lookup

Bimolecular fluorescence complementation wikipedia , lookup

Western blot wikipedia , lookup

Protein purification wikipedia , lookup

Protein–protein interaction wikipedia , lookup

Two-hybrid screening wikipedia , lookup

Proteolysis wikipedia , lookup

Transcript
Rosetta
Steven Bitner
A Need for Rosetta
A need exists for the ability to compute the complex workings of amino acid chains. These
proteins have been found to fold in many different ways, and the forces that affect these changes
have yet to be fully understood. The folded state of a protein is the state in which it finally gains
its functionality within the organism. As a result, it has become very important to try to predict
these folding patterns for various reasons. Also, it has been found that the final shape of a folded
protein determines its function. This fact is very important since in the future it may be possible
to know what shape of protein is needed to allow for a certain function. This can help with very
specific medication by using synthetic proteins to create functions that prevent or reduce/remove
diseased protein interactions.
Enter Rosetta. Rosetta software is capable of the two major points above with a great deal of
accuracy. Rosetta has won the Critical Assessment of Structure Prediction (CASP) competition
held at the Lawrence Livermore Lab in Livermore, California which is an important competition
for determining the best available de novo protein predictor. De Novo protein prediction is that
prediction done without any classification information for the protein given. Many softwares
used for protein folding prediction require the knowledge of what protein family it is a part of or
which portions of the protein strand are rigid or flexible by using techniques such as the pebble
game. The CASP competition gives an unpublished protein sequence to each team in the
competition. Each team then utilizes its own software to predict the folding of the given protein.
Then a metric such as Root Mean Square Deviation (RMSD) is used to determine which
software is closest to the actual protein. Rosetta has been used to create synthetic proteins such
as Top-7 (fig.1). Using a combinatorial approach combined with its folding prediction abilities
using many different amino acid sequences and calculating the folding. When a protein gets
close to the desired shape (approximately 2-3 Å RMSD), then the protein can be synthesized
using the amino acid sequence found.
Fig.1: Top-7 Protein, the first synthetic protein, synthesized in November 2003.
The Top-7 protein shown above as already described was created by using the Rosetta software.
This protein has not yet been found in nature, and may never be. It was created to have the shape
shown above.
The initial, or target, shape was (almost) arbitrarily chosen and then the
computing began to find the necessary sequence. This synthesis was completed in November of
2003 and the Rosetta team became the first to synthesize a protein. For this synthesis, the Baker
Lab won the coveted AAAS Newcomb Cleveland Prize.
As stated earlier, this ability to
synthesize proteins with given shapes may one day help scientists create specific medicines
based on a desired reaction that may reduce or remove symptoms of various diseases.
Rosetta - Steven Bitner
2
The Baker Lab
Rosetta was created by a team of students and faculty at the University of Washington in the
United States. David Baker, the lab’s figurehead, is the faculty member of the University of
Washington that has gathered the talent available and has led the lab since the early 1990’s. Info
on the Baker Lab can be found at http://www.bakerlab.org/.
The Many Faces of Rosetta
Although this paper will only cover the protein prediction abilities of Rosetta, Rosetta is capable
of more. Rosetta can be used to determine protein to protein interactions, known as docking, as
well as the synthesis already mentioned. Rosetta is also working on computing the protein
predictions of all known amino acid sequences due to the high cost of NMR and X-ray
crystallography techniques.
These computations require a deal of processing speed, much
greater than is present on any supercomputer. So, Rosetta has joined forces with the World
Community Grid in Seattle, Washington to create something called the Human Proteome
Folding Project under the name Rosetta@Home.
How Rosetta Works
Rosetta, like many other software packages for protein folding, uses a series of energy
minimization functions [2]. These functions help to determine whether or not a configuration of
the protein is plausible (through use of various weights assigned due to probabilities), or even
possible (through penalties for steric collisions).
Software could be written which could
determine the exact best conformation of a protein by simply looking at all possible
conformations and taking the lowest energy. This would take far too much time to ever be
practical. So, Rosetta does some things to speed up the algorithm and make it useful for actual
implementation and use.
Rosetta uses a concept of global folding that helps to distinguish it from some other protein
folding software packages. The ideal folding of a subset of the backbone does not necessarily
occur. The ideal folding of the subset simply provides a local preference value to the overall
folding algorithm. This local preference may help to determine the local fold, but this preference
is not the only part of the equation. This is due to the possibility of an ideal local fold causing a
Rosetta - Steven Bitner
3
higher energy conformation in other parts of the protein by way of steric collisions and other
energy function penalties. Global preferences also include things such as β-strand pairings, and
hydrophobic preferences at the surface of the protein.
The first step Rosetta uses to speed up computation is that of side chain reduction. Side chains
make the folding of a protein much more complicated due to the higher number of atoms in the
protein.
So, Rosetta sets them aside and deals with them after the backbone has been
determined. The way that this is done is through the use of side chain centroids. A side chain
centroid is an approximation of an entire side chain to one vertex. This vertex is chosen to be an
atom in the location of the probabilistic center of mass for that specific side chain. So, if there
are 1,000 ways that a given side chain appears in the Rosetta library, then those 1,000
configurations are averaged together to find the predicted center of mass. This center of mass
location relative to the Cα atom on the backbone chain is used in the energy computations during
Rosetta’s run.
The major part of the Rosetta folding algorithm involves two steps. Fragment insertion and
fragment assembly. These steps involve replacing sections of the protein with segments from a
library stored in the Rosetta database. Rosetta, due to empirical testing, uses nine residue and
three residue substitution segments. These segments in the library are gathered from ‘known’
proteins in the PDB. The library segments are used to replace sections of the protein by using
the torsion angles as fixed values. So, in effect, the eight torsion angles connecting nine residues
become fixed, and the segment is treated as one rigid body. This increases the algorithm speed
for two reasons. First, since the torsion angles used in the folding are obtained from a finite
library, torsion angles cease being continuous and become discrete variables and thus much
faster computation becomes possible. Also, since windows are used to create rigid segments
with fixed torsion angles, very few individual torsion angles need to be computed.
For fragment insertion, we check the library and insert segments which are a good fit into a list.
More specifically, nine residue and three residue windows are checked against the library and a
list of the best 200 fits for both nine and three residue window size is obtained. The best fits are
based on a subset of the energy functions given in [2].
Rosetta - Steven Bitner
4
Fragment assembly is the assembly of the protein as a whole. First, a nine residue segment is
randomly chosen from the top 25 in the nine residue list created during fragment insertion. This
segment is then used to replace the corresponding nine residues in the protein. The scoring
function is then used to calculate the change in energy. If the energy went down, then this
replacement is kept. If not, it is placed aside with its score. This repeats until either a negative
score is obtained, or the program has determined that the likelihood of a negative score is low, in
which case the lowest positive score is kept. In each simulation folding, Rosetta chooses a
random start segment from the list and attempts 28,000 nine residue replacements. Next, a
similar process is performed for three residue segments, but only 8,000 replacements are
attempted.
Side chains are added last. The importance that Rosetta places on side chains in the overall
protein fold is done using the centroid. This probabilistic approach does seem to obtain a high
level of accuracy, but can most likely be improved. We discuss potential improvements later in
the section titled “Nobody’s Perfect”. A randomized Monte Carlo approach is used to check the
known rotamers of the residue being analyzed. The replacement is made and Rosetta verifies
whether or not steric clashes have occurred. If they have, then the algorithm continues; if not
then the algorithm moves along to the next residue.
Obtaining a Copy of Rosetta
Getting a copy is quite easy, but not entirely necessary. To obtain a copy go to [5] and click the
link for ‘Rosetta Licensing Information’. Follow the appropriate links and follow the directions
in the email that will be sent to you via email. Installation help is provided below in ‘Using
Rosetta’. The reason that I say that getting a copy of the software is not entirely necessary is due
to the availability of online servers. One such server is available via a link from the Baker Lab
website. This server is called ROBETTA. The server is down as of the writing of this report,
but is intended to come back online sometime in 2006. Another server is available through the
University of North Carolina by going to [1]. This link is for academic use only and limits the
user to prediction using 200 residues per trial. Another limitation is that only protein prediction
Rosetta - Steven Bitner
5
can be done via this server. One cannot use the docking capabilities of Rosetta through this web
server. I am unaware of the capabilities and limitations of the ROBETTA Server since it has
been out of commission for the duration of this project term.
Using Rosetta
Since there are two different places to go (three if you include the ROBETTA server), we must
discuss how to use both versions.
Using the web server
Firstly, let’s discuss the web server available at [1]. This web server uses a standard GU
interface, which makes its use quite straightforward. There is a link for documentation on the
persistent navigation sidebar on the left side of the page. This sidebar also contains links for
registration, logging in, submitting jobs and checking status of jobs.
Registration is not
necessary for submitting jobs or using any other features of the site. Registration does send you
email notification of job completion and stores jobs under the username that you select making
them easier to locate among the sometimes lengthy queue. To submit a job, you must first
download a PDB file to your computer. Upload this file to the web server by clicking on the
browse button. The main submit screen, see figure 2, gives you a few options. After you have
uploaded the desired PDB file, you can opt to use a resfile that you have created, or create one
using a simple web form by selecting upload your own list or create a list respectively. Of
course, Rosetta can compute packing for all residues in the PDB file if you select the all residues
option. The exception to this is in the case of a protein containing more than 200 residues. In
this case, the web server will force you to create a resfile if one is not uploaded. If it is absolutely
necessary to repack all residues of the protein and it contains more than 200 residues, then you
must follow the instructions above for obtaining the software and then follow the instructions
below for running the downloaded software. It is also possible when submitting a job to place a
smaller emphasis on the importance of the repulsive energy functions. Another important option
available via the submit job screen is the option to run the job multiple times. As described
above in the section entitled ‘How Rosetta Works’, Rosetta uses a random start point during
simulation. Therefore, different conformations of the same protein may be obtained. Using
multiple simulations allows you to see multiple results for potential conformations of the protein.
Rosetta - Steven Bitner
6
Fig.2: The submit job screen for the UNC online Rosetta server
After you have submitted a job, you must check the queue by selecting the queue link from the
navigation bar on the left for your results. Your job has completed when “Complete” appears
next to your job. If “Unsubmitted” appears next to your job, then you must resubmit your job.
There seems to be a glitch in the server that sometimes fails to submit a job. This glitch is
sometimes activated by clicking on the ‘view resfile’ link after you have submitted a job. Even
though you have selected submit job, it may not always be the case that it has been submitted.
Be sure to check the queue for a status of “Processing” or “Complete”.
Rosetta - Steven Bitner
7
Using the downloaded software
The downloaded software is much more difficult to use and does not contain user friendly GU
interfaces. Whenever possible a novice user should opt for the on line version as described
above. However, if you intend to repack proteins quite regularly, or need to repack in excess of
200 residues, then you have no other choice. The downloaded version is also necessary if you
intend to use the docking portions of the Rosetta software.
The uses of the software as well as sample command lines are given in the various README
documents given in the Rosetta package. There are a few important things worth noting at this
time to make use of the software run more smoothly. Firstly, Rosetta is not supported on all
operating system platforms. See the README_platforms text file in the Rosetta package for a
list of supported platforms. If using this software at UTDallas, you must use the software on a
machine in the UNIX lab, since the school’s Apache server runs Sun Solaris 9 (an unsupported
platform). Unpack the Rosetta package using the command “tar –zxvf filename” where filename
is the name under which you have previously stored the package. Change directories to get into
the rosetta++ directory. In this directory, you can check out the README file for compiling
instructions. If you don’t intend to change the code in any way, then the simplest way to compile
is to type “make gcc”. The system that you are using must have the GNU compiler and make
installed. If the correct software is installed and you are using a supported platform, compiling
will take about twenty minutes for the optimized version (this is the version that will be created
by using the command “make gcc”).
After compiling, one change must be made before you can use the software. In the folder
rosetta++, you must alter the file “paths.txt” by changing the line that begins “data files” to
contain the following “../rosetta_database/” in stead of the default location. This is the only
change that is necessary. Now you are ready to run jobs as you please by using the commands
and options in the README file located in the rosetta++ directory.
Interpreting Results
http://rosettadesign.med.unc.edu/documentation.html#II gives a field by field breakdown of the
output file format. The most important fields for basic use are the output coordinates. These
coordinates are given in the same format as the PDB files needed as input. This output file can
be used with visualization software such as PyMol to see the final conformation without making
Rosetta - Steven Bitner
8
any changes to the output file. The most important lines after the coordinates are the overall
score that was assigned to the conformation by the Rosetta scoring functions and the
rms_to_start showing the RMSD between the start and finish conformations. In both cases, the
lower the better. The fields that remain after the total are the individual function values used to
compute the total energy. It is worth noting that in the files created by using the C++ version of
the software the energy abbreviations are generalized form LJ (Lennard-Jones) and LK (LazarisKarplus) to E (Energy). For example, the output file format given above shows LJatr for the
Lennard-Jones attractive function, but in the C++ version, the output file contains Eatr for the
Energy attractive formula.
The figure below shows the input (before) and output (after) of a sample run performed using
PyMol for visualization. The input file was the accepted conformation for the protein 1ubq, and
the output is slightly different. This is because Rosetta does not assume the input file to be the
known conformation. The score for the conformation below is 113, and the RMSD is 1.6Å.
Fig.3: Input and output conformations for sample run of Rosetta using the protein 1ubq
Rosetta - Steven Bitner
9
Nobody’s Perfect
Rosetta is no Superman, and would not claim to be. All software has its imperfections, and
Rosetta is no different. One problem with Rosetta is that it tends to group together atoms with
similar chemical properties.
The problem with this is that hydrophilic residues may form
hydrogen bonds with each other and allow for hydrophobic portions of the protein to appear on
or near the surface of the protein. If you notice this problem occurring, you can steer Rosetta
away by omitting folding in these regions of the protein by creating a resfile. Rosetta does not
yet have a filter installed to prevent this placement of hydrophobic atoms near the surface, and as
such it is the user’s responsibility to watch out for and correct these situations if they occur.
The side chain simplification portion of Rosetta is also an area that may need some
improvement. Currently, Rosetta uses the centroid replacement scheme described earlier. The
problem with this centroid replacement is that when the side chains are later added, we must use
the Monte Carlo method of replacement, and may find many conflicts. It could be better to use a
different replacement structure. Using spheres that contain the common intersection of the
known side chain rotamers may be a better method. A better fit could potentially be obtained by
using the 3D convex hull that circumscribes the intersection of the known rotamers. This will
account for rotamers that may have a longer major axis and thus aren’t well represented by a
sphere. Documentation is lacking in this area, but it seems as though Rosetta forces rotamer
configurations around the backbone chain and as a result, developers of Rosetta wanted to use a
small structure to represent the side chain to minimize conflicts during backbone folding. The
idea being that the backbone should be free to do as it pleases and later the side chains will be
forced into the best rotamer fit available. The view of the importance of side chains in backbone
folding is still disputed, so this matter will not be a focus unless it is found that side chains do
play a large part in the folding of the entire protein.
Another problem comes in on of the smallest parts of the algorithm. After the appropriate nine
and three residue fragments have been assembled, Rosetta iteratively calculates the individual
unknown torsion angles for the rest of the protein. The problem with the way that this is done is
that a bad torsion angle in one bond may be offset by a torsion angle in another bond, making the
bad local change a good global choice. Rosetta concentrates mostly on the course (nine residue)
Rosetta - Steven Bitner
10
and fine (three residue) adjustments, and overlooks the finishing touches needed for the
remainder of the protein.
The biggest problem of all is the lack of guidance. The documentation for Rosetta is thin and
poor at best. There is no clear user’s manual other than a single page of text with sample
command lines, and a sample output file. The output file has little description of what each item
means, and although [2] contains the scoring functions, there is no description about how these
functions are combined to attain a final score. There is also no useful description of the
functions, just the equations themselves. Basically, if you want to learn a lot about Rosetta, you
had better already know it.
Citations
[1] Rosetta Design Web Server http://rosettadesign.med.unc.edu/documentation.html
[2] Protein Structure Prediction Using Rosetta, Numerical Computer Methods, C.A. Rohl, C.E.
Strauss, K.M. Misura, D. Baker, pp. 66-93, 2004
[3] README documentation included with rosetta2.0.1
[4] Rosetta Website https://www.rosettacommons.org/
[5] David Baker Lab Homepage http://www.bakerlab.org/
Rosetta - Steven Bitner
11