Download User`s Guide - Structural Bioinformatics Group

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Magnesium transporter wikipedia , lookup

Multilocus sequence typing wikipedia , lookup

Western blot wikipedia , lookup

Protein wikipedia , lookup

Genetic code wikipedia , lookup

Protein–protein interaction wikipedia , lookup

Point mutation wikipedia , lookup

Proteolysis wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Nuclear magnetic resonance spectroscopy of proteins wikipedia , lookup

Two-hybrid screening wikipedia , lookup

Ancestral sequence reconstruction wikipedia , lookup

Transcript
PyMod
User’s Guide
PyMod Documentation
(Version 2.1, September 2011)
http://schubert.bio.uniroma1.it/pymod/
Emanuele Bramucci & Alessandro Paiardini, Francesco Bossa, Stefano Pascarella, Department of
Biochemical Sciences “A. Rossi Fanelli”, Sapienza University of Rome, Italy
Table of Contents
1
Introduction ............................................................................................................. 4
2
Installation................................................................................................................ 4
2.1 Windows (XP/Vista/Seven) ............................................................................... 4
2.2 Mac OS (10.5+)................................................................................................... 4
2.3 Linux (Ubuntu 10+) ........................................................................................... 5
3
PyMod Overview ..................................................................................................... 7
3.1 Components ........................................................................................................ 8
3.1.1 Similarity search .......................................................................................... 8
3.1.2 Alignment of sequences and structures ..................................................... 8
3.1.3 Homology Modeling..................................................................................... 9
4
Usage Example......................................................................................................... 9
4.1 Modeling the dihydrofolate reductase from Mycobacterium avium............ 9
5
References............................................................................................................... 17
1 Introduction
A simple and intuitive interface, PyMod, between the popular molecular graphics system PyMOL [1]
and several other tools (i.e., (PSI-)BLAST [2], MUSCLE [3], ClustalW [4], CEalign [5] and
MODELLER [6]) has been developed, to show how the integration of the individual steps required
for homology modeling and sequence/structure analysis within the PyMOL framework can hugely
simplify these tasks. Sequence similarity searches, multiple sequence and structural alignments
generation and editing, and even the possibility to merge sequence and structure alignments have been
implemented in PyMod, with the aim of creating a simple, yet powerful tool for sequence and
structure analysis and building of homology models.
2 Installation
2.1 Windows (XP/Vista/Seven)
1. The first step is to check which is the Python version of your PyMOL. Type
import sys; print sys.version
in the PyMOL console and watch the first number (e.g "2.7").
2. Retrieve the Windows Installer specific for your Python version (from step 1) from the
Download page (http:// schubert.bio.uniroma1.it/pymod/download.html).
3. The installer will guide you during the installation process. Remember to register
MODELLER to get a license key.
4. When you have finished you will be able to see PyMod from the plugin menu of PyMOL.
2.2 Mac OS (10.5+)
(Beta test - some functions may be missing)
If you have Mac OS X 10.5 you need to use PyMOL 1.4
1. The first step is to check which is the Python version of your PyMOL. Type
import sys; print sys.version
in the PyMOL console and watch the first number (e.g "2.7").
2. Retrieve the Mac package specific for your python version (from step 1) from the Download
page.
3. Unzip the package and copy the content of the "modules" and "startup" directories
respectively in your "modules" and "startup" folders that usually can be found at:
PyMOLX11Hybrid.app/pymol/modules/pmg_tk/startup
4. Download and install ClustalW (ftp://ftp.ebi.ac.uk/pub/software/clustalw2/).
5. Download and install MODELLER (http://salilab.org/modeller/download_installation.html).
Remember to register to get a license key. If you have installed PyMOL 1.4 you need
MODELLER version 9.9 or greater.
6. (Not required if you have PyMOL 1.4 and python version 2.7 [from step 1]). The final
step is the setup for the CEAlign module. You can compile ccealign from the source (a) or try
the "quick and dirty" method (b):
a. Go to your ".../pmg_tk/startup/pymod/cealign". Open a shell and type
sudo python setup.py build
Now the compiler has generated a folder named "build". Inside this folder there is a
directory with a name based on your OS and Python version (e.g. "lib.linux-x86_642.6"). Inside this directory copy the file "ccealign.so" and paste it in
".../startup/pymod".
b. Go to your ".../pmg_tk/startup/pymod/cealign" and rename the file "ccealign-version10.X.so" (10.X is the version of your OS) to "ccealign.so" and copy it in
".../startup/pymod".
2.3 Linux (Ubuntu 10+)
(Beta test - some functions may be missing)
1. Retrieve the Linux package from the Download page and unzip all the files in the "startup"
folder of PyMOL. It might be under:
/var/lib/python-support/python2.X/pmg_tk/startup/
2. Open the Synaptic package manager (System--->Administration--->Synaptic package
manager) and download these packages:
a. Clustalw
b. Biopython
c. Python-dev (this is important for the last step)
3. Download and install MODELLER (http://salilab.org/modeller/download_installation.html)
Remember to register to get a license key. If you have installed PyMOL 1.4 you need
MODELLER version 9.9 or greater.
4. The final step is the setup for the CEAlign module. You have to compile ccealign from the
source (this is why you have downloaded Python-dev in step 2):
a. Go to your ".../pmg_tk/startup/pymod/cealign"
b. Open a shell and type "sudo python setup.py build"
c. Now the compiler has generated a folder named "build". Inside this folder there is a
directory with a name based on your OS and Python version (e.g. "lib.linux-x86_642.6").
d. Inside this directory copy the file "ccealign.so" and paste it in ".../startup/pymod"
3 PyMod Overview
Sequence
Database search
(Psi-)BLAST
Sequence alignment
MUSCLE - ClustalW
Sequences
Structures
Structural alignment
CE align
Structure-based multiple
sequence alignment
Homology Modeling
MODELLER
3D-Structure
Figure 1. Flowchart representing PyMod workflow. Every step can be considered as standalone, e.g. you don’t need to
use BLAST (for sequence retrieving) before aligning (with ClustalW or MUSCLE) two or more sequences. Algorithms
used are highlighted in red.
3.1 Components
PyMod has a rich functionality, based on its core sequence alignment, clustering and editing window.
These features are described in the following sub-sections.
3.1.1 Similarity search

BLAST - ( http://blast.ncbi.nlm.nih.gov/Blast.cgi )
The BLAST algorithm is a heuristic program, which means that it relies on some smart shortcuts
to perform the search faster. BLAST performs "local" alignments. Most proteins are modular in
nature, with functional domains often being repeated within the same protein as well as across
different proteins from different species. The BLAST algorithm is tuned to find these domains or
shorter stretches of sequence similarity (McEntyre J, Ostell J: The NCBI Handbook,
http://www.ncbi.nlm.nih.gov/books/NBK21097/)

PSI-BLAST (http://www.ncbi.nlm.nih.gov/blast/Blast.cgi?CMD=Web&PAGE=Proteins&PROGRAM=blastp
&RUN_PSIBLAST=on )
Position-Specific Iterated (PSI)-BLAST is the most sensitive BLAST program, making it useful
for finding very distantly related proteins or new members of a protein family. Use PSI-BLAST
when your standard protein-protein BLAST search either failed to find significant hits.
The first round of PSI-BLAST is a standard protein-protein BLAST search. The program builds a
position-specific scoring matrix (PSSM or profile) from a multiple alignment of the sequences
returned with Expect values better (lower) than the inclusion threshold (default=0.005). The
PSSM will be used to evaluate the alignment in the next iteration of search. Any new database hits
below the inclusion threshold are included in the construction of the new PSSM. A PSI-BLAST
search is said to have converged when no more matches to new database sequences are found in
subsequent iterations
(http://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=Pr
ogSelectionGuide#tab31).
3.1.2 Alignment of sequences and structures

MUSCLE - ( http://www.drive5.com/muscle/ )
MUSCLE is a program for creating multiple alignments of amino acid or nucleotide sequences. A
range of options is provided that give you the choice of optimizing accuracy, speed, or some
compromise between the two (http://www.drive5.com/muscle/muscle.html#_Toc81224823).

ClustalW - ( http://www.ebi.ac.uk/Tools/msa/clustalw2/ )
ClustalW2 is a general purpose multiple sequence alignment program for DNA or proteins. It
attempts to calculate the best match for the selected sequences, and lines them up so that the
identities, similarities and differences can be seen
(http://www.ebi.ac.uk/Tools/msa/clustalw2/help/index.html).

Cealign - ( http://cl.sdsc.edu/ce.html )
CE is a method for calculating pairwise structure alignments. CE aligns two polypeptide chains
using characteristics of their local geometry as defined by vectors between C alpha positions.
Matches are termed aligned fragment pairs (AFPs). Heuristics are used in defining a set of optimal
paths joining AFPs with gaps as needed. The path with the best RMSD is subject to dynamic
programming to achieve an optimal alignment (http://cl.sdsc.edu/ce/ce_help.html).
3.1.3 Homology Modeling

Modeller – (http://salilab.org/modeller/)
MODELLER is used for homology or comparative modeling of protein three-dimensional
structures. The user provides an alignment of a sequence to be modeled with known related
structures and MODELLER automatically calculates a model containing all non-hydrogen atoms.
MODELLER implements comparative protein structure modeling by satisfaction of spatial
restraints, and can perform many additional tasks, including de novo modeling of loops in protein
structures, optimization of various models of protein structure with respect to a flexibly defined
objective function, multiple alignment of protein sequences and/or structures, clustering, searching
of sequence databases, comparison of protein structures, etc.
(http://salilab.org/modeller/about_modeller.html)
4 Usage Example
4.1 Modeling the dihydrofolate reductase from Mycobacterium avium

Go to the NCBI web site and search for the dihydrofolate reductase from Mycobacterium
avium (GI: 1586159 - http://www.ncbi.nlm.nih.gov/protein/2586159). Download the sequence
file in FASTA format.

Launch PyMOL and select PyMod from the PyMOL Plugin menu. From the main window of
PyMod select File  Sequences  Add from file and choose the fasta file that you have
downloaded before. The sequence will be imported in the plugin, as showed in fig. 2
Figure 2. PyMod main window.
The next step involves the database search for homologous sequences corresponding to an
experimentally–solved 3D structure. To perform this task we will use the BLAST function:


Select the sequence by left-clicking on its header (in the PyMod left panel - it will become
green).
From the Tools menu select BLAST; a preference window will appear (fig. 3). It is possible to
modify several parameters; however, in this tutorial we can just keep values at their default
and submit. This operation could take several minutes, depending on sequence length and
speed of your internet connection.
Figure 3. BLAST Preferences window.
 After the database search task has done, the results window will show up; here, you can
choose to import one or more sequences (fig. 4). As you can see in this example, the first entry
has 100% identity with our query sequence; this is due to the fact that the dihydrofolate
reductase of Mycobacterium avium has been already experimentally solved. We will ignore
this entry and use it later to validate our results. For this tutorial, we will choose two proteins
as templates for modeling task, i.e., dihydrofolate reductase from Bacillus anthracis (PDB
code: 3JW3; 33.94% sequence identity with our query) and dihydrofolate reductase from
Moritella profunda (PDB code: 2ZZA; 40,80% sequence identity with our query). Select these
proteins using the checkbox and press Submit.
Figure 4. BLAST output window.

Your selected sequences will be imported in PyMod main window, and clustered with your
query sequence. You can expand or collapse this cluster by clicking on the “ + ” button that is
placed beside your query sequence.

Expand your cluster and download the corresponding PDB structures by right-clicking on each
sequence header and select Get PDB File (fig. 5). After a few seconds PyMod will
automatically import the structures inside PyMOL and it will split them by chain in PyMod
main window (fig. 6)
Figure 5. Get PDB File function.
Figure 6. Structures imported in PyMOL and split by chain.

You can select all the sequences that you don’t want to work with (by left-clicking them) and
then delete the selection through the pop-up menu on the left panel of PyMod window (you
can see this option in fig. 5 – in that case it was not clickable because only one sequence was
selected). Here we will leave “A” chains, and delete the other ones.

Although the increase of accuracy when making use of multiple structural templates is still a
matter of debate, during the years it has been claimed that this approach is able to better
capture the variability and divergence of natural structures [7]. When modeling with multiple
templates, it is mandatory to superpose them as a first step, and then derive a structure-based
sequence alignment. To accomplish this task, select the headers of the protein 3JW3 and 3IA4
and click on Tools  CE struct alignment (fig. 7).
Figure 7 CE align function
A dialog box will appear, asking if you want to use sequence information in the Combinatorial
Extension algorithm. Using sequence information will increase the probability that similar
amino acids will be structurally superposed. Press YES in the dialog box. After a few seconds
the structures will be superposed in PyMOL and the derived structure-based sequence
alignment will be shown in PyMod (fig. 8)
Figure 8. Structural alignment performed with the Combinatorial Extension algorithm.

After the structural templates have been aligned, add the query sequence to the alignment. To
accomplish this task you can choose between two different tools: ClustalW and MUSCLE. In
this case we will use the first algorithm. Select all the sequences by left-clicking on their
header and click Tools  ClustalW. As usual, the preferences window will appear allowing
you to modify some of the most important parameters of the algorithm. We can just keep
values at their default and submit. A dialog box will appear asking if we want to keep the
previously obtained structural alignment. Since we would like to keep the structural alignment
“in-frame” (i.e., adding indels, when necessary, to both templates), click Yes. At this point the
structural and sequence alignments will be merged together.
As refinement step we want to delete the C-terminal overhang; right-click the query sequence
and select Edit Sequence. In the “Edit sequence” window just delete the last amino acids as
shown in fig. 9 and press Submit. Edit the other sequences to delete their overhangs.
Figure 9. Sequence editor window.

After a multiple alignment has been obtained, we can proceed with the last step of the
flowchart, model building. But, just before performing this last task, we will manually check
the alignment to pinpoint potential misaligned regions. Indeed, scrolling the alignment till the
C-terminal region (approximately near ASP 130 of the query sequence) we notice four
consecutive ASP residues that are not present in the structural templates. This suggests a
possible indel in this region. Modify the alignment as shown in Fig. 10, by left-clicking on a
sequence and dragging to the right or to the left respectively to create or remove an insertion.
Figure 10. Refining the alignment.

The next step is the homology model building. Select the query sequence and click Tools 
Modeller. In the options window (Fig. 11) choose both templates and set to High the
optimization level. Make sure to include heteroatoms (i.e., ligands or cofactors) during the
model building. Click SUBMIT. This operation could take several minutes.
Figure 11. Modeller option window.

When Modeller has done, the homology model will be automatically imported in PyMOL
main window (Fig. 12) and a DOPE score-based graph will appear for an energetic validation
of the model (Fig. 13).
Figure 12. Homology model imported in the PyMOL main window.
Figure 13. DOPE score-based graph.

Now we can compare the obtained model with the experimentally-solved 3D structure. Click
on Plugin  PDB Loader Service from PyMOL menu and type 2W3W. Now click on the “A”
near the 2w3w code and choose Align  to molecule  1_gi_2586159. Structures will be
superposed as shown in Fig. 14.
Figure 14. Superposition of the obtained model with the experimentally-solved 3D structure. In white: model of
the dihydrofolate reductase of Mycobacterium avium. In cyan: experimentally-solved dihydrofolate reductase of
Mycobacterium avium (PDB code: 2W3W).

As we can see, our model contains only a few mistakes in the external loops but has a great
consistency with the experimentally-solved structure in the core region and the active site. It’s
also important to stress that the ability to build a model including heteroatoms allows the right
orientation of side chains in the active site, as shown in fig. 15.
Figure 15. In the picture is shown the correct orientation of side chains that interact with the cofactor in the
active site of the protein. In white: model of the dihydrofolate reductase of Mycobacterium avium. In cyan:
experimentally-solved dihydrofolate reductase of Mycobacterium avium (PDB code: 2W3W).
5 References
1. DeLano WL: The PyMOL Molecular Graphics System. San Carlos, CA: DeLano Scientific
2002.
2. Altschul SF, Gish W, Miller W, Myers EW and Lipman DJ: Basic local alignment search
tool. J. Mol. Biol 1990. 215, 403-410.
3. Edgar RC: MUSCLE: multiple sequence alignment with high accuracy and high
throughput. Nucleic Acids Res. 2004, 32(5):1792-1797.
4. Thompson JD, Higgins DG, Gibson TJ: CLUSTAL W: improving the sensitivity of
progressive multiple sequence alignment through sequence weighting, position-specific
gap penalties and weight matrix choice. Nucleic Acids Res. 1994, 22, 4673-4680.
5. Shindyalov IN, Bourne PE: Protein structure alignment by incremental combinatorial
extension (CE) of the optimal path. Protein Eng. 1998, 9, 739-747.
6. Eswar N, Marti-Renom, MA, Webb B, Madhusudhan MS, Eramian D, Shen M, Pieper U, Sali
A: Comparative Protein Structure Modeling With MODELLER. Current Protocols in
Bioinformatics 2006, Supplement 15, 5.6.1-5.6.30.
7.
Venclovas Č, Zemla A, Fidelis K, Moult J: Assessment of progress over the CASP
experiments. Proteins 2003, 53, Suppl 6:585-595.