Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
About this User Guide This user guide is a practical guide to using Reliscript, a command-line interface which allows access to PDB data and Relibase+ search methods from within the Python scripting language environment. It includes information on how to access data from the Relibase+ database and provides example scripts and tutorial scripts to illustrate how to set up a search. Use the < and > navigational buttons above to move between pages of the user guide and the TOC and Index buttons to access the full table of contents and index. Additional on-line Relibase+ resources can be accessed by clicking on the links on the right hand side of any page. A set of tutorials is available for Reliscript. Tutorials can be accessed by clicking on the Tutorials link on the right hand side of any page. 1 How to Use This Manual If you are completely new to Reliscript and Relibase+, start by reading the following sections in the order given: 2 Basic Introduction (see page 2) 3 Reliscript Overview (see page 6) If you already know about Relibase+, you can get familiar with Reliscript by doing the tutorial (see Appendix C: Reliscript Tutorials, page 101). Alternatively, just read through some of the example scripts: 9 Example Scripts (see page 75) If you have already used Reliscript and want to look up particular details of objects and functions, the key reference sections are: 4 5 6 7 Accessing Protein Data: Data Objects (see page 23) Storing and Manipulating Collections of Objects: Container Objects (see page 54) Doing Searches and Other Calculations: Operation Objects (see page 61) Global Utility Functions (see page 72) If you are an advanced Reliscript user and want to extend the functionality of the language, e.g. by writing your own objects for searching Relibase+ data, the key section is: 8 Extending the Functionality of Reliscript (see page 73) Reliscript User Guide 1 2 Basic Introduction 2.1 What is Relibase+? Relibase+ (http://www.ccdc.cam.ac.uk/products/life_sciences/relibase/) is a tool for searching and analysing protein-ligand structures. It features: • • • • • • • • • • • • • • • A browser-based graphical user interface A fast database-search engine 3D visualisation using AstexViewer (embedded) or Hermes (client) Local installation for confidential searching The ability to search both the PDB and proprietary databases of protein-ligand complexes Text searching 2D substructure searching 3D substructure searching 3D searching for protein-ligand interactions Similarity searching for ligands Sequence searching Automatic superimposing of related binding sites Logical combination of hitlists Exploration of protein crystal packing A water structure information module containing detailed information about the water structure in each entry • A cavity information module for detecting similarities (unexpected or otherwise) amongst protein cavities (e.g. active sites) that share little or no sequence homology • A secondary structure information module for searching and displaying secondary structure 2.2 What is Reliscript? Reliscript is a command-line interface to Relibase+ (see Section 2.1, page 2). It allows access to the Relibase+ enhanced PDB data and search methods from within the Python scripting language environment (see Section 2.3, page 3). It can be used to construct more complex queries than are available through the Relibase+ web-browser interface. Hits from Reliscript searches can be saved as Relibase+ hitlists for subsequent viewing in the Relibase+ web interface. Conversely, hitlists from Relibase+ searches can be read into Reliscript for further manipulation. Reliscript can be used in conjunction with many other libraries and applications using powerful interface facilities provided by Python (see Section 2.3, page 3). The Reliscript Overview (see Section 3, page 6) gives more details of the objects, functions and mode 2 Reliscript User Guide of use of Reliscript. 2.3 What is Python? The following is quoted from the Python web site (http://www.python.org): Python is an interpreted, interactive, object-oriented programming language. It is often compared to Tcl, Perl, Scheme or Java. Python combines remarkable power with very clear syntax. It has modules, classes, exceptions, very high level dynamic data types, and dynamic typing. There are interfaces to many system calls and libraries, as well as to various windowing systems (X11, Motif, Tk, Mac, MFC). New built-in modules are easily written in C or C++. Python is also usable as an extension language for applications that need a programmable interface. The Appendix A: Glossary (see page 77) includes, amongst other things, a brief overview of basic Python features and terminology. Beyond that, an excellent Python tutorial can be found at the following web address: • http://www.python.org/doc/current/tut/tut.html 2.4 Quick Python Primer The following is intended as a quick primer to Python. Open up an interactive session by typing python at the operating system command-line prompt. The code below illustrates how to create a simple ’Hello world’ program in Python. >>> print ’Hello world’ Hello world The code below illustrates the data types integer, float and string. >>> a = 1 # an integer >>> b = 3 # another integer >>> a + b 4 >>> a / b # Careful! integer division! 0 >>> c = 3.0 # a float >>> a / c 0.33333333333333331 Reliscript User Guide 3 >>> d = ’string’ >>> d[0] # access first letter (as if it was a list, see below) s >>> e = "another string" >>> d + e 'stringanother string' >>> print a, c, e # note that whitespaces are added automatically 1 3.0 a string The code below illustrate the data structures list, tuple and dictionary. >>> l = [1, 3.3, ’t’] # a list can hold different data types >>> l.append(’s’) # append another value to the list >>> l[0] # first value 1 >>> l[-1] # last value ’s’ >>> l[0] = 4.5 # change first value from 1 to 4.5 >>> for value in l: print value ... 4.5 3.3 t s >>> >>> t = (1, 3.3, ’t’) # a tuple can hold different data types >>> t[0] 1 >>> t[0] = 4.5 # Error! tuples are not mutable Traceback (most recent call last): File "<stdin>", line 1, in ? TypeError: object does not support item assignment >>> >>> d = {’a’: 1, ’b’: 2.2} # a dictionary >>> d[’c’] = ’t’ # update/append value of entry ’c’ >>> d[’a’] 1 >>> d[’a’] = 3.3 >>> for key, value in d.iteritems(): print key, value a 3.3 c t 4 Reliscript User Guide b 2.2 See the script python_primer.py for a quick introduction to functions, for loops, writing to and reading from files. Please note that Python is very sensitive to indentation, as that is how it determines where functions, for loops and conditional statements begin and end. It is therefore a good habit to indent with white spaces (preferably four per indentation) instead of tabs. 2.5 A Simple Reliscript Example A very simple Reliscript example is as follows: import reliscript pdb_set = reliscript.set(’pdb’) ser_search = reliscript.text_search(’SERINE’, field=’header’) ser_search(pdb_set) pdb_set.save_to_hitlist(’serine search’) This searches all PDB entries to find those containing the string SERINE in the header record. The resulting list of entries is stored persistently (i.e. on disk) as the Relibase+ hitlist serine search and hence can be viewed in the Relibase+ web-browser interface. A more detailed explanation of this script is given elsewhere (see Section 3.2, page 12). Reliscript User Guide 5 3 Reliscript Overview Reliscript is based on Python (see Section 2.3, page 3), which is an object-oriented language. Consequently, Reliscript itself is object-oriented. Although it provides some global utility functions (see Section 7, page 72), virtually all data-access and search-and-analysis functionality is presented in the form of objects. Doing things in Reliscript basically involves creating the appropriate objects, manipulating them as desired, and writing out information contained in the manipulated objects. 3.1 Running Reliscript and Writing and Debugging Scripts Python in general, and Reliscript in particular, can be used either interactively or by running a Python (.py) script file in batch mode. You will probably want to run Reliscript interactively, using a small test hitlist, when writing and debugging a new script (see Section 3.1.8, page 11). Reliscript jobs on the full PDB database can take quite a long time to run (several hours), so once a script has been debugged, it is usually better to run it in batch mode to get the final results. 3.1.1 Setting up the Reliscript Environment In order to be able to run reliscript one first has to set up a reliscript environment. Start by moving into the relibase install directory: cd <Relibase install directory> Source the script relibase.setup.sh in the bin directory: # for sh/bash users: . bin/relibase.setup.sh # for csh/tcsh users: source bin/relibase.setup This sets up a number of R+ environment variables and defines the setup_reliscript alias. Run the setup_reliscript command: setup_reliscript This sets up environment variables required for reliscript. 3.1.2 Setting up a Reliscript Client If you intend to make use of Reliscript, it is worth considering creating a Reliscript client installation on a different machine to your Relibase server. This has the advantage of keeping the (memory and 6 Reliscript User Guide CPU hungry) python process that reliscript uses, away from other Relibase processes such as the main server, and the database search engine. Further, this also means that any potential users do not need to be able login to the Relibase server. In order to be able to use a Reliscript client the database needs to have its security relaxed in order that remote Reliscript clients may connect. To do this, edit the $RELIBASE_ROOT/derby/ derby.properties file on the server, and remove the '#' from the front of the first line so it reads: derby.drda.host=0.0.0.0 For this to take effect the database needs to be restarted: relibase -database stop relibase -database start To create the Reliscript client login to the Relibase server and move into the Relibase root directory: cd $RELIBASE_ROOT Read the documentation of the reliscript_client.sh script: ./bin/reliscript_client.sh -h Run the reliscript_client.sh script: ./bin/reliscript_client.sh This creates a file called reliscript_client.tar.gz. Copy this file to the target system. Login to the target system and unpack the reliscript_client.tar.gz file: tar -zxvf reliscript_client.tar.gz Move into the reliscript_client directory: cd reliscript_client Read the README file: cat README To setup the reliscript client environment run the commands: Reliscript User Guide 7 env RELIBASE_ROOT=$PWD bin/update_config.sh -reliscript # For bash users . bin/relibase.setup.sh # For csh/tsch users source bin/relibase.setup setup_reliscript The Reliscript client is now installed and python reliscripts can be executed on the target machine. The first time the Reliscript client is installed it is worth creating a fast lookup table (see Section 3.1.5, page 9): python python/reliscript/create_fast_lookup.py 3.1.3 Starting Python in Interactive Mode To run interactively, type python at the operating system command-line prompt (or, depending on your installation, you may need to type an alias instead): bash-3.1.17$ python Python 2.5.2 (r252:60911, Jul 23 2008, 17:11:49) [GCC 3.2.3 20030502 (Red Hat Linux 3.2.3-59)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> You must then import the Reliscript module by entering the command import reliscript (see Section 3.1.4, page 8); Reliscript commands may then be typed and executed. Typing the following: python -i my_script.py at the Unix prompt will open the Python interpreter, run the commands in my_script.py, and leave you in interactive mode as if you had just typed the commands manually. 3.1.4 Importing the Reliscript Module and Other Python Modules To import the Reliscript module, enter the command import reliscript. This will produce 8 Reliscript User Guide output such as: >>> import reliscript Starting the JVM -Xms128m -Xmx512m -Xmn64m Imported psyco for python speed optimization >>> The command sets up a “namespace” called reliscript which allows access to all the Reliscript functionality, e.g. commands such as the following may then be executed: import reliscript reliscript.use_workspace('fred') lig_set = reliscript.set('ligand') The namespace can be aliased to something else, e.g.: import reliscript as rs rs.use_workspace('fred') lig_set = rs.set('ligand') Other Python modules (see page 82) may be similarly loaded, e.g. import re will load the Python module for handling regular expressions. 3.1.5 Reliscript Fast Lookup Script When Reliscript is started it needs to create an internal fast lookup table that is used when obtaining one type of Reliscript object from another, for example, getting the ligand objects associated with a PDB object. Creating this fast lookup information can take quite a while. To speed up Reliscript startup and also reduce its memory requirements, it is possible to precalculate this lookup information. To do this, go to the location of the reliscript.py file in the Reliscript hierarchy and locate the file create_fast_lookup.py. Execute this Python script using the same version of Python as used for normal Reliscript. Reliscript User Guide 9 This script duplicates the fast lookup creation in Reliscript but saves the information to a Python file so it can be used in preference to recalculating the lookup information. The name of the file created will be of the form: reliscript_fast_lookup_xxxxxxxxxx.py where x will be numeric digits. These digits represent the total size of all the Relibase+ databases. This total size is used by reliscript.py to check that the correct fast lookup information is available and can be loaded. When you add updates provided by the CCDC, update your own in-house databases or change which set of database files are to be used. You will need to re-run the create_fast_lookup.py script to create a lookup file to match the new database configuration. 3.1.6 Using Alternative Databases By default all databases are used by Reliscript, i.e. the PDB database reli as well as any in-house databases. To use an in-house database only one first needs to create a fast lookup table for that database: create_fast_lookup.py mydb1 The database can then be set using the command: >>> reliscript.set_database(’mydb1’) To use multiple in-house databases one first needs to generate a fast lookup table for the databases of interest: create_fast_lookup.py mydb1:mydb2 The databases of interest can then be set using the command: >>> reliscript.set_database(’mydb1:mydb2’) 3.1.7 Useful Interactive Aids; Browsing and Autocompletion When using Reliscript interactively, the up and down arrow keys can be used to browse the command history list. In addition the <TAB> key can be used to auto-complete a command. For example: >>> consensus_search = reliscript.con<TAB pressed> produces: 10 Reliscript User Guide >>> consensus_search = reliscript.consensus_search and: >>> pdb_1sq4 = reliscript.create(’1qs4’) >>> print pdb_1sq4.<TAB pressed> lists all the attributes of onesq4, while: >>> print pdb_1sq4.a<TAB pressed> lists all the attributes starting with the letter "a". Note: This feature should be activated by default in the Reliscript Python initialization file, identified by the PYTHONSTARTUP environment variable in $RELIBASE_ROOT/python/ reliscript/reliscript.setup. In order to deactivate this feature comment out the appropriate line. 3.1.8 Creating and Loading Hit Lists It is easy to create and load hitlists representing small subsets of the entire database. This is useful when writing and debugging scripts, since calculations on the full database can take several hours. The command for loading a PDB hitlist is: pdb_set = reliscript.set(’pdb’, ’name_of_hitlist’) Similarly, hitlists can be created from sets: pdb_set.save_to_hitlist(’name_of_hitlist’) The command above will raise an exception if the name of the hitlist is already used. A PDB set of interest can be created by performing a search (see Section 6, page 61). Alternatively, a set of PDB codes can be read in from a text file where each line contains one PDB code, using the script create_hitlist_from_text_file.py. Note that hitlists created using the Relibase+ GUI can also be loaded in Reliscript. 3.1.9 Running Reliscript Jobs in Batch Mode A Reliscript Python script called myscript.py can be run in the foreground by typing the Reliscript User Guide 11 following at the operating system command-line prompt: python myscript.py (You may need to replace python by an alias if one has been set in your local installation.) To run in background mode, type: python myscript.py > myscript.out & Your Python script must import the reliscript module in order to use Reliscript functionality (see Section 3.1.4, page 8). 3.2 Walking Through a Simple Reliscript Example Consider the following script (lines beginning in # are comments): import reliscript pdb_set = reliscript.set(’pdb’) ser_search = reliscript.text_search(’SERINE’, field=’header’) ser_search(pdb_set) pdb_set.save_to_hitlist(’serine search’) Line 1: Imports the reliscript Python module (see Section 3.1.4, page 8), which provides access to all the classes and functions in Reliscript. Line 2: The reliscript.set command creates a set object (called, in this case, pdb_set), which is a container for holding data objects (see Section 5.2, page 55). Specifically, the argument ’pdb’ instructs Reliscript to create a set containing all the PDB entries in the Relibase+ database. Each PDB entry will be held as a PDB data object (see Section 4.1, page 23). For example (though it is not relevant for the above script) pdb_set[0] (the first object in the pdb_set container) would be the object holding the data for the first PDB entry in the database (index numbers in Python begin at zero, not one). Line 3: The reliscript.text_search command creates an operation object (called, in this case, ser_search) for performing a text search. The arguments specify that the operation object is to search for the text string SERINE in the header data of PDB objects (equivalent to the HEADER record in a pdb file). There are several other types of operation objects (see Section 6, page 61), e.g. reliscript.sequence_search would create an operation object for searching protein sequences. 12 Reliscript User Guide Line 4: This line applies the operation object ser_search to the set object pdb_set. It could equally well be written as: pdb_set(sersearch) The practical outcome of the command is that each PDB object in pdb_set is subjected to the search defined in the ser_search operation object, i.e. the HEADER record is searched for the presence of the string SERINE. Only those PDB objects passing this test will be retained in the set pdb_set; the others will be eliminated. Line 5: pdb_set now contains all PDB entries in the Relibase+ database that contain SERINE in the HEADER record. The command pdb_set.save_to_hitlist calls the save_to_hitlist function, which is a function available to any set object. In this example, it writes out the hits from the search as a Relibase+ hitlist called serine search. This hitlist may then be loaded into Relibase+ for viewing (see Section 3.7.3, page 18). 3.3 Introduction to Objects in Reliscript Objects in Reliscript fall into three main categories: • Data objects, which provide access to all the data that you would expect to find in a database of 3D protein structures, e.g atomic coordinates, experimental conditions, ligand chemical structures, chain sequences, etc. (see Section 3.3.1, page 13). • Container objects, for storing collections of data objects, e.g. the hits from a search (see Section 3.3.2, page 14). • Operation objects, for performing searches, superimpositions, etc. (see Section 3.3.3, page 15). 3.3.1 Introduction to Data Objects Available data objects are PDB, Chain, NucleicAcid, Ligand, Solvent, Residue, Atom, Bond, BindingSite and PackBindingSite (see Section 4, page 23). The PDB object is the most important, since it contains within it objects of the other types (i.e. a PDB object will contain Chain, NucleicAcid, Ligand and Solvent objects). Data objects allow access to all the data stored in Relibase+. This access is either provided directly via attributes (see page 77) of the data object or indirectly by providing access to other, related data objects that themselves have access to the required information. For example, a PDB object would provide direct access to the temperature of the structure determination, which is an attribute of the Reliscript User Guide 13 PDB class: # Load structure using the PDB code pdb_object = reliscript.create('1xp0') # Access the temperature temperature = pdb_object.temp However, access to the compound name of a ligand contained within the PDB entry would require the Ligand object to be created, and it would be this latter object that would provide access to the name: # Access a list of all the ligands ligand_list = pdb_object.ligands # Access the first ligand ligand_object = ligand_list[0] # Access the ligand name ligand_name = ligand_object.compound_name The values of some data-object attributes can be over-ridden by the user; the most obvious case where this would be useful is if you wished to overwrite the default (Sybyl) atom type of an atom with some other atom type. 3.3.2 Introduction to Container Objects A container object is an object that holds within it other objects: for example, a list of Ligand objects (list = container object, Ligand = data object). Python provides several types of containers, such as tuples (see page 83), lists (see page 80) and dictionaries (see page 78), and Reliscript makes extensive use of these for returning data. Lists are in some programming languages referred to as arrays and dictionaries are sometimes called “associative arrays” or “hashes”. For more information on how to interact with the different containers of reliscript objects see the script container_example.py. There is also a Reliscript-specific container object called a set which provides extra functionality (see Section 5.2, page 55). Only some types of data objects can be stored in sets (viz. PDB, Chain, NucleicAcid, Ligand, Solvent). A set containing one type of object can be converted to a set containing another type of object, e.g. a set of PDB objects can be transformed to a set of Ligand objects: # Create a set containing all PDB objects in database 14 Reliscript User Guide pdb_set = reliscript.set('pdb') # Create a set containing all ligands in the PDB entries # in pdb_set lig_set = reliscript.set('ligand', pdb_set) The main use of sets is to store and manipulate collections of data objects on which some sort of search or geometrical transformation is to be applied using an operation object (see Section 3.3.3, page 15). 3.3.3 Introduction to Operation Objects These objects (see Section 6, page 61) are used to perform tests or geometrical transformations on data objects such as protein chains or ligands. There are different types of operation objects for, e.g.: • performing text searches; • performing substructure searches using a SMILES string; • superimposing protein chains. It is also possible to write customised operation objects for performing specialist tasks (see Section 8, page 73). An operation object is applied to a set of data objects. The result is the same container holding a modified collection of data objects, e.g. data objects satisfying a particular test. In addition, the operation object may add extra attributes (see page 77) to the data objects that it processes, e.g. # Create a set containing all ligands in database(s) lig_set = reliscript.set('ligand') # Print number of ligands in lig_set (len returns “length” of # set, i.e. number of items it contains) print len(lig_set) 26544 # Create an operation object called sim_lig_search to do a # similar-ligand search, comparing each ligand in lig_set # with a previously-created ligand object called a_ligand_obj sim_lig_search = reliscript.similar_ligand_search(a_ligand_obj) # Apply operation object to ligand set sim_lig_search(lig_set) Reliscript User Guide 15 # lig_set now contains only those ligands whose similarity # to the reference ligand exceeds the default threshold value print len(lig_set) 242 # The sim_lig_search operation object added a new attribute to # each ligand, viz. its similarity with the reference ligand print lig_set[0].ligand_similarity['value'] 0.95 Operation objects can be applied in two ways. Specifically, if search_object is an operation object for performing a particular search and a_set is a set of data objects, then both of the following will produce identical results: search_object(a_set) and a_set(search_object) In both cases, a_set will end up containing just those data objects that satisfy the search defined by search_object. 3.4 Introduction to Functions in Reliscript Functions in Reliscript fall into two categories: • Functions that are global, i.e. available throughout Reliscript (see Section 7, page 72), e.g. d = reliscript.distance(atom1, atom2) (computes distance between two atoms, or minimum distance between two groups of atoms). • Functions that are only available to particular types of objects, e.g. atom_list = pdbobject.pdb_atoms(include_pack=0) pdb_atoms is a function available to PDB objects that writes atoms as a list of string objects. Most object-specific functions are for writing out structural data (see Section 3.7.4, page 19) or for applying or clearing geometrical transformations (see Section 3.5, page 17). 16 Reliscript User Guide 3.5 Geometrical Transformations The chain superimposition object (see Section 6.7, page 71) can be used to superimpose each of a set of protein chains onto a reference chain. This implicitly involves applying a rotation-translation operation to each chain. If chains are transformed in this way, then the same transformation will automatically be applied to ligands retrieved by use of the adjacent_ligands attribute of Chain objects (see Section 4.2.3, page 28). However, other objects (e.g. the solvent around the transformed chain) will not have the geometric transformation applied by default when they are retrieved from the Relibase+ database. A function (transform) is therefore provided which will allow the transformation to be applied explicitly. More details can be found in the sections on individual data objects (see Section 4, page 23). The function clear_transform can be used to re-set objects back to their original orientation (i.e. as stored in the Relibase+ database). 3.6 Search Databases and Database Identifiers Relibase+ and Reliscript gives access to structural data of protein-ligand complexes in the Protein Data Bank. This data is stored in the database named reli. The entries derived from the reli database is referred to by the database identifier pdb. In addition, Relibase+ system administrators can set up inhouse databases containing proprietary protein structures. Each such database will be assigned its own identifier, e.g. mydb. These identifiers are used, e.g. when printing out Reliscript objects; for example, printing out a Chain object might result in output like: Chain<pdb:1a01:A> or Chain<mydb:1a01:A> depending which database the chain came from. By default, Relibase+ and Reliscript will access and search all available databases, i.e. reli and any in-house databases. 3.7 Writing Output from Reliscript Reliscript provides facilities for printing results to standard output or a file, transferring data to and from Relibase+, and writing structure files. 3.7.1 Printing to Standard Output Results from a Python job can be printed to standard output, e.g. Reliscript User Guide 17 print pdb_object.year 1997 3.7.2 Opening Files for Saving Results Reliscript data objects provide functions for exporting structural data in a number of formats (see Section 3.7.4, page 19). In addition, Python itself provides options for opening files to which results may then be written, e.g. # Open file for writing out_file = open(’tmp_out.txt’, ’w’) # Write something to file out_file.write(’something’) # Close file out_file.close() See the script ouput_example.py for more information on how to write a pdb object to a file. 3.7.3 Communicating with the Relibase+ Graphical User Interface The key method for communicating results between Reliscript and Relibase+ (e.g. to view hits from Reliscript jobs in 3D) is to convert Reliscript sets (see Section 5.2, page 55) to Relibase+ hitlists, or vice versa. PDB and ligand sets can be saved as Relibase+ hitlists by commands such as: pdb_set.save_to_hitlist(’my_search’) (see Section 5.2.4, page 56). Hitlists saved in a Relibase+ session can be read into Reliscript as sets by commands such as: hit_list_lig_set = reliscript.set(’ligand’, ’my_hitlistname’) (see Section 5.2.2, page 55). If the type specified for the set (e.g. ‘ligand’ above) is not the same as the type of the hitlist, an automatic conversion will occur. For example, if ‘my_hitlistname’ in the example above is a PDB hitlist, the resulting Reliscript set will contain all the Ligand objects in the PDB entries contained in the hitlist. 18 Reliscript User Guide 3.7.4 Exporting Structural Data Most types of data objects have functions that enable the structural data they contain (atom coordinates, etc.) to be written out to file or as a list of string objects. Availability of these functions is as follows: Reliscript User Guide 19 20 pdb_atoms (returns ATOM records as string objects) pdb_line (returns one ATOM record as a string object) save_pdb (writes object in pdb format) save_mol2 (writes object in mol2 format) PDB (see Section 4.1.4, page 26) yes no yes no Chain (see Section 4.2.4, page 29) yes no yes no NucleicAcid (see Section 4.3.4, page 32) yes no yes no Ligand (see Section 4.4.4, page 36) yes no yes yes Solvent (see Section 4.5.4, page 39) yes no yes no Residue (see Section 4.6.4, page 42) yes no yes no Atom (see Section 4.7.4, page 46) no yes no no BindingSite (see Section 4.9.4, page 50) yes no yes yes PackBindingSite (see Section 4.10.4, page 53) yes no yes no Reliscript User Guide If a data object has been modified in a Reliscript job (e.g. had an attribute added or changed, or been subjected to a geometrical transformation), the data that will be written out by save_pdb will, by default, be that of the modified object. The data corresponding to the original object, as retrieved from the Relibase+ database, can usually be written out by setting modified = 0 in the save_pdb parameter list (this facility is not available for a small number of data objects). Reliscript User Guide 21 22 Reliscript User Guide 4 Accessing Protein Data: Data Objects The above data objects (see Section 3.3.1, page 13) are available. 4.1 PDB Objects A PDB object holds information about a complete PDB entry (or an entry in an in-house database of protein-ligand structures), e.g. author names, the experimental conditions of the structure determination (if available), etc. It also allows access to the 3D results of the structure determination by returning lists of the Chain, NucleicAcid, Ligand, Solvent and Atom objects that it contains. 4.1.1 Creation of PDB Objects In most cases, PDB objects will be created as members of a container object (see Section 5, page 54), e.g. # Create a set containing all PDB objects in database # and print first member pdb_set = reliscript.set(’pdb’) print pdb_set[0] PDB<pdb:1a01> It is also possible to create a particular, individual PDB object, e.g. # Create a PDB object for Protein Data Bank entry 1A01 pdb_object = reliscript.create(’1A01’) # Create a PDB object for the entry 1XYZ in the in-house # database DBID pdb_object = reliscript.create(’DBID:1XYZ’) Note that the entries in the reli database (entries from the Protein Data Bank) are stored as lower case. The PDB code argument of the create function is case insensitive on these entries. However, the PDB code argument of the create function is case sensitive on any entries from in-house databases. 4.1.2 Textual Representation of PDB Objects Print operations, etc., on PDB objects (e.g. print pdb_object) will produce output such as: PDB<pdb:1a01> Colons separate the contents of the angle brackets into components. The first component is the Reliscript User Guide 23 database identifier (see Section 3.6, page 17), followed by the PDB code. If the PDB object is from an in-house database (i.e. not derived from the main Protein Data Bank) the output will be, e.g.: PDB<dbid:1ax1> where dbid is the identifier of the in-house database. 4.1.3 Attributes of PDB Objects The attributes (see page 77) of a PDB object are: Name Type Description a float Cell length a, in Å. alpha float Cell angle alpha, in degrees. atoms list of Atoms List of Atom objects, one for each atom in the entry, in the order: ligand atoms, chain atoms, solvent atoms. author string Author field as a single string containing all authors delimited by commas or spaces, e.g. P.J.B.PEREIRA,A.BERGNER,S.MACEDORIBEIRO,R.HUBER authors list of strings List of authors, each author stored as a separate string, e.g. ['P.J.B.PEREIRA', 'A.BERGNER', 'S.MACEDO-RIBEIRO', 'R.HUBER'] b float Cell length b, in Å. beta float Cell angle beta, in degrees. binding_sites list of BindingSites List of BindingSite objects, one for each bound ligand. bonds list of Bonds List of Bond objects, one for each bond in the entry. c float Cell length c, in Å. chains list of Chains List of Chain objects, one for each unique protein chain in the entry. compound string Contents of the PDB compound record (COMPND) 24 Reliscript User Guide crystal dictionary Dictionary containing crystallographic information (this information is also accessible via other attributes). Dictionary is of the form, e.g. {'space_group': 'P41', 'z_value': 16, 'cell': (82.93, 82.93, 172.86), 'angles': (90.0, 90.0, 90.0)} where 'cell' refers to the a, b and c cell lengths and 'angles' refers to the alpha, beta and gamma cell angles. Cell lengths of 1.0, 1.0, 1.0 and angles of 90.0, 90.0, 90.0 will be returned for structures determined by NMR. date string Deposition date as stored in the PDB file, e.g. 12. 3. 97 exptl_method string String describing the method used to determine the protein structure, as taken from the EXPDTA record of the PDB file, e.g. X-RAY DIFFRACTION gamma float Cell angle gamma, in degrees. header string Contents of the PDB header record, e.g. SERINE PROTEINASE ligands list of Ligands List of Ligand objects, one for each bound ligand. nucleic_acids list of NucleicAcids List of NucleicAcid objects, one for each unique nucleic acid chain in the entry. pack_binding_sites list of PackBindingSites List of PackBindingSite objects, one for each bound ligand. ph float pH value (returned as -1.0 if no pH value available). r_value float Crystallographic R value (returned as -1.0 if no R-value available). resolution float Crystallographic resolution, in Å (returned as -1.0 if no resolution value available). solvent list containing one Solvent object List containing one Solvent object. This single object will contain information on all the solvent atoms in the entry. Relibase+ solvent data refers only to water molecules. Reliscript User Guide 25 source string Contents of the source field, e.g. MOL_ID: 1 ORGANISM_SCIENTIFIC: HOMO SAPIENS ORGANISM_COMMON: HUMAN ORGAN: LUNG CELL: MAST CELL space_group string Crystallographic space group, e.g. P41 temp float Temperature of the study (Kelvin; returned as -1.0 if no temperature available). title string Contents of the title field, e.g. HUMAN BETA-TRYPTASE: A RING-LIKE TETRAMER WITH ACTIVE SITES FACING A CENTRAL PORE year integer Year of the study, e.g. 1997 z_value integer Number of polymeric chains in unit cell. 4.1.4 Functions of PDB Objects The functions (see page 80) of a PDB object are: pdb_atoms(include_pack=0) Returns a list of string objects containing the ATOM records of the PDB entry. Arguments: • include_pack (integer): By default (i.e. if pdb_atoms() is called with no argument), the list will not contain atoms generated by crystallographic symmetry, i.e. the atoms in the PackBindingSite objects (see Section 4.10, page 51) that would be returned by the PDB pack_binding_sites attribute. However, these can be included by passing a positive include_pack value. save_pdb(filename, include_pack=0, modified=1) Saves the PDB entry as a file in pdb format. Arguments: • filename (string): Filename to be used for the output pdb file. • include_pack (integer): By default, the file will not contain atoms generated by crystallographic symmetry, i.e. the atoms in the PackBindingSite objects (see Section 4.10, page 51) that would be returned by the PDB pack_binding_sites attribute. However, these can be included by passing a positive include_pack value. 26 Reliscript User Guide • modified (integer): By default, the output file will include any changes (geometrical transformations, changed or added attributes) that may have been made to the object by Reliscript. To write out the original, unmodified object, set modified=0. transform(object) Applies the same (rotation + translation) geometrical transformation to the PDB object as has already been applied to the object passed in as an argument. Arguments: • object (Reliscript data object): For example, this could be a Chain object that has been subjected to a geometrical transformation in order to superimpose it on another chain. clear_transform() Clears any geometrical transformation that has been applied to the PDB object so that all atom coordinates return to their original values. 4.2 Chain Objects A Chain object holds information about a protein chain, e.g. its sequence. It also allows access to the Residue objects that make up the chain, the PDB object to which it belongs, etc. 4.2.1 Creation of Chain Objects Chain objects can be created either by accessing a member of a chain container object, e.g. # Create set containing all chains in database chain_set = reliscript.set('chain') # Get first member chain_obj0 = chain_set[0] or from a PDB object, e.g. pdb_object = reliscript.create('1qs4') # Get second chain in PDB entry 1qs4 chain_obj1 = pdb_object.chains[1] These two steps may be combined in a single line, e.g. chain_obj1 = reliscript.create('1qs4').chains[1] Reliscript User Guide 27 4.2.2 Textual Representation of Chain Objects Print operations, etc., on Chain objects (e.g. print chain_obj) will produce output such as: Chain<pdb:1a01:A> Colons separate the contents of the angle brackets into components. The first component is the database identifier (see Section 3.6, page 17), followed by the PDB code; the final component is the chain identifier. If the Chain object is from an in-house database (i.e. not derived from the main PDB) the output will be, e.g.: Chain<dbid:1ax1:A> where dbid is the identifier of the in-house database. 4.2.3 Attributes of Chain Objects The attributes (see page 77) of a Chain object are: Name Type Description adjacent_ligands list of Ligands List of ligands whose BindingSites include at least one residue from this chain. The list is ordered by the size of the chain/ligand interaction, i.e. when there are two or more ligands in the list, the first will have more of this chain’s residues involved in its binding site than will the second. If the chain has been subjected to a geometrical transformation, e.g. using the superimpose_chain operation object, then all ligands in the list will be transformed in the same way. atoms list of Atoms The Atom objects in the chain. bonds list of Bonds The Bond objects in the chain. chain_id string Chain identifier, e.g. A (returns the single-character string “-” if no identifier available). n_atom integer Number of atoms in the chain. n_unit integer Number of units (i.e. residues) in chain; equivalent to len(chain). pdb PDB The PDB object that contains the chain. 28 Reliscript User Guide residues list of Residues The Residue objects in the chain. Residue index numbers in this list may not be the same as their numbers in the protein SEQRES sequence (see Section 4.2.5, page 30). sequence string String containing the amino-acid sequence as one-letter codes, e.g. IVGTRVTYLDWIHHYVPKK. The sequence is that given in the PDB SEQRES records. When the chain has been created from a BindingSite or PackBindingSite object, this attribute will return an empty string. sequence_3d string String containing the amino-acid sequence as one-letter codes, e.g. IVGTRVTYLDWIHHYVPKK. The sequence returned is that determined from the residues in the PDB ATOM records, not the sequence defined in PDB SEQRES records. These may differ, e.g. the experimental sequence would not include residues whose 3D atomic positions were not determined because of crystallographic disorder. type string String that identifies this object type. For a Chain object, this will be the string protein_chain 4.2.4 Functions of Chain Objects The functions (see page 80) of a Chain object are: pdb_atoms() Returns a list of string objects containing the ATOM records of the chain. save_pdb(filename, modified=1) Saves the chain as a file in pdb format. Arguments: • filename (string): Filename to be used for the output pdb file. • modified (integer): By default, the output file will include any changes (geometrical transformations, changed or added attributes) that may have been made to the object by Reliscript. To write out the original, unmodified object, set modified=0. transform(object, on_original=0) Applies the same (rotation + translation) geometrical transformation to the Chain object as has Reliscript User Guide 29 already been applied to the object passed in as an argument. Arguments: • object (Reliscript data object): A data object (e.g. a PDB, Chain, NucleicAcid, Ligand, Solvent, or Residue object) that has been subjected to a geometrical transformation. • on_original (integer): By default, the transformation will be applied to the Chain object in its current orientation, which may already be the result of a previous transformation. To transform the original, untransformed atomic positions, set the on_original flag to a nonzero value. clear_transform() Clears any geometrical transformation that has been applied to the Chain object so that all atom coordinates return to their original values. 4.2.5 Accessing the Residues in a Chain; Residue Numbering Chain objects have internal functions that allow them to act both like Python lists (see page 80) and Python dictionaries (see page 78) for the purposes of accessing the Residue objects that they contain. The script accessing_residues.py show how these access functions can be used; in each case, the comment indicates how the equivalent access could be made by using the residues attribute (see Section 4.2.3, page 28). The crucial point is that, for loops and numerical indexing, the residues are considered to run from residue 0 to residue N-1 (where N is the number of residues in the chain). Thus, the third example would return the 5th to the 21st residues in the chain, not the 4th to the 20th. However, within a chain residues have string labels associated with them from the PDB, such as 16, -1 or 27B; normally (though not invariably) these will be the residue sequence numbers as used conventionally by a protein chemist. The Chain object provides access to residues via this label by allowing the index into the chain to be a string “key”, as in the final example above. This method of access is not possible via the residues attribute, which is a pure Python list and therefore does not support key access. 4.3 NucleicAcid Objects 4.3.1 Creation of NucleicAcid Objects NucleicAcid objects can be created either by accessing a member of a nucleic_acid container object, e.g. # Create set containing all nucleic acids in database nucleic_acid_set = reliscript.set('nucleic_acid') # Get first member nucleic_acid_obj0 = nucleic_acid_set[0] 30 Reliscript User Guide or from a PDB object, e.g. pdb_object = reliscript.create('100d') # Get first nucleic acid chain in PDB entry 100d nucleic_acid_obj1 = pdb_object.nucleic_acids[0] These two steps may be combined in a single line, e.g. nucleic_acid_obj1 = reliscript.create('100d').nucleic_acids[1] 4.3.2 Textual Representation of NucleicAcid Objects Print operations, etc., on NucleicAcid objects (e.g. print nucleic_acid_obj) will produce output such as: NucleicAcid<pdb:100d:A> Colons separate the contents of the angle brackets into components. The first component is the database identifier (see Section 3.6, page 17), followed by the PDB code; the final component is the nucleic acid identifier. If the NucleicAcid object is from an in-house database (i.e. not derived from the main PDB) the output will be, e.g.: NucleicAcid<dbid:1xxx:A> where dbid is the identifier of the in-house database. 4.3.3 Attributes of NucleicAcid Objects The attributes (see page 77) of a NucleicAcid object are: Name Reliscript User Guide Type Description 31 adjacent_ligands list of Ligands List of ligands whose BindingSites include at least one residue from this nucleic acid. The list is ordered by the size of the nucleic acid/ligand interaction, i.e. when there are two or more ligands in the list, the first will have more of this nucleic acid’s residues involved in its binding site than will the second. If the nucleic acid has been subjected to a geometrical transformation then all ligands in the list will be transformed in the same way. atoms list of Atoms The Atom objects in the nucleic acid. bonds list of Bonds The Bond objects in the nucleic acid. chain_id string Nucleic acid chain identifier, e.g. B (returns the singlecharacter string “-” if no identifier available). n_atom integer Number of atoms in the nucleic acid. n_unit integer Number of units (i.e. residues) in the nucleic acid chain; equivalent to len(nucleic_acids). pdb PDB The PDB object that contains the nucleic acid. residues list of Residues The Residue objects in the nucleic acid. sequence_3d string String containing the nucleic acid sequence as one-letter codes, e.g. ATTAGTA. The sequence returned is that determined from the residues in the PDB ATOM records, not the sequence defined in PDB SEQRES records. These may differ, e.g. the experimental sequence would not include residues whose 3D atomic positions were not determined because of crystallographic disorder. type string String that identifies this object type. For a NucleicAcid object, this will be the string nucleic_acid 4.3.4 Functions of NucleicAcid Objects The functions (see page 80) of a NucleicAcid object are: pdb_atoms() Returns a list of string objects containing the ATOM records of the nucleic acid. 32 Reliscript User Guide save_pdb(filename, modified=1) Saves the nucleic acid as a file in pdb format. Arguments: • filename (string): Filename to be used for the output pdb file. • modified (integer): By default, the output file will include any changes (geometrical transformations, changed or added attributes) that may have been made to the object by Reliscript. To write out the original, unmodified object, set modified=0. transform(object, on_original=0) Applies the same (rotation + translation) geometrical transformation to the NucleicAcid object as has already been applied to the object passed in as an argument. Arguments: • object (Reliscript data object): A data object (e.g. a PDB, Chain, NucleicAcid, Ligand, Solvent, or Residue object) that has been subjected to a geometrical transformation. • on_original (integer): By default, the transformation will be applied to the NucleicAcid object in its current orientation, which may already be the result of a previous transformation. To transform the original, untransformed atomic positions, set the on_original flag to a nonzero value. clear_transform() Clears any geometrical transformation that has been applied to the NucleicAcid object so that all atom coordinates return to their original values. 4.3.5 Looping around the Contents of a NucleicAcid Object Like Chain objects, NucleicAcid objects can simulate certain list and dictionary operations (see Section 4.2.5, page 30). Looping around the contents of a NucleicAcid object will produce Residue objects. For an example see the script looping_around_nucleic_acids.py. 4.4 Ligand Objects A Ligand object holds information about a protein-bound ligand, e.g. compound name, molecular weight. It also allows access to the binding site to which it is bound, other nearby chains generated by crystallographic symmetry, the PDB object to which it belongs, etc. Each ligand is divided up into a number of units. Often there will be only one unit, but some ligands - for example small peptides - are divided into multiple units. For consistency with protein chains, each unit of a ligand is stored as a Residue object. 4.4.1 Creation of Ligand Objects Ligand objects can be created either by accessing a member of a ligand container object, e.g. Reliscript User Guide 33 # Create set containing all ligands in database ligand_set = reliscript.set('ligand') # Get second member ligand_obj1 = ligand_set[1] or from a PDB object, e.g. pdb_object = reliscript.create('1qs4') # Get first ligand in PDB entry 1qs4 ligand_obj0 = pdb_object.ligands[0] These two steps may be combined in a single line, e.g. ligand_obj0 = reliscript.create('1qs4').ligands[0] 4.4.2 Textual Representation of Ligand Objects Print operations, etc., on Ligand objects (e.g. print lig_object) will produce output such as: Ligand<pdb:1a01:APA_301-A> Colons separate the contents of the angle brackets into components. The first component is the database identifier (see Section 3.6, page 17), followed by the PDB code; the final component is the internal Relibase+ ligand identifier, which is based on the nomenclature of the ligands in the original PDB file. If the Ligand object is from an in-house database (i.e. not derived from the main PDB) the output will be, e.g.: Ligand<dbid:1ax1:ALA_ARG_VAL_50> where dbid is the identifier of the in-house database. 4.4.3 Attributes of Ligand Objects The attributes (see page 77) of a Ligand object are: Name 34 Type Description Reliscript User Guide adjacent_chains list of Chains List of all chains that have at least one residue in the ligand’s BindingSite. The list is ordered by the size of the chain/ligand interaction, i.e. when there are two or more chains in the list, the first will have more of its residues involved in the ligand binding site than will the second. adjacent_nucleic_a cids list of NucleicAcids List of all nucleic acids that have at least one residue in the ligand’s BindingSite. The list is ordered by the size of the nucleic acid/ligand interaction, i.e. when there are two or more nucleic acids in the list, the first will have more of its residues involved in the ligand binding site than will the second. atoms list of Atoms The Atom objects in the ligand. binding_site BindingSite The BindingSite object associated with the ligand. bonds list of Bonds The Bond objects in the ligand. cofactor Boolean Returns 1 if the ligand is a cofactor or comprises cofactor building blocks. The method is based on the ligand full_name attribute and checks for the following building blocks: ADP, AMP, ATP, B12, BTN, COA, FAD, FMN, FS3, FS4, HEM, NAD, NAP, PLP, TPP compound_name string The compound name of the ligand as given in the PDB file. covalently_bound Boolean Returns 1 if the ligand is covalently bound to the protein, otherwise 0. full_name string A list of the ligand building blocks, as defined in the PDB file, which can be one or more than one, e.g. MQI or NAS-GLY-PAP-PIP mol_wt float Molecular weight. n_atom integer Number of atoms in the ligand. n_unit integer Number of units (i.e. residues) in the ligand; equivalent to len(ligand). Reliscript User Guide 35 pack_binding_site PackBindingSite The PackBindingSite associated with the ligand (i.e. nearby atoms in the crystal-packing environment). pdb PDB The PDB object containing the ligand. peptide Boolean Returns 1 if the ligand contains at least one natural amino acid building block. The method checks the ligand full_name attribute, e.g. NAS-GLY-PAPPIP returns 1. pure_peptide Boolean Returns 1 if the ligand contains only natural amino acid building blocks. The method checks the ligand full_name attribute, e.g. GLY-ARG-PHE returns 1 residues list of Residues A list of the Residue objects in the ligand. The term residue is used for consistency with Chain objects. In reality, they are simply the component sections of the ligand, which may or may not be peptide units. sugar Boolean Returns 1 if the ligand is comprises carbohydrate building blocks. The method is based on the ligand full_name attribute and checks for the following building blocks: ARA, ARB, FUC, GAL, GLU, MAN type string String that identifies this object type. For a Ligand object, this will be the string ligand 4.4.4 Functions of Ligand Objects The functions (see page 80) of a Ligand object are: pdb_atoms() Returns a list of string objects containing the ATOM records of the ligand. save_pdb(filename, modified=1) Saves the ligand as a file in pdb format. Arguments: • filename (string): Filename to be used for the output pdb file. • modified (integer): By default, the output file will include any changes (geometrical transformations, changed or added attributes) that may have been made to the object by Reliscript. To write out the original, unmodified object, set modified=0. 36 Reliscript User Guide save_mol2(filename, modified=1) Saves the ligand as a file in mol2 format (Tripos Inc., St Louis, USA). Arguments: • filename (string): Filename to be used for the output mol2 file. • modified (integer): By default, the output file will include any changes (geometrical transformations, changed or added attributes) that may have been made to the object by Reliscript. To write out the original, unmodified object, set modified=0. transform(object, on_original=0) Applies the same (rotation + translation) geometrical transformation to the Ligand object as has already been applied to the object passed in as an argument. Arguments: • object (Reliscript data object): A data object (e.g. a PDB, Chain, NucleicAcid, Ligand, Solvent, or Residue object) that has been subjected to a geometrical transformation. For example, this might be a Chain object in the ligand’s binding site that has been rotated and translated to superimpose it on another, similar chain. • on_original (integer): By default, the transformation will be applied to the Ligand object in its current orientation, which may already be the result of a previous transformation. To transform the original, untransformed atomic positions, set the on_original flag to a nonzero value. clear_transform() Clears any geometrical transformation that has been applied to the Ligand object so that all atom coordinates return to their original values. 4.4.5 Looping around the Contents of a Ligand Object Like Chain objects, Ligand objects can simulate certain list and dictionary operations (see Section 4.2.5, page 30). Looping around the contents of a Ligand object will produce Residue objects. In many cases there will only be one residue object in the whole ligand, but some ligands (particularly peptides) contain several. For an example see the script looping_around_ligands.py. 4.5 Solvent Objects A Solvent object holds information about the water molecules in a protein structure. It allows access to the water Atom objects and to the parent PDB object. There is one Solvent object for each PDB object. Each water molecule within a Solvent object is treated as a separate “residue”. While slightly artificial, this helps maintain consistency with Chain and Ligand objects. Reliscript User Guide 37 4.5.1 Creation of Solvent Objects Solvent objects can be created either by accessing a member of a solvent container object, e.g. # Create set containing all Solvent objects in database solvent_set = reliscript.set('solvent') # Get last member solvent_obj_last = solvent_set[-1] or from a PDB object, e.g. pdb_object = reliscript.create('1qs4') # Get first (and only!) Solvent object for PDB entry 1qs4 solvent_obj0 = pdb_object.solvent[0] These two methods may be combined in a single line, e.g. solvent_obj0 = reliscript.create('1qs4').solvent[0] 4.5.2 Textual Representation of Solvent Objects Print operations, etc., on Solvent objects (e.g. print solv_object) will produce output such as: Solvent<pdb:1a01:SOLV> Colons separate the contents of the angle brackets into components.The first component is the database identifier (see Section 3.6, page 17), followed by the PDB code; there will only be one Solvent object per PDB entry, so the final component will always be SOLV. If the Solvent object is from an in-house database (i.e. not derived from the main PDB) the output will be, e.g.: Solvent<dbid:1ax1:SOLV> where dbid is the identifier of the in-house database. 4.5.3 Attributes of Solvent Objects The attributes (see page 77) of a Solvent object are: Name 38 Type Description Reliscript User Guide atoms list of Atoms The solvent Atom objects (in effect, the water oxygens in the structure, if no hydrogen-atom coordinates are present). bonds list of Bonds The solvent Bond objects (will usually be an empty list, as solvent Atom objects will normally be disconnected water oxygens). n_atom integer Number of solvent atoms. n_unit integer Number of units (i.e. residues) in solvent; equivalent to len(solvent). As each water is treated as a separate residue, and water hydrogen atoms are usually missing, the value of this attribute is generally identical to that of n_atom. pdb PDB The PDB object containing the solvent. residues list of Residues The solvent Residue objects. The term residue is used for consistency with Chain objects. In practice, each water molecules is treated as a separate residue. type string String that identifies this object type. For a Solvent object, this will be the string solvent 4.5.4 Functions of Solvent Objects The functions (see page 80) of a Solvent object are: pdb_atoms() Returns a list of string objects containing the ATOM records of the Solvent object (i.e. all solvent atoms in the PDB entry). save_pdb(filename, modified=1) Saves the Solvent object as a file in pdb format. Arguments: • filename (string): Filename to be used for the output pdb file. • modified (integer): By default, the output file will include any changes (geometrical transformations, changed or added attributes) that may have been made to the object by Reliscript. To write out the original, unmodified object, set modified=0. transform(object, on_original=0) Applies the same (rotation + translation) geometrical transformation to the Solvent object as has Reliscript User Guide 39 already been applied to the object passed in as an argument. Arguments: • object (Reliscript data object): A data object (e.g. a PDB, Chain, NucleicAcid, Ligand, Solvent, or Residue object) that has been subjected to a geometrical transformation. • on_original (integer): By default, the transformation will be applied to the Solvent object in its current orientation, which may already be the result of a previous transformation. To transform the original, untransformed atomic positions, set the on_original flag to a nonzero value. clear_transform() Clears any geometrical transformation that has been applied to the Solvent object so that all atom coordinates return to their original values. 4.5.5 Looping around the Contents of a Solvent Object To maintain consistency with Chain and Ligand objects, there is an intermediate Residue object that is produced either when the residues attribute is retrieved or if the object is accessed via the supported list functions. Thus, looping around the contents of a Solvent object will produce Residue objects, although each Residue object will usually contain just one water-oxygen atom. The script looping_around_solvent_objects.py prints the coordinates of the oxygen atom of all solvent residues. 4.6 Residue Objects A Residue object holds information about a unit of a Chain, NucleicAcid, Ligand or Solvent object. It allows access to the Atom objects it contains and the parent object to which it belongs. The term residue is really applicable only to chains and some ligands (i.e. peptides and related compounds) but is used throughout so that scripts can be written which will work in the same way, regardless of whether the parent object is a Chain, NucleicAcid, Ligand or Solvent. 4.6.1 Creation of Residue Objects Residue objects are created by requesting them from (or looping around the contents of) Chain, NucleicAcid, Ligand or Solvent objects, or their packed equivalents (see Section 4.10.5, page 54), e.g. # Get first chain in PDB entry 1mmb pdb_obj = reliscript.create('1mmb') chain_obj = pdb_obj.chains[0] # Get first residue in chain 40 Reliscript User Guide res1 = chain_obj[0] This can be done with one command: res2 = reliscript.create('1mmb').chains[0][0] # Now get first residue in a pack binding site chain ligand_obj = reliscript.create('1qs4').ligands[0] pack_bs_obj = ligand_obj.pack_binding_site res3 = pack_bs_obj.chains[0][0] 4.6.2 Textual Representation of Residue Objects Print operations, etc., on Residue objects (e.g. print res_object) will produce output such as: Residue<pdb:1a01:A:’16B’> Residue<pdb:1a01:100_1004:’1004’> Residue<pdb:1a01:SOLV:’500’> Colons separate the contents of the angle brackets into components. The first component is the database identifier (see Section 3.6, page 17), followed by the PDB code; the third component is the identifier of the object that contains the residue and the final part is the residue identifier (this is the contents of the number field relating to that residue in the original PDB file). If the Residue object is from an in-house database (i.e. not derived from the main PDB) the output will be, e.g.: Residue<dbid:1a01:A:’16B’> where dbid is the identifier of the in-house database. 4.6.3 Attributes of Residue Objects The attributes (see page 77) of a Residue object are given in the table below. Name Type Description atoms list of Atoms The Atom objects in the residue. bonds list of Bonds Reliscript User Guide The Bond objects in the residue, including any bonds linking this residue to other residues in the same chain or ligand, but excluding disulphide bridges and bonds between proteins and covalently-bound ligands. 41 chain_id string Chain identifier; if no chain identifier is set, this will return the one-character string “-”. index_no integer The index number of the residue in the Chain, NucleicAcid, Ligand or Solvent object to which it belongs. The first residue in a chain would have an index_no of 0, etc. n_atom integer Number of atoms in the residue. This will be the number whose positions were determined experimentally, i.e. given on PDB ATOM records. n_atom_ideal integer For amino acid residues, the number of atoms that the residue should ideally contain. This may be different from n_atom, e.g. if one or more atoms in the residue were not located experimentally. name string For peptidic residues, the amino-acid name, e.g. SER. For solvent residues, returns HOH. one_letter_cod e string One letter code of the residue; for non-peptide residues, this will return the one-character string “*”. pdb PDB The PDB object containing the residue. sequence_no string The residue sequence identifier, e.g. 10 or 123. For solvents, the Relibase+ internal count. type string String that identifies the type of object that the residue is part of. This will be either amino acid, nucleic acid, ligand or solvent. If the residue is part of a peptidic ligand, the type will be returned as amino acid rather than ligand. 4.6.4 Functions of Residue Objects The functions (see page 80) of a Residue object are: pdb_atoms() Returns a list of string objects containing the ATOM records of the residue. save_pdb(filename) Saves the residue as a file in pdb format. If the residue has been modified in any way (e.g. subjected to a geometric transformation), the modified data will be written out, not the original data as retrieved from the Relibase+ database. 42 Reliscript User Guide Arguments: • filename (string): Filename to be used for the output pdb file. transform(object, on_original=0) Applies the same (rotation + translation) geometrical transformation to the Residue object as has already been applied to the object passed in as an argument. All other atoms in the object to which the residue belongs will also have the same transformation applied, e.g. if the residue belongs to a chain, the whole chain will be transformed. Arguments: • object (Reliscript data object): A data object (e.g. a PDB, Chain, NucelicAcid, Ligand, Solvent, or Residue object) that has been subjected to a geometrical transformation. • on_original (integer): By default, the transformation will be applied to the Residue object in its current orientation, which may already be the result of a previous transformation. To transform the original, untransformed atomic positions, set the on_original flag to a nonzero value. clear_transform() Clears any geometrical transformation that has been applied to the Residue object so that all atom coordinates return to their original values. All other atoms in the object to which the residue belongs will also be returned to their original positions, e.g. if the residue is part of a chain, the whole chain will be reset. 4.6.5 Looping around the Contents of a Residue Object Internal functions allow the Residue object to be treated as a list, where the list contains the atoms stored in the residue. Thus, looping around the contents of a Residue object will produce Atom objects, for an example see the script looping_around_residue_objects.py. 4.6.6 Residue Numbering Chain objects provide some internal functions which allow the Residue objects they contain to be referred to by the residue identifier used in the original PDB file, which will normally be the position of the residue in the protein SEQRES sequence, e.g. # Retrieve residue labelled 17 res = chain_obj[’17’] You are recommended to use this method if you wish to access particular residues in a protein sequence. Other methods are available for accessing the residues of a chain, but they do not necessarily use biologically meaningful numbering schemes (see Section 4.2.5, page 30). Reliscript User Guide 43 4.7 Atom Objects An Atom object holds information about a particular atom in a protein chain, a nucleic acid, a ligand, or a solvent molecule (e.g. positional coordinates). It also allows access to the Residue and PDB objects to which it belongs. Atom and Bond objects are the smallest 3D-structural components of a PDB entry. 4.7.1 Creation of Atom Objects Atom objects are created by requesting them from Residue objects or Bond objects, e.g. atom0 = res_obj[0] atom1 = bond_obj[0] In addition, all data objects containing Residue objects (i.e. PDB, Chain, NucleicAcid, Ligand, Solvent, BindingSite, PackBindingSite, etc.) will produce a list of the atoms they contain if requested, e.g. # Get second atom in PDB object pdb_obj atom1 = pdb_obj.atoms[1] # Get last atom in Chain object chain_obj atom2 = chain_obj.atoms[-1] # Get first atom in BindingSite object bindingsite_obj atom3 = bindingsite_obj.atoms[0] 4.7.2 Textual Representation of Atom Objects Print operations, etc., on Atom objects (e.g. print atom_object) will produce output such as: Atom(N)<pdb:1a01:A:’16B’:121> Atom(Cl)<pdb:1a01:100_1004:’1004’:31> Atom(O)<pdb:1a01:SOLV:’500’:500> Colons separate the contents of the angle brackets into components. The first component is the database identifier (see Section 3.6, page 17), followed by the PDB code; the third part is the identifier of the object that contains the atom and the final part is the atom number from the original PDB file. If the Atom object is from an in-house database (i.e. not derived from the main PDB) the output will be, e.g.: 44 Reliscript User Guide Atom(N)<dbid:1a01:A:’16B’:121> where dbid is the identifier of the in-house database. 4.7.3 Attributes of Atom Objects The attributes (see page 77) of an Atom object are: Name Type Description b_factor float Temperature (B) factor bonds list of Bonds List of all bonds in which this atom is involved, including bonds to atoms in other residues, if any such bonds exist, but excluding bonds that correspond to covalent protein-ligand linkages. coords tuple Tuple of three floating point numbers containing the orthogonal x, y, z coordinates of the atom, e.g. (59.589, 58.943, 86.473) element_no integer The elemental atomic number of the atom. index_no integer The integer index number of the atom within the Chain, NucleicAcid, Ligand or Solvent object of which it is part, i.e. its position in the list of Atom objects produced by the atoms attribute of the Chain, NucleicAcid, Ligand or Solvent object. name string The PDB label of the atom, e.g. N, CA, CB. pdb_atom_n umber integer Number of the atom in the PDB entry of which it is part. occupancy float Site occupancy. pdb PDB The PDB object containing the atom. residue residue The Residue object containing the atom. symbol string The element symbol of the atom, e.g. C, N, Cl. This string is included in the textual representation of the atom. sybyl_type string The Sybyl atom type (see page 83) of the atom, e.g. N.2. Returned as UNK if unknown. These may not be set reliably, especially if the atom has an uncertain protonation state. Reliscript User Guide 45 x float The atomic x Cartesian coordinate. y float The atomic y Cartesian coordinate. z float The atomic z Cartesian coordinate. 4.7.4 Functions of Atom Objects The functions (see page 80) of an Atom object are: pdb_line() Returns (as a string) the PDB ATOM line relating to this atom. 4.8 Bond Objects A Bond object holds information about the chemical bond between two atoms in a protein chain, nucleic acid, ligand, or solvent molecule. Bond and Atom objects are the smallest 3D-structural components of a PDB entry. 4.8.1 Creation of Bond Objects Bond objects are created by requesting them from data objects such as Atom, Residue, Ligand, etc., e.g. bond1 = atom_obj.bonds[1] bond2 = residue_obj.bonds[-1] bond3 = ligand_obj.bonds[0] 4.8.2 Textual Representation of Bond Objects Print operations, etc., on Bond objects (e.g. print bond_object) will produce output such as: Bond(SINGLE)<pdb:1qs4:CHN-A:(Atom(C) <'65':75>, Atom(S) <'65':76>)> Colons separate the contents of the angle brackets into components. The first component is the database identifier (see Section 3.6, page 17), followed by the PDB code; the third part is the identifier of the object that contains the bond and the final part is a tuple identifying the atoms involved in the bond. If the Bond object is from an in-house database (i.e. not derived from the main PDB) the output will be, e.g.: Bond(SINGLE)<dbid:1qs4:CHN-A:(Atom(C) <'65':75>, Atom(S) <'65':76>)> 46 Reliscript User Guide where dbid is the identifier of the in-house database. 4.8.3 Attributes of Bond Objects The attributes (see page 77) of a Bond object are: Name Type Description atoms list containing two Atom objects List containing the two Atom objects that form the bond. bond_type string Bond type, one of SINGLE, DOUBLE, TRIPLE, AROMATIC, or AMIDE (i.e. amide or peptide). 4.8.4 Functions of Bond Objects The functions (see page 80) of a Bond object are: other_atom(atom_object) Returns (as an Atom object) the other atom involved in the bond, assuming that atom_object itself is involved in the bond. Returns NONE if atom_object is not involved in the bond. Arguments: • atom_object (Atom): An atom. 4.8.5 Looping around the Contents of a Bond Object Internal functions allow the Bond object to behave like a Python list containing the two atoms involved in the bond, i.e. first_atom = bond_obj[0] second_atom = bond_obj[1] Looping around the contents of a Bond therefore produces Atom objects, see the script looing_around_bond_objects.py. Also, the contents of a Bond object can be tested, e.g. if atom_obj in bond_obj: print ’atom is connected by bond’ Reliscript User Guide 47 4.9 BindingSite Objects A BindingSite object holds information about the atoms surrounding a bound ligand. These atoms may belong to protein chains, nucleic acids, solvent molecules or other ligands. There is exactly one BindingSite object for each Ligand object, i.e. each BindingSite is defined with respect to a particular Ligand object. A Binding Site object is similar to a PDB object in that it has the attributes chains, nucleic_acids, solvent and ligands. In the PDB object, these refer to the contents of the complete protein; in a BindingSite object, they refer to: • chains: All protein chain residues that have at least one atom within 7Å of the ligand defining the BindingSite object. • nucleic_acids: All protein nucleic acid residues that have at least one atom within 7Å of the ligand defining the BindingSite object. • solvent: All solvent atoms within 7Å of the ligand defining the BindingSite object. • ligands: All other ligands that have at least one atom within 7Å of the ligand defining the BindingSite object. A BindingSite object does not contain any atoms generated by crystallographic symmetry; these can be accessed via a PackBindingSite object (see Section 4.10, page 51). 4.9.1 Creation of BindingSite Objects A BindingSite object can only be created from the associated Ligand object or the corresponding PackBindingSite object, e.g. bindingsite1 = ligand_obj.binding_site bindingsite2 = pack_bindingsite_obj.binding_site A list of BindingSite objects is also available from the PDB object, e.g. # Get third binding site in PDB object pdb_obj bindingsite3 = pdb_obj.binding_sites[2] 4.9.2 Textual Representation of BindingSite Objects Print operations, etc., on BindingSite objects (e.g. print bindingsite_object) will produce output such as: BindingSite<pdb:1a01:APA_301-A> 48 Reliscript User Guide Colons separate the contents of the angle brackets into components. The first component is the database identifier (see Section 3.6, page 17), followed by the PDB code; the final component is the internal Relibase+ identifier of the bound ligand (see Section 4.4.2, page 34). If the BindingSite object is from an in-house database (i.e. not derived from the main PDB) the output will be, e.g.: BindingSite<dbid:1a01:APA_301-A> where dbid is the identifier of the in-house database. 4.9.3 Attributes of BindingSite Objects The attributes (see page 77) of a BindingSite object are: Name Type Description atoms list of Atoms The Atom objects in the binding site, not including the atoms of the bound ligand. bonds list of Bonds The Bond objects in the binding site. In the case of chains, this includes only those bonds between atoms in the binding site, i.e. it does not include bonds from residues in the binding site to residues outside the binding site. bound_ligand Ligand The Ligand object of which this is the binding site. chains list of BindingSiteChains The BindingSiteChain objects in the binding site. Each of these objects will contain only those Residue objects that have at least one atom within 7Å of the bound ligand. nucleic_acids list of BindingSiteNucleicAcids The BindingSiteNucleicAcid objects in the binding site. ligands list of Ligands A list of other Ligand objects (excluding the bound ligand) that are contained in the binding site. pack_binding_site PackBindingSite The associated PackBindingSite object (i.e. nearby atoms in the crystallographic environment). pdb The PDB object containing this binding site. Reliscript User Guide PDB 49 solvent list containing one BindingSiteSolvent object List containing one BindingSiteSolvent object. This single object contains the solvent atoms in the binding site. 4.9.4 Functions of BindingSite Objects The functions (see page 80) of a BindingSite object are: pdb_atoms() Returns a list of string objects containing the ATOM records of the binding site. save_pdb(filename) Saves the binding site as a file in pdb format. If the binding site has been modified in any way (e.g. subjected to a geometric transformation), the modified data will be written out, not the original data as retrieved from the Relibase+ database. Arguments: • filename (string): Filename to be used for the output pdb file. save_mol2(filename, radius=7.0, modified=1, include_pack=1) Saves the binding site as a file in mol2 format (Tripos Inc., St Louis, USA). Will not work if the binding site has been subjected to any geometrical transformation, e.g. as a result of chain superimposition (save_pdb can be used instead). Arguments: • filename (string): Filename to be used for the output mol2 file. • radius (float): Distance criterion which determines how much of the binding site will be written out. Default is 7Å, i.e. all residues that have at least one atom within 7Å of at least one atom in the ligand will be included. • modified (integer): By default, the output file will include any changes (geometrical transformations, changed or added attributes) that may have been made to the object by Reliscript. To write out the original, unmodified object, set modified=0. • include_pack (integer): By default, the output file will include the associated PackBindingSite object. If and only if modified=1, the PackBindingSite data can be excluded by setting include_pack=0. transform(object, on_original=0) Applies the same (rotation + translation) geometrical transformation to the BindingSite object as has already been applied to the object passed in as an argument. Arguments: • object (Reliscript data object): A data object (e.g. a PDB, Chain, NucleicAcid, Ligand, 50 Reliscript User Guide Solvent, or Residue object) that has been subjected to a geometrical transformation. • on_original (integer): By default, the transformation will be applied to the BindingSite object in its current orientation, which may already be the result of a previous transformation. To transform the original, untransformed atomic positions, set the on_original flag to a nonzero value. clear_transform() Clears any geometrical transformation that has been applied to the BindingSite object so that all atom coordinates return to their original values. 4.9.5 BindingSiteChain, BindingSiteNucleicAcid and BindingSiteSolvent Objects These objects (see Section 4.9.3, page 49) have the same representation, attributes and functions as their non-binding-site equivalents with the following exceptions: • They will only return information on the atoms, bonds and residues that are in the binding site (e.g the atoms attribute of a BindingSiteChain object will not include Atom objects that belong to residues in the chain that lie outside the binding site). • BindingSiteChain objects have no sequence or sequence_3d attributes. • BindingSiteNucleicAcid objects have no sequence_3d attributes. • The textual representations of BindingSiteChain, BindingSiteNucleicAcid and BindingSiteSolvent objects are slightly different from those of Chain and Solvent objects, viz. the name component is BindingSiteChain rather than Chain and BindingSiteSolvent rather than Solvent. 4.10 PackBindingSite Objects Like the BindingSite object (see Section 4.9, page 48), a PackBindingSite object contains information on a ligand’s surroundings. The difference between the two is that a PackBindingSite holds data about protein chains, nucleic acid, ligand and solvent molecules that are within range of a protein-bound ligand because of crystallographic packing; in other words, atoms that are generated by crystallographic symmetry. There is exactly one PackBindingSite object for each Ligand object (and, therefore, for each BindingSite object), i.e. each PackBindingSite is defined with respect to a particular Ligand object. A PackBindingSite object is similar to a PDB object in that it has the attributes chains, solvent and ligands. In the PDB object, these refer to the contents of the complete protein; in a PackBindingSite object, they refer to: • chains: All protein chain residues generated by crystallographic symmetry that have at least one atom within 7Å of the ligand defining the PackBindingSite object. Reliscript User Guide 51 • nucleic_acids: All protein nucleic acid residues generated by crystallographic symmetry that have at least one atom within 7Å of the ligand defining the PackBindingSite object. • solvent: All solvent atoms generated by crystallographic symmetry that are within 7Å of the ligand defining the PackBindingSite object. • ligands: All other ligands generated by crystallographic symmetry that have at least one atom within 7Å of the ligand defining the PackBindingSite object. A PackBindingSite object only contains atoms generated by crystallographic symmetry; for the primary binding site of a ligand, use the BindingSite object (see Section 4.9, page 48). 4.10.1 Creation of PackBindingSite Objects A PackBindingSite object can only be created from the associated Ligand object or the corresponding BindingSite object, e.g. pack_bs_obj1 = ligand_obj.pack_binding_site pack_bs_obj2 = bindingsite_obj.pack_binding_site A list of PackBindingSite objects is also available from the PDB object, e.g. # Get first pack binding site in PDB object pdb_obj pack_bs_obj3 = pdb_obj.pack_binding_sites[0] 4.10.2 Textual Representation of PackBindingSite Objects Print operations, etc., on PackBindingSite objects (e.g. print pbsite_object) will produce output such as: PackBindingSite<pdb:1a01:APA_301-A> Colons separate the contents of the angle brackets into components. The first component is the database identifier (see Section 3.6, page 17), followed by the PDB code; the final component is the internal Relibase+ identifier of the associated ligand (see Section 4.4.2, page 34). If the PackBindingSite object is from an in-house database (i.e. not derived from the main PDB) the output will be, e.g.: PackBindingSite<dbid:1a01:APA_301-A> where dbid is the identifier of the in-house database. 4.10.3 Attributes of PackBindingSite Objects The attributes (see page 77) of a PackBindingSite object are: 52 Reliscript User Guide Name Type Description atoms list of Atoms The Atom objects in the PackBindingSite, not including the atoms of the ligand used to define the PackBindingSite. binding_site BindingSite The associated BindingSite object. bonds list of Bonds The Bond objects in the PackBindingSite. This includes only those bonds between atoms in the PackBindingSite, i.e. does not include bonds from residues in the PackBindingSite to residues outside the PackBindingSite. bound_ligand Ligand The ligand object used to define the PackBindingSite. chains list of Chains The Chain objects in the PackBindingSite. Each Chain object will contain only those residues that have at least one atom within 7Å of the bound ligand. nucleic_acids list of NucleicAcids The NucleicAcid objects in the PackBindingSite. Each NucleicAcid object will contain only those residues that have at least one atom within 7Å of the bound ligand. ligands list of Ligands The Ligand objects that are contained in the PackBindingSite (these will not include the ligand used to define the PackBindingSite). pdb PDB The PDB object associated with this PackBindingSite. solvent list containing one Solvent object List containing one Solvent object. This single object contains the solvent atoms in the PackBindingSite. 4.10.4 Functions of PackBindingSite Objects The functions (see page 80) of a PackBindingSite object are: pdb_atoms() Returns a list of string objects containing the ATOM records of the PackBindingSite object. save_pdb(filename) Saves the PackBindingSite object as a file in pdb format. If the PackBindingSite has been modified in any way (e.g. subjected to a geometric transformation), the modified data will be written out, not the original data as retrieved from the Relibase+ database. Reliscript User Guide 53 Arguments: • filename (string): Filename to be used for the output pdb file. transform(object, on_original=0) Applies the same (rotation + translation) geometrical transformation to the PackBindingSite object as has already been applied to the object passed in as an argument. Arguments: • object (Reliscript data object): A data object (e.g. a PDB, Chain, NucleicAcid, Ligand, Solvent, or Residue object) that has been subjected to a geometrical transformation. • on_original (integer): By default, the transformation will be applied to the PackBindingSite object in its current orientation, which may already be the result of a previous transformation. To transform the original, untransformed atomic positions, set the on_original flag to a nonzero value. clear_transform() Clears any geometrical transformation that has been applied to the PackBindingSite object so that all atom coordinates return to their original values. 4.10.5 Chain, NucleicAcid, Ligand and Solvent Objects derived from PackBindingSite Objects Chain, NucleicAcid, Ligand and Solvent objects derived from PackBindingSite objects have some exceptional features: • They will only return information on the atoms, bonds and residues that are in the PackBindingSite (e.g. the atoms attribute of a Chain object will not include Atom objects that belong to residues in the chain that lie outside the PackBindingSite). • Chain objects do not have sequence or sequence_3d attributes. • NucleicAcid objects do not have sequence_3d attributes. • Ligand does not have a save_mol2 function. 5 Storing and Manipulating Collections of Objects: Container Objects Container objects are used to store and manipulate collections of data objects, allowing access to individual members of the collection in a simple and consistent manner (see Section 3.3.2, page 14). Python itself provides several types of container objects, several of which are used in Reliscript. In addition, Reliscript has one customised container object, the set, which offers features for, e.g., interconverting one type of data object to another (e.g. a set of PDB objects to a set of Ligand objects). 54 Reliscript User Guide 5.1 Using Standard Python Containers to Hold Data Objects Reliscript data objects can be stored in standard Python container objects (see Section 3.3.2, page 14) such as lists (see page 80) and dictionaries (see page 78), this is illustrated in the script container_example2.py. 5.2 Set Objects Set objects are similar to Relibase+ hitlists except that it is possible to maintain a given order in a set (so, for example, it is possible to sort the members of a set into a particular order depending on the value of a particular attribute). No Reliscript data object can appear more than once in a set. 5.2.1 Types of Sets A set will have a specific type depending on the data objects it contains. The five possible types are ’pdb’, ’chain’, ’nucleic_acid’, ’ligand’ and ’solvent’, referring to sets which contain, respectively, PDB, Chain, NucleicAcid, Ligand or Solvent objects (see Section 4, page 23). Set types can be specified when a set is created (see Section 5.2.2, page 55). Although Chain and Solvent objects can be contained in Reliscript sets (and these sets can be stored persistently in files), they cannot currently be stored in Relibase+ hitlists. 5.2.2 Creating Sets To create a set containing all objects of a given type (see Section 5.2.1, page 55) in the Relibase+ database(s), enter a command such as: # Create set containing all PDB objects in database pdb_set = reliscript.set(’pdb’) To create an empty set of a specific type, enter a command such as: empty_ligand_set = reliscript.set(’ligand’,[]) To construct a set based on a hitlist that has been created in a web-based Relibase+ session and stored in a user’s Relibase+ workspace (see Section 3.7.3, page 18), enter commands such as: reliscript.use_workspace(’myname’) hitlist_ligand_set = reliscript.set(’ligand’, ’my_hitlistname’) If the type specified for the set (e.g. ‘ligand’ above) is not the same as the type of the hitlist, an automatic conversion will occur. For example, if my_hitlistname in the example above is a PDB hitlist, the resulting Reliscript set will contain all the Ligand objects in the PDB entries contained in Reliscript User Guide 55 the hitlist. By default, Reliscript will use the login username of the user for the workspace identifier. In the above example, if no hitlist called my_hitlistname is found in the relevant workspace, Reliscript will assume that the name refers to a file (hitlists may be saved to file as well as stored in Relibase+ workspaces). In this case, if the name does not include a file extension, the extension .rbs will be added. The conversion of one type of set to another always leads to the creation of a new set (see Section 5.2.5, page 56). 5.2.3 Copying Sets Sets have a copy function that produces a full copy of the set, e.g. set1 = set2.copy() 5.2.4 Saving Sets PDB and ligand sets (but not chain or solvent sets) can be saved as Relibase+ hitlists (which are then accessible in the Relibase+ graphical user interface), e.g. hitlist_chain_set.save_to_hitlist(’hitlistname’) These may then be read into Relibase+ sessions, e.g. for 3D viewing. Any type of set can be saved as a file, e.g. ligand_set.save(’/home/user/myligset’) The extension .rbs will be added if no file extension is specified, e.g. the above command will save the set into the file myligset.rbs in the /home/user directory. 5.2.5 Converting One Type of Set to Another Sets of all types (pdb, chain, nucleic_acid, ligand, solvent) can be interconverted. Conversion of one type of set into another may change the number of data objects, e.g. since a given PDB object may contain different numbers of chains and ligands. Set type conversions can take place on request or automatically. For example, if the set pdb_set contains PDB objects and we want to produce a set called lig_set containing all the ligands for the entries in the pdb_set, we would use the command: lig_set = reliscript.set(’ligand’, pdb_set) 56 Reliscript User Guide Alternatively, if we have a set of PDB objects, pdb_set, and a set of Ligand objects, lig_set, and we require a new set containing all PDB objects in pdb_set that do not have ligands in lig_set, we would use: pdb_set2 = pdb_set – lig_set (see Section 5.2.6, page 57). When this command is executed, an automatic conversion of lig_set to a temporary set containing the corresponding PDB objects will occur, so that the subtraction can then take place between sets of the same type. 5.2.6 Logical Operations on Sets It is possible to apply logical operators to sets. The & operator performs a logical AND of two sets, i.e. the resulting set will contain only those objects that appear in both sets. For example: pdb_set3 = pdb_set1 & pdb_set2 # pdb_set3 contains only those PDB objects that occur in both # pdb_set1 and pdb_set2 The | operator performs a logical OR of two sets, i.e. the resulting set will contain all objects that occur in either (or both) of the sets involved. This is equivalent to the use of the addition operator. For example: lig_set2 = lig_set1 | pdb_set1 # lig_set2 contains all the Ligand objects in lig_set1 and all the # Ligand objects contained in the PDB objects in pdb_set1 The ^ operator performs a logical XOR (exclusive OR) of two sets, i.e. the resulting set will contain only those objects that appear in one of the sets but not the other. For example: chain_set3 = chain_set1 ^ chain_set2 # chain_set3 contains only those Chain objects in chain_set1 that # are not in chain_set2, plus those in chain_set2 that are not in # chain_set1 The other operators that are allowable between sets are the addition (+) and subtraction (-) operators. Addition of sets produces identical results to the | (i.e. OR) operator. Subtraction of sets produces a set containing only those objects in the first set that do not occur in the second set (i.e. NOT operation). The precedence of operators in Python (governing the order in which they are executed in an expression without brackets) is: + ,- done before & done before ^ done before |. The safest rule is Reliscript User Guide 57 to use brackets to ensure that operations will be done in the order you expect, e.g. # Operator in brackets will be executed before operator # outside brackets: set4 = set1 + (set2 ^ set3) In all operations, the first set has priority. This means that: • The resulting set will always be of the same type as the first set, e.g. new_set = ligand_set & chain_set # new_set is a ligand set, not a chain set • When the operation is such that the resulting set could contain objects from both of the original sets, the order of objects in the resulting set will be: objects from first set followed by objects from second set. This means that, if either of the original sets was ordered according to some attribute, the sort would need to be reapplied to the new set to get the correct overall order. 5.2.7 Indexing, Accessing and Deleting Members of a Set Set objects mimic Python lists for the purposes of accessing members of the set. The script set_example.py illustrates the implemented list commands. In addition, objects in PDB sets can be indexed by 4 letter PDB codes, e.g. pdb_obj = pdb_set[‘1qs4’] 5.2.8 Sorting Members of a Set into a Particular Order The initial order of entries in a set is alphabetical, based on the complete Relibase+ identifier for the stored object. Examples of these identifiers are: PDB1A0L (PDB Object) 100_1004_PDB1QS4_1 (Ligand) PDB1A0L-A_1 (Chain) SOLV_PDB1A0L_1 (Solvent) If one type of set is converted to another type, the order of objects in the original set will be preserved. Because of the different locations of the PDB code within the identifier, this may mean that the resulting set is not ordered alphabetically even if the original set was. To reverse the order of objects in a set, use a command such as: 58 Reliscript User Guide pdb_set.reverse() To sort a set on the object identifiers, use a command such as: pdb_set.sort() To sort on an attribute, pass the attribute name as a string to the sort function (the order is that for a normal Python sort, i.e. lowest first), e.g. # Sort the PDB objects by crystallographic resolution pdb_set.sort(’resolution’) For a more complex sort, it is possible to pass a function into the sort function, e.g. pdb_set.sort(my_function) my_function must take two arguments, each being an object of the type stored in the set (e.g. PDB objects for a set of type ’pdb’) and return -1, 0 or 1 depending on whether the first argument is considered smaller than, equal to, or larger than the second argument. Note that sorting an entire set may be very time consuming. It is therefore recommended that sorting is performed as a last step after any filtering has taken place. For example, if we wanted to sort urokinase entries by year, it would be inefficient to first sort the whole pdb set by year and then search for urokinase entries. The preferred style would filter out the urokinase entries first and then sort by year: # Create the set pdb_set = rs.set('pdb') # Filter search = rs.text_search(field='title',searchstring='UROKINASE') search(pdb_set) # Sort after filtering pdb_set.sort('year') Reliscript User Guide 59 5.2.9 Appending Objects to a Set It is possible to append objects to a set using the append function; the extend function can be used in exactly the same way. The item to be appended can be defined as an object or an identifier string; other sets, lists and tuples of objects can also be appended. For example, the following code uses identifier strings to create a set containing 3 PDB objects: # Initialise a list of 3 PDB identifiers my_list = [’1ab2’,’2b03’,’3c04’] # Create an empty set new_pdb_set = reliscript.set(’pdb’,[]) # Append each PDB entry to set new_pdb_set.append(my_list) The last line could also have be written using a for loop, appending the individual pdb identifiers to the new_pdb_set one at a time, see script set_example2.py. The script set_example3.py shows the use of the append command to divide the objects in one set into two new sets based on some criterion, in this case the date: Before an object is appended to a set, it is tested to see if it is of the same type as the objects already in the set. If not (e.g. if we try to append a PDB object to a set containing Ligand objects), the object to be appended is converted to the correct type (in the example just given, this would mean generating all the Ligand objects contained in the PDB object and adding them to the set). 5.2.10 Summary of Set Functions Sets have the following functions: • • • • • • copy (see Section 5.2.3, page 56) save (see Section 5.2.4, page 56) save_to_hitlist (see Section 5.2.4, page 56) sort (see Section 5.2.8, page 58) reverse (see Section 5.2.8, page 58) append or extend (see Section 5.2.9, page 60) In addition, sets can be: • Interconverted (one type of set to another) (see Section 5.2.5, page 56) • Subjected to logical operations (see Section 5.2.6, page 57) • Indexed (for accessing items within the set) in various ways (see Section 5.2.7, page 58) 60 Reliscript User Guide and there are options for deleting items from a set and taking “slices” (i.e. subsets) of a set (see Section 5.2.7, page 58). 6 Doing Searches and Other Calculations: Operation Objects Operation objects (see Section 3.3.3, page 15) are available for performing the above tasks. In addition, customised operation objects may be written for doing searches and other calculations not in the above list (see Section 8, page 73). 6.1 Text and Keyword Searching The text search class can be used to filter a set of data objects so that only those objects containing a specified textual search term will be kept. The class can handle both simple text searches and more complex searches involving regular expressions. It is also possible to limit the search to particular text fields, e.g. the author field. 6.1.1 Creating a Text Search Object; Initialization Parameters An operation object for performing a text search can be created with a command of the form: text_search_object = reliscript.text_search(parameters) The first parameter must specify what is to be searched for. This can be either a string (see Section 6.1.2, page 62) or a compiled regular expression object (see Section 6.1.3, page 62). Other, optional, parameters are: case • When a text string is used the search will, by default, ignore the case of the string while searching. By passing the option case=’match’ the search is forced to match the case of the search string. This option will be ignored if the search is for a regular expression. component or components (either spelling can be used) • This allows the searching of text strings within particular components of the data object. For example, if we have a ligand set but wish to do a search on text strings in the PDB object associated with the ligand, we would use the option component=’pdb’. Conversely, if we have a PDB set, but wish to search the text strings of the chains, nucleic_acids, ligands or solvent molecules in the PDB entries, we would use component=’chains’, component=’nucleic_acids’, component=’ligands’ or Reliscript User Guide 61 component=’solvent’, respectively. An option such as components=[’chains’,’ligands’] with a PDB set would search the text-string attributes of both the chain and ligand objects associated with each PDB object. field or fields or attribute or attributes (any spelling can be used) • By default, the search will be over all the text-string attributes within the object or its nominated component(s). The precision and speed of the search can be improved by specifying which object attribute(s) are to be searched. Examples are: field=’authors’ and attributes=[’header’,’method’]. type • If a text string is used, then, by default, the search will simply look for this string. Passing the argument type=’re’ will treat the passed string as a regular expression definition string and will create an internal regular expression object. This can be more convenient than setting up a regular expression object yourself before creating the text search object, but is not as flexible. 6.1.2 Example Text Search Please refer to example_text_search.py to view the example script. 6.1.3 Example Regular Expression Text Search Please refer to example_regular_expression_search.py to view the example script. 6.2 Numeric Searching The numeric search class can be used to filter a set of data objects on the numerical value of a particular attribute, e.g. resolution, mol_wt, etc. 6.2.1 Creating a Numeric Search Object; Initialization Parameters An operation object for performing a numeric search can be created with a command of the form: numeric_search_object = reliscript.numeric_search(parameters) The first parameter must specify the attribute whose numeric value is to be tested, e.g. the resolution of a PDB object. Other, optional, parameters are: min • Specifies the minimum acceptable value of the attribute. max 62 Reliscript User Guide • Specifies the maximum acceptable value of the attribute. component • This allows the searching of attributes within particular components of the data object. For example, if we have a ligand set but wish to do a numeric search on the resolution of the parent PDB objects, we would use the option component=’pdb’. 6.2.2 Example Numeric Search Please go to example_numeric_search.py to view the example script. 6.3 Sequence Searching The sequence search class can be used to filter a set of Chain objects so that only those objects are kept whose percentage sequence identity to a user-specified Chain object (or a sequence defined by one-letter amino-acid codes) falls within a given range. By default, the percentage identity is set to 100, i.e. the search will find exact sequence matches only. The search object reorders the set so that the most similar chains are at the front. The sequence search class has an additional use beyond simple chain similarity as it is used as the basis for determining similar binding sites (see Section 6.7.3, page 72). 6.3.1 Creating a Sequence Search Object; Initialization Parameters An operation object for performing a FASTA (http://fasta.bioch.virginia.edu/) sequence search can be created with a command of the form: seq_search_object = reliscript.sequence_search(parameters) The first parameter must be the sequence to be searched for, either as a string of one-letter codes or as a Chain object (see Section 6.3.3, page 64). Other, optional, parameters are: minidentity and maxidentity • These should be floating point numbers in the range 0.0 to 100.0, with maxidentity greater than or equal to minidentity. Their purpose is to define how closely the specified search sequence must be matched in order for a data object to be considered a hit. The search will calculate an identity value for each data-object sequence compared with the requested search sequence. Only those objects for which the identity value is greater than or equal to minidentity and smaller than or equal to maxidentity will be regarded as hits. The default values are minidentify = 100.0, maxidentity = 100.0, i.e. only exact matches. Reliscript User Guide 63 attribute_name • By default, an attribute called sequence_similarity is added to each hit (i.e. data object found in a sequence search) (see Section 6.3.2, page 64). It contains information about how similar the hit is to the search sequence. A different name can be specified for this attribute, e.g. by using attribute_name = seqsim as an option when creating the sequence search object. This would be useful if two or more sequence searches were performed on the same set of data objects. By using different attribute names for each search, it would be possible to distinguish the results of the searches later. align_identity • This retrieves the sequence similarity of two chains from different proteins. It is an "on the fly" ALIGN calculation which does a one on one sequence alignment. 6.3.2 Attributes Created by Sequence Search Objects The following attribute will be added to each data object passing a sequence search: sequence_similarity • Dictionary object containing two values whose keys are homology and score. These relate to the FASTA-calculated homology value and score, respectively. • The default name of this attribute, sequence_similarity, can be over-ridden (see Section 6.3.1, page 63). • An example showing how the homology value can be accessed is: # Examine the similarity value for the third-closest hit print chain_set[2].sequence_similarity[’homology’] 90.0 6.3.3 Example Sequence Search Please go to example_sequence_search.py to view the example script. 6.4 Consensus Motif Searching The consensus motif search class can be used to filter a set of data objects so that only those data objects matching a specified consensus motif (an amino acid sequence containing one or more variable residues) will be kept. It is different from the sequence search object (see Section 6.3, page 63) in two ways: • The sequence specified can include the character X, which will match any residue. 64 Reliscript User Guide • The search will only find exact matches (apart from the variability implied by the use of the symbol X), i.e. there is no option to specify a homology range, as there is in the normal sequence search. 6.4.1 Creating a Consensus Motif Search Object; Initialization Parameters An operation object for performing a consensus motif search can be created with a command of the form: con_motif_object = reliscript.consensus_search(parameters) The first parameter must be the sequence to be searched for, either as a string of one-letter codes, in which X can be used to mean any amino acid (see Section 6.4.3, page 66), or as a Python regular expression string or search object (see Section 6.4.4, page 66). Other, optional, parameters are: attribute_name • By default, an attribute called consensus_search is added to each data object found in a consensus motif search. It contains additional information about the results of the search (see Section 6.4.2, page 65). A different name can be specified for this attribute, e.g. by using attribute_name = consim as an option when creating the consensus motif search object. This would be useful if two or more consensus motif searches were performed on the same set of data objects. By using different attribute names for each search, it would be possible to distinguish the results of the searches later. 6.4.2 Attributes Created by Consensus Motif Search Objects The following attribute will be added to each data object passing a sequence search: consensus_search • Dictionary object containing one item (a tuple of tuples) whose key is locations. Each tuple will contain two values, the first being the starting position in the sequence of the match (the first residue in a chain will be number 0, not 1!!), the second being the length of the match, i.e. the number of residues in the matched sequence. The latter is relevant if a regular expression has been used which may produce sequence matches involving variable numbers of residues (see Section 6.4.4, page 66). • The default name of this attribute, consensus_search, can be over-ridden (see Section 6.4.1, page 65). • An example showing how the location data can be accessed is: # Examine the matching location(s) for the third closest hit Reliscript User Guide 65 print chain_set[2].consensus_search[’locations’] ((34,6),) 6.4.3 Example Consensus Motif Search Please go to example_consensus_motif_search.py to view the example code. 6.4.4 Example Consensus Motif Search Using a Regular Expression A search for a protein sequence beginning with I, then having 3, 4 or 5 G or V residues, followed by one Q, followed by 2 or 3 residues of any type, and ending with the sequence PRS can be achieved using python’s regular expression module. Please go to example_consensus_motif_search_using_regular_expression.py to view the example code. 6.5 SMILES and SMARTS Searching The SMILES search class can be used to filter a set of data objects so that only those objects containing a particular substructure, as defined by a SMILES string (http://www.daylight.com/smiles/ index.html), will be kept. The following information is helpful if you use SMILES in Reliscript: • Information about charges, isotopes and stereochemistry is ignored. • Hydrogens are only allowed in brackets together with a heavy atom, e.g. [NH3] or [OH]. • Hydrogens can be used to fill up valencies, e.g. C(=O)[NH2] will find only carbamoyl groups, and not, e.g., peptide linkages. • Reliscript supports the bond-type any (use the one-character symbol ’~’). • Reliscript supports three types of atom “wildcards”, viz: • *: any atom • A: any aliphatic atom • a: any aromatic atom • Aromatic bonds are only supported for 6-membered aromatic rings; use single and double bonds for other unsaturated rings • Reliscript does not support tautomeric states; use bonds of type any (SMILES code ~) • Queries using ’.’ are not supported SMARTS Searching The SMARTS search class is analogous to SMILES except SMARTS are used to represent 66 Reliscript User Guide substructures rather than entire molecules (http://www.daylight.com/dayhtml_tutorials/languages/ smarts/index.html). The implementation of SMARTS in Relibase+ is not comprehensive; limitations are primarily due to the way in which ligands are stored in Relibase+. The following should be taken into consideration when using SMARTS: • Relibase+ assumes bond types given in the SMARTS query match Relibase+ conventions. In particular: • Six-membered aromatic rings have aromatic bond types • Five-membered rings are non-aromatic unless pi bonded to a metal (e.g. ferrocenes). • Due to the nature of the data source, hydrogen counts on atoms other than carbon are not reliable, use of Dn atom constraint (number of non-hydrogen connections) is recommended rather than Xn (total number of connections) for heteroatoms. Unsupported features (general): • Dot disconnected fragments, e.g. (C).(C) • Recursive SMARTS, e.g. [$(CC);$(CCC)] • Reaction SMARTS, e.g. [CC>>CC]. Unsupported features (atom properties): • Some atom constraints (where n is an integer): • v<n>: valency constraint. • x<n>: number of ring connections constraint. • h<n>: implicit hydrogen constraint (no distinction is made between implicit and explicit H in Relibase+). • Charge constraints (no charges are stored in Relibase+). • R<n> where n>=1 (no smallest set of smallest rings implementation). • #<n>: atomic number (the element symbol should be used). • <n>: atomic mass. • Stereochemical descriptors. • Constraints of different types combined with OR operator, e.g. [X1, D2]. • High precedence AND in OR subexpression, e.g. [C, N&H1] (constraints can only be applied to all element types in an atom). Unsupported features (bond properties): • Stereochemical descriptors for double bonds: these are treated as single bonds with unspecified stereochemistry. Reliscript User Guide 67 • High-precedence AND in OR subexpression, e.g. =&@,- (cyclic double or single and unspecified cyclicity). • The following constructs are not supported: • NOT any bond, e.g. !~. • different bond types combined with AND operator, e.g. -&= (single and double). • different NOT bond types combined with OR operator, e.g. !-,!= (not single or not double, equivalent to any bond). 6.5.1 Creating a SMILES or SMARTS Search Object; Initialization Parameters An operation object for performing a SMILES search can be created with a command of the form: smiles_object = reliscript.smiles_search(parameters) A similar operation object can be created for SMARTS: smarts_object = reliscript.smarts_search(parameters) The first parameter must be the SMILES or SMARTS definition of the substructure to be searched for (see Section 6.5.3, page 69). Other, optional, parameters are: attribute_name • By default, an attribute called smiles_hit_data/smarts_hit_data (for SMILES and SMARTS searches respectively) is added to each data object found in a substructure search. It contains information about which atoms in the hit object matched the atoms of the SMILES string (see Section 6.5.2, page 69). A different name can be specified for this attribute, e.g. by using attribute_name = my_matched_atoms as an option when creating the SMILES search object. This would be useful if two or more SMILES searches were performed on the same set of data objects. By using different attribute names for each search, it would be possible to distinguish the results of the searches later. store_match • The default value for this parameter is 1, which means that the smiles_hit_data attribute will be created for each hit. Set store_match = 0 if you do not want to create this attribute (i.e. you do not need the atom-matching information). exact_match • This is an additional optional parameter for SMILES search objects only. By default, this 68 Reliscript User Guide parameter is set to 1, meaning that the SMILES string ’c1ccccc1’ would match only benzene, not ligands containing a benzene substructure. Note that by setting this parameter to 0, the SMILES search object will perform exactly the same search as the equivalent SMARTS search object. all_models • If used, this command should be added after the above arguments. This command controls whether all ligand models are included for NMR structures which have multiple structural models. The default all_models=0 includes ligands only from the first NMR model. Setting all_models=1 causes all ligand models to be stored. See the smiles_smarts_search_example.py for a sample script illustrating ligand searches using SMILES and SMARTS. 6.5.2 Attributes Created by SMILES/SMARTS Search Objects The following attribute will be added to each data object passing a SMILES/SMARTS search: smiles_hit_data/smarts_hit_data • List of lists containing the atom objects that match the SMILES/SMARTS string. The outer list will contain more than one item if the substructure specified by the SMILES/SMARTS string occurs more than once in the hit object. The inner list contains the actual atoms in a matching fragment. • For example, suppose a search for the SMILES string ’C(=O)N’ (i.e. a peptide group) has been performed on the ligand set lig_set. After the search, ligset[0].smiles_hit_data[0][0] contains the carbon atom of the first peptide group in ligset[0]. Similarly, ligset[0].smiles_hit_data[0][1] and ligset[0].smiles_hit_data[0][2] contain the oxygen and nitrogen atoms, respectively. If ligset[0] contains more than one peptide group, then ligset[0].smiles_hit_data[1][0] to ligset[0].smiles_hit_data[1][2] will contain the C, O and N atoms of the second peptide group; and so on. • The default name of this attribute, smiles_hit_data/smarts_hit_data, can be overridden (see Section 6.5.1, page 68). • If desired, you can request that this attribute is not created (e.g. if you do not need the matching information and want to save memory). 6.5.3 Example SMILES Search Please refer to example_smiles_search.py to view the example script. Reliscript User Guide 69 6.6 Similar Ligand Searching The similar ligand search class can be used to filter a set of Ligand objects so that only those objects are kept whose structural similarity to a user-specified ligand falls within a given similaritycoefficient range. Similarity is judged using a Tanimoto similarity coefficient calculated from 2D fingerprints. The search object also reorders the set so that the most similar ligands are at the front. Only the 1000 most similar results to the query ligand are returned. 6.6.1 Creating a Similar Ligand Search Object; Initialization Parameters An operation object for performing a similar ligand search can be created with a command of the form: sim_lig_search = reliscript.similar_ligand_search(parameters) The first parameter must specify the Ligand object which is to be used as the basis for the similarity calculations. Other, optional, parameters are: mintani and maxtani • These should be floating point numbers in the range 0.0 to 1.0, with mintani less than or equal to maxtani. The search will calculate a Tanimoto similarity coefficient for each ligand compared with the search ligand. Only those ligands whose Tanimoto coefficient is greater than or equal to mintani and less than or equal to maxtani will be accepted as hits. The default values are mintani = 0.4 and maxtani = 1.0. attribute_name • By default, an attribute called ligand_similarity, containing similarity-coefficient information, will be added to each Ligand object found in a similar ligand search (see Section 6.6.2, page 70). A different name can be specified for this attribute, e.g. by using attribute_name = ligsim as an option when creating the similar ligand search object. This would be useful if two or more similar ligand searches were performed on the same set of data objects. By using different attribute names for each search, it would be possible to distinguish the results of the searches later. 6.6.2 Attributes Created by Similar Ligand Search Objects The following attribute will be added to each Ligand object passing a similar ligand search: ligand_similarity 70 Reliscript User Guide • Dictionary object containing one item whose key is value. This is the calculated similarity coefficient of the Ligand object with the search ligand. • The default name of this attribute, ligand_similarity, can be over-ridden (see Section 6.6.1, page 70). • An example showing how the similarity value can be accessed is: # Print similarity value for second hit print lig_set[1].ligand_similarity[’value’] 0.95 6.6.3 Example Similar Ligand Search Please go to example_similar_ligand_search.py to view the example code. 6.7 Superimposing Chains and Similar Binding Sites The chain superimposition class can be used to superimpose each member of a set of Chain objects onto a reference chain. The normal mode of use will involve performing a sequence similarity search first (see Section 6.3, page 63), to sequence-align each member of the chain set with the reference chain. Once this is done, the chain superimposition object can be used to perform the superimpositions by least-squares overlaying the alpha-carbon atoms of some of the matched residues. A particular use of chain superimposition is to overlay similar binding sites. 6.7.1 Creating a Chain Superimposition Object; Initialization Parameters An operation object for performing a chain superimposition can be created with a command of the form: chain_superpose = reliscript.superimpose_chain(parameters) The first parameter must specify the Chain object on which the other chains will be superimposed. Other, optional, parameters are: ligand • The atoms used for superimposition will always be restricted to alpha-carbons in matched residues (i.e. residues that have been successfully matched in a sequence alignment of the reference chain and the chain to be superimposed). The least-squares superposition can be further restricted to alpha-carbon atoms in matched residues in the binding site of a ligand bound to the reference chain. To do this, include a parameter such as ligand = ref_chain_lig (see Section 6.7.2, page 72). Reliscript User Guide 71 6.7.2 Example Chain Superimposition This example assumes that a sequence similarity search has already been done (see Section 6.3, page 63). Please refer to example_chain_superimposition.py for the chain superimposition example script. 6.7.3 Example Similar Binding Site Search A similar binding site search involves: • Specifying the ligand of interest and getting the chain to which it is bound (the reference chain). • Using a sequence similarity search (see Section 6.3, page 63) to find all chains similar in sequence to the reference chain. • Using a chain superimposition object (see Section 6.7, page 71) to overlay the similar chains onto the reference chain. • Applying a distance test to find all ligands bound to the various superimposed chains that are close to the original ligand of interest. • Writing this information out. Please refer to example_similar_binding_site_search.py to view the example script. 7 Global Utility Functions In addition to the reliscript.create command, used for creating data objects (see Section 4, page 23), the reliscript.set command for creating sets (see Section 5.2.2, page 55), and commands such as reliscript.text_search, reliscript.sequence_search, etc., for constructing operation objects (see Section 6, page 61), Reliscript provides a selection of global utility functions, as follows: 7.1 nice level The nice level of reliscripts is set to 5 by default so as not to make reliscripts run with a higher priority than the Relibase server. Note that the nice level of your reliscripts can easily be modified using the built in os module. For scripts that are anticipated to run for a long time it is recommended that the nice level is set to 10. This is achieved by inserting the code below at the beginning of the script: import os os.nice(10) 7.2 distance and max_distance Each of these functions takes a pair of objects as arguments, e.g. 72 Reliscript User Guide reliscript.distance(object1, object2) Each object can be any of the following: • An Atom object or any other object that has a coords attribute. • Any data object that will provide a list of Atom objects (i.e. all Reliscript data objects). distance calculates and returns the distance (in Å) between the two closest atoms in the two objects (one from object1, one from object2). Conversely, max_distance calculates and returns the maximum distance between two atoms, one from each of the two objects. 7.3 hitlists A command such as: reliscript.hitlists('a_user_name') will return a list of dictionaries containing information on all the Relibase+ hitlists that have been saved for the given username. If the username is omitted, the current username will be used. Each hitlist dictionary will look like, e.g. {{'name': 'test', 'user': 'a_user_name', 'time': u'2008-12-11 11:39:11.58', 'type': 'pdb', 'size': 10}} 7.4 use_workspace, set_workspace and set_username Sets the workspace to be used for saving and reading of hitlists (see Section 3.7.3, page 18), e.g. reliscript.use_workspace(’my_name’) By default, Reliscript will use the login username of the user for the workspace identifier. The global utility functions set_workspace, set_username do exactly the same thing as use_workspace. 8 Extending the Functionality of Reliscript It is obviously possible to extend the functionality of Reliscript by building a library of your own Python and Reliscript functions, and by using the many Python modules available on the Internet (see Reliscript User Guide 73 http://www.python.org). In addition, you can write customised operation classes for performing searches and other calculations not provided by default. To do this, it is necessary for the customised operation class to inherit from a base operation class. For more details, see example_customised_operation_class.py. 8.1 Base Operation Class The base operation class, reliscript.base_operation_class, provides much of the code required for user-defined operation classes. The user’s operation class must inherit from base_operation_class and provide working versions of a small number of class functions (see Section 8.2, page 74). 8.2 Functions Required in a Customised Operation Class Some or all of the following functions of the base operation class (see Section 8.1, page 74) will need to be over-ridden to produce a useful, customised operation class. Default implementations for all functions are provided in the base class, and a particular function need not be over-ridden if the default action is acceptable. The function declarations below are given as they must appear in the class definition and, as such, self must be the first parameter: filter(self, object) • Performs a test on object (e.g. for the presence of a text string). Returns 1 if the test was successful, otherwise returns zero; default return value is 1. This function will normally be overridden to create a customised operation class that applies a useful filter. filter_object_type(self) • Returns the type of object that the filter function will require. Must return one of ’pdb’, ’chain’, ’ligand’ or ’solvent’. use_filter(self) • Returns 1 if the filter function is to be used, otherwise returns zero. By default, this function returns 1. manipulate(self, object) • Performs some sort of manipulation on the object passed in; this could be, for example, the addition of a new attribute to the object. By default, this function leaves the object unchanged. Unlike the filter function, the type of the set being processed, e.g. ’pdb’, ’ligand’, etc., must match the object type that the manipulate function expects. manipulate_object_type(self) • Returns the type of object that the manipulate function will require. Must return one of ’pdb’, ’chain’, ’ligand’ or ’solvent’. 74 Reliscript User Guide use_manipulate(self) • Returns 1 if the manipulate function is to be called, otherwise returns zero. By default, this function returns 1. 8.3 Example of a Customised Operation Class The following is an example script showing how the base operation class could be extended (see Section 8.2, page 74) to create a customised operation object that will: • Filter a set by a performing a search on all ligands in a PDB object for a given text string in the ligand compound name. • For each PDB object that passes this test, store the total number of residues in the protein chains in the PDB object as an object attribute. Please refer to example_customised_operation_class.py for the example script. 9 Example Scripts Some simple examples are included in previous sections of this manual: 6.1.2 Example Text Search (see page 62) 6.1.3 Example Regular Expression Text Search (see page 62) 6.2.2 Example Numeric Search (see page 63) 6.3.3 Example Sequence Search (see page 64) 6.4.3 Example Consensus Motif Search (see page 66) 6.4.4 Example Consensus Motif Search Using a Regular Expression (see page 66) 6.5.3 Example SMILES Search (see page 69) 6.6.3 Example Similar Ligand Search (see page 71) 6.7.2 Example Chain Superimposition (see page 72) 6.7.3 Example Similar Binding Site Search (see page 72) 8.3 Example of a Customised Operation Class (see page 75) In addition, more extensive and scientifically interesting examples are: 9.1 9.2 Finding and Classifying Contacts to Ligand Carboxylates (see page 76) Analysing Ligand Contacts to Atoms in the Crystal-Field Environment (see page 76) These scripts are available as separate .py files so that you can try them out. Reliscript User Guide 75 9.1 Finding and Classifying Contacts to Ligand Carboxylates Please refer to example_contacts_to_carboxylates.py to view the example script. 9.2 Analysing Ligand Contacts to Atoms in the Crystal-Field Environment Please refer to example_packing_environment_contacts.py to view the example script. 10 Acknowledgements Reliscript was conceived by Ingo Dramburg (Institute of Pharmaceutical Chemistry, Philipps University of Marburg, Germany) who also contributed significantly to its design and coding. The Java Programming Language is provided by Sun Microsystems, Inc. under the Binary Code License Agreement. The Python Programming Language is provided by the Python Software Foundation under the Python Licence, Version 2.5. The Java Python integration is provided by JPype under the Apache Licence V2.0. 76 Reliscript User Guide 11 Appendix A: Glossary This is mainly (but not exclusively) a guide to Python terms. attributes (see page 77) comment (see page 77) dictionaries (see page 78) exception (see page 79) flow control (see page 79) for (see page 79) functions (see page 80) global functions (see page 80) if (see page 80) indentation (see page 80) lists (see page 80) modules (see page 82) representation (see page 82) Sybyl atom type (see page 83) tuples (see page 83) types (see page 84) while (see page 84) attributes The attributes of an object are those items that can be retrieved by simply placing the attribute name after the name of the object. For example, a command such as: res = pdb_object.resolution would result in res being a floating point number containing the resolution in Å of the PDB structure contained in pdb_object. comment In Python, anything on a line after a hash mark (#) is a comment. Additionally, comments can be attached to functions by using a triple-quoted string block below the function definition. These are useful as they can be picked up by some automatic documentation systems (such as Pydoc, which comes with Python) and some Python shell applications (such a PyCrust: http://sourceforge.net/ projects/pycrust/), where the comment will appear as a help pop-up. Reliscript User Guide 77 dictionaries Python dictionaries (or associative arrays) are container (i.e. storage) objects, i.e. a dictionary object contains within it a collection of other objects. The objects stored in a dictionary are referenced by a key. Python dictionaries can be initialised by specifying the key, value pairs enclosed in {} brackets. For example: # Dictionary: provides lookup fruit_colours = {'apple':'red','banana':'yellow'} # The next line returns 'yellow' fruit_colours['banana'] A large number of standard operations can be done on dictionaries: DICT[KEY] Returns item associated with KEY or raises an exception. DICT[KEY] = ITEM Stores ITEM in dictionary with key KEY. del DICT[KEY] Remove item referenced by KEY from dictionary. len(DICT) Number of items stored with a KEY. DICT.has_key(KEY) Returns true if DICT has an item indexed with KEY. DICT.keys() Returns a list of all dictionary keys. DICT.values() Returns a list of all items stored in dictionary. DICT.items() Returns a list of tuples, each a key, item pair. DICT.clear() Remove all items from DICT. DICT.copy() Returns a copy of DICT. This is a shallow copy: see main Python documentation for more details. DICT.update(DICT2) Merge DICT2 into DICT. If identical keys, DICT2 takes precedence. DICT.get(KEY, [,default]) Like DICT[KEY] but will return default if KEY not present 78 Reliscript User Guide DICT.setdefault(KEY Like .get, but sets default for later , [,default]) requests. DICT.popitem() Removes and returns an arbitrary (key, item) pair. exception Python can handle unexpected events, i.e. exceptions, which occur during the runtime of a Python program. Example: a = 1 b = 0 print a/b would raise a ’ZeroDivisionError’ and stop the program. The script zero_division_example.py illustrates how this exception could be handled, thus preventing the script from terminating. flow control Python uses the fairly standard flow-control options of: • for (see page 79); • while (see page 84); • if (see page 80). for This command loops through the contents of a list, tuple or any other object that supports index functionality and applies the code below it. For an example see the script for_loop_example.py. Note that the beginning and end of a for loop is determined by the indentation. There are two special commands used in for loops, viz. continue, which finishes the current loop and starts the next, and break, which exits the for loop completely. Reliscript User Guide 79 functions Calling a function of an object: object.myfunc(arguments) executes some code associated with that object. For example, a command such as: atom_list = pdb_object.pdb_atoms(include_pack=0) would execute code that sets up a list of string objects, atom_list, containing the ATOM records of the PDB entry stored in pdb_object. The argument in this example instructs the function to exclude pack atoms, i.e. atoms generated by crystallographic symmetry. global functions A global function is one that is not associated with a particular object, e.g. d = reliscript.distance(atom1, atom2) if Like the similar command while (see page 84), the if command executes a test statement but the associated command will only be run once. In addition there is the elif command, which stands for else if, and the else command. An example is the script if_example.py. indentation Indentation of commands is (and must be) used to indicate code that lies within loops and conditional statements. It is good practice to never use tabs in python scripts. Furthermore it is recommended that each indentation is 4 white spaces long. For some sample code see the script indentation_example.py. lists Python lists are container (i.e. storage) objects, i.e. a list object contains within it a collection of other objects. Lists can be initialised by specifying the required items enclosed in [] brackets, e.g. [0,1,2,3] initialises a list of four integers. The contents of a list can be of any type, including other lists, e.g. [1,'two',[3.0,'four']] (this being a list containing an integer, a string, and another list). Accessing lists is done by treating them as arrays; for example, if the previous list was called mylist, then mylist[1] would return two and mylist[2][0] would return 3.0. The first item in a list has index number 0, not 1. 80 Reliscript User Guide Unlike tuples (see page 83), lists can be changed, e.g. # List: can be changed colours = ['red','yellow'] # The next line adds ‘blue’ to the colours list colours.append('blue') A large number of standard operations can be done on lists: ITEM in LIST Logical operation that returns true if ITEM is in LIST ITEM not in LIST Logical operation that returns true if ITEM is not in LIST for ITEM in LIST: LIST1 + LIST2 Loops round LIST using each ITEM in turn lists can be added, e.g. [1,2] + [3,4] = [1,2,3,4] NUMBER * LIST Repetition, e.g. [1,2] * 2 = [1,2,1,2]. LIST * NUMBER has the same effect. LIST[INDEX] Returns the INDEX item in list. Raises an exception if out of range. Also, INDEX can be negative, in which case it starts from the end, e.g. if L = [1,2,3], then L[-1] = 3 LIST[START:END] Returns a slice of list, e.g.if L = [1,2,3,4,5] L[1:-1] = [2,3,4] (note:goes up to, but does not include L[-1]) len(LIST) Length of list (i.e. number of objects it contains), e.g. len([0,1,2,3]) = 4 min(LIST) Returns minimum value in list; mainly useful when all numeric or text items. max(LIST) Returns maximum value in list. LIST[INDEX] = ITEM Removes the current contents of LIST[INDEX] and replaces it with the object ITEM LIST[START:END] = LIST2 Removes list items between START and END (including START but not END) and replaces them with LIST2 Reliscript User Guide 81 del LIST[INDEX] Removes item at index INDEX; resulting list will be shorter by one del LIST[START:END] Deletes section of list (including START but not END), e.g.if L = [1,2,3,4,5] and del L[1:-1] is executed, then L = [1,5] LIST.append(ITEM) Adds ITEM to LIST LIST.sort([FUNCTION ]) Sorts list; FUNCTION is optional and can be used to specify a sort function other than, e.g., normal numerical order LIST.reverse() Reverses order of list, e.g. if L = [1,2,3] and L.reverse() is executed, then L = [3,2,1] LIST.index(ITEM) Returns index of first instance of ITEM in LIST; exception if not found. LIST.count(ITEM) Returns number of times ITEM appears in the list LIST.insert(INDEX, ITEM) Inserts ITEM at point INDEX, e.g. if L = [1,2,3] and we execute L.insert(2,4), then L = [1,2,4,3] LIST.remove(ITEM) Removes first instance of ITEM in LIST; raises exception if not present LIST.pop() Returns and removes last item in LIST LIST.extend(LIST2) Adds LIST2 to end of LIST modules Modules are the larger-scale building blocks of Python. Reliscript is a Python module which you load into a Python session using the command import reliscript. Python itself provides a wealth of modules and a large number are available elsewhere for specific tasks, e.g. mathematics, statistics, plotting, and many more (see http://www.vex.net/parnassus/). In all cases, the name used to import a module defines the “namespace” for the components of that module. For example, when Reliscript is loaded using import reliscript, then expressions such as reliscript.set(‘pdb’) or reliscript.distance(at1, at2) are used to access Reliscript functions. representation The representation of an object refers to how it is represented when a command such as: 82 Reliscript User Guide print my_object is executed. Sybyl atom type Reliscript will return the Sybyl atom type of an atom, as defined in the Sybyl program of Tripos Inc., St Louis, USA (http://www.tripos.com/). The most important of these types are: C.3 sp3 carbon N.pl3 trigonal planar nitrogen C.2 sp2 carbon N.4 sp3 cationic nitrogen C.1 sp carbon O.3 sp3 oxygen C.ar aromatic carbon O.2 sp2 oxygen C.cat carbocation (e.g. in guanidinium) O.co2 carboxylate/phosphate oxygen N.3 sp3 nitrogen S.3 sp3 sulphur N.2 sp2 nitrogen S.2 sp2 sulphur N.1 sp nitrogen S.o sulphoxide sulphur N.ar aromatic nitrogen S.o2 sulphone sulphur N.am amide/peptide nitrogen P.3 phosphorus, e.g in phosphate Sybyl atom types in Reliscript are not always set reliably, especially for atoms whose protonation state is uncertain. tuples Python tuples are container (i.e. storage) objects, i.e. a tuple object contains within it a collection of other objects. Tuples can be created by specifying the required objects enclosed in () brackets, e.g. (1,2,3,4) creates a tuple containing four integers. Tuples are like lists (see page 80) as far as data access goes, e.g. # Create tuple days_of_week = ('mon','tue','wed','thu','fri','sat','sun') # The next line returns 'mon' days_of_week[0] Tuples differ from lists in that they cannot be modified. Therefore, tuples have all the same functions as lists for data access - all those above the dotted line in the section on lists (see page 80) - but no functions that modify the tuple contents. Reliscript User Guide 83 types Python supports basic types including integers, floats and string. Strings can be delimited by single(‘), double (“) or triple (“““) quotes. Triple quotes are useful in that they can cover more than one line. For example: Nursery_rhyme = “““Mary had a little lamb Its fleece was white as snow And everywhere that Mary went Her lamb was sure to go“““ while The while command continues to execute a statement until a condition is no longer satisfied, for an example see the script while_example.py. 84 Reliscript User Guide 12 Appendix B: List of Commands, Attributes, Functions, Parameters and Operators + operator applicable to sets (see Section 5.2.6, page 57) - operator applicable to sets (see Section 5.2.6, page 57) & operator applicable to sets (see Section 5.2.6, page 57) | operator applicable to sets (see Section 5.2.6, page 57) ^ operator applicable to sets (see Section 5.2.6, page 57) a PDB attribute (see Section 4.1.3, page 24) adjacent_chains Ligand attribute (see Section 4.4.3, page 34) adjacent_ligands Chain attribute (see Section 4.2.3, page 28) adjacent_ligands NucleicAcid attribute (see Section 4.3.3, page 31) adjacent_nucleic_acids Ligand attribute (see Section 4.4.3, page 34) align_identity sequence_search parameter (see Section 6.3.1, page 63) all_models smiles_search or smarts_search parameter (see Section 6.5.2, page 69) alpha PDB attribute (see Section 4.1.3, page 24) append Set function (see Section 5.2.9, page 60) atoms BindingSite attribute (see Section 4.9.3, page 49) Reliscript User Guide 85 86 atoms Bond attribute (see Section 4.8.3, page 47) atoms Chain attribute (see Section 4.2.3, page 28) atoms Ligand attribute (see Section 4.4.3, page 34) atoms NucleicAcid attribute (see Section 4.3.3, page 31) atoms PackBindingSite attribute (see Section 4.10.3, page 53) atoms PDB attribute (see Section 4.1.3, page 24) atoms Residue attribute (see Section 4.6.3, page 41) atoms Solvent attribute (see Section 4.5.3, page 38) attribute text_search parameter (see Section 6.1.1, page 61) attribute_name consensus_search parameter (see Section 6.4.1, page 65) attribute_name sequence_search parameter (see Section 6.3.1, page 63) attribute_name smiles_ or smarts_search parameter (see Section 6.5.2, page 69) attribute_name similar_ligand_search parameter (see Section 6.6.1, page 70) attributes text_search parameter (see Section 6.1.1, page 61) author PDB attribute (see Section 4.1.3, page 24) authors PDB attribute (see Section 4.1.3, page 24) b PDB attribute (see Section 4.1.3, page 24) Reliscript User Guide b_factor Atom attribute (see Section 4.7.3, page 45) base_operation_class Base class in Reliscript (see Section 8.1, page 74) beta PDB attribute (see Section 4.1.3, page 24) binding_site Ligand attribute (see Section 4.4.3, page 34) binding_site PackBindingSite attribute (see Section 4.10.3, page 53) binding_sites PDB attribute (see Section 4.1.3, page 24) bonds Atom attribute (see Section 4.7.3, page 45) bonds BindingSite attribute (see Section 4.9.3, page 49) bonds Chain Attribute (see Section 4.2.3, page 28) bonds Ligand attribute (see Section 4.4.3, page 34) bonds NucleicAcid attribute (see Section 4.3.3, page 31) bonds PackBindingSite attribute (see Section 4.10.3, page 53) bonds PDB attribute (see Section 4.1.3, page 24) bonds Residue attribute (see Section 4.6.3, page 41) bonds Solvent attribute (see Section 4.5.3, page 38) bond_type Bond attribute (see Section 4.8.3, page 47) bound_ligand BindingSite attribute (see Section 4.9.3, page 49) Reliscript User Guide 87 88 bound_ligand PackBindingSite attribute (see Section 4.10.3, page 53) c PDB attribute (see Section 4.1.3, page 24) case text_search parameter (see Section 6.1.1, page 61) chain Set type (see Section 5.2.1, page 55) chain_id Chain attribute (see Section 4.2.3, page 28) chain_id NucleicAcid attribute (see Section 4.3.3, page 31) chain_id Residue attribute (see Section 4.6.3, page 41) chains BindingSite attribute (see Section 4.9.3, page 49) chains PackBindingSite attribute (see Section 4.10.3, page 53) chains PDB attribute (see Section 4.1.3, page 24) clear_transform BindingSite function (see Section 4.9.4, page 50) clear_transform Chain function (see Section 4.2.4, page 29) clear_transform Ligand function (see Section 4.4.4, page 36) clear_transform NucleicAcid function (see Section 4.3.4, page 32) clear_transform PackBindingSite function (see Section 4.10.4, page 53) clear_transform PDB function (see Section 4.1.4, page 26) clear_transform Residue function (see Section 4.6.4, page 42) Reliscript User Guide clear_transform Solvent function (see Section 4.5.4, page 39) cofactor Ligand attribute (see Section 4.4.3, page 34) component numeric_search parameter (see Section 6.2.1, page 62) component text_search parameter (see Section 6.1.1, page 61) components text_search parameter (see Section 6.1.1, page 61) compound PDB attribute (see Section 4.1.3, page 24) compound_name Ligand attribute (see Section 4.4.3, page 34) consensus_search Attribute created by consensus_search (see Section 6.4.2, page 65) consensus_search Reliscript operation object (see Section 6.4.1, page 65) coords Atom attribute (see Section 4.7.3, page 45) copy Set function (see Section 5.2.3, page 56) covalently_bound Ligand attribute (see Section 4.4.3, page 34) create Reliscript command (see Section 4.1.1, page 23) crystal PDB attribute (see Section 4.1.3, page 24) date PDB attribute (see Section 4.1.3, page 24) del Internal function of set (see Section 5.2.7, page 58) Reliscript User Guide 89 90 distance Reliscript command (see Section 7.2, page 72) element_no Atom attribute (see Section 4.7.3, page 45) elif Python command (see if, page 80) else Python command (see if, page 80) exptl_method PDB attribute (see Section 4.1.3, page 24) extend Set function (see Section 5.2.9, page 60) field text_search parameter (see Section 6.1.1, page 61) fields text_search parameter (see Section 6.1.1, page 61) filter Customised operation class function (see Section 8.2, page 74) filter_object_type Customised operation class function (see Section 8.2, page 74) for Python command (see for, page 79) full_name Ligand attribute (see Section 4.4.3, page 34) gamma PDB attribute (see Section 4.1.3, page 24) header PDB attribute (see Section 4.1.3, page 24) hitlists Reliscript command (see Section 7.3, page 73) homology sequence_similarity attribute key (see Section 6.3.2, page 64) if Python command (see if, page 80) import Python command (see Section 3.1.4, page 8) Reliscript User Guide index_no Atom attribute (see Section 4.7.3, page 45) index_no Residue attribute (see Section 4.6.3, page 41) len Internal function of Chain (see Section 4.2.5, page 30) len Internal function of Set (see Section 5.2.7, page 58) ligand superimpose_chain parameter (see Section 6.7.1, page 71) ligand Set type (see Section 5.2.1, page 55) ligand_similarity Attribute created by similar_ligand_search (see Section 6.6.2, page 70) ligands BindingSite attribute (see Section 4.9.3, page 49) ligands PackBindingSite attribute (see Section 4.10.3, page 53) ligands PDB attribute (see Section 4.1.3, page 24) locations consensus_search attribute key (see Section 6.4.2, page 65) manipulate Customised operation class function (see Section 8.2, page 74) manipulate_object_type Customised operation class function (see Section 8.2, page 74) max numeric_search parameter (see Section 6.2.1, page 62) max_distance Reliscript command (see Section 7.2, page 72) maxidentity sequence_search parameter (see Section 6.3.1, page 63) maxtani similar_ligand_search parameter (see Section 6.6.1, page 70) Reliscript User Guide 91 92 min numeric_search parameter (see Section 6.2.1, page 62) minidentity sequence_search parameter (see Section 6.3.1, page 63) mintani similar_ligand_search parameter (see Section 6.6.1, page 70) mol_wt Ligand attribute (see Section 4.4.3, page 34) n_atom Chain attribute (see Section 4.2.3, page 28) n_atom Ligand attribute (see Section 4.4.3, page 34) n_atom NucleicAcid attribute (see Section 4.3.3, page 31) n_atom Residue attribute (see Section 4.6.3, page 41) n_atom Solvent attribute (see Section 4.5.3, page 38) n_atom_ideal Residue attribute (see Section 4.6.3, page 41) n_unit Chain attribute (see Section 4.2.3, page 28) n_unit Ligand attribute (see Section 4.4.3, page 34) n_unit NucleicAcid attribute (see Section 4.3.3, page 31) n_unit Solvent attribute (see Section 4.5.3, page 38) name Atom attribute (see Section 4.7.3, page 45) name Residue attribute (see Section 4.6.3, page 41) nucleic_acid Set type (see Section 5.2.1, page 55) Reliscript User Guide nucleic_acids BindingSite attribute (see Section 4.9.3, page 49) nucleic_acids PackBindingSite attribute (see Section 4.10.3, page 53) nucleic_acids PDB attribute (see Section 4.1.3, page 24) number Atom attribute (see Section 4.7.3, page 45) numeric_search Reliscript operation object (see Section 6.2.1, page 62) occupancy Atom attribute (see Section 4.7.3, page 45) one_letter_code Residue attribute (see Section 4.6.3, page 41) other_atom Bond function (see Section 4.8.4, page 47) pack_binding_site BindingSite attribute (see Section 4.9.3, page 49) pack_binding_site Ligand attribute (see Section 4.4.3, page 34) pack _binding_sites PDB attribute (see Section 4.1.3, page 24) pdb Set type (see Section 5.2.1, page 55) pdb Atom attribute (see Section 4.7.3, page 45) pdb BindingSite attribute (see Section 4.9.3, page 49) pdb Chain attribute (see Section 4.2.3, page 28) pdb Ligand attribute (see Section 4.4.3, page 34) pdb NucleicAcid attribute (see Section 4.3.3, page 31) Reliscript User Guide 93 94 pdb PackBindingSite attribute (see Section 4.10.3, page 53) pdb Residue attribute (see Section 4.6.3, page 41) pdb Solvent attribute (see Section 4.5.3, page 38) pdb_atoms BindingSite function (see Section 4.9.4, page 50) pdb_atoms Chain function (see Section 4.2.4, page 29) pdb_atoms Ligand function (see Section 4.4.4, page 36) pdb_atoms NucleicAcid function (see Section 4.3.4, page 32) pdb_atoms PackBindingSite function (see Section 4.10.4, page 53) pdb_atoms PDB function (see Section 4.1.4, page 26) pdb_atoms Residue function (see Section 4.6.4, page 42) pdb_atoms Solvent function (see Section 4.5.4, page 39) pdb_line Atom function (see Section 4.7.4, page 46) peptide Ligand attribute (see Section 4.4.3, page 34) ph PDB attribute (see Section 4.1.3, page 24) pure_peptide Ligand attribute (see Section 4.4.3, page 34) r_value PDB attribute (see Section 4.1.3, page 24) re Python module (see Section 3.1.4, page 8) Reliscript User Guide residue Atom attribute (see Section 4.7.3, page 45) residues Chain attribute (see Section 4.2.3, page 28) residues Ligand attribute (see Section 4.4.3, page 34) residues NucleicAcid attribute (see Section 4.3.3, page 31) residues Solvent attribute (see Section 4.5.3, page 38) resolution PDB attribute (see Section 4.1.3, page 24) reverse Set function (see Section 5.2.8, page 58) save Set function (see Section 5.2.4, page 56) save_to_hitlist Set function (see Section 5.2.4, page 56) save_mol2 BindingSite function (see Section 4.9.4, page 50) save_mol2 Ligand function (see Section 4.4.4, page 36) save_pdb BindingSite function (see Section 4.9.4, page 50) save_pdb Chain function (see Section 4.2.4, page 29) save_pdb Ligand function (see Section 4.4.4, page 36) save_pdb NucleicAcid function (see Section 4.3.4, page 32) save_pdb PackBindingSite function (see Section 4.10.4, page 53) save_pdb PDB function (see Section 4.1.4, page 26) Reliscript User Guide 95 96 save_pdb Residue function (see Section 4.6.4, page 42) save_pdb Solvent function (see Section 4.5.4, page 39) score sequence_similarity attribute key (see Section 6.3.2, page 64) sequence Chain attribute (see Section 4.2.3, page 28) sequence_3d Chain attribute (see Section 4.2.3, page 28) sequence_3d NucleicAcid attribute (see Section 4.3.3, page 31) sequence_no Residue attribute (see Section 4.6.3, page 41) sequence_search Reliscript operation object (see Section 6.3.1, page 63) sequence_similarity Attribute created by sequence_search (see Section 6.3.2, page 64) set Reliscript command (see Section 5.2.2, page 55) similar_ligand_search Reliscript operation object (see Section 6.6.1, page 70) smarts_hit_data Attribute created by smarts_search (see Section 6.5.1, page 68) smarts_search Reliscript operation object (see Section 6.5.1, page 68) smiles_hit_data Attribute created by smiles_search (see Section 6.5.1, page 68) smiles_search Reliscript operation object (see Section 6.5.1, page 68) solvent Set type (see Section 5.2.1, page 55) solvent BindingSite attribute (see Section 4.9.3, page 49) Reliscript User Guide solvent PackBindingSite attribute (see Section 4.10.3, page 53) solvent PDB attribute (see Section 4.1.3, page 24) sort Set function (see Section 5.2.8, page 58) source PDB attribute (see Section 4.1.3, page 24) space_group PDB attribute (see Section 4.1.3, page 24) store_match smiles_search or smarts_search parameter (see Section 6.5.2, page 69) sugar Ligand attribute (see Section 4.4.3, page 34) superimpose_chain Reliscript operation object (see Section 6.7.1, page 71) sybyl_type Atom attribute (see Section 4.7.3, page 45) symbol Atom attribute (see Section 4.7.3, page 45) temp PDB attribute (see Section 4.1.3, page 24) text_search Reliscript operation object (see Section 6.1.1, page 61) title PDB attribute (see Section 4.1.3, page 24) transform BindingSite function (see Section 4.9.4, page 50) transform Chain function (see Section 4.2.4, page 29) transform Ligand function (see Section 4.4.4, page 36) Reliscript User Guide 97 98 transform NucleicAcid function (see Section 4.3.4, page 32) transform PackBindingSite function (see Section 4.10.4, page 53) transform PDB function (see Section 4.1.4, page 26) transform Residue function (see Section 4.6.4, page 42) transform Solvent function (see Section 4.5.4, page 39) type Chain attribute (see Section 4.2.3, page 28) type Ligand attribute (see Section 4.4.3, page 34) type NucleicAcid attribute (see Section 4.3.3, page 31) type Residue attribute (see Section 4.6.3, page 41) type Solvent attribute (see Section 4.5.3, page 38) type text_search parameter (see Section 6.1.1, page 61) use_filter Customised operation class function (see Section 8.2, page 74) use_manipulate Customised operation class function (see Section 8.2, page 74) use_workspace Reliscript command (see Section 7.4, page 73) value ligand_similarity attribute key (see Section 6.6.2, page 70) while Python command (see while, page 84) x Atom attribute (see Section 4.7.3, page 45) Reliscript User Guide y Atom attribute (see Section 4.7.3, page 45) year PDB attribute (see Section 4.1.3, page 24) z Atom attribute (see Section 4.7.3, page 45) z_value PDB attribute (see Section 4.1.3, page 24) Reliscript User Guide 99 100 Reliscript User Guide 13 Appendix C: Reliscript Tutorials 13.1 Tutorial 1: Finding and Classifying Contacts to Ligand Carboxylate Groups 13.1.1 Objectives • To illustrate basic use of the Python interpreter. • To show how a script can be written to identify protein-bound ligands containing carboxylate groups, and then extended to identify carboxylate groups that form unusual patterns of nonbonded contacts. 13.1.2 The Example Problem CCDC distributes and develops the protein-ligand docking program GOLD. Like most good docking programs, GOLD usually makes reliable predictions but sometimes produces a questionable result. In testing the program, we noticed a case where it had docked a carboxylate-containing ligand in such a way that the oxygen atoms were in a hydrophobic environment and formed close contacts to a backbone carbonyl oxygen (Figs. 1-3). GOLD is deliberately parameterised to allow interatomic contacts that are slightly too short (this compensates for the fact that the protein is not allowed to flex). We were therefore not concerned to see contact distances in the region of 2.6Å. However, the nature of these contacts - viz. to hydrophobic carbons and the electronegative carbonyl oxygen - was a concern. We would obviously expect a carboxylate group to form contacts to H-bond donors and/or be solvent exposed. GOLD did, in fact, produce an alternative solution in which the carboxylate group was solvent exposed. We wondered whether the solution shown in the figures is sufficiently unlikely that it should be rejected automatically. Is there a precedent in the PDB for such a carboxylate-group environment? Gohlke et al. (J. Mol. Biol., 295, 337-356, 2000) mention that PDB entry 1ICN has a buried ligand carboxylate, but comment that the ligand is disordered and the electron density is somewhat ambiguous. We are unaware of any systematic survey of the environments of ligand carboxylates. In this tutorial, we analyse the contacts made by ligand carboxylate oxygen atoms in order to assess whether it is so unlikely for a carboxylate group to bind in a largely non-polar environment that docking solutions containing such a feature should be filtered out. Reliscript User Guide 101 Fig. 1. Docked ligand (carbon atoms in green), showing carboxylate group (top right of ligand) forming apparently unfavourable contacts. Fig. 2. As above, in space filling style. 102 Reliscript User Guide Fig. 3. Close-up of docked carboxylate showing close contacts. 13.1.3 Is Relibase+ or Reliscript the Most Suitable Tool? By exploiting the 3D search capabilities of Relibase+, we can easily find carboxylates forming a particular pattern of contacts to hydrophobic atoms. For example, we could find all ligand carboxylates forming two or more contacts to protein carbon atoms less than, say, 3.2Å. The disadvantage is that we have to specify in advance what pattern of contacts we are looking for. It would be better if we could analyse all carboxylate-group environments in order to determine the percentage of environments that are non-polar and/or involve close contacts to H-bond acceptor atoms. Reliscript is well suited to this task and offers us great flexibility in how we analyse the results. 13.1.4 Assumed Starting Point It is assumed that: • Relibase+, Python and Reliscript are installed • The Reliscript environment has been set-up (see Section 3.1.1, page 6) 13.1.5 Creating a Hitlist for Debugging Purposes Because searches can take some time when performed on the whole database the first step will be to set up a hitlist representing a subset of the protein-ligand complexes in Relibase. Reliscript User Guide 103 1. Read tutorial1_hitlist.py. • Read through the tutorial script, tutorial1_hitlist.py • This script filters out ligands with a molecular weight in the range of 300 to 500 and saves them to a hitlist named tutorial1 2. Run the script tutorial1_hitlist.py. • Type in the following on the command line: % python tutorial1_hitlist.py • Note that if you try and run this script twice it will produce an error as the hitlist already exists 13.1.6 Part a: SMILES Searching and Basic Use of the Python Interpreter 1. Open the Python interpreter. • Type python in the terminal. This should result in something like: bash-3.1.17$ python Python 2.5.2 (r252:60911, Jul 23 2008, 17:11:49) [GCC 3.2.3 20030502 (Red Hat Linux 3.2.3-59)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> 2. Import Reliscript. • Type the following command at the Python >>> prompt to import the reliscript module and alias it to rs: >>> import reliscript as rs • This should produce output looking something like: Starting the JVM -Xms128m -Xmx512m -Xmn64m Imported psyco for python speed optimization >>> 104 Reliscript User Guide 3. Perform a SMILES search for carboxylate groups. • Create a set containing all ligands in the tutorial1 hitlist; the relevant command is: >>> ligset = rs.set(’ligand’, ’tutorial1’) • Create a smiles_search object that will find carboxylate groups: >>> co2_search = rs.smarts_search(’C(=O)[OH]’) • The use of [OH] in the SMILES string ensures that the search will find carboxylates but not esters. Hydrogen atoms, of course, are not usually present in PDB structures. In Relibase+ and Reliscript, an H-count in a SMILES string is used to specify valencies that must remain unfilled. In the present case, C(=O)[OH] will find C(=O)O- but will not find an ester group such as C(=O)OCH3. • Apply the smiles_search object to the ligand set by typing ligset(co2_search) at the Python prompt. This will cause the SMILES search to be performed. >>> ligset(co2_search) reading data.....ok. >>> • ligset now contains only those ligands that have carboxylate groups. Type len(ligset) to find out how many of these ligands there are: >>> len(ligset) 1432 >>> • Print the name of the first ligand containing a carboxylate (this will be ligset[0] because Python indexing begins at zero, not one): >>> print ligset[0] Ligand<pdb:1sln:INH_256> >>> Reliscript User Guide 105 4. Exit Python and then re-run the SMILES search using the prepared script tutorial1a.py. • Just to illustrate another Python feature, exit the current Python session by typing Ctrl-D. This will return you to the Unix prompt. • Now look at the contents of the first tutorial script tutorial1a.py. This file contains the Python commands that you have just run interactively. • Python can be opened, and this file of commands run automatically, by using the -i commandline option, i.e. by typing python -i tutorial1a.py at the terminal: % python -i tutorial1a.py Starting the JVM -Xms128m -Xmx512m -Xmn64m Imported psyco for python speed optimization Loading catalog .||.|.|.|.. complete Elapsed time for 'com_sub2d_impl' = 1.56 seconds. IPC_OUT::/tmp/reli14334 NUMBER_OF_HITS::26434 :IPC_END: >>> • You are left in the Python interpreter, so can continue as if you had just typed the commands in manually. For example, if you type len(ligset) you will get: >>> len(ligset) 1432 >>> • Read through tutorial1a.py. Note that for the sake of speed we are only using the ligand entries in the tutorial1 hitlist. 5. Use Python in background mode. • Using Python interactively is excellent when you are writing and debugging scripts. Once a debugged script is produced, it is usually easier to run jobs in the background and re-direct output to a file. This is what we will do in the remainder of the tutorial. • For example, we can run tutorial1a.py as a background job, redirecting the output to a new file called example.out, by typing the following at the Unix prompt: python tutorial1a.py >example.out & 106 Reliscript User Guide 13.1.7 Part b: Find and Print Out Binding-Site Atoms in Contact with Carboxylate Oxygens 1. Read tutorial1b.py. • Read through the next version of the tutorial script, tutorial1b.py, with the help of the notes that follow. 2. Understand how the script accesses carboxylate-group atoms. • The first few lines of tutorial1b.py simply repeat what we did in Part a, i.e. set up and run a search for ligands containing carboxylate groups. The resulting ligand set, ligset, contains the ligands found by the SMARTS search. Each ligand in ligset must contain at least one carboxylate group but may contain more than one. • We need to know which atoms in the ligand correspond to the carboxylate-group atoms. When the SMARTS search ran, it created a new attribute called smarts_hit_data for every ligand that satisfied the search. Suppose that lig is a ligand object in ligset. lig.smarts_hit_data[0][0] contains the carbon atom of the first carboxylate group found in lig. The oxygen atoms will be in lig.smarts_hit_data[0][1] and lig.smarts_hit_data[0][2]. The order of atoms in smarts_hit_data is the same as the order of the atoms in the SMARTS string we specified (viz. ‘C(=O)[OH]’). If lig contains >1 carboxylate group, the atoms of the second carboxylate will be in lig.smarts_hit_data[1][j], where j = 0, 1 and 2; and so on. • The following code in tutorial1b.py therefore loops through all the carboxylate atoms and prints out their names and index numbers: 3. Understand how the script finds contacts to atoms in the binding site. • In the above section of code, we retrieved the binding site of each ligand as bs. The final few lines of code in tutorial1b.py loop through all the atoms in bs. Each is tested to see whether it is within 3.2Å of the carboxylate oxygen and is not a hydrogen. If so, its details are printed out: 4. Run tutorial1b.py. • At the Unix prompt, run tutorial1b.py as a background job, redirecting the output to part1b.out. This should look something like: % python tutorial1b.py >part1b.out & [1] 251339 • The job should take a few minutes to run. When it is done, type more part1b.out to see the first few lines that the script has produced. It should look something like: Reliscript User Guide 107 Starting the JVM -Xms128m -Xmx512m -Xmn64m Importing psyco for python speed optimization Fast interchange file = /local/shields/relibase/python/reliscript/ reliscript_fast_lookup_54998.py .... complete Catalog successfully loaded from fast lookup (54998 entries) IPC_OUT::/tmp/reli37035 ligand: Ligand<pdb:1sln:INH_256> contacts to atom: O5 index no. = 4 CD2 HIS NE2 HIS NE2 HIS NE2 HIS ZN ZN contacts to atom: O4 index no. = 3 CE1 HIS ZN ZN O HOH ligand: Ligand<pdb:2vmy:FFO_505-A> contacts to atom: O1 index no. = 30 None contacts to atom: O2 index no. = 31 NZ LYS ligand: Ligand<pdb:2vmy:FFO_505-A> contacts to atom: OE2 index no. = 28 None contacts to atom: OE1 index no. = 27 OG SER ligand: Ligand<pdb:2vmy:FFO_505-B> contacts to atom: O1 index no. = 30 None contacts to atom: O2 index no. = 31 OH TYR 108 Reliscript User Guide • The output lists all the binding-site atoms in contact with the oxygen atoms of every ligand carboxylate group in the test database. In principle, this is what we need to answer the scientific problem at hand. Browsing through the output shows that most of the contact atoms are H-bond donors such as water O, lysine NZ, arginine NE, NH1 and NH2, etc., as we would expect. However, it is clear that manual analysis of the output would be very tedious, so we need to enhance the script. 13.1.8 Part c: Classify Atoms that Form Contacts to Carboxylate Oxygens 1. Read tutorial1c.py. • Read through the next version of the tutorial script, tutorial1c.py, with the help of the notes that follow. 2. Understand the strategy of tutorial1c.py. • The idea behind the enhanced script is to determine the nature of every binding-site atom in contact with a carboxylate oxygen. Each contact atom is classified as a hydrogen-bond donor, a metal ion, a hydrophobic atom, or a hydrogen-bond acceptor that is not also a hydrogenbond donor. If an atom cannot be assigned to one of these categories, it is classified as unknown. It is then possible to count and print out the number of good and bad contacts each pair of carboxylate oxygen atoms makes. Contacts to H-bond donors and metals ions are likely to be energetically favourable, so are classified as good. Contacts to hydrophobes or Hbond acceptors which are not also H-bond donors are bad. 3. Understand how the script classifies contact atoms. • The script begins with four functions for classifying contact atoms. All four functions take as their input argument the contact-atom atom object. They return true or false, depending on whether or not the atom is of a particular type. • The first function in the code, is_hydrophobe, determines whether the atom is hydrophobic by testing if its element symbol is C or S: • The second function, is_metal, works exactly the same way, i.e. tests to see if the element symbol of the input atom corresponds to a metal. • The third function, is_donor, tests the atom name and the name of the parent residue to see if the atom is a recognised protein H-bond donor, e.g. NZ of lysine. Water donors are also detected but donors in other non-peptidic entities (e.g. cofactors) will not be recognised. • The final function, is_acceptor, again uses atom and residue name information to determine whether the atom is an H-bond acceptor that is not also an H-bond donor. 4. Understand how the script counts good and bad contacts. • In the main program, the classification functions are called for each atom found to be in Reliscript User Guide 109 contact with a carboxylate oxygen. If a contact atom is found to be an H-bond donor or metal ion, the count n_good is incremented. If it is found to be a hydrophobe or an H-bond acceptor, n_bad is incremented. The counts of good and bad contacts (and unrecognised contacts) are printed out for each carboxylate group, see the code extract from tutorial1c.py. 5. Run tutorial1c.py. • At the Unix prompt, run tutorial1c.py as a background job, redirecting the output to part1c.out. This should look something like: % python tutorial1c.py >part1c.out & [1] 251305 • The job should take a few minutes to run. When it is done, more the first few lines of part1c.out to see what the script has produced. It should look something like: good good good good good good good = = = = = = = 2 1 3 1 1 1 1 bad bad bad bad bad bad bad = = = = = = = 2 0 0 1 0 0 0 unknown unknown unknown unknown unknown unknown unknown = = = = = = = 4 0 0 0 0 1 0 • The output is an improvement on the previous version of the script since it shows at a glance how many good and bad contacts are formed by each carboxylate group. However, it would still be tedious to analyse it manually so further code is developed in the next part of the tutorial. 13.1.9 Part d: Find Frequencies of Occurrence of Carboxylate-Group Environments 1. Read tutorial1d.py. • Read through the next version of the tutorial script, tutorial1d.py, with the help of the notes that follow. 2. Understand the strategy of tutorial1d.py. • This version of the script is the same as the previous version except that it keeps track of how many different combinations of n_good, n_bad and n_unknown occur. For example, suppose that there were only five carboxylate groups in the set and they had the following 110 Reliscript User Guide contact counts: good good good good good = = = = = 3 5 2 2 5 bad bad bad bad bad = = = = = 0 1 2 2 1 unknown unknown unknown unknown unknown = = = = = 0 0 0 0 0 occurrences = 1 percentage = 0.0443852640923 • In this simple example, there are two occurrences of the combination n_good = 5, n_bad = 1, n_unknown = 0; two occurrences of n_good = 2, n_bad = 2, n_unknown = 0; and one of n_good = 3, n_bad = 0, n_unknown = 0. • The various combinations are sorted so that those containing the most good contacts occur first and are then printed out. 3. Understand how the script uses a dictionary to store the unique combinations of contact-atom counts. • The script uses a Python dictionary to do the book-keeping. This is initialised with the statement: # Initialise dictionary that will be used to store # the results count_combo = {} • What is called a dictionary in Python is called an associative array in some other languages. It is an array of items, each of which is associated with a key, and which can be accessed via that key. • Every time the values of n_good, n_bad and n_count are evaluated for a carboxylate group in tutorial1d.py, they are used to generate a key. The line of code is: key = 10000*n_good + 100*n_bad + n_unknown • The dictionary count_combo is checked to see whether it already contains that key. If so, this particular combination of n_good, n_bad, n_unknown has already been found in a previous carboxylate group and all we need do is increment its occurrence-count by one. If not, this is the first time this particular combination has been seen, so a new item is added to the dictionary, initialised with an occurrence-count of 1, see the code extract from tutorial1d.py. Reliscript User Guide 111 4. Understand how the script sorts and prints out the results. • All that remains to be done at the end is to convert the dictionary to a list, sort the list and then reverse the order. This will place all the unique combinations of contact-atom counts in descending order of their key values, which will effectively mean that they are sorted first on n_good, then on n_bad and then on n_unknown. results = count_combo.items() results.sort() results.reverse() • Then the results are printed out, see the code extract from tutorial1d.py: 5. Run tutorial1d.py. • At the Unix prompt, run tutorial1d.py as a background job, redirecting the output to part1d.out. This should look something like: % python tutorial1d.py >part1d.out & [1] 251435 • The job should take a few minutes to run. Once it is done, cat part1d.out to see what the script has produced. The first few lines of the file should look something like: good = 9 bad = 2 unknown = 0 occurrences = 1 percentage = 0.0443852640923 good = 8 bad = 4 unknown = 0 occurrences = 1 percentage = 0.0443852640923 good = 8 bad = 1 unknown = 0 occurrences = 1 percentage = 0.0443852640923 good = 7 bad = 1 unknown = 0 occurrences = 2 percentage = 0.0887705281846 good = 7 bad = 0 unknown = 0 occurrences = 7 percentage = 0.310696848646 good = 6 bad = 3 unknown = 0 occurrences = 4 percentage = 0.177541056369 112 Reliscript User Guide • This is the first version of the script that provides a direct answer to the scientific problem under investigation. In particular, we see a few contact counts that look distinctly unfavourable (e.g. 1 good, 3 bad), although they occur with low frequency. An obvious need now is to get details of those carboxylate groups that are in unfavourable environments, so that we can inspect them manually in Relibase+. It would also be nice to get an overall percentage of carboxylate groups that are in unfavourable environments. The next and final version of the script addresses these requirements. 13.1.10Part e: Find and Print Details of Carboxylate Groups in Unusual Environments 1. Read tutorial1e.py. • Read through the final version of the tutorial script, tutorial1e.py, with the help of the notes that follow. 2. Understand the strategy of tutorial1e.py. • This final version of the script does everything that the previous version did, but in addition keeps track of how many carboxylate groups occur in unfavourable environments. • We have to define what unfavourable means. One complication is that any carboxylate group forming less than 4 contacts in total is probably at least partly exposed to bulk solvent. If the total number of contacts is n, we crudely allow for this by assuming that 4-n contacts (= n_assumed) are to bulk water (and therefore inherently favourable). We then define as unfavourable any carboxylate group environment for which the following two conditions are true: n_good < n_bad (i.e. the group is observed to form more contacts to hydrophobes and H-bond acceptors than to H-bond donors and metal ions); and n_good + n_assumed < 4 (i.e. there are less than four favourable contacts, either explicitly observed or assumed contacts to bulk water). • In addition, the script prints details of the fifty carboxylate groups occurring in the worst environments. This is done by calculating and sorting on the quantity n_bad - n_good n_assumed (roughly, number of unfavourable contacts minus number of favourable contacts). 3. Understand how the script identifies and counts carboxylate groups in unexpectedly unfavourable environments. • The count of groups in unfavourable environments is initialised: n_unexpected =0 • When each carboxylate group in an unfavourable environment is identified, the count Reliscript User Guide 113 incremented, see the code extract from tutorial1e.py. 4. Understand how the script identifies the fifty carboxylate groups in the worst environments: • A list is initialised which will end up storing the details of the worst fifty carboxylates: # Initialise list that will contain details of the # carboxylates in the most unfavourable environments worst = [] • As each group is processed, a parameter called how_bad is calculated. This is a crude index of how unfavourable the group’s environment is. If the index is one of the largest fifty values so far encountered, details of the group are put into worst, see the code extract from tutorial1e.py. • Finally, details of the groups are printed out at the end, see the code extract from tutorial1e.py. 5. Run tutorial1e.py. • At the Unix prompt, run tutorial1e.py as a background job, redirecting the output to part1e.out. The job should take a few minutes to run. Once it is done, cat part1e.out to see what the script has produced. The file should look contain a line something like: Number of groups in unexpected environments = 44 percentage = 6.11961057024 • It should also contain details of the groups in the most unfavourable environments, e.g. ligand = Ligand<pdb:3eo7:ACT_708-A> carboxylate oxygens = OXT 2 and O 1 good = 1 bad = 3 unknown = 0 ligand = Ligand<pdb:1tvw:CB3_318> carboxylate oxygens = OE1 27 and OE2 28 good = 0 bad = 3 unknown = 0 ligand = Ligand<pdb:3dl6:DHF_613-C> carboxylate oxygens = O1 30 and O2 31 good = 1 bad = 3 unknown = 2 114 Reliscript User Guide ligand = Ligand<pdb:2nph:AETF_1-S> carboxylate oxygens = O 24 and OXT 32 good = 2 bad = 4 unknown = 0 • Only about 8% of carboxylate groups occur in unfavourable environments. We can look at some of these groups in Relibase+. The last in the list above, 1GHB, shows a remarkable interaction in which a ligand carboxylate appears to point directly at the face of a tyrosine ring: Fig. 4. Unusual carboxylate-group environment in 1GHB. Reliscript User Guide 115 Fig.5. As above, interaction shown in space-filling style. 6. Generate results from the full database. • So far we have used the small test database. This is ideal when developing and testing scripts since it is large enough to be a meaningful test set but small enough that jobs typically run in a few minutes. Now, however, we may wish to generate final results from the full database, reli. This will take much longer - typically, several hours. • The tutorial script can be run on the full database by removing the hitlist from the script, in other words change the line: ligset = rs.set(’ligand’, ’tutorial1’) to: ligset = rs.set(’ligand’) 13.1.11Scientific Conclusions The questionable GOLD docking that prompted this study clearly falls well within the definition used here of an unfavourable carboxylate-group environment. Only a small percentage of carboxylate groups in the PDB are observed to occur in such environments (and some of these are probably due to experimental errors in measuring or fitting electron density). We therefore conclude that the GOLD solution is sufficiently unlikely that it, and others like it, could reasonably be rejected automatically 116 Reliscript User Guide as false predictions. 13.1.12Ways of Improving the Tutorial Script The tutorial script could be improved in many ways. For example, we could: • Restrict the study to structures with a resolution better than 2.5Å. • Eliminate disordered ligands by testing atom site occupation factors. • Impose a minimum size (i.e. number of atoms) on the ligands, in case very small ligands are atypical. • Eliminate common cofactors in case they bias the results. • Replace the fixed distance criterion used to define a short contact (3.2Å in the script) by a criterion that varies according to the van der Waals radii of the contact atoms. • Check for contacts to atoms from neighbouring chains in the crystal packing, using PackBindingSite objects. • Extend the functions is_donor and is_acceptor so that they reliably identify H-bond donor and acceptor atoms in cofactors, etc. • Create a Relibase+ hitlist of all the ligands containing carboxylate groups in unfavourable environments, so that they may be inspected more easily in Relibase+. • Perform tests on the directions of the short contacts to carboxylate oxygens, e.g. to identify contacts to H-bond donor atoms that are not, in fact, hydrogen bonds because of poor directionality. All of these enhancements could be made using existing Reliscript functionality. Reliscript User Guide 117 13.2 Tutorial 2: Creating a Binding Site Quality Checker 13.2.1 Objectives To create a script that can be run from the command line that takes a PDB code as an argument and checks the binding site for: • Clashes between protein side-chains • Clashes between the protein and the ligand • Missing atoms • Influences of symmetry related protein residues • Atoms with high B-factors • Atoms with low occupancy 13.2.2 Steps Required • Create a module, my_rs_tools, containing functions for testing the quality of a binding site • Create main script using functions defined in the module my_rs_tools • Use python’s inbuilt optparse module for reading arguments and options from the command line • Use python’s raw_input function to allow the user to interact with the script 13.2.3 The Example When performing docking experiments it is important that the quality of the binding site is optimal. If, for example, there are atoms missing in the binding site docking programs will not know about them and the results obtained will be flawed. Because this is a well recognized problem in docking several studies have been aimed at creating high quality data sets for validating docking programs and scoring functions (Nissink et al., Proteins, 49, 457-471, 2002; Hartshorn et al., J. Med. Chem., 50, 726-741, 2007; Verdonk et al., J. Chem. Inf. Model., in press 2008). These test sets have all been tested for involvement of symmetry related protein side-chains in ligand binding, bad clashes between the protein side-chains and the ligand, unlikely ligand conformations and inconsistencies of the placement of the ligand in the electron densities. Tests for bad clashes and involvement of symmetry related protein atoms in the ligand binding are easily implemented in reliscript. Unlikely ligand conformations could easily be tested using the CSD software Mogul, but will not be further treated here. Testing for inconsistencies of the placement of the ligand in the electron densities is not currently possible with reliscript. However, using reliscript it is easy to implement some other simple tests which give indications of the quality of the binding site, such as highlighting atoms with unusually high B-factors and/or unusually low occupancies. We can also further extend the notion of bad clashes from clashes between the protein and the ligand to clashes between different protein side-chains. 118 Reliscript User Guide 13.2.4 Creating a Python Module This task is a lot less daunting than it sounds. As a simple illustration copy the files my_module.py and my_script.py onto your computer, making sure that they are in the same directory. Upon inspection of the my_script.py file you will notice the line: import my_module This means that any classes and functions defined in my_module.py can be used in my_script.py using the prefix my_module: my_module.test() The code in my_rstools1.py outlines the functions required for this tutorial (b_factor, occupancy, missing_atoms, symmetry, clash) and illustrates several points about writing code in python: • Documentation of the module and the functions is done via docstrings. The module docstring is the text within the triple quotes at the top of the module and the function docstrings are the text within the triple quotations at the beginning of each function. These help provide documentation for your program. Try running the command: pydoc my_rstools in the directory where you downloaded the my_rstools.py module. • At the end of the module there is a script testing the functionality of the module. If the module is run as a stand alone program the conditional statement: if __name__ == ’__main__’: evaluates as true and the code below is executed. Try running the command: python my_rstools1.py in the directory where you downloaded the my_rstools1.py module. It should give the following output: None None None None Reliscript User Guide 119 None None None • Note that although none of the functions do anything useful, this is still a functioning script. The next step will be to add the programming logic to these functions. The code in my_rstools2.py contains the programming logic for the functions: b_factor, occupancy, missing_atoms, symmetry. These represent the tests that are easily implemented using the functionality inherent in reliscript. Note that strings can be easily formatted using a convention similar to the C’s fprint function. For example the statement: s = ’%s is equal to %.2f’ % (’x’, 1.23456) will result in the variable x representing the string ’x is equal to 1.23’. Try running the command: python my_rstools2.py in the directory where you downloaded the my_rstools2.py module. It should give the following output: False Atom(N)<pdb:1mup:CHN-A:'5':1> has large b_factor 60.00 (> 40.00) False Atom(N)<pdb:1mup:CHN-A:'5':1> has occupancy 1.00 (< 2.00) Residue<pdb:1mup:CHN-A:'5'> missing 4 atoms There are 9 symmetry packed atoms in binding site None Finally, the code in my_rstools3.py implements the clash function. In order to check for bad clashes we use the data from table II in Nissink et al., which reports the minimum distances for selected atom-atom contacts. The data in the table is stored in a dictionary containing dictonaries (_MINIMUM_DISTANCES), so that minimum distances can be queried using the following syntax: cutoff = _MINIMUM_DISTANCES[atom1_atom_type][atom2_atom_type] The problem is that the atoms types reported in the paper are in an E(n) format where E is the atom and n is the number of atoms bonded to E, whereas Relibase+ uses the Sybyl atom types. A dictionary converting mol2 Sybyl atom types to E(n) notation (_mo2_to_En_notation) is therefore implemented. Finally a new function called bad_clash is implemented to automate the conversion of sybyl atom 120 Reliscript User Guide types to E(n) notation, the minimum distance look up and to check whether the two atoms clash or not. Have a look at the final version of the module my_rstools.py and try running it using the command: python my_rstools3.py in the directory where you downloaded the my_rstools3.py module. It should give the following output: False Atom(N)<pdb:1mup:CHN-A:'5':1> has large b_factor 60.00 (> 40.00) False Atom(N)<pdb:1mup:CHN-A:'5':1> has occupancy 1.00 (< 2.00) Residue<pdb:1mup:CHN-A:'5'> missing 4 atoms There are 9 symmetry packed atoms in binding site False 13.2.5 Creating the Main Script We can now create the main script binding_site_quality1.py. The basic outline of the script has the following outline: 1. 2. 3. The module my_rstools3 is imported Some variables (PDB code, binding site index, various cutoffs) are set The quality checks from my_rstools3 are called When reading through the script notice the use of the try/except idiom for catching invalid PDB codes and binding site indexes. Try running the script using the command: python binding_site_quality1.py in the directory where you downloaded both binding_site_quality1.py and my_rstools3.py. It should give you the following output: --------------------------------------------------------------------------PDB code : 1mup Binding site: BindingSite<1mup:CD_201> Ligand : Ligand<pdb:1mup:CD_201> Resolution : 2.40 R-value : 0.191 Reliscript User Guide 121 Checking b-factors... Atom(C)<pdb:1mup:CHN--:'119':912> has large b_factor 41.60 (> 40.00) Atom(O)<pdb:1mup:CHN--:'119':913> has large b_factor 47.07 (> 40.00) Atom(N)<pdb:1mup:CHN--:'119':914> has large b_factor 60.00 (> 40.00) Atom(C)<pdb:1mup:CHN--:'141':1091> has large b_factor 48.16 (> 40.00) Atom(O)<pdb:1mup:CHN--:'144':1110> has large b_factor 58.05 (> 40.00) Atom(C)<pdb:1mup:CHN--:'144':1113> has large b_factor 59.23 (> 40.00) Atom(O)<pdb:1mup:CHN--:'144':1114> has large b_factor 60.00 (> 40.00) Atom(O)<pdb:1mup:CHN--:'144':1115> has large b_factor 60.00 (> 40.00) Atom(O)<pdb:1mup:CHN--:'145':1119> has large b_factor 40.14 (> 40.00) Atom(O)<pdb:1mup:CHN--:'146':1129> has large b_factor 50.15 (> 40.00) Checking occupancy... ok Checking for missing atoms... Residue<pdb:1mup:CHN--:'146'> missing 1 atoms Checking symmetry... There are 9 symmetry packed atoms in binding site Checking protein side-chain bad clashes... Atom(C)<pdb:1mup:CHN--:'108':822> bad contact with Atom(N)<pdb:1mup:CHN--:'119':914>: 3.07 < 3.40 Atom(C)<pdb:1mup:CHN--:'108':823> bad contact with Atom(N)<pdb:1mup:CHN--:'119':914>: 3.23 < 3.40 Atom(N)<pdb:1mup:CHN--:'108':824> bad contact with Atom(N)<pdb:1mup:CHN--:'119':914>: 2.32 < 3.20 Atom(S)<pdb:1mup:CHN--:'121':929> bad contact with Atom(C)<pdb:1mup:CHN--:'145':1121>: 3.67 < 3.70 Checking protein ligand bad clashes... ok --------------------------------------------------------------------------- 13.2.6 Adding optparse and raw_input Functionality Obviously, one could edit the main script every time one wanted to determine the quality of a different PDB structure. However, it would be handy to be able to read the PDB code of interest from the command line. Furthermore it would be useful if one was provided with a list of binding sites, so that the user could interactively select the binding site of interest. In terms of reading in arguments and options from the command line, we will be making use of the built in module optparse. For the selection of ligands we will be making use of the built in functionality raw_input. Have a look at the code in binding_site_quality2.py. The optparse module automatically sorts out command line help. To illustrate this, try running the command: python binding_site_quality2.py -h Now try running the script using the command: python binding_site_quality2.py 1mup You will be prompted to select a binding site: 122 Reliscript User Guide Make binding site selection... [0] BindingSite<1mup:CD_201> [1] BindingSite<1mup:CD_202> [2] BindingSite<1mup:CD_203> [3] BindingSite<1mup:CD_204> [4] BindingSite<1mup:TZL_167> Type 4 and enter. You should get the following output: --------------------------------------------------------------------------PDB code : 1mup Binding site: BindingSite<1mup:TZL_167> Ligand : Ligand<pdb:1mup:TZL_167> Resolution : 2.40 R-value : 0.191 Checking b-factors... Atom(O)<pdb:1mup:CHN--:'46':333> has large b_factor 56.67 (> 40.00) Atom(C)<pdb:1mup:CHN--:'60':447> has large b_factor 46.58 (> 40.00) Checking occupancy... ok Checking for missing atoms... ok Checking symmetry... ok Checking protein side-chain bad clashes... Atom(C)<pdb:1mup:CHN--:'60':447> bad contact with Atom(C)<pdb:1mup:CHN--:'73':545>: Atom(C)<pdb:1mup:CHN--:'60':447> bad contact with Atom(S)<pdb:1mup:CHN--:'73':546>: Atom(C)<pdb:1mup:CHN--:'73':547> bad contact with Atom(C)<pdb:1mup:CHN--:'88':653>: Atom(C)<pdb:1mup:CHN--:'73':547> bad contact with Atom(C)<pdb:1mup:CHN--:'88':654>: Atom(C)<pdb:1mup:CHN--:'73':547> bad contact with Atom(C)<pdb:1mup:CHN--:'88':656>: Atom(C)<pdb:1mup:CHN--:'88':652> bad contact with Atom(N)<pdb:1mup:CHN--:'92':690>: Checking protein ligand bad clashes... ok --------------------------------------------------------------------------- 3.26 3.42 3.34 2.85 3.16 3.17 < < < < < < 3.40 3.70 3.40 3.40 3.40 3.40 13.2.7 Altering the Binding Site Selection Process Reliscript offers a simple and flexible way of investigating protein-ligand interactions. The current script binding_site_quality2.py requires the user to provide the PDB code of interest on the command line and for the user to interactively select the binding site of interest. However, because of the modular design of the code in this tutorial one could now easily create a separate script for designing a large validation set such as in the studies referenced in the introduction (Nissink et al., Proteins, 49, 457-471, 2002; Hartshorn et al., J. Med. Chem., 50, 726-741, 2007; Verdonk et al., J. Chem. Inf. Model., in press 2008). The first steps in such a process would probably be to exclude protein-ligand interactions where the ligand was a metal, an ion or a cofactor, which could be easily be achieved by creating a ligand set and filtering it. >>> >>> >>> >>> import reliscript as rs ligand_set = rs.set(’ligand’) filter_by_mr = rs.numeric_search(’mol_wt’, min=300, component=’ligands’) filter_by_mr(ligand_set) Reliscript User Guide 123 Once happy with the ligand set one could investigate the quality of the binding sites associated with each of the ligands in the set. 124 Reliscript User Guide