Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Protein Data Representation and Query Using Optimized Data Decomposition Ilya N. Shindyalov San Diego Supercomputer Center P.O. Box 85608, San Diego, CA 92186-9784 voice: (619) 534-8322, fax: (619) 534-8362 [email protected] Philip E. Bourne* San Diego Supercomputer Center P.O. Box 85608, San Diego, CA 92186-9784 and Department of Pharmacology University of California, San Diego San Diego, CA 92093-0365 voice: (619) 534-8301, fax: (619) 534-5113 [email protected] Running Head: Protein Data Representation and Query. Keywords: protein structure, property pattern, data representation, optimized data decomposition, protein query tools. *To whom correspondence should be addressed. 08/08/98 7:42 AM 2 Abstract Motivation: To provide data management tools to efficiently maintain and query experimental and derived protein data with the goal of providing new insights into structure-function relationships. The tools should be portable, extensible, and accessible locally, or via the World Wide Web, providing data that would not otherwise be available. Results: The initial phase of the work, the data representation and query of all available macromolecular structure data, including real-time access to complex property patterns based on the amino acid sequence is reported. Protein structure data taken from the Protein Data Bank (PDB) are decomposed into native and derived elementary properties and represented as compact indexed objects minimizing storage requirements and query time for select types of query. In addition, collections of indices representing a particular property are maintained and can be queried for specific property patterns found across the whole database. The approach is proving applicable to a wide variety of data available on specific protein families. Availability: Three resources are available using this approach. (I) The query of basic structural components and property patterns of the complete PDB is available via the World Wide Web at the URL http://www.sdsc.edu/moose. (II) WPDB, a PC-based compressed macromolecular structure database and loader with a Microsoft Windows interface, is available from ftp://rosebud.sdsc.edu/pub/sdsc/biology/WPDB/. (III) A database supporting real-time 3-dimensional substructure searching will be reported elsewhere. Binaries of (I) and (III) are available for local use by contacting the authors. Contact: {shindyal,bourne}@sdsc.edu 3 Introduction There are a number of data resources devoted to proteins which provide original experimental data, for example, the PDB (Bernstein et al., 1977) provides structure data and SWISS-PROT (Bairoch and Boeckmann, 1993) provides sequence data. Similarly, there are data resources supporting derived data, for example, PROSITE (Bairoch, 1993) groups proteins according to assumed functional similarity based on sequence homology, DSSP (Kabsch and Sander, 1983) provides secondary structure assignments, and HSSP (Sander and Schneider, 1991) provides fold classifications based on structure alignments. The majority of these data resources do not use an underlying data management system, beyond that offered by the operating system file system. Other resources provide an integrated view of a variety of protein data, for example, Entrez (Hogue, Ohkawa and Bryant, 1996) and SRS (Etzold and Argos, 1993). Both of these resources assign additional indices to items of data. These indices facilitate the cross-referencing of data as part of a complex query. The system described here also uses indices to associate related items of data. However, it takes the approach further by defining collections of indices which all share the same property and by supporting complex query and visualization tools. To put this approach to data management in perspective, consider our earlier efforts (summarized in Shindyalov et al., 1995) which like the work describe here concentrated on protein structure data derived from the Protein Data Bank (PDB) and which resulted in a variety of tools and databases. Relevant here were a C++ class library providing an in-memory representation of protein structure (PDBlib, Chang et al., 1994); a persistent version of PDBlib implemented using ObjectStore™ (OOPDB, Shindyalov et al., 1995); and a powerful query language directed at macromolecular query (MMQL, Shindyalov et al., 1994). This software sought to retain the relationships inherent in macromolecular structure through features supported in the C++ and Java object oriented programming languages, for example, data abstraction and class inheritance. While this approach satisfies many of our own research needs, we sought to provide a portable, non-proprietary data management and query system which could be downloaded and used locally in both research and education settings. Sush a system should be extensible to include new types of data and query methodologies. This precluded the use of a commercial proprietary database management system like ObjectStore, or the popular relational database management systems like Sybase or Oracle on the basis of cost. We could have chosen a public domain database management system (e.g., POSTGRES 4.2) but opted for a lightweight form of indexbased data representation and query which, from a user’s perspective is read-only. Thus, the data representation and query capabilities described here have, by design, few of the characteristics of a database management system (DBMS). For example, there is no support for transaction scheduling, concurrency control, and levels of security beyond that offered by the operating system. While falling short of a DBMS, this lightweight indexbased approach described here has produced free, portable databases in wide use by the community. For example, somewhere between 500 and 1,000 copies of WPDB (described subsequently) are currently in use. 4 Current efforts to support a larger variety of data on proteins indicate the extensibility of the approach, hence, details are reported here. System and methods All software has been written in the C++ language. It has been successfully compiled on Digital Equipment Corporation (DEC) Alpha, and Sun Microsystems Inc. Sparc platforms using gcc version 2.7.2. Given the wide availability of the gcc compiler on UNIX platforms, we do not anticipate porting problems to other common hardware platforms supporting UNIX. It has also been compiled on Intel PC platforms using the Borland C++ compiler, version 4.5 under Windows95. The databases and query tools built using this system are freely available via the Internet. Source code is available by contacting one of the authors. The system can be easily extended by C++ or Java programmers. Algorithm Approach A summary of the terminology used throughout is given in Table I. Note that the terms Entity1 and property are not taken from database terminology, but are used here to refer to a unique component of the macromolecular structure and a physical property of a macromolecule, respectively. The approach is to decompose 3-D macromolecular structure information into its elementary properties. which are represented using Property2 objects and associated indices. For example, the atomic coordinates of every Entity in the PDB are represented as a single Property object containing two attributes, an index for the Entity and for each index, all the x, y, and z coordinates for that Entity. All the x, y, and z coordinates associated with that one Entity index are collectively referred to as a data item. The same Entity index is used for other properties of the same Entity, each contained in separate Property objects. As will be shown subsequently, a variety of other indices are used to capture features of macromolecular structure, both contained in a PDB file and derived from PDB data. The parser is a key feature in the decomposition process, since parsing the complete PDB is not straightforward. The parser was introduced previously (Chang et al., 1994). Figure 1. Relational Verses Indexed-based Representation of Monomers. 1 2 Terms defined in Table I are given in italics. Objects, classes, and their attributes are given in bold. Table “Monomers” code3 n_atom code1 Property Objects using Index “mon” Property Object code3.mon Property Object n_atom.mon 5 Property Object code1.mon ALA 5 A 1 ALA 1 5 1 A VAL 7 V 2 VAL 2 7 2 V LEU 8 L 3 LEU 3 8 3 L ILE 8 I 4 ILE 4 8 4 I CYS 6 C 5 CYS 5 6 5 C Figure 1. Relational Verses Indexed-based Representation of Monomers. A property, as implemented here, corresponds to a single column in a relational table, while an index, given as a sequential integer value, corresponds to an identifier for a row in that relational table (Figure 1). Persistent Property objects are each stored as individual binary files and their contents subject to compression. Queries relating to a given property, or properties, are efficient since only the appropriate files need be accessed. Moreover, in some instances data are compressed achieving efficient storage. This was discussed in Shindyalov and Bourne (1995), where up to a 20-fold compression over PDB files is possible when compressing atomic coordinates. Within a Property object data items can be singular, tat is, one value per index (e.g., compound name), or multiple, that is, an array of values of fixed or variable length (e.g., all the atomic coordinates associated with an Entity). The choices of data types for a data item in a Property object are character string, 1-byte unsigned integer, 2- or 4-byte integer, or 4byte floating point. Queries which require only a single property, for example, a search of textual records or protein sequences, are performed efficiently using a single Property object. Collection objects contain collections of indices exhibiting the same property and hence permit fast property pattern matching. For example, the exposure to solvent (Lee and Richards, 1971) of all pentamers in the database can be expressed as strings of 5 numbers, 00000 through 55555, where 00000 indicates a buried pentamer and 55555 a fully exposed pentamer. The 55 possible combinations for describing the exposure values of pentamers are the indices for the collection, and the Entity indices representing the Entities that contain the pattern, the data item. It will be shown subsequently that this scheme can be extended for property patterns covering larger ranges of properties than simple pentamers. Persistent Collection objects are also maintained as separate binary files. Property and Collection objects are the primary level of data representation in our system (Figure 2). 6 Figure 2. System Overview. Query Type Find Text Find Sequence Find Exposure Pattern Quarternary (In-memory) Display Contact Map Cmap(“4HHB:A” ,“4HHB:B”) IEntity(...) ISubentity(…) IAtom(…) Tertiary (In-memory) IMonomer(…) Secondary (In-memory) Primary (Persistent) Method Entities(…) Property(“compnd.com”) find(“hemoglobin”,0) IBond(…) Monomers(…) Property(“seq.enp”) Collection(“exp.ip6”) Property(“name.enc”) Property(“bond.mon”) Property(‘xyz.enc”) Property(“atom.mon”) find(“vlspadktnvkaa”,0) find(“01110”) Composite objects are in-memory objects characterized as Monomers and Entities to indicate content relating to Monomers and Entities, respectively. Composite objects are derived from Property and Collection objects at run time. Iterators IEntity, ISubentity, IAtom, IMonomer and IBond navigate macromolecular features (as defined by their name) contained in Monomers and Entities objects. Composite and Iterator objects are the secondary and tertiary levels of data representation in our system, respectively (Figure 2). Finally at the quaternary level, Specialized objects are provided to support more complex queries. For example, the Cmap (contact map) object (Figure 2) calculates and maintains distance contact matrices. Persistent Specialized objects are currently being written to capture important relationships between protein features necessary to describe families of proteins (e.g., multiple sequence alignments). Query is via methods contained within the various objects described above. Queries are fast since the query method supports the specific data organization within the data item. However, it means specific query methods must be written for each new type of query. Sufficient methods have been coded to provide basic functionality. For example, the find()3 method encapsulated in both Property and Collection objects provides a variety of search capabilities — searching for a particular substring in text fields, searching for protein sequence similarity, searching for sites exposed to solvent (Figure 2), and so on. Specific examples of searching will be given subsequently. Indices 3 Methods are given in both bold and italics. 7 The Entity index introduced above is currently one of six indices used to retrieve protein features (Table II). Only one level of indexing is allowed, that is, every data item is retrieved based on a single index. There are no indices representing the most basic protein features, for example, Subentities and Atoms. These basic features are represented as part of an Entity and retrieved using the Entity index. As stated, each persistent Property object is contained in a separate file. Names of properties serve as names of files, which are of the form, pppppppp.iii, where ppppppppp is a property name (8 characters or less to satisfy MS-DOS and Windows 3.x platforms,) and iii, is the index identifier (Table II). How PDB native and derived data are currently decomposed into elementary properties for efficient query is given in Table III. Classes, their Attributes and Methods The DB class The major classes comprising the system (many of whose object instances are described above), their purpose, and representative methods from each class are given in Table IV. DB is the foundation class containing the access methods to each of the files containing Property and Collection objects. The Persistent Property Class As stated, the Property class represents persistent protein features. These features are obtained directly from PDB files or derived as the database is built or incremented. The Property class includes two attributes, objectIndexArray and propertyValuesArray (Figure 3). The use of an array-based data representation rather than a linked list is a departure from our earlier work with PDBlib (Chang et al., 1994). Arrays provide improved performance in situations where iteration is over a large number of property values associated with a single index, for example, all atomic coordinates associated with an Entity index. The attribute objectIndexArray links indices representing specific objects (Table II) with their boundary values. For example, consider the atomic coordinates for all Entities contained in the file xyz.enc. This could be all Entities in the PDB, or all Entities in a database built by the user. If the first entry was deoxyhemoglobin (Fermi et al., 1984; PDB code 4HHB) each Entity in this structure would be assigned an index. There would be a total of 231 indices (4 polypeptide chains i.e., Polymer Entities, 4 heme groups, 2 phosphates and 221 water molecules) each represented by a separate Entity. Reference 1 for index 1 (Figure. 3) would have a value of 1 and point to the first atom of the first Entity (designated A in the PDB file). Reference 2 for index 2 would have a value of 1071 pointing to the first atom of the second Entity (designated B in the PDB file), and so on. 8 Figure 3. In-memory Representation (a) and Storage Model (b) for the Class Property. N is the number of indices in the Property object, each with an associated data item; M is the total number of data values in the Property object. (a) In memory representation objectIndexArray Index 1 Reference 1 Index 2 Index N Reference 2 ... Reference N ... propertyValuesArray Value 1 Value 2 Value 3 Value 4 ... Value M (b) Storage model Record 1 Record 2 Record 3 Record 4 Record 5 Record 6 Record 7 Type of Value Size of Value Reference to object Index Array Size of object Index Array Size of property Values Array property Values Array object Index Array The attribute propertyValuesArray (Figure. 3) represents actual property values. In our example database all atomic coordinates would be stored sequentially with the 1071th entry corresponding to the first atom of the hemoglobin B polypeptide chain. Additional attributes of the Property class, valueType and valueSize, define the data type for Property objects. Possible value types are: character string, unsigned 1-byte integer, 2-byte integer, 4-byte integer, and 4-byte float. valueSize can be either fixed or variable; if fixed, objects defined by each index are assumed to be of the same size (e.g., atomic coordinates), otherwise the size of the object is arbitrary. An attribute loadInMemory defines the loading mode i.e., whether the property should be accessed from disk on a data item by data item basis, or loaded into memory en mass. Loading mode can be overridden when opening the Property object. The choice of loading mode depends on how many data items are going to be involved in a particular query. Consider common methods associated with the Property class (Table IV). The constructor Property() creates an interface to manipulate properties. This method also establishes communication between the Property and the DB class, necessary for the Property object to perform input and output. Method create() creates a named property in memory and defines its value type and value size. Method add() is used to add property values to the next object, assuming sequential ordering of objects according to the index. The methods item() and items() are used to access single and multiple value properties, respectively. As stated previously, method find() searches for various property patterns in Property objects. For example, consider a search for a protein sequence. Since protein 9 sequences are stored as a single Property object, that object is loaded into memory and searched using the appropriate find() method. Using this approach, finding a sequence pattern takes a few seconds of computing time on a low-end workstation, assuming the current PDB database of approximately 5,500 structures. The Persistent Collection Class The persistent Collection class encapsulates the Collection objects described above, and thus supports fast searching of property patterns. The Collection class consists of collections of indices for a particular property grouped according to ranges of similar values. This class is designed for properties which can be represented as sets of object indices, where each set is associated with yet another index, the SetIndex. SetIndex is fixed but the number of object indices represented by each SetIndex is not fixed and can be incremented with database growth. Figure 4. In-memory Representation (a) and Storage Model (b) for the Class Collection. (a) In memory representation objectSetIndexArray SetIndex 1 SetIndex 2 Reference 1 SetIndex N Reference 2 objectIndex 1.1 ... Reference N objectIndex 2.1 objectIndex N.1 ... objectIndex 1.2 objectSets ... objectIndex 2.2 ... objectIndex N.2 ... (b) Storage model Record 1 Record 2 Record 3 Record 4 Type of object Index Number of objectSets objectSets Array Size objectSets (1…N) The attribute objectSetIndexArray defines the range of property patterns in the Collection class (Figure 4). In a previous example using exposure to solvent, a SetIndex of 0 would correspond to the exposure pattern 00000 and a SetIndex of 55 the exposure pattern 55555. The objectSets attribute would then contain the indices of the Entities containing the appropriate exposure pattern. 10 Container and Iterator Classes The in-memory Entities and Monomers classes are containers supporting queries pertaining to a few Entities or Compounds, but involving several properties simultaneously. Iterator classes, IEntity, ISubentity, IAtom, and IBond iterate through basic protein components (as suggested by their names) when loaded into the container class, Entities. Hence, various macromolecular properties can be accessed through methods associated with iterator classes (Table IV). For example, the number of Subentities in an Entity can be obtained from the nSubentities() method of the IEntity object, and atomic coordinates can be returned using the x(), y(), z() methods of the IAtom iterator object. Similarly, instances of the IMonomer class loaded with the Monomers class iterate and return, for example, one and three letter codes for Monomers. Specialized Classes The Cmap (contact map) class illustrated in Figure 2 is one example of a specialized class that use classes lower in the hierarchy. Cmap will be described in more detail subsequently to illustrate the approach. Other examples of specialized classes, not described, are Profile (view property profiles), View3D (3-D rendering tool) Align (sequence alignment), and Idf (basic property and collection query tool described below). Query There is no general purpose query language associated with this system. Rather, queries have been written as specialized classes to address specific questions of interest. Nevertheless, enough questions can be asked to make the system generally useful as indicated by the applications discussed subsequently. Further, additional queries can be added through the reuse of existing methods, and writing of more specialized methods. Figure 2 illustrates the basic approach, with queries of increasing complexity shown from left to right. Either a general or specialized method with appropriate arguments is used. Each query in Figure 2 is discussed to illustrate how the system functions. Find Text The file compnd.com contains text from PDB COMPND records stored as a Compound Property object. A simple text searching algorithm present in the find() method is used to search for the text string “hemoglobin” - no mismatches are allowed. This search on a PDB database of 5,500 structures takes from 1 second on a typical UNIX workstation to a few seconds on an Intel 486 processor. Find a Subsequence 11 Using the identical find() method, again with no mismatches allowed, a search is performed for the sequence string “vlspadktnvkaa.” The search is on the file seq.enp which contains all the protein sequences taken from PDB SEQRES records and stored as a Polymer Entity Property object. This search takes approximately the same amount of time as the text search. Find an Exposure to Solvent Pattern Again the find() method is used, but this time on the file exp.ip6 which contains a Collection object describing patterns of exposure. The query will return a list of Entity indices corresponding to Entities containing the pattern. Display a Contact Map This uses the specialized Cmap object which accepts two arguments, the two Entities between which a contact map is to be calculated, in this example the A and B chains (as designated in the PDB file) of hemoglobin (PDB code 4HHB). A contact is found if the distance between any two atoms from the different Entities are less than a threshold value (6Å by default). Variations of this method allow for Cα or Cβ contacts to be returned. The steps are as follows: (i) The appropriate Entities are located using the IEntity iterator object which first locates the indices associated with the Entity names from the file name.enc. (ii) Using the Entity indices the atomic coordinates are retrieved for the required Entities from the Property object contained in the file xyz.enc and an Entities object created. (iii) An in-memory Monomers object is created which describes the connectivity of each Monomer in the database. This is achieved by the Monomers object accessing the Property objects which contain the atom names and specific bonds in each Monomer. These Monomer Property objects contain descriptions of all canonical and non-canonical Monomers found when the PDB or other structural data were loaded. (iv) Using this connectivity information Cmap iterates through each atom (by way of IEntity, ISubentity, and Iatom iterators) of each Monomer for one Entity calculating the distance from that atom to all atoms in each of the Monomers of the other Entity. If a contact distance is found, the contact distance, associated atom, Subentity and Entity are retained and Cmap skips to the next Monomer. Distances are returned and displayed in a variety of ways depending on the application. Calculation of a contact map takes from 1 second on a typical UNIX workstation to 3-4 seconds on an Intel 486 processor. Implementation 12 Several tools have been implemented and distributed using this idea of optimized data decomposition and are available either for World Wide Web access, or for local access once downloaded via the Internet. MOOSE - Macromolecular Object Oriented Search Engine (http://www.sdsc.edu/moose) - is a read-only Web-based browsing and query tool for finding and analyzing macromolecular structures found in the PDB. The MOOSE database is initially built from the complete PDB (Bernstein et al., 1977). and updated once every 24 hours by obtaining new structures from the PDB ftp archives using a mirror program. Once built, existing Property and Collection objects are incremented each day to include new structure data. Thus, MOOSE is never more that 24 hours out of date with the primary source of data. Structures are also removed from the PDB. Since the PDB was founded over 300 structures have become obsolete. A method is currently being implemented that masks access to these entries as they become obsolete so that synchronization with the primary data source is maintained. Obsolete entries are still accessible from the Archive of Obsolete Entries (Weissig and Bourne, 1997). At present obsolete entries are removed from MOOSE only when a complete rebuild of the database is performed. During database build and increment a variety of Property and Collection objects are derived. These include secondary structure, a variety of patterns involving environmental properties (solvent exposure, polarity, secondary structure etc.) and static amino acid properties (hydrophobicity, volume etc.). A database build of 5,500 structures takes 120 hours on a 275 Mhz DEC Alpha workstation and produces a database approximately 3.5 gigabytes in size. Property pattern searching using Collection objects is considered a strength of MOOSE. One result of this type of search is illustrated in Table V which summarizes results from a solvent exposure search (Lee and Richards, 1971) performed across the complete PDB. The report indicates the distribution of buried residues with respect to various lengths of primary sequence. This query is run every 24 hours when new structures are loaded from the ftp archives and available via the Web as one of a number of reports on the current content of the database. Individual exposure values are added to Property objects and the distribution of exposure values for the whole database stored as Collection objects and then queried to produce the reports. The reports indicate the number of likely hits when searching for a specific property pattern and serve to help the user refine a property pattern query. Similar reports are maintained for isotropic temperature factors, polarity, hydrophobicity, isoelectric point, and volume. Specific queries associated with amino acid property patterns may be formulated as follows: • • • • Threshold — Find structures with a specified length of sequence where all values of a specific property are above or below some threshold. Index — Find structures where the average value of a specific property over a range of residues is used as a search pattern. Profile — As Index, but allow approximate matching. Sequence As Index or Profile, but rather than using absolute values for a property use the static value specific to a given amino acid. 13 • Structure — As Sequence, but rather than use static values for the property, use property values found in a segment of an existing structure in the database. Currently MOOSE maintains Collection objects spanning 5, 10, 15, 20, 25, and 30 Monomers. All are represented by 55 indices such that, for example, each digit of the index would be the value averaged over 6 Monomers when searching for a property pattern spanning 30 Monomers. While MOOSE provides a Web interface to the underlying database, actual queries are made using a CGI script that invokes a query program, Idf (indexed decomposed features) which, if a database is built and stored locally, can be invoked from the command line. A few representative examples of Idf query syntax are given in Table VI. The Idf program is not meant to be a general purpose query language, but is introduced to indicate the types of queries that can be made of a MOOSE database. Readers interested in building their own MOOSE databases should contact one of the authors. WPDB (http://www.sdsc.edu/CCMS/Packages/wpdb.html) is a Microsoft Windows based macromolecular structure interrogation tool. WPDB v2.2 supports only Property objects, but Collection objects supporting property-pattern based query will be available with v3.0. The overall database is from 10-20 fold smaller (depending on amount of derived data) than the PDB distribution, and can be queried from local disk or a single CD-ROM. The disk space saving comes from a simple algorithm applied to PDB ATOM records. Besides query capabilities WPDB provides visual components such as a 3Dviewer, profile viewer, contact map viewer, alignment editor and structure validation tool. All visual components communicate with each other providing intuitive interaction. WPDB was described in Shindyalov and Bourne (1995). WPDBL the loader program to build databases from PDB files is available from ftp://ftp.sdsc.edu/pub/biology/WPDB/WPDBL. QuickPDB (http://xtal1.sdsc.edu/misha/QuickPDB.html) is an example of a specialized object encapsulated in a lightweight Java applet which perform text and sequence searching using the MOOSE database. The applet implements a small subset of Idf commands which are passed to the MOOSE database which returns a list of Entities which are displayed by the applet. Once an Entity is selected the complete Compound is displayed and rendered in a variety of ways including side-by-side stereo, sequence/structure highlighting, secondary structure highlighting, and B factor highlighting. Conformational Likeness (http://xtal1.sdsc.edu/misha/misha.html) is a tool for real-time 3-D substructure similarity searching using a variety of geometric and physicochemical properties. Currently 495 individual properties are maintained as Property objects. Details of the system and the algorithms for substructure matching will be described elsewhere. Discussion 14 Data management systems that support a large variety of data needed by the molecular biologist, are comprehensive in their data coverage and query capability, simple to use, portable, efficient, and in the public domain, simply do not exist. Each subdiscipline has developed a variety of data management systems of which few, if any, which meet all the above mentioned criteria. Rather they focus on meeting a specific subset of criteria. The index-based system described here focuses on efficiency for a particular set of queries at the price of generality and formal query language. The data management system is portable, and the idea of using Property and Collection objects general, but requires some programming skill to implement for supporting new types of data. Nevertheless, we find the approach provides a high level of functionality to the user and welcome others to try the approach. This data management system is currently being expanded to encompass specific protein families for which additional data on sequence alignments, structure alignments, kinetics, and disease-related information are available. Our test case uses data on the protein kinase family of enzymes (Bourne et al., 1997). The system works since relationships between data items are kept at their most basic level - a small set of indices relating Entities, Monomers, etc. This gets around the problem of maintaining a fast evolving schema, but at the price of a greater coding effort for new types of query. Acknowledgments This work was supported by the National Science Foundation grant number BIR 9507625. We are grateful to Calton Pu for his careful review of this manuscript. References Bairoch,A. (1993) The PROSITE dictionary of sites and patterns in proteins, its current status. Nucl. Acid Res., 21, 3097-3103. Bairoch,A. and Boeckmann,B. (1993) The SWISS-PROT protein sequence databank, recent developments. Nucl. Acid Res., 21, 3093-3096. Bernstein,F.C., Koetzle,T.F., Williams,G.J.B., Meyer Jr.,E.F., Brice,M.D., Rogers,J.R., Kennard,O., Shimanouchi,T. and Tasumi,M. (1977) Protein data bank: a computer based archival file for macromolecular structures. J. Mol. Biol., 112, 535-542. Bourne,P.E., Smith,C., Gribskov,M., Berman,H.M., and Taylor,S. (1997). The protein kinase resource (http://www.sdsc.edu/kinases/). Chang,W., Shindyalov,I.N., Pu,C. and Bourne,P.E. (1994) Design and application of PDBlib, a C++ macromolecular class library. CABIOS, 10, 575-586. Etzold,T., and Argos,P. (1993) SRS - an indexing and retrieval tool for flat file data libraries. CABIOS, 9, 49-57. 15 Fermi,G., Perutz,M.F., Shaanan,B. and Fourme,R. (1984) The crystal structure of human deoxyhaemoglobin at 1.74Å resolution. J. Mol. Biol., 175, 159-174. Hogue,C.W.V, Ohkawa,H. and Bryant,S.H. (1996) A dynamic look at structures: WWWEntrez and the molecular modeling database. TIBS, 21, 226-229. Kabsch,W. and Sander,C. (1983) Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers, 22, 2577-2637. Lee,B. and Richards,F.M. (1971) The interpretation of protein structures: estimation of static accessibility. J. Mol. Biol., 55, 379-400. Sander,C., and Schneider,R. (1991) Database of homology-derived protein structures and the structural meaning of sequence alignment. Proteins, 9, 56-68. Shindyalov,I.N., Chang,W., Pu,C. and Bourne,P.E. (1994) Macromolecular query language (MMQL): a prototype data model and implementation. Prot. Eng., 7, 13111322. Shindyalov,I.N. and Bourne,P.E. (1995) WPDB - PC Windows based interrogation of macromolecular structure. J. App. Cryst., 28, 847-852. Shindyalov,I.N., Chang,W., Cooper,J.A., and Bourne,P.E. (1995) Design and use of a software framework to obtain information derived from macromolecular structure data. Proceeding of the 28th annual Hawaii international conference on system science. Vol. V. Biotechnology Computing, IEEE Computer Society Press, pp.207-216. Weissig,H and Bourne,P.E. (1997) Archive of obsolete PDB entries. http://db2.sdsc.edu/PDBObs/PDBObs.cgi. 16 Table I. Terminology Used to Describe Features of the Data Management System. Term Atom Collection Compound data item Entity Entities index Iterators Monomer Monomers Polymer Entity Property Subentity Definition Attribute of Subentity describing the properties of a single atom. Persistent object describing patterns of macromolecular properties. Set of Entities comprising a single experiment, e.g., all members of an NMR ensemble in a PDB file. All values of a property associated with a given index. Single polypeptide chain or DNA strand (Polymer Entity) or unique non-polymer component of Compound. In-memory object containing properties of Entities. Protein property reference; consecutive integers from 0 to N are assigned to each instance of index. Type of object used to navigate macromolecular features. Amino acid residue or nucleotide base from which instances of Polymer Entities are formed. In-memory object containing properties of Monomers. Collection of Subentities in sequential order. Persistent object describing a single macromolecular physical property, where each instance of that property is represented by an index. Specific instance of Monomer forming a Polymer Entity. 17 Table II. Indices Representing Protein Features. Index Description ID Monomer index used to retrieve Monomers. The names of Monomers, the names mon of atoms comprising Monomers, and the list of chemical bonds between atoms in a Monomer are examples. Compound index used to retrieve features associated with the whole Compound, com for example, information found on PDB SOURCE and HEADER records. Entity index used to retrieve properties pertinent to both polymer and nonenc polymer Entities. Atomic coordinates and isotropic temperature factors are examples. Polymer Entity index used to retrieve properties of Polymer Entities. Amino acid enp sequence and secondary structure assignments are examples. Single interval property pattern index is used to retrieve property patterns from imm Collection objects, based on some threshold. Interval property pattern index is used to retrieve property patterns from ip6 Collection objects, based on interval values. 18 Table III. Property Objects Representing Protein Features. Property Object code3.mon code1.mon n_bond.mon bond.mon n_atom.mon atom.mon prev.mon next.mon type.mon Protein Feature Three letter Monomer code. One letter Monomer code. Number of chemical bonds in a Monomer. Chemical bonds in a Monomer. Number of atoms in a Monomer. Atoms in a Monomer. Atom connected to the previous Monomer in the chain. Atom connected to the next Monomer in the chain. Monomer type (canonical versus non-canonical). id.com compnd.com source.com date_tex.com date_int.com header.com auth.com jrnl.com expdta.com ec.com res.com n_enp.com i_enp.com n_enc.com i_enc.com PDB compound ID (if available). Compound name. Source of Compound. Deposition date in PDB (text). Deposition date in PDB (integer). Functional class of Compound. Authors associated with Compound. Primary citation for Compound. Experimental method used in determining Compound. Enzyme classification code of Compound (if applicable). Resolution of Compound (if applicable). Number of Polymer Entities in Compound. Indices of first Polymer Entity the Compound. Number of Entities in Cmpound. Index of first Entity in Compound. name.enc i_com.enc i_enp.enc n_se.enc se.enc sen_pdb.enc n_xyz.enc xyz.enc bfac.enc se_xyz.enc xyz_se.enc Entity name. Index of Compound for Entity. Index of Polymer Entity for Entity. Number of Subentities in Entity. Individual Subentities of Entity. Subentity numbers for Entity (PDB assignment). Number of atoms with coordinates. Atom coordinates. B factors for atoms. Subentity to atoms reference array (i.e., se.enc to xyz.enc). Atoms to Subentity reference array (i.e., xyz.enc to se.enc). name.enp i_com.enp i_enc.enp Polymer Entity name. Index of compound for Polymer Entity. Index of Entity for Polymer Entity. 19 Property Object type.enp mw.enp n_se.enp alpha_c.enp beta_c.enp seq.enp k_s.enp exp.enp pol.enp Protein Feature Polymer Entity type (Protein, DNA, RNA). Polymer Entity molecular weight. Number of Subentities in Polymer Entity. Alpha helical content of Polymer Entity. Beta structural content of Polymer Entity. Sequence of Polymer Entity. Secondary structure assignment (Kabsch and Sander, 1983) within Polymer Entity. Solvent exposure (Lee and Richards, 1971) within Polymer Entity. Local polarity within Polymer Entity. 20 Table IV. Classes Representing Protein Features and Typical Query Methods. Class DB Purpose Container class containing low-level access methods, database path, and access status. Property IEntity Class supporting the representation and storage of property values for a set of objects representing features of Monomers, Compounds, and Entities. Class supporting the representation and storage of integer values (e.g., property indices) according to a SetIndex of fixed size. Represents collections of Compounds, Entities, and other indexed objects classified by property pattern. Container class for temporary in-memory storage of major Monomer properties. Monomer iterator. Container class for temporary in-memory storage of major Entity properties for a subset of Entities. Entity iterator. ISubentity Subentity iterator. IAtom IBond Atom iterator. Bond iterator. Collection Monomers IMonomer Entities Typical Methods setPath(), openFile(), read(), write(), closeFile(). Property(), create(), clear(), open(), close(), add(), item(), items() Collection(), create(), clear(), open(), close(), add(), item(), item2(), item4(). code1(), code3(), bonds(), atoms(). ++, code1(), code3(). AddCom(), addEnc(), addEnp(). ++, name(), nSubentity(). ++, findAtom(), nAtom(). ++, x(), y(), z(). ++, atom1(), atom2(). 21 Table V. Distribution of Solvent Exposure Sites by Entity for All Proteins in the PDB. Solvent exposure is calculated according to the method of Lee and Richards (1971) and characterized for different fragment sizes. Column represent percent exposed and rows the fragment size in number of consecutive amino acids. For example, there are 12 Entities in the PDB which contain 30 or more consecutive residues which are less that 2% exposed. Fragment Size 2 5 3869 10 1629 15 562 20 127 25 25 30 12 Percent Exposed 4 4801 2492 1214 481 181 55 6 5332 3667 2000 1023 458 193 8 5542 4620 3188 2028 1002 550 10 5790 5293 4464 3316 2202 1337 12 6021 5766 5335 4839 4142 3256 14 6108 5948 5754 5513 5256 4947 16 6182 6045 5972 5819 5676 5544 18 6213 6100 6056 5986 5875 5741 20 6249 6111 6069 6014 5929 5791 22 Table VI. The General Form and Examples of Idf Query Syntax. Idf Query General form: idf {input} [and_or] query [[and_or] query] ... [[and_or] [query] Description where: input is one of the following: (i) Entity or Compound identifier e.g., 4HHB. (ii) Complete database. (iii) File with a list of Entity or Compound identifiers e.g., @filename. (iv) Compound and Entity for contact maps e.g., 4HHB:A,4HHB:A and query returns a variety of single values, patterns, and reports. Specific examples: idf 2CCX comp Return the Compound name for the structure with the PDB code 2CCX. idf 2CCX date Return the deposition date for the Compound with the PDB code 2CCX. idf - comp toxin and date 88 90 Returns Compounds with 'toxin' in the name solved between 1988 and 1990. The “-“ signifies the whole database. idf - patp exp:0000077777,iso:7777777777 20 Return 20 Entities with the best property pattern match for the specified combination pattern of solvent exposure and isoelectric point.