Download PDF file .07MB

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Database wikipedia , lookup

Entity–attribute–value model wikipedia , lookup

Relational model wikipedia , lookup

Extensible Storage Engine wikipedia , lookup

Clusterpoint wikipedia , lookup

Versant Object Database wikipedia , lookup

Database model wikipedia , lookup

Transcript
Protein Data Representation and
Query Using Optimized Data Decomposition
Ilya N. Shindyalov
San Diego Supercomputer Center
P.O. Box 85608, San Diego, CA 92186-9784
voice: (619) 534-8322, fax: (619) 534-8362
[email protected]
Philip E. Bourne*
San Diego Supercomputer Center
P.O. Box 85608, San Diego, CA 92186-9784
and
Department of Pharmacology
University of California, San Diego
San Diego, CA 92093-0365
voice: (619) 534-8301, fax: (619) 534-5113
[email protected]
Running Head: Protein Data Representation and Query.
Keywords: protein structure, property pattern, data representation, optimized data
decomposition, protein query tools.
*To whom correspondence should be addressed.
08/08/98 7:42 AM
2
Abstract
Motivation: To provide data management tools to efficiently maintain and query
experimental and derived protein data with the goal of providing new insights into
structure-function relationships. The tools should be portable, extensible, and accessible
locally, or via the World Wide Web, providing data that would not otherwise be available.
Results: The initial phase of the work, the data representation and query of all
available macromolecular structure data, including real-time access to complex property
patterns based on the amino acid sequence is reported. Protein structure data taken from
the Protein Data Bank (PDB) are decomposed into native and derived elementary
properties and represented as compact indexed objects minimizing storage requirements
and query time for select types of query. In addition, collections of indices representing a
particular property are maintained and can be queried for specific property patterns found
across the whole database. The approach is proving applicable to a wide variety of data
available on specific protein families.
Availability: Three resources are available using this approach. (I) The query of
basic structural components and property patterns of the complete PDB is available via the
World Wide Web at the URL http://www.sdsc.edu/moose. (II) WPDB, a PC-based
compressed macromolecular structure database and loader with a Microsoft Windows
interface, is available from ftp://rosebud.sdsc.edu/pub/sdsc/biology/WPDB/. (III) A
database supporting real-time 3-dimensional substructure searching will be reported
elsewhere. Binaries of (I) and (III) are available for local use by contacting the authors.
Contact: {shindyal,bourne}@sdsc.edu
3
Introduction
There are a number of data resources devoted to proteins which provide original
experimental data, for example, the PDB (Bernstein et al., 1977) provides structure data
and SWISS-PROT (Bairoch and Boeckmann, 1993) provides sequence data. Similarly,
there are data resources supporting derived data, for example, PROSITE (Bairoch, 1993)
groups proteins according to assumed functional similarity based on sequence homology,
DSSP (Kabsch and Sander, 1983) provides secondary structure assignments, and HSSP
(Sander and Schneider, 1991) provides fold classifications based on structure alignments.
The majority of these data resources do not use an underlying data management system,
beyond that offered by the operating system file system. Other resources provide an
integrated view of a variety of protein data, for example, Entrez (Hogue, Ohkawa and
Bryant, 1996) and SRS (Etzold and Argos, 1993). Both of these resources assign
additional indices to items of data. These indices facilitate the cross-referencing of data as
part of a complex query. The system described here also uses indices to associate related
items of data. However, it takes the approach further by defining collections of indices
which all share the same property and by supporting complex query and visualization
tools.
To put this approach to data management in perspective, consider our earlier
efforts (summarized in Shindyalov et al., 1995) which like the work describe here
concentrated on protein structure data derived from the Protein Data Bank (PDB) and
which resulted in a variety of tools and databases. Relevant here were a C++ class library
providing an in-memory representation of protein structure (PDBlib, Chang et al., 1994);
a persistent version of PDBlib implemented using ObjectStore™ (OOPDB, Shindyalov et
al., 1995); and a powerful query language directed at macromolecular query (MMQL,
Shindyalov et al., 1994). This software sought to retain the relationships inherent in
macromolecular structure through features supported in the C++ and Java object oriented
programming languages, for example, data abstraction and class inheritance.
While this approach satisfies many of our own research needs, we sought to
provide a portable, non-proprietary data management and query system which could be
downloaded and used locally in both research and education settings. Sush a system
should be extensible to include new types of data and query methodologies. This
precluded the use of a commercial proprietary database management system like
ObjectStore, or the popular relational database management systems like Sybase or
Oracle on the basis of cost. We could have chosen a public domain database
management system (e.g., POSTGRES 4.2) but opted for a lightweight form of indexbased data representation and query which, from a user’s perspective is read-only. Thus,
the data representation and query capabilities described here have, by design, few of the
characteristics of a database management system (DBMS). For example, there is no
support for transaction scheduling, concurrency control, and levels of security beyond that
offered by the operating system. While falling short of a DBMS, this lightweight indexbased approach described here has produced free, portable databases in wide use by the
community. For example, somewhere between 500 and 1,000 copies of WPDB (described
subsequently) are currently in use.
4
Current efforts to support a larger variety of data on proteins indicate the
extensibility of the approach, hence, details are reported here.
System and methods
All software has been written in the C++ language. It has been successfully
compiled on Digital Equipment Corporation (DEC) Alpha, and Sun Microsystems Inc.
Sparc platforms using gcc version 2.7.2. Given the wide availability of the gcc compiler on
UNIX platforms, we do not anticipate porting problems to other common hardware
platforms supporting UNIX. It has also been compiled on Intel PC platforms using the
Borland C++ compiler, version 4.5 under Windows95. The databases and query tools
built using this system are freely available via the Internet. Source code is available by
contacting one of the authors. The system can be easily extended by C++ or Java
programmers.
Algorithm
Approach
A summary of the terminology used throughout is given in Table I. Note that the
terms Entity1 and property are not taken from database terminology, but are used here to
refer to a unique component of the macromolecular structure and a physical property of a
macromolecule, respectively. The approach is to decompose 3-D macromolecular
structure information into its elementary properties. which are represented using
Property2 objects and associated indices. For example, the atomic coordinates of every
Entity in the PDB are represented as a single Property object containing two attributes,
an index for the Entity and for each index, all the x, y, and z coordinates for that Entity.
All the x, y, and z coordinates associated with that one Entity index are collectively
referred to as a data item. The same Entity index is used for other properties of the same
Entity, each contained in separate Property objects. As will be shown subsequently, a
variety of other indices are used to capture features of macromolecular structure, both
contained in a PDB file and derived from PDB data. The parser is a key feature in the
decomposition process, since parsing the complete PDB is not straightforward. The parser
was introduced previously (Chang et al., 1994).
Figure 1. Relational Verses Indexed-based Representation of Monomers.
1
2
Terms defined in Table I are given in italics.
Objects, classes, and their attributes are given in bold.
Table “Monomers”
code3 n_atom code1
Property Objects using Index “mon”
Property Object
code3.mon
Property Object
n_atom.mon
5
Property Object
code1.mon
ALA
5
A
1 ALA
1
5
1
A
VAL
7
V
2 VAL
2
7
2
V
LEU
8
L
3
LEU
3
8
3
L
ILE
8
I
4
ILE
4
8
4
I
CYS
6
C
5 CYS
5
6
5
C
Figure 1. Relational Verses Indexed-based Representation of Monomers.
A property, as implemented here, corresponds to a single column in a relational
table, while an index, given as a sequential integer value, corresponds to an identifier for a
row in that relational table (Figure 1). Persistent Property objects are each stored as
individual binary files and their contents subject to compression. Queries relating to a
given property, or properties, are efficient since only the appropriate files need be
accessed. Moreover, in some instances data are compressed achieving efficient storage.
This was discussed in Shindyalov and Bourne (1995), where up to a 20-fold compression
over PDB files is possible when compressing atomic coordinates. Within a Property
object data items can be singular, tat is, one value per index (e.g., compound name), or
multiple, that is, an array of values of fixed or variable length (e.g., all the atomic
coordinates associated with an Entity). The choices of data types for a data item in a
Property object are character string, 1-byte unsigned integer, 2- or 4-byte integer, or 4byte floating point. Queries which require only a single property, for example, a search of
textual records or protein sequences, are performed efficiently using a single Property
object.
Collection objects contain collections of indices exhibiting the same property and
hence permit fast property pattern matching. For example, the exposure to solvent (Lee
and Richards, 1971) of all pentamers in the database can be expressed as strings of 5
numbers, 00000 through 55555, where 00000 indicates a buried pentamer and 55555 a
fully exposed pentamer. The 55 possible combinations for describing the exposure values
of pentamers are the indices for the collection, and the Entity indices representing the
Entities that contain the pattern, the data item. It will be shown subsequently that this
scheme can be extended for property patterns covering larger ranges of properties than
simple pentamers. Persistent Collection objects are also maintained as separate binary
files. Property and Collection objects are the primary level of data representation in our
system (Figure 2).
6
Figure 2. System Overview.
Query Type
Find Text
Find Sequence
Find Exposure Pattern
Quarternary
(In-memory)
Display Contact Map
Cmap(“4HHB:A” ,“4HHB:B”)
IEntity(...)
ISubentity(…)
IAtom(…)
Tertiary
(In-memory)
IMonomer(…)
Secondary
(In-memory)
Primary
(Persistent)
Method
Entities(…)
Property(“compnd.com”)
find(“hemoglobin”,0)
IBond(…)
Monomers(…)
Property(“seq.enp”) Collection(“exp.ip6”) Property(“name.enc”) Property(“bond.mon”)
Property(‘xyz.enc”)
Property(“atom.mon”)
find(“vlspadktnvkaa”,0) find(“01110”)
Composite objects are in-memory objects characterized as Monomers and
Entities to indicate content relating to Monomers and Entities, respectively. Composite
objects are derived from Property and Collection objects at run time. Iterators IEntity,
ISubentity, IAtom, IMonomer and IBond navigate macromolecular features (as defined
by their name) contained in Monomers and Entities objects. Composite and Iterator
objects are the secondary and tertiary levels of data representation in our system,
respectively (Figure 2).
Finally at the quaternary level, Specialized objects are provided to support more
complex queries. For example, the Cmap (contact map) object (Figure 2) calculates and
maintains distance contact matrices. Persistent Specialized objects are currently being
written to capture important relationships between protein features necessary to describe
families of proteins (e.g., multiple sequence alignments).
Query is via methods contained within the various objects described above.
Queries are fast since the query method supports the specific data organization within the
data item. However, it means specific query methods must be written for each new type of
query. Sufficient methods have been coded to provide basic functionality. For example,
the find()3 method encapsulated in both Property and Collection objects provides a
variety of search capabilities — searching for a particular substring in text fields, searching
for protein sequence similarity, searching for sites exposed to solvent (Figure 2), and so
on. Specific examples of searching will be given subsequently.
Indices
3
Methods are given in both bold and italics.
7
The Entity index introduced above is currently one of six indices used to retrieve
protein features (Table II). Only one level of indexing is allowed, that is, every data item is
retrieved based on a single index. There are no indices representing the most basic protein
features, for example, Subentities and Atoms. These basic features are represented as part
of an Entity and retrieved using the Entity index.
As stated, each persistent Property object is contained in a separate file. Names of
properties serve as names of files, which are of the form, pppppppp.iii, where ppppppppp
is a property name (8 characters or less to satisfy MS-DOS and Windows 3.x platforms,)
and iii, is the index identifier (Table II). How PDB native and derived data are currently
decomposed into elementary properties for efficient query is given in Table III.
Classes, their Attributes and Methods
The DB class
The major classes comprising the system (many of whose object instances are
described above), their purpose, and representative methods from each class are given in
Table IV. DB is the foundation class containing the access methods to each of the files
containing Property and Collection objects.
The Persistent Property Class
As stated, the Property class represents persistent protein features. These features
are obtained directly from PDB files or derived as the database is built or incremented.
The
Property
class
includes
two
attributes,
objectIndexArray
and
propertyValuesArray (Figure 3). The use of an array-based data representation rather
than a linked list is a departure from our earlier work with PDBlib (Chang et al., 1994).
Arrays provide improved performance in situations where iteration is over a large number
of property values associated with a single index, for example, all atomic coordinates
associated with an Entity index. The attribute objectIndexArray links indices
representing specific objects (Table II) with their boundary values. For example, consider
the atomic coordinates for all Entities contained in the file xyz.enc. This could be all
Entities in the PDB, or all Entities in a database built by the user. If the first entry was
deoxyhemoglobin (Fermi et al., 1984; PDB code 4HHB) each Entity in this structure
would be assigned an index. There would be a total of 231 indices (4 polypeptide chains
i.e., Polymer Entities, 4 heme groups, 2 phosphates and 221 water molecules) each
represented by a separate Entity. Reference 1 for index 1 (Figure. 3) would have a value
of 1 and point to the first atom of the first Entity (designated A in the PDB file). Reference
2 for index 2 would have a value of 1071 pointing to the first atom of the second Entity
(designated B in the PDB file), and so on.
8
Figure 3. In-memory Representation (a) and Storage Model (b) for the Class Property. N
is the number of indices in the Property object, each with an associated data item; M is
the total number of data values in the Property object.
(a) In memory representation
objectIndexArray
Index 1
Reference 1
Index 2
Index N
Reference 2
...
Reference N
...
propertyValuesArray
Value 1
Value 2
Value 3
Value 4
...
Value M
(b) Storage model
Record 1
Record 2
Record 3
Record 4
Record 5
Record 6
Record 7
Type of
Value
Size of
Value
Reference
to object
Index
Array
Size of
object
Index
Array
Size of
property
Values
Array
property
Values
Array
object
Index
Array
The attribute propertyValuesArray (Figure. 3) represents actual property values.
In our example database all atomic coordinates would be stored sequentially with the
1071th entry corresponding to the first atom of the hemoglobin B polypeptide chain.
Additional attributes of the Property class, valueType and valueSize, define the data
type for Property objects. Possible value types are: character string, unsigned 1-byte
integer, 2-byte integer, 4-byte integer, and 4-byte float. valueSize can be either fixed or
variable; if fixed, objects defined by each index are assumed to be of the same size (e.g.,
atomic coordinates), otherwise the size of the object is arbitrary. An attribute
loadInMemory defines the loading mode i.e., whether the property should be accessed
from disk on a data item by data item basis, or loaded into memory en mass. Loading
mode can be overridden when opening the Property object. The choice of loading mode
depends on how many data items are going to be involved in a particular query.
Consider common methods associated with the Property class (Table IV). The
constructor Property() creates an interface to manipulate properties. This method also
establishes communication between the Property and the DB class, necessary for the
Property object to perform input and output. Method create() creates a named property
in memory and defines its value type and value size. Method add() is used to add property
values to the next object, assuming sequential ordering of objects according to the index.
The methods item() and items() are used to access single and multiple value properties,
respectively. As stated previously, method find() searches for various property patterns in
Property objects. For example, consider a search for a protein sequence. Since protein
9
sequences are stored as a single Property object, that object is loaded into memory and
searched using the appropriate find() method. Using this approach, finding a sequence
pattern takes a few seconds of computing time on a low-end workstation, assuming the
current PDB database of approximately 5,500 structures.
The Persistent Collection Class
The persistent Collection class encapsulates the Collection objects described
above, and thus supports fast searching of property patterns. The Collection class consists
of collections of indices for a particular property grouped according to ranges of similar
values. This class is designed for properties which can be represented as sets of object
indices, where each set is associated with yet another index, the SetIndex. SetIndex is
fixed but the number of object indices represented by each SetIndex is not fixed and can
be incremented with database growth.
Figure 4. In-memory Representation (a) and Storage Model (b) for the Class Collection.
(a) In memory representation
objectSetIndexArray
SetIndex 1
SetIndex 2
Reference 1
SetIndex N
Reference 2
objectIndex 1.1
...
Reference N
objectIndex 2.1
objectIndex N.1
...
objectIndex 1.2
objectSets
...
objectIndex 2.2
...
objectIndex N.2
...
(b) Storage model
Record 1
Record 2
Record 3
Record 4
Type of
object
Index
Number of
objectSets
objectSets
Array Size
objectSets
(1…N)
The attribute objectSetIndexArray defines the range of property patterns in the
Collection class (Figure 4). In a previous example using exposure to solvent, a SetIndex
of 0 would correspond to the exposure pattern 00000 and a SetIndex of 55 the exposure
pattern 55555. The objectSets attribute would then contain the indices of the Entities
containing the appropriate exposure pattern.
10
Container and Iterator Classes
The in-memory Entities and Monomers classes are containers supporting queries
pertaining to a few Entities or Compounds, but involving several properties
simultaneously.
Iterator classes, IEntity, ISubentity, IAtom, and IBond iterate through basic
protein components (as suggested by their names) when loaded into the container class,
Entities. Hence, various macromolecular properties can be accessed through methods
associated with iterator classes (Table IV). For example, the number of Subentities in an
Entity can be obtained from the nSubentities() method of the IEntity object, and atomic
coordinates can be returned using the x(), y(), z() methods of the IAtom iterator object.
Similarly, instances of the IMonomer class loaded with the Monomers class iterate and
return, for example, one and three letter codes for Monomers.
Specialized Classes
The Cmap (contact map) class illustrated in Figure 2 is one example of a
specialized class that use classes lower in the hierarchy. Cmap will be described in more
detail subsequently to illustrate the approach. Other examples of specialized classes, not
described, are Profile (view property profiles), View3D (3-D rendering tool) Align
(sequence alignment), and Idf (basic property and collection query tool described below).
Query
There is no general purpose query language associated with this system. Rather,
queries have been written as specialized classes to address specific questions of interest.
Nevertheless, enough questions can be asked to make the system generally useful as
indicated by the applications discussed subsequently. Further, additional queries can be
added through the reuse of existing methods, and writing of more specialized methods.
Figure 2 illustrates the basic approach, with queries of increasing complexity
shown from left to right. Either a general or specialized method with appropriate
arguments is used. Each query in Figure 2 is discussed to illustrate how the system
functions.
Find Text
The file compnd.com contains text from PDB COMPND records stored as a Compound
Property object. A simple text searching algorithm present in the find() method is used to
search for the text string “hemoglobin” - no mismatches are allowed. This search on a
PDB database of 5,500 structures takes from 1 second on a typical UNIX workstation to
a few seconds on an Intel 486 processor.
Find a Subsequence
11
Using the identical find() method, again with no mismatches allowed, a search is
performed for the sequence string “vlspadktnvkaa.” The search is on the file seq.enp
which contains all the protein sequences taken from PDB SEQRES records and stored as
a Polymer Entity Property object. This search takes approximately the same amount of
time as the text search.
Find an Exposure to Solvent Pattern
Again the find() method is used, but this time on the file exp.ip6 which contains a
Collection object describing patterns of exposure. The query will return a list of Entity
indices corresponding to Entities containing the pattern.
Display a Contact Map
This uses the specialized Cmap object which accepts two arguments, the two
Entities between which a contact map is to be calculated, in this example the A and B
chains (as designated in the PDB file) of hemoglobin (PDB code 4HHB). A contact is
found if the distance between any two atoms from the different Entities are less than a
threshold value (6Å by default). Variations of this method allow for Cα or Cβ contacts to
be returned. The steps are as follows:
(i) The appropriate Entities are located using the IEntity iterator object which
first locates the indices associated with the Entity names from the file
name.enc.
(ii) Using the Entity indices the atomic coordinates are retrieved for the required
Entities from the Property object contained in the file xyz.enc and an Entities
object created.
(iii) An in-memory Monomers object is created which describes the connectivity
of each Monomer in the database. This is achieved by the Monomers object
accessing the Property objects which contain the atom names and specific
bonds in each Monomer. These Monomer Property objects contain
descriptions of all canonical and non-canonical Monomers found when the
PDB or other structural data were loaded.
(iv) Using this connectivity information Cmap iterates through each atom (by way
of IEntity, ISubentity, and Iatom iterators) of each Monomer for one Entity
calculating the distance from that atom to all atoms in each of the Monomers
of the other Entity. If a contact distance is found, the contact distance,
associated atom, Subentity and Entity are retained and Cmap skips to the next
Monomer. Distances are returned and displayed in a variety of ways depending
on the application. Calculation of a contact map takes from 1 second on a
typical UNIX workstation to 3-4 seconds on an Intel 486 processor.
Implementation
12
Several tools have been implemented and distributed using this idea of optimized
data decomposition and are available either for World Wide Web access, or for local
access once downloaded via the Internet.
MOOSE - Macromolecular Object Oriented Search Engine (http://www.sdsc.edu/moose) - is a read-only Web-based browsing and query tool for finding and analyzing
macromolecular structures found in the PDB. The MOOSE database is initially built from
the complete PDB (Bernstein et al., 1977). and updated once every 24 hours by obtaining
new structures from the PDB ftp archives using a mirror program. Once built, existing
Property and Collection objects are incremented each day to include new structure data.
Thus, MOOSE is never more that 24 hours out of date with the primary source of data.
Structures are also removed from the PDB. Since the PDB was founded over 300
structures have become obsolete. A method is currently being implemented that masks
access to these entries as they become obsolete so that synchronization with the primary
data source is maintained. Obsolete entries are still accessible from the Archive of
Obsolete Entries (Weissig and Bourne, 1997). At present obsolete entries are removed
from MOOSE only when a complete rebuild of the database is performed.
During database build and increment a variety of Property and Collection objects
are derived. These include secondary structure, a variety of patterns involving
environmental properties (solvent exposure, polarity, secondary structure etc.) and static
amino acid properties (hydrophobicity, volume etc.). A database build of 5,500 structures
takes 120 hours on a 275 Mhz DEC Alpha workstation and produces a database
approximately 3.5 gigabytes in size.
Property pattern searching using Collection objects is considered a strength of
MOOSE. One result of this type of search is illustrated in Table V which summarizes
results from a solvent exposure search (Lee and Richards, 1971) performed across the
complete PDB. The report indicates the distribution of buried residues with respect to
various lengths of primary sequence. This query is run every 24 hours when new
structures are loaded from the ftp archives and available via the Web as one of a number
of reports on the current content of the database. Individual exposure values are added to
Property objects and the distribution of exposure values for the whole database stored as
Collection objects and then queried to produce the reports. The reports indicate the
number of likely hits when searching for a specific property pattern and serve to help the
user refine a property pattern query. Similar reports are maintained for isotropic
temperature factors, polarity, hydrophobicity, isoelectric point, and volume. Specific
queries associated with amino acid property patterns may be formulated as follows:
•
•
•
•
Threshold — Find structures with a specified length of sequence where all values of a
specific property are above or below some threshold.
Index — Find structures where the average value of a specific property over a range of
residues is used as a search pattern.
Profile — As Index, but allow approximate matching.
Sequence As Index or Profile, but rather than using absolute values for a property use
the static value specific to a given amino acid.
13
•
Structure — As Sequence, but rather than use static values for the property, use
property values found in a segment of an existing structure in the database.
Currently MOOSE maintains Collection objects spanning 5, 10, 15, 20, 25, and 30
Monomers. All are represented by 55 indices such that, for example, each digit of the index
would be the value averaged over 6 Monomers when searching for a property pattern
spanning 30 Monomers.
While MOOSE provides a Web interface to the underlying database, actual queries
are made using a CGI script that invokes a query program, Idf (indexed decomposed
features) which, if a database is built and stored locally, can be invoked from the command
line. A few representative examples of Idf query syntax are given in Table VI. The Idf
program is not meant to be a general purpose query language, but is introduced to indicate
the types of queries that can be made of a MOOSE database. Readers interested in
building their own MOOSE databases should contact one of the authors.
WPDB (http://www.sdsc.edu/CCMS/Packages/wpdb.html) is a Microsoft
Windows based macromolecular structure interrogation tool. WPDB v2.2 supports only
Property objects, but Collection objects supporting property-pattern based query will be
available with v3.0. The overall database is from 10-20 fold smaller (depending on amount
of derived data) than the PDB distribution, and can be queried from local disk or a single
CD-ROM. The disk space saving comes from a simple algorithm applied to PDB ATOM
records. Besides query capabilities WPDB provides visual components such as a 3Dviewer, profile viewer, contact map viewer, alignment editor and structure validation tool.
All visual components communicate with each other providing intuitive interaction.
WPDB was described in Shindyalov and Bourne (1995). WPDBL the loader program to
build databases from PDB files is available from ftp://ftp.sdsc.edu/pub/biology/WPDB/WPDBL.
QuickPDB (http://xtal1.sdsc.edu/misha/QuickPDB.html) is an example of a
specialized object encapsulated in a lightweight Java applet which perform text and
sequence searching using the MOOSE database. The applet implements a small subset of
Idf commands which are passed to the MOOSE database which returns a list of Entities
which are displayed by the applet. Once an Entity is selected the complete Compound is
displayed and rendered in a variety of ways including side-by-side stereo,
sequence/structure highlighting, secondary structure highlighting, and B factor
highlighting.
Conformational Likeness (http://xtal1.sdsc.edu/misha/misha.html) is a tool for
real-time 3-D substructure similarity searching using a variety of geometric and physicochemical properties. Currently 495 individual properties are maintained as Property
objects. Details of the system and the algorithms for substructure matching will be
described elsewhere.
Discussion
14
Data management systems that support a large variety of data needed by the
molecular biologist, are comprehensive in their data coverage and query capability, simple
to use, portable, efficient, and in the public domain, simply do not exist. Each subdiscipline
has developed a variety of data management systems of which few, if any, which meet all
the above mentioned criteria. Rather they focus on meeting a specific subset of criteria.
The index-based system described here focuses on efficiency for a particular set of
queries at the price of generality and formal query language. The data management system
is portable, and the idea of using Property and Collection objects general, but requires
some programming skill to implement for supporting new types of data. Nevertheless, we
find the approach provides a high level of functionality to the user and welcome others to
try the approach.
This data management system is currently being expanded to encompass specific
protein families for which additional data on sequence alignments, structure alignments,
kinetics, and disease-related information are available. Our test case uses data on the
protein kinase family of enzymes (Bourne et al., 1997). The system works since
relationships between data items are kept at their most basic level - a small set of indices
relating Entities, Monomers, etc. This gets around the problem of maintaining a fast
evolving schema, but at the price of a greater coding effort for new types of query.
Acknowledgments
This work was supported by the National Science Foundation grant number BIR
9507625. We are grateful to Calton Pu for his careful review of this manuscript.
References
Bairoch,A. (1993) The PROSITE dictionary of sites and patterns in proteins, its current
status. Nucl. Acid Res., 21, 3097-3103.
Bairoch,A. and Boeckmann,B. (1993) The SWISS-PROT protein sequence databank,
recent developments. Nucl. Acid Res., 21, 3093-3096.
Bernstein,F.C., Koetzle,T.F., Williams,G.J.B., Meyer Jr.,E.F., Brice,M.D., Rogers,J.R.,
Kennard,O., Shimanouchi,T. and Tasumi,M. (1977) Protein data bank: a computer based
archival file for macromolecular structures. J. Mol. Biol., 112, 535-542.
Bourne,P.E., Smith,C., Gribskov,M., Berman,H.M., and Taylor,S. (1997). The protein
kinase resource (http://www.sdsc.edu/kinases/).
Chang,W., Shindyalov,I.N., Pu,C. and Bourne,P.E. (1994) Design and application of
PDBlib, a C++ macromolecular class library. CABIOS, 10, 575-586.
Etzold,T., and Argos,P. (1993) SRS - an indexing and retrieval tool for flat file data
libraries. CABIOS, 9, 49-57.
15
Fermi,G., Perutz,M.F., Shaanan,B. and Fourme,R. (1984) The crystal structure of human
deoxyhaemoglobin at 1.74Å resolution. J. Mol. Biol., 175, 159-174.
Hogue,C.W.V, Ohkawa,H. and Bryant,S.H. (1996) A dynamic look at structures: WWWEntrez and the molecular modeling database. TIBS, 21, 226-229.
Kabsch,W. and Sander,C. (1983) Dictionary of protein secondary structure: pattern
recognition of hydrogen-bonded and geometrical features. Biopolymers, 22, 2577-2637.
Lee,B. and Richards,F.M. (1971) The interpretation of protein structures: estimation of
static accessibility. J. Mol. Biol., 55, 379-400.
Sander,C., and Schneider,R. (1991) Database of homology-derived protein structures and
the structural meaning of sequence alignment. Proteins, 9, 56-68.
Shindyalov,I.N., Chang,W., Pu,C. and Bourne,P.E. (1994) Macromolecular query
language (MMQL): a prototype data model and implementation. Prot. Eng., 7, 13111322.
Shindyalov,I.N. and Bourne,P.E. (1995) WPDB - PC Windows based interrogation of
macromolecular structure. J. App. Cryst., 28, 847-852.
Shindyalov,I.N., Chang,W., Cooper,J.A., and Bourne,P.E. (1995) Design and use of a
software framework to obtain information derived from macromolecular structure data.
Proceeding of the 28th annual Hawaii international conference on system science. Vol.
V. Biotechnology Computing, IEEE Computer Society Press, pp.207-216.
Weissig,H and Bourne,P.E. (1997) Archive of obsolete PDB entries.
http://db2.sdsc.edu/PDBObs/PDBObs.cgi.
16
Table I. Terminology Used to Describe Features of the Data Management System.
Term
Atom
Collection
Compound
data item
Entity
Entities
index
Iterators
Monomer
Monomers
Polymer Entity
Property
Subentity
Definition
Attribute of Subentity describing the properties of a single atom.
Persistent object describing patterns of macromolecular
properties.
Set of Entities comprising a single experiment, e.g., all members
of an NMR ensemble in a PDB file.
All values of a property associated with a given index.
Single polypeptide chain or DNA strand (Polymer Entity) or
unique non-polymer component of Compound.
In-memory object containing properties of Entities.
Protein property reference; consecutive integers from 0 to N are
assigned to each instance of index.
Type of object used to navigate macromolecular features.
Amino acid residue or nucleotide base from which instances of
Polymer Entities are formed.
In-memory object containing properties of Monomers.
Collection of Subentities in sequential order.
Persistent object describing a single macromolecular physical
property, where each instance of that property is represented by
an index.
Specific instance of Monomer forming a Polymer Entity.
17
Table II. Indices Representing Protein Features.
Index
Description
ID
Monomer index used to retrieve Monomers. The names of Monomers, the names
mon
of atoms comprising Monomers, and the list of chemical bonds between atoms in
a Monomer are examples.
Compound index used to retrieve features associated with the whole Compound,
com
for example, information found on PDB SOURCE and HEADER records.
Entity index used to retrieve properties pertinent to both polymer and nonenc
polymer Entities. Atomic coordinates and isotropic temperature factors are
examples.
Polymer Entity index used to retrieve properties of Polymer Entities. Amino acid
enp
sequence and secondary structure assignments are examples.
Single interval property pattern index is used to retrieve property patterns from
imm
Collection objects, based on some threshold.
Interval property pattern index is used to retrieve property patterns from
ip6
Collection objects, based on interval values.
18
Table III. Property Objects Representing Protein Features.
Property Object
code3.mon
code1.mon
n_bond.mon
bond.mon
n_atom.mon
atom.mon
prev.mon
next.mon
type.mon
Protein Feature
Three letter Monomer code.
One letter Monomer code.
Number of chemical bonds in a Monomer.
Chemical bonds in a Monomer.
Number of atoms in a Monomer.
Atoms in a Monomer.
Atom connected to the previous Monomer in the chain.
Atom connected to the next Monomer in the chain.
Monomer type (canonical versus non-canonical).
id.com
compnd.com
source.com
date_tex.com
date_int.com
header.com
auth.com
jrnl.com
expdta.com
ec.com
res.com
n_enp.com
i_enp.com
n_enc.com
i_enc.com
PDB compound ID (if available).
Compound name.
Source of Compound.
Deposition date in PDB (text).
Deposition date in PDB (integer).
Functional class of Compound.
Authors associated with Compound.
Primary citation for Compound.
Experimental method used in determining Compound.
Enzyme classification code of Compound (if applicable).
Resolution of Compound (if applicable).
Number of Polymer Entities in Compound.
Indices of first Polymer Entity the Compound.
Number of Entities in Cmpound.
Index of first Entity in Compound.
name.enc
i_com.enc
i_enp.enc
n_se.enc
se.enc
sen_pdb.enc
n_xyz.enc
xyz.enc
bfac.enc
se_xyz.enc
xyz_se.enc
Entity name.
Index of Compound for Entity.
Index of Polymer Entity for Entity.
Number of Subentities in Entity.
Individual Subentities of Entity.
Subentity numbers for Entity (PDB assignment).
Number of atoms with coordinates.
Atom coordinates.
B factors for atoms.
Subentity to atoms reference array (i.e., se.enc to xyz.enc).
Atoms to Subentity reference array (i.e., xyz.enc to se.enc).
name.enp
i_com.enp
i_enc.enp
Polymer Entity name.
Index of compound for Polymer Entity.
Index of Entity for Polymer Entity.
19
Property Object
type.enp
mw.enp
n_se.enp
alpha_c.enp
beta_c.enp
seq.enp
k_s.enp
exp.enp
pol.enp
Protein Feature
Polymer Entity type (Protein, DNA, RNA).
Polymer Entity molecular weight.
Number of Subentities in Polymer Entity.
Alpha helical content of Polymer Entity.
Beta structural content of Polymer Entity.
Sequence of Polymer Entity.
Secondary structure assignment (Kabsch and Sander, 1983)
within Polymer Entity.
Solvent exposure (Lee and Richards, 1971) within Polymer
Entity.
Local polarity within Polymer Entity.
20
Table IV. Classes Representing Protein Features and Typical Query Methods.
Class
DB
Purpose
Container class containing low-level access methods, database
path, and access status.
Property
IEntity
Class supporting the representation and storage of property
values for a set of objects representing features of Monomers,
Compounds, and Entities.
Class supporting the representation and storage of integer
values (e.g., property indices) according to a SetIndex of fixed
size. Represents collections of Compounds, Entities, and other
indexed objects classified by property pattern.
Container class for temporary in-memory storage of major
Monomer properties.
Monomer iterator.
Container class for temporary in-memory storage of major
Entity properties for a subset of Entities.
Entity iterator.
ISubentity
Subentity iterator.
IAtom
IBond
Atom iterator.
Bond iterator.
Collection
Monomers
IMonomer
Entities
Typical Methods
setPath(), openFile(),
read(), write(),
closeFile().
Property(), create(),
clear(), open(), close(),
add(), item(), items()
Collection(), create(),
clear(), open(), close(),
add(), item(), item2(),
item4().
code1(), code3(),
bonds(), atoms().
++, code1(), code3().
AddCom(), addEnc(),
addEnp().
++, name(),
nSubentity().
++, findAtom(),
nAtom().
++, x(), y(), z().
++, atom1(), atom2().
21
Table V. Distribution of Solvent Exposure Sites by Entity for All Proteins in the
PDB.
Solvent exposure is calculated according to the method of Lee and Richards (1971) and
characterized for different fragment sizes. Column represent percent exposed and rows
the fragment size in number of consecutive amino acids. For example, there are 12
Entities in the PDB which contain 30 or more consecutive residues which are less that
2% exposed.
Fragment
Size
2
5
3869
10
1629
15
562
20
127
25
25
30
12
Percent Exposed
4
4801
2492
1214
481
181
55
6
5332
3667
2000
1023
458
193
8
5542
4620
3188
2028
1002
550
10
5790
5293
4464
3316
2202
1337
12
6021
5766
5335
4839
4142
3256
14
6108
5948
5754
5513
5256
4947
16
6182
6045
5972
5819
5676
5544
18
6213
6100
6056
5986
5875
5741
20
6249
6111
6069
6014
5929
5791
22
Table VI. The General Form and Examples of Idf Query Syntax.
Idf Query
General form:
idf {input} [and_or] query [[and_or]
query] ... [[and_or] [query]
Description
where: input is one of the following:
(i) Entity or Compound identifier e.g.,
4HHB.
(ii) Complete database.
(iii) File with a list of Entity or Compound
identifiers e.g., @filename.
(iv) Compound and Entity for contact maps
e.g., 4HHB:A,4HHB:A
and query returns a variety of single values,
patterns, and reports.
Specific examples:
idf 2CCX comp
Return the Compound name for the
structure with the PDB code 2CCX.
idf 2CCX date
Return the deposition date for the
Compound with the PDB code 2CCX.
idf - comp toxin and date 88 90
Returns Compounds with 'toxin' in the name
solved between 1988 and 1990. The “-“
signifies the whole database.
idf - patp exp:0000077777,iso:7777777777
20
Return 20 Entities with the best property
pattern match for the specified combination
pattern of solvent exposure and isoelectric
point.