Download PDF

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Oracle Database wikipedia , lookup

IMDb wikipedia , lookup

Open Database Connectivity wikipedia , lookup

Entity–attribute–value model wikipedia , lookup

Ingres (database) wikipedia , lookup

Extensible Storage Engine wikipedia , lookup

Microsoft Jet Database Engine wikipedia , lookup

Concurrency control wikipedia , lookup

Functional Database Model wikipedia , lookup

Database wikipedia , lookup

Versant Object Database wikipedia , lookup

Relational model wikipedia , lookup

Clusterpoint wikipedia , lookup

ContactPoint wikipedia , lookup

Database model wikipedia , lookup

Transcript
WHAT DOES DATABASE FEDERATION MEAN TO CRYSTALLOGRAPHY?
Philip E. Bourne123, Ilya N. Shindyalov1, Christopher M. Smith1 and Helge Weissig12
1
San Diego Supercomputer Center, P.O. Box 85608, San Diego, CA 92186, USA,
2
Department of Pharmacology, University of California, San Diego, 9500 Gilman Dr.,
La Jolla, CA 92093, USA and
3
The Burnham Institute, 10901 North Torrey Pines Road,
La Jolla, CA 92037, USA
More than ever crystallographers are faced with using a variety of databases, each with its
own content, organization and query interface. Database federation offers the promise of a unified
view of these disparate data and detailed query through a single easy to use interface available via
the World Wide Web. This paper discusses current progress towards database federation in
molecular biology and uses the Protein Kinase Resource (PKR) as an example. The paper concludes
that while we are a long way from achieving the computer scientist’s view of database federation,
there are useful resources available today. These resources and what we can expect in the future are
introduced. Throughout emphasis is placed on the importance of better computer-readable
annotation for achieving true database federation.
1.
INTRODUCTION
We begin with an explanation of what is meant in this paper by both “database” and
“federation.” The definitions we use are not the formal definitions from computer science, but
approximations that suit our final goals. Those goals are: (i) to provide insight into what federated
databases are available via the Internet; (ii) to describe the technology behind these resources in
regard to their strengths and weaknesses; and (iii) to provide insight into what we expect from the
next generation of integrated data resources important to crystallographers. As we shall see, without
these approximate definitions there would be little to discuss, since no formal database federations
exist in molecular biology today.
For our purposes a database can be considered as a set of data of some defined atomicity
(level of detail) and scope, for example, a set of protein structures, a set of aligned sequences, and so
on, in which each item can be referenced and is organized in a way that provides efficient answers to
the queries being asked. By this definition the Protein Data Bank (PDB) [5] is a database since it has
scope - all available X-ray crystal structures and NMR structures of proteins and DNA, each with a
unique identifier - and is efficient as distributed if all you wish to do is locate a single structure using
the PDB code found in the file name. The computer operating system can perform this query with a
simple directory lookup*. It is not an efficient organization, however, if you are trying to find all
serine/threonine protein kinase structures solved to better that 2.5Å resolution. To be answered
efficiently this type of query requires a specific level of data annotation and data organization, since
opening each of over 6000 files (one per structure as of July, 1997) and searching for the appropriate
terms is time consuming even on current workstations. An efficient answer to the protein kinase
question requires that the data be organized in a certain way. Whatever the underlying data model
used to represent the data (we will get to this) the principle is the same. Organize the data such that
like features are grouped together. Thus, all compound names might be grouped together and all
resolutions might be grouped together. Then to find the resolution of a structure the computer need
not wade through a lot of extraneous information which is time consuming. Each member of a
grouping has an associated reference so that the correct protein name can be associated with the
correct resolution. These references go by names like primary key, primary index, and object
identifier depending on whether the underlying data model is a relational database, an indexed
database, or an object oriented database. The Nucleic Acid Database (NDB) [4], the genome database
(GDB) [14] and the Biological Macromolecule Crystallization Database (BMCD) [17] are examples
of databases that use a relational model; the Sequence Retrieval System (SRS) [12,13], Entrez
[21,25] and the obsolete structures database [37] are examples that use the index system, and P/FDM
[18] and the earlier versions of MOOSE [7,30] are examples that use object oriented databases. In
the context of this paper, it is not important to understand these underlying data representations. It is
important, however, to know that these resources exist and contain structural information important
to crystallographers. Readers interested in database theory can refer to [9,10] for useful reading.
What is important here is that data are organized and can be referenced. The organization is
expressed by a schema which defines how items of data are grouped and related together.
Given the different types of databases what is a “database federation?” Again, a less than
formal definition, but one which serves our purpose is as follows. A database federation is a
collection of discrete databases to which a user can pose a single query and get an answer through
reference to information contained in all databases. The federation should be able to include simple
ASCII (flat) files, relational, indexed, and object oriented databases.
1.1 Is this Topic Relevant?
As a discipline crystallography, and macromolecular crystallography in particular, is
changing. Thanks to expression systems providing larger amounts of protein than could be obtained
from natural sources, more powerful synchrotrons and better detectors, improved phasing
techniques, semi-automated electron density map fitting, and easier to use and more robust software,
the time to complete a structure determination has been drastically reduced. Evidence for this
progress is found in the exponential growth in the number of macromolecular structures. These
events parallel those that took place in small molecule crystallography 20 years ago. The impact at
that time was that small molecule crystallographers became more diverse in their research interests.
This is now true for today’s macromolecular crystallographers. A macromolecular structure is the
beginning of a line of inquiry that may extend to a detailed study of the functional role of the
macromolecule and macromolecules like it. Such a diverse line of inquiry leads macromolecular
crystallographers to a variety of different databases and at the same time often results in a sense of
*
Not strictly true since some operating systems limit the number of files that can be accessed in a single directory.
frustration since the information they are seeking is spread throughout these databases, each having
its own unique query procedures and descriptions for each item of data. If the promise of database
federation represents a simple means of searching multiple databases, then we argue that it is
relevant to today’s crystallographers.
2.
EXISTING DATABASE FEDERATIONS
Given our informal definition of a database federation, hyperlinks provide the simplest
database federation. That is, a query of one resource, say for a particular structure, will provide
specific links to information relating to that structure in say the PROSITE database of protein
sequences [2,3]. While the query itself does not return information from the other database - that is
left to the user - it does provide a way to navigate to that information. We refer to this capability as a
loose federation.
2.1
Loose Federations
How is a loose federation implemented? Frequently the curators of a particular database
make available a specification which describes how others may reference data in their database via a
World Wide Web hyperlink. For example, the National Center for Biotechnology Information
(NCBI) provides access to several of their databases in this way, namely:
•
•
•
•
•
The PubMed database, a compendium of medical abstracts that includes all of the National
Library of Medicine's MEDLINE database;
The protein database (composite of native or translated sequences from GenBank, EMBL,
DDBJ, PIR, SWISS-PROT, PRF, and PDB);
The nucleotide database (composite from GenBank, EMBL, DDBJ);
The MMDB 3-D protein structures database (from PDB data);
The Genomes database (from GDB and elsewhere).
The atomicity provided by the hyperlink is at the level of a discrete entry in one of these
databases - a specific citation, a specific structure, or a specific sequence. Each of these discrete
entities is identified by a unique code, for example, a GenBank accession number which is never
reused. Providing an immutable hyperlink to an item of data in a public database is a valuable
service and many Internet-accessible data resources, including our own Protein Kinase Resource
(PKR)[34], use this facility to link sequences, structures, citations found in our database to the
primary source of data. So for example we make available sequence alignments for casein dependent
kinases, where each sequence in the alignment is linked to its GenBank source, which includes
additional information like the feature table and yet further links to citations and so on.
However, in answering specific scientific questions this type of loose federation is limited.
First, the link is uni-directional - there is not necessarily a link from an NCBI database to an
external database like the PKR (the PDB does make these links to the PKR). Second, there is no
guarantee of once having made the jump to the database of interest you will acquire sufficient related
information to provide useful answers to your query. For example, jumping to the PDB based on a
specific sequence of a tyrosine kinase which has a known structure and getting access to all tyrosine
kinase structures is not possible. Third, this form of navigation permits you to acquire information,
however, it does not perform some calculation or information filtering specific to a query. Finally,
the desired link between two databases necessary to answer the query may simply not exist.
The navigational shortcomings introduced above are addressed in part by resources such as
Entrez, SRS, and DBGET [24] which take existing databases and create their own bi-directional
loose federations with improved nomenclature and better level of atomicity, which are then
accessible for query through the Web. The basic approach used by each of these resources is similar.
Parse the flat files that are used to build each individual database and create an index of terms such
that each term can be found in a number of associated databases built from the flat files. The index
then relates identical items of information in multiple databases and allows you to access them
quickly. While this sounds straightforward, it is not. The difficulty is not in building the indices, but
parsing the files and interpreting the contents. Each contributing database has its own flat file format
which is convenient to read, but difficult to parse consistently since there is a temporal inconsistency
in each of these databases. That is, the format has evolved over time and the database may contain
entries in different formats that the parser must deal with. The current PDB is an example with files
in v1.0 and v2.0 of the PDB File Format. This problem is relatively trivial since those formats are
documented and can be interpreted. What is more difficult to resolve are the undocumented
interpretations of what a specific item of data meant to a curator 10 years ago compared to how it is
interpreted today. For example, what constitutes a polypeptide chain in the PDB, and hence is given
a chain identifier has changed over time [37]. We will come back to this issue of nomenclature.
Figure 1. Interconnected databases available through SRS.
Parsing is handled differently by Entrez and SRS. Entrez works with a low-level form of
data notation referred to Abstract Syntax Notation version 1 (ASN.1). Thus data is either collected
directly in ASN.1, through, for example, direct deposition of sequences to the NCBI, or converted
from other formats with programs specific to each database. (These programs are, however, built
from the same library of tools.) SRS, on the other hand, uses an Object Data Definition (ODD)
which is a generic data definition language used to describe the representation found for example in
a PDB entry or GenBank sequence entry, or the flat file representations of approximately 80 other
biological databases for that matter [13]. Once the ODD for a specific type of data is defined, the
same code is used to perform the parsing and indexing by first reading the ODD definition for a
particular database. Figure 1 gives a view of some of the databases accessible through SRS
characterized by the type of information available.
Entrez takes this indexing idea a step further with the concept of neighbors. Neighbors
provide hypertext links between related items of information and are determined by algorithms
which define the likelihood that the items of information are related. How this is computed depends
on the type of information being related. For sequences it depends on a similarity score computed
with BLAST [1] - 100 nearest neighbors are reported; for structures it is computed with VAST [16] a
structure matching algorithm; and for text it is determined by the frequency important terms appear
in a document. It is beyond the scope of this paper to consider the details of how neighbors are
determined in Entrez, see [28] for further details.
2.2
Tight Federations
While an active area of computer science research, we consider only the practical
implementation of tight federations. Refer to [10] for additional reading on research into tight
federations in molecular biology. A tight federation is defined here as a single query made of
multiple data items in any of the databases that comprise the federation. The query language should
have considerable expressive power and be independent of the underlying database structure. The
results of a query should be easy to interpret and form the basis for further study as needed.
It must be said that while technology exists to create tight federations, it is people and
policies that prevent the widespread availability of database federations today. Databases, or more
specifically the data they contain, are coveted and each curator has their own ideas of how a database
should be queried and what is the exact definition of each data item within the database. The
government funding of large and small biological databases, in the USA at least, has not
traditionally fostered database federation - the federal agencies themselves covert the databases they
fund. Efforts in Europe show promise. The European Union (EU) Bridge Database Project
Consortium funded by the European Community produced a combined schema [19] for a
macromolecular structure database based on the consortium’s experience in developing individual
structure databases, notably SESAM [22], IDITIS [36] and P/FDM [18]. While not a database
federation, it does represent a collective effort. This schema has yet to be implemented but is likely to
be prominent in the European Bioinformatics Institute’s (EBI) plans for a structure database.
3.
CURRENT APPROACHES - THE PROTEIN KINASE RESOURCE
The approach of the EU Bridge project in taking the best from existing schema and
implementing those in a new schema is today’s norm. To better understand this process we take as
an example our own on-going work in developing the Protein Kinase Resource (PKR) [34] which
integrates sequence data from GenBank, structure data from PDB, genetic information from OMIM
[27], local laboratory data, enzymatic data extracted from the literature, and other miscellaneous data
of use to researchers interested in protein kinases. There are various methodologies for developing
integrated resources like the PKR. From a user’s perspective the important issues are the provision of
accurate and comprehensive data and making it meaningful to access. It is these issues which are the
focus here, not the details of the database implementation which are available elsewhere [31].
The PKR is one of a new generation of data resources. We consider existing resources as
broad but shallow, that is, they cover all known instances, for example, all protein structures or all
protein sequences, but, by necessity, do so in limited detail. New resources like the PKR are narrow
but deep. That is, they cover a particular sub-topic, in this case one protein family, but do so in
greater detail, bringing together data from multiple broad data resources and elsewhere. Consider the
basic approach we are adopting in developing the PKR (Figure 2) on a step-by-step basis.
Dictionaries
Sequence
Lab. Data
F
I
L
T
E
R
S
Raw Data
Load
Structure
Literature
Genetics
Property
Object
Database
Storage &
Memory Model
Query
Library
Viz.
Tools
Query
Display
Figure 2. The topology of the Protein Kinase Resource (PKR).
Step 1:
Data from existing flat file formats (e.g., PDB, GenBank) are parsed, analogous to the
parsing performed for SRS and Entrez. This is currently performed with special programs rather
than an ODD approach.
Step 2:
Each parsed data item is then checked against an appropriate dictionary to check for its
existence and to validate that its data type and value are appropriate. For dictionary checking we use
the Self-defining Text Archival and Retrieval (STAR) data representation [8,15] defined for use in
crystallography, but of general applicability. This approach lets us leverage off the extensive efforts
that has already been put into the macromolecular Crystallographic Information File (mmCIF)
dictionary which contains approximately 3,200 terms [6]. Additional dictionaries were written to
cover primary sequence and sequence feature tables and enzymology [35]. Computer-readable
dictionaries are critical for sustaining a consistent nomenclature in an evolving scientific discipline
and their importance cannot be overemphasized. Not only do the dictionaries specify the state of
knowledge of a given field at a particular point in time, they provide an explicit record of what exists
in that database at a given point in time. Until now, too much of that information was in the curators
head and not documented leading to inconsistencies in representation and misuse by programmers
writing code to use this data. Each programmer, without the benefit of a definitive guide, is left to
make their own interpretation of items of data. The lack of consistency that results has become
apparent in our temporal study of the PDB [37]. Two examples are given here to highlight the type
of problem which we anticipate exists in all major long-lived data resources and is not peculiar to the
PDB.
Figure 3a shows a query for hemoglobin in the obsolete structures database and indicates
that this macromolecule has existed in the PDB since 1975. However, in 1984 1HHB was replaced
simultaneously by 3 entries (2HHB, 3HHB, 4HHB), all derived from the same data set (Figure 3b).
Closer inspection reveals that while these entries are all correct within the limits of the experimental
data they show significant differences in their coordinate sets. This difference becomes apparent
when considering how each structure compares to ideal geometric values taken from small molecule
data [11]. Figure 3c indicates that 2HHB is a highly constrained model, whereas 4HHB is less
constrained, and 3HHB is a dimer to which a non-crystallographic transformation has been applied
to restore the tetramer. While the latter situation can be automatically ascertained when building a
database through the distinct MTRIXn and SCALEn PDB records, the details of the differently
constrained models 2HHB and 4HHB is only partially described in REMARK records and in any
case such free text is indecipherable by a parser and hence the information is lost when building a
database, unless entered manually - a time consuming and expensive process.
The second example (Figure 4) shows the distribution of all structures in the obsolete
structures database and their corresponding entries in the current PDB distribution based upon
changes in the total number of atoms from one release to the next. Surprisingly, 15% of entries have
less atoms than their corresponding earlier version. Close inspection reveals that this is caused
predominantly by an over determination of water being corrected in a later version of the structure.
Neither this change in water content, nor the criteria used to define water cut-off, is provided in
these PDB entries. If it is reported it is reported via free text REMARK records and again cannot be
consistently parsed and included in a database, and so vital information is lost.
Dictionaries provide strict definitions for all items of data to be included in the database and
hence provides a high level of machine usable and consistent annotation. The two examples of
temporal inconsistency - the inability to automatically characterize change from one version of a
structure to the next, could have been avoided if terms describing these changes had been included in
a computer-readable dictionary.
Step 3
A characteristic of the STAR dictionaries when written using the Dictionary Definition
Language (DDL) developed for macromolecular crystallography [38] is the notion of categories
which group important data items together based on their structural relevance. We use these
categories and new categories defined in additional dictionaries to represent indices in a property
object database as described elsewhere [31]. In summary, the indices correspond to protein features,
polypeptide chains, monomers, compounds etc. Each index then has multiple properties associated
with it, for example, solvent exposure and secondary structure assignments for every residue in a
polypeptide chain. Properties are maintained one per file. For example, solvent exposure for every
residue found in all available protein kinase structures is in a single file. The index to each entry
corresponds to the polypeptide chain and this is followed by the exposure values for each amino acid
in that polypeptide chain. Properties can be retrieved very quickly using search methods that return
the indices of polypeptide chains that include the search pattern. Collection objects group indices all
having the same property. For example, all polypeptide chains containing a sequence of highly
buried residues constitutes a collection object. In this way the time consuming step becomes the
building of the database rather than the query to find structures with buried residues, which can be
performed in real time.
Figure 5. The Compare3D Java applet.
Step 4
Query methods are hidden from the user by invoking queries through a Web interface, for
example a simple Web form or a more sophisticated Java applet. Figure 5 illustrates an applet we
have written [32] for the comparative analysis of proteins contained in the PKR (or for any other
group of structurally related proteins). Protein sequences are selected from the property object
database and a Smith Waterman alignment [33] applied. The C alpha coordinates from the
corresponding residues in the sequence alignment can then be superimposed in a least squares
minimization according to the method of Hendrickson [20]. Structure rendering (translation,
rotation, zooming, color coding, atom picking) can be performed and contact distance difference
matrices examined. While useful, Java applets like this one nevertheless fall short of the rendering
capability available with most molecular graphics programs. We anticipate that this situation will
change with the advent of software libraries like Java-3D which provide capabilities such as
predefined graphics primitives, shadowing, and clipping currently found in popular Fortran and C
libraries such as OpenGL.
3.
THE FUTURE
The query language and visualization tools found in resources like PKR extend the indexing
principles used in SRS and Entrez. Other resources are being developed which take native flat file
data and load that data into specialized databases that support complex queries and modeling of
more complete systems. These type of resources are in their infancy, but appear as the next logical
step in federated database development. Examples of this type of resource are the genetic and
metabolic description of E. coli found in EcoCyc [23] and the metabolic maps found in PUMA [29]
and used in describing a variety of organisms.
4.
CONCLUSION
A true database federation in the computer science sense does not currently exist in
molecular biology. That is, queries that truly exploit the schema of multiple databases and that can
evolve efficiently as the schema of the underlying databases continues to evolve do not exist. Rather,
efforts have tended towards reorganizing data from multiple sources into a single indexed based
system that lets the user query that index. This methodology has proven to be efficient and can be
expected to remain so in the face of exponentially increasing data growth. Nevertheless, this
approach limits the types of queries that can be asked and more formal data models and query
languages are needed that better support iterative query and can model complete systems. It is a
natural tendency to model these data resources after the subcellular, cellular, and higher-order
physiology that as biologists we understand. Unfortunately, these model systems are as complicated
as the biological systems they represent and we are only just beginning to understand how
technically this might be done [26]. Close collaboration between computer scientists and structure
and molecular biologists is needed to advance this cause. These collaborations are beginning, crossdisciplinary students are being trained, and the resultant discipline of bioinformatics is beginning to
emerge [39]. We believe that end result will be of great benefit to crystallographers.
ACKNOWLEGEMENTS
Our own work on temporal databases, property object models, and the protein kinase
resource are supported by NSF grants BIR 9630339 and ASC 8902825. Our own work on mmCIF is
supported by NSF grant BIR 9310154 and the DOE.
REFERENCES
[1] S.F. Altschul, W. Gish, W. Miller, E.W. Myers and D.J. Lipman, J. Mol. Biol. 5 (1990), 403.
[2] A. Bairoch and B. Boeckmann, Nucl. Acids Res. 21 (1993), 3093.
[3] A. Bairoch, Nucl. Acids Res. 21 (1993), 3097.
[4] H.M. Berman, W.K. Olson, D.L. Beveridge, J. Westbrook, A. Gelbin, T. Demeny, S.H. Hsieh,
A. R. Srinivasan and B. Schneider, Biophys. J. 63 (1992), 751.
[5] F. Bernstein, T.F. Koetzle, G.J.B. Williams, E.F. Meyer Jr., M.D. Brice, J.R. Rogers, O.
Kennard, T. Shimanouchi and M. Tasumi, J. Mol. Biol. 112 (1977), 535.
[6] P.E. Bourne, H.M. Berman, B. McMahon, K. Watenpaugh, J. Westbrook and P.M.D.
Fitzgerald, Methods in Enzymology 277 (1997), 571.
[7] W. Chang, I.N. Shindyalov, C. Pu and P.E. Bourne, CABIOS 10 (1994), 575.
[8] A. Cook and S.R. Hall, J. Chem Inf. Compt. Sci. 31 (1992), 326.
[9] C.J. Date, An Introduction to Database Systems, Sixth Edition (Addison-Wesley, Reading,
1994).
[10] S.B. Davidson, C. Overton and P. Buneman, J. Comp. Biol. 2 (1995), 557.
[11] R.A. Engh and R. Huber, Acta Cryst. A47 (1991), 392.
[12] T. Etzold and P. Argos, CABIOS 9 (1993), 49.
[13] T. Etzold, A. Ulyanov and P. Argos. Methods in Enzymology 266 (1996), 114.
[14] K.H. Fasman, S.I. Letovsky, P. Li, R.W. Cottingham and D.T. Kingsbury, Nucl. Acids Res. 25
(1997), 72.
[15] S.R. Hall and N. Spadaccini, J. Chem Inf. Compt. Sci. 34 (1994), 505.
[16] J.F. Gibrat, T. Madej and S.H. Bryant, Curr. Opin. Struct. Biol. 6 (1996), 377.
[17] G.L. Gilliland, M. Tung, D.M. Blakeslee and J. Ladner, Acta Cryst. D50 (1994), 408.
[18] P.M. Gray, N.W. Paton, G.J. Kemp and J.E. Fothergill, Protein Eng. 3 (1990), 235.
[19] P.M. Gray, G.J.L. Kemp, C.J. Rawlings, N.P. Brown, C. Sander, J.M. Thornton, C.M. Orengo,
S.J. Wodak and J. Richelle, TIBS 21 (1996), 251.
[20] W.A. Hendrickson, Acta Cryst. A35 (1979), 158.
[21] C. Hogue, H. Ohkawa and S.H. Bryant, TIBS 21 (1996), 226.
[22] M. Huysmans, J. Richelle and S.J. Wodak, Proteins 11 (1991), 59.
[23] P. Karp and S. Paley, J. Comp. Biol. 3 (1996), 191.
[24] H. Migimatsu and W. Fujibuchi, The DBGET Resource (1997),
http://www.genome.ad.jp/dbget/dbget.html.
[25] National Center Biotechnology Information, Linking to Entrez Databases (1996),
http://www3.ncbi.nlm.nih.gov/Entrez/linking.html.
[26] B. Palsson, Nature Biotech. 15 (1997), 3.
[27] P. Pearson, C. Francomano, P. Foster, C. Bocchini, P. Li and V. McKusick. Nucl. Acids Res.
22 (1994), 3470.
[28] G.D. Schuler, J.A. Epstein, H. Ohkawa and J.A. Kans, Methods in Enzymology 266 (1996),
141.
[29] E. Selkov, S. Basmanova, T. Gaasterland, I. Goryanin, Y. Gretchkin, N. Maltsev , V.
Nenashev, R. Overbeek, E. Panyushkina, L. Pronevitch, E. Selkov, Jr. and I. Yunus, Nucl. Acids
Res. 24 (1996), 26.
[30] I.N. Shindyalov, W. Chang, J.A. Cooper and P.E. Bourne Proceeding of the 28th Annual
Hawaii International Conference on System Science (Vol. V. Biotechnology Computing, 1995)
IEEE Computer Society Press, p. 208. http://www.sdsc.edu/moose.
[31] I.N. Shindyalov and P.E. Bourne, CABIOS 13 (1997), In Press.
[32] I.N. Shindyalov and P.E. Bourne, The Compare3D Java Applet (1997)
http://xtal1.sdsc.edu/misha/compare_3d.html.
[33] T.F. Smith and M.S. Waterman, J. Mol. Biol. 147 (1981), 195.
[34] C. Smith, M. Gribskov, I.N. Shindyalov, S.S Taylor, L. Ten Eyck, S. Veretnik and P.E. Bourne,
TIBS (1997), Submitted.
[35] C. Smith and P.E. Bourne Enzymology Protein Information File Dictionary (1997),
http://www.sdsc.edu/Kinases/development/PIF/SFBrowser.html.
[36] M.J. Sternberg and S.A. Islam, Biochem. Soc. Trans. 17 (1989), 845.
[37] H. Weissig and P.E. Bourne, Archive of obsolete PDB entries (1997),
http://db2.sdsc.edu/PDBObs/PDBObs.cgi.
[38] J. D. Westbrook and S. R. Hall. A Dictionary Description Language for Macromolecular
Structure, Report NDB-110 (Rutgers University, New Brunswick, NJ, 1995).
[39] N. Williams, Science 275 (1997), 301.
LIST OF FIGURES
Figure 3. Queries from the obsolete structures database ; a) the chronology of hemoglobin; b) details of different hemoglobin
replacements; c) color-coded deviation from ideal geometry (bond lengths, bond angles, and dihedral angles) averaged for each
residue, green closest, red farthest from ideality.
Figure 4. The distribution of PDB replaced structures showing changes in the number of atoms.