Download PDF

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Current Proteomics, 2004, 1, 49-57
49
The Protein Data Bank: A Case Study in Management of Community Data
Helen M. Berman1,*, Philip E. Bourne2 and John Westbrook1
Research Collaboratory for Structural Bioinformatics (RCSB) 1Department of Chemistry and
Chemical Biology; Rutgers, The State University of New Jersey, 610 Taylor Road, Piscataway,
NJ 08854-8087, USA and 2Department of Pharmacology and San Diego Supercomputer Center,
University of California San Diego, 9500 Gilman Drive, La Jolla, CA 92093-0537, USA
Abstract: As the sole repository for three-dimensional structure data of biological macromolecules, the Protein Data Bank (PDB) is an important resource for research in the academic,
pharmaceutical, and biotechnology sectors. Over the years, the methods and speed of structure
determination have changed as technology has improved. At the same time the methods for data
collection, archiving, and distribution of the structural data in the PDB have also evolved.
Concurrently, the community of data depositors and users has expanded. As of October 2003, the PDB archive contains
approximately 23,000 released structures and the website receives over 160,000 hits per day. The lessons learned from the
development of the PDB may be applicable to the ongoing development of new data and knowledge resources in
proteomics.
Key Words: Structural biology, bioinformatics, databases, Protein Data Bank.
INTRODUCTION
We are witnessing a data explosion in biology. The
recently completed Human Genome and model organism
projects have produced approximately 250,000 sequences of
over 1,000 genomes (http://www.ncbi.nlm.nih.gov/entrez/
query.fcgi?db=Genome). These sequences contain the
instructions for building molecules and controlling their
functions. The availability of all of these sequences has
given rise to a myriad of research projects designed to help
us understand relationships among sequence, structure, and
function. An important sequel of these sequencing projects
has been the structural genomics initiative whose goal is to
determine the structures of all proteins on a genomic scale in
a high throughput manner (Nature Structural Biology, 2000).
Now, proteomics projects are emerging whose goals are to
determine the functions of the expressed proteins in a variety
of systems using a wide spectrum of techniques.
These projects create formidable data management
challenges at every level. High throughput pipelines require
laboratory information management systems (LIMS) that
keep track of all the experiments in the pipeline. From these
LIMS it should be possible to not only cull the final results,
but also analyze systems so as to improve the experimental
processes. Archival databases then capture the final results of
sequence and structure analyses to make the data available to
the very wide community of users. Analyses of the contents
of archival databases yield new results that are used to create
specialized databases of information often called knowledge
bases. Information among all of these databases needs to be
easily exchanged, and in the long term, made interoperable.
*Address correspondence to this author at the Research Collaboratory for
Structural Bioinformatics (RCSB), Department of Chemistry and Chemical
Biology; Rutgers, The State University of New Jersey, 610 Taylor Road,
Piscataway, NJ 08854-8087, USA; Tel: 732-445-4667; Fax: 732-445-4320;
E-mail: [email protected]
1570-1646/04 $45.00+.00
The Protein Data Bank (PDB) (Berman et al., 2000) was
created more than thirty years ago as a repository for the
results of crystal structure analyses. It has evolved into a
sophisticated resource that contains not only the results of
structure analyses but information about experimental
procedures, derived information about structures and links to
related data sources. The underlying architecture of the
current instantiation of the PDB makes it scalable and
generalizable. The lessons learned from the development of
the PDB may be applicable to the ongoing development of
new data and knowledge resources in proteomics.
HISTORY
The PDB began during the late 1960’s and early 1970’s
with community discussions about the need for such a
resource. Protein crystallography was still in its infancy, but
it was apparent to the producers of these structures as well as
the potential users that every structure contained valuable
information that needed to be archived and maintained for
posterity. In June 1971, the two communities attended the
Cold Spring Harbor Symposium on Quantitative Biology
(Cold Spring Laboratory Press, 1972) and agreed that the
time was right to create the PDB. In October 1971, an article
appeared in Nature New Biology (1971) and other journals
announcing the formation of the PDB.
In the beginning of the PDB, letters were sent to each
author of a structure paper asking for the coordinates to be
deposited. Some complied; some did not. The archive grew
slowly. As time went on there were profound changes in
technology as well as attitudes towards data sharing.
The process of X-ray crystallography involves a series of
steps (Fig. 1). Initially, a target is selected for structure
determination. Subsequently, the material is purified and
then crystallized. One or more crystals are put in front of an
X-ray beam and the diffracted intensities are collected on a
©2004 Bentham Science Publishers Ltd.
50
Current Proteomics, 2004, Vol. 1, No. 1
Berman et al.
Fig. (1). The steps involved in X-ray crystal structure determination, from target selection to publication.
detector. These data are then analyzed using a variety of
computational methods that ultimately lead to the
determination of the molecular structure.
Over time the technology has changed for every step in
the process. The effects of changes in technology began to be
evident in the 1980’s. The revolution in molecular biology
meant that rather than extracting proteins from the original
organisms, it was possible to clone and express large quantities of pure proteins using recombinant DNA technology.
Rather than attempting to crystallize materials using batch
methods, methods for multiple crystallization trials began to
evolve. Crystallization kits became the forerunners of robots.
The availability of synchrotron sources allowed for much
more intense radiation sources. Multi-wire detectors and then
image plates speeded up the collection of the diffracted
intensities. The development of the multiple anomalous
diffraction (MAD) (Hendrickson, 1991) methods allowed
one to capitalize on the tunable wavelengths of the synchrotron sources. Faster computers with more storage furthered
all aspects of the analysis. By the end of the 1980’s, crystal
structure analysis was more straightforward, but not yet
automated.
Attitudes about data sharing also changed over time. In
small molecule crystallography, all coordinates were printed
in journal articles, and Acta Crystallographica published the
structure factors as well. The coordinate data were extracted
by the Cambridge Structural Database (Allen et al., 1991)
and archived. The size of the data sets made a similar
procedure impractical for the protein crystallography
community. Protein structures took years to determine and
there were many that felt that the data needed full analysis by
the author before others could and should benefit from these
data. In addition, it was felt that only experts would be able
to correctly take into account the errors inherent in these
early structures. So although many people submitted data to
the PDB to let the public examine them, some did not. The
recognition of the importance of three-dimensional structure
to the public health had a profound effect on attitudes about
depositing structural data. Some representatives of the
funding agencies felt strongly that if they supported the
determination of biological structures, the data had to be
public. Individual scientists, most notably structural
biologists involved in some of the earliest structure
determinations (Frederic Richards and Richard Dickerson),
were very vocal in their attitudes that these data must be
shared. Committees were formed, petitions were written, and
in 1989, a formal set of guidelines established the rules by
which data would be publicly available (International Union
of Crystallography, 1989). These guidelines required
deposition of all coordinates, but did allow for a “hold” on
the release of the data for a fixed period beyond publication.
The requirements were adopted by journals and by funding
agencies. Over time the guidelines have changed so that now
virtually all structures are deposited in the PDB and released
upon publication. Currently, only 8% of PDB depositions are
held past publication. Experimental data are now deposited
with approximately 67% of structures. In addition, the
sequences of more than half the structures that are deposited
are released prior to publication.
The effects of the evolution in technology and changes in
community attitudes toward data sharing are evident in the
growth curve of the PDB (Fig. 2).
EVOLUTION OF THE PROTEIN DATA BANK
The technology used by the Protein Data Bank to archive
and manage the data in the repository has changed. The first
data sets were entered on punched cards. Later, structures
were submitted on magnetic tape. Correspondence was done
by postal mail. As time went on submissions were made via
FTP, then electronic mail, and by the late 1990’s via web
tools. Distribution of the data followed a similar pattern with
magnetic tapes being the first method, followed by FTP, and
now the web. The archive is also distributed via CD-ROM.
The PDB file format was established early on (Bernstein
et al., 1977) to contain the coordinates and related
information, and still endures. In the late 1990’s a new
format evolved called the macromolecular Crystallographic
Information File (mmCIF) (Bourne et al., 1997) that
conforms to well-documented standards and facilitates
automated data management. The organization of the data
itself was in the form of flat files that are still the main form
of data exchange. Since 1998, relational databases based on
mmCIF have allowed for more efficient query of the data. In
the next sections we review the systems that are currently
(Berman et al., 2000) in place in the PDB pipeline for data
collection, processing, archiving, and query (Fig. 3).
DATA COLLECTION AND PROCESSING
The challenge in this part of the PDB pipeline is to
collect accurate information quickly and in the most efficient
way possible for the depositor. The information required
includes the coordinates of all the atoms in the macromolecule, the chemical description of the various molecules in the
crystal, experimental information about the structure determination, and a structural description of the biological molecule.
PDB: Management of Community Data
Current Proteomics, 2004, Vol. 1, No. 1
51
Fig. (2). Growth chart of the PDB showing the total number of depositions per year (in gray) and the total number of structures available per
year (in black).
Fig. (3). Data flow diagram for data processing. In the first step, the depositor interacts with the PDB web-based deposition system. During
the deposition session, the depositor uploads coordinate and experimental data files, and adds additional data items. He/she may also run
validation checks on the data. At the end of the deposition session, a PDB ID code is immediately generated for the deposition. A PDB
annotator reviews each deposited entry. At the end of this review (step 2), the processed entry is returned to the depositor along with a
validation summary and any questions raised during the annotation process. In corresponding with depositors, new or unusual scientific
issues that arise during annotation or issues of policy are reviewed by a senior annotator. Any corrections provided by the depositor are
integrated into the entry (step 3). Entries are then loaded into a relational database and all data processing materials, including depositor
correspondence, are saved in a data processing archive. The finished entry is returned to the depositor for approval (step 4). According to
their release status, data files and database updates are sent to the SDSC distribution site on a weekly basis. After an entry is released, the
depositor may provide further corrections or additional information, such as annotation about the molecule’s biological function (step 5).
This information is incorporated into the entry and the revised entry is re-released. This is a common situation for structural genomics entries
in which the function is not known at the time a structure is solved, but then becomes known after further analysis of the structure.
The mmCIF dictionary is the key element in enabling this
part of the process. This dictionary contains the
approximately 2,500 definitions for terms used to describe
the crystallographic experiment. The dictionary definition
language (DDL) provides for explicit specification of type,
range, and relationships of all the terms. It is structured in
such a way that data files that conform to this syntax can be
readily loaded into a database.
An editor called the AutoDep Input Tool (ADIT) is built
on top of this dictionary (Fig. 4). ADIT accepts data either in
52
Current Proteomics, 2004, Vol. 1, No. 1
Berman et al.
Fig. (4). The data processing architecture has five key components: data dictionaries, MAXIT, validation, data views, and a database loader.
The data dictionaries contain the precise definitions for all the data items archived by the PDB. MAXIT is an application that does format
interchange, integrity checks and nomenclature checks. The validation application checks geometrical features and compares the coordinates
against the experimental data. The data views determine which data items are visible on the ADIT interface. The database loader is an
application that allows the final data files to be loaded into a relational database.
PDB or mmCIF formats. Data that are entered into ADIT are
immediately translated into mmCIF files. Once loaded, the
data are subjected to a series of computerized validation
procedures: all the nomenclatures of the atoms and residues
are standardized; the sequence records are checked for
consistency and are checked against the sequence databases;
the geometry of the macromolecule is checked against
known standards for distances, angles, and chirality;
stereochemical clashes and the chemistry of the small
molecule ligands are checked; and the coordinate data are
checked against the experimental data. The results of these
checks are reviewed by the annotation staff and then sent to
the author for review. Once approved by the author, the data
are ready for release.
Processing time, including all author correspondence,
averages about two weeks for any single structure. Several
improvements in the process have been instituted since
ADIT was first introduced in 1998. Standalone versions of
data collection and validation tools are now available
(http://deposit.pdb.org/software/). This enables the depositor
to precheck all aspects of the structure before submission to
reduce the number of errors in the file and the subsequent
time needed by the annotator to review the structure.
Currently, in addition to the RCSB site, data are processed
with ADIT by PDBj in Osaka (http://www.pdbj.org/) and by
one member of the annotation staff in Prague. Data are also
deposited via AutoDep and processed at the European
Bioinformatics
Institute
(http://www.ebi.ac.uk/msd/).
Another important innovation is the creation of a data
harvesting tool that extracts data items required for
submission from each step in the structure determination
(Fig. 5). Reducing manual intervention required in the
deposition process will further reduce errors in the data.
Fig. (5). The steps involved in a simplified structure determination data pipeline. At each step along the pipeline, essential details about that
step are captured and assembled to make a data file for PDB deposition. The PDB data processing system has been developed in anticipation
of a structure determination data pipeline in which automated deposition would be a final step.
PDB: Management of Community Data
The current rate of data deposition is about 5,000
structures per year, which is more than double the rate of 5
years ago. The structures themselves are more complex, as
measured by molecular weight and number of protein chains
per molecule. With the advent of structural genomics and the
use of cryo-electron microscopy for structure determination,
the size of the PDB archive promises to continue to grow in
the years to come.
There is a perceptible sea change in the requirements of
depositors. In the early days of the PDB, relatively few meta
data were deposited and depositions occurred late in the
publication process; currently, more meta data are being
submitted (Fig. 6), and for some authors the deposition date
determines the priority of their structure over competing
ones.
DATA DISTRIBUTION AND QUERY
The PDB is committed to making all data files publicly
available as soon as they are authorized for release. The
author determines the release status at the moment of
deposition. In the majority of cases (92%), structures are
released prior to or at the time of publication in a journal.
The PDB staff has tools that cull the literature to determine if
publication has occurred. Once confirmed by the author, the
data are ready for release. Once a week the data files are
packaged and loaded into the various databases maintained
by the PDB. The updated web and FTP sites are mirrored to
eight international sites.
There is heavy activity on the primary PDB website, with
more than 160,000 web hits per day. There are many
databases that depend on the PDB (Weissig and Bourne,
2003) and these databases regularly download the entire
archive and curate the data to present them in their local data
resources. Among these are two databases of folds, SCOP
(Murzin et al., 1995) and CATH (Orengo et al., 1997), as
well as the NCBI Entrez (Hogue et al., 1996).
Current Proteomics, 2004, Vol. 1, No. 1
53
In addition to flat files in PDB and mmCIF format, the
data are organized in a core relational database and several
other databases of derived information (Fig. 7). A web-based
query engine allows the user to ask a wide variety of
questions about one structure or groups of structures. The
results are presented in summary form on the Structure
Explorer page and in various tabular reports. The current
query capabilities are shown in (Table 1). In addition, the
PDB dynamically links to the major data resources so that
the PDB is a portal to information about three-dimensional
structures of biological macromolecules.
The query capability is only as good as the primary data.
Over the 30+ years that the PDB has been in existence there
have been subtle changes in content and format of the PDB
files. Starting in 1998, the PDB began an effort to make all
of the files as uniform as possible (Bhat et al., 2001;
Westbrook et al., 2002). The entire archive was revalidated,
and data were remediated with respect to nomenclature,
ligand chemistry, consistency between sequence and atomic
coordinates, and other items. These files were used to create
new mmCIF files as well as files in XML format.
A new three-tier architecture (Fig. 8) has been put in
place in an effort to reengineer the entire query system,
including the database, applications logic, and web site
presentation layer. The new site is designed to take
advantage of the newest tools available for navigation as
well as the newly remediated data. The database schema will
allow for query of the full structural hierarchy, from
biological assembly to the atomic level, and should be much
more user-friendly than the database that has been in use
since 1998. In addition, an applications programming
interface (API) that implements CORBA technology and is
based on mmCIF (Greer et al., 2001) was developed and
standardized by the Object Management Group (OMG). The
availability of the API will allow direct programmatic
interaction with the database. An API based on web services
is under development.
Fig. (6). Changes in content over the years. In addition to the coordinates, there are several hundred data items that can be archived in the
PDB.
54
Current Proteomics, 2004, Vol. 1, No. 1
Table 1.
Berman et al.
Current Query Capability
Query – Single or Iterative
PDB ID – direct lookup of structures by PDB ID
Text Search (“SearchLite”) – search for any word or partial word match
Text Search by Attribute – same as text search, but restricted to one of 13 attributes (e.g., author, compound, source, etc.)
Advanced Search (“SearchFields”) – search any combination of fields from about 20 categories, e.g., citations, deposition and release dates, source,
experimental technique, sequence, secondary structure, resolution, cell dimensions, ligands, etc.
Status Search – search structures on hold by ID, author, title, holding status, sequence availability, release date, deposition date
Results Analysis – Single Structure
Summary Information – title, compound name, authors, experimental method, classification, enzyme classification (name and EC number), source, citation,
deposition and release dates, resolution, R-value, space group, unit cell dimensions, chains, residues, atoms, het groups
Single-click Queries (“query by example”) – citations authors, EC numbers, and ligand IDs are hyperlinks to new queries
View Structure (Render) – VRML, Rasmol, Swiss-PdbViewer, MICE, FirstGlance (Chime), Protein Explorer (Chime), Sting, QuickPDB Java applet, still
images (JPEG, TIFF, fixed and custom size), fixed size still images of biological molecule
Download/Display File – PDB and mmCIF format (curated files), uncompressed, Unix compressed (.Z), gzipped (.gz), ZIPed (.zip); XML format
(gzipped) in beta testing
Structural Neighbors – links to CATH, CE, FSSP, SCOP, and VAST classifications and alignments
Geometry – bond lengths, bond angles, dihedrals, Ramachandran plot, fold deviation score, links to external resources (e.g., Procheck, What Check,
Promotif, castP, etc.)
Links (“Other Sources”) – dynamically tracks links to about 75 external resources through MIA (Molecular Information Agent)
Sequence Details – graphical and tabular displays of sequence and secondary structure related data
Crystallization Info – crystallization data from BMCD
Previous versions – Links to previous (“obsoleted”) versions of the structure that were replaced by the current version (if applicable)
NDB Atlas Entry – link to NDB Atlas entry for nucleic acid containing structures
Quick Report – tabular reports with geometric data for nucleic acid containing structures
Structure Factors – download in compressed form where available for X-ray structures
NMR Restraints – download in compressed form where available for NMR structures
Results Analysis – Multiple Structure
Query Result Browser – summary information: PDB ID, deposition date, experimental method, title, classification, compound information
Query Refinement – Iterative query over result set using OR, AND or NOT Boolean logic
Tabular Reports – 8 tabular reports in html or txt format: structure summary, sequence, crystallization description, unit cell dimensions, data collection
details, refinement details, refinement parameters, citation
Custom Tabular Reports – choose any combination of about 30 columns from the 8 canned tabular reports
Remove Sequence Homologs – remove sequence homologs at 90%, 70% or 50% sequence homology
Download Structures or Sequences – download structures files (PDB or mmCIF) in one of three compressed/tarred formats, download sequences as
FASTA file
EXTENSIBILITY OF THE PDB
An enormous amount of effort has been expended in
developing a stable infrastructure based on the mmCIF
dictionary that would be scalable and extensible. This
dictionary-based approach has allowed the PDB to modify
ADIT in order to collect structures derived by cryo-electron
microscopy, interact and interoperate with the BioMagRes
Bank (BMRB, the database for NMR experimental data;
Ulrich et al., 1989), and expand the content needed for the
structural genomics initiative. The syntax of mmCIF makes
it possible to readily create relational databases with
PDB: Management of Community Data
Current Proteomics, 2004, Vol. 1, No. 1
55
Fig. (7). Current data query interface to the PDB. The current system contains several databases and a CGI layer.
straightforward query capabilities. The semantics provide the
basis for building APIs that allow programmatic access to
the PDB.
More recently, a new use is being made of the PDB
database architecture. The structural genomics projects have
developed a series of protocols for all aspects of the protein
production pipeline. Although only a fraction of these targets
will result in a structure determination, the availability of
these protocols will help enable new research in biology. The
PDB is now collaborating with the nine US centers to create
such a database whose architecture is similar to the full
archival PDB.
LESSONS LEARNED
Creating and maintaining a public community database is
a complex challenge in technology and sociology. Here is
what we have learned from managing the PDB.
The technology must take advantage of the most current
innovations in hardware and software. It is often useful to
look at implementations in other domains. It is equally
important that the data be uniform and well-curated. It is an
absolute necessity to have a carefully designed data
dictionary/ontology, use controlled vocabularies, provide
validation tools, and constantly reassess the state of the data
to maintain the archive.
These technological developments, however, must be
introduced so as to enable and not disrupt the users of the
resource. Thus, it is critical to maintain an interactive dialog
with the user community about desired new functionalities
and the feasibility of their implementation. This type of
outreach can be achieved electronically, but it must also be
done by attending meetings, having workshops, and in
general, keeping in good face-to-face contact with the users.
This feedback will form the basis of development that must
be always ongoing. Stability is an important goal of a
community data resource, and when changes must be made
these must be announced well in advance and thoroughly
tested prior to their implementation.
Developing and maintaining these resources should never
be underestimated. Technical, sociological, and cultural
issues associated with the data must always be taken into
account. Even the best efforts may not achieve the desired
goal, but if the vision is broad and the motivation high, the
many challenges can be met.
ACKNOWLEDGEMENTS
The Protein Data Bank (PDB) is managed by three
members of the Research Collaboratory for Structural
Bioinformatics-Rutgers University, SDSC/UCSD, and
CARB/NIST-and is funded by the National Science
Foundation, the Department of Energy, and the National
Institutes of Health.
The dedication of all the members of the PDB staff since
its inception has been key to its endurance and success.
The help of Christine Zardecki in preparing this manuscript is very much appreciated.
56
Current Proteomics, 2004, Vol. 1, No. 1
Berman et al.
Fig. (8). The newly engineered PDB under development now has a three-tier architecture consisting of a presentation layer, an application
layer and a persistence layer. This modular design will allow for easier maintenance and greater flexibility, and for evolution of the site as
new technologies are developed.
ABBREVIATIONS
SDSC
= San Diego Supercomputer Center
PDB
= Protein Data Bank
UCSD
= University of California San Diego
LIMS
= Laboratory information management system
CARB
= Center for Advanced Research in Biotechnology
MAD
= Multiple anomalous diffraction
NIST
= National Institute of Standards and Technology
mmCIF = Macromolecular Crystallographic Information
File
CGI
= Common gateway interface
FTP
= File transfer protocol
ADIT
= AutoDep Input Tool
API
= Applications programming interface
SCOP
= Structural Classification of Proteins
CATH
= Class, Architecture, Topology and Homologous
superfamily
NCBI
= National Center for Biotechnology Information
CORBA = Common Object Request Broker Architecture
OMG
= Object Management Group
BMRB = BioMagResBank
RCSB
= Research Collaboratory for Structural
Bioinformatics
REFERENCES
Allen, F. H., Davies, J. E., Galloy, J. J., Johnson, O., Kennard, O., Macrae,
C. F., Mitchell, E. M., Mitchell, G. F., et al. (1991). The development
of versions 3 and 4 of the Cambridge Structural Database System. J.
Chem. Inf. Comp. Sci. 31: 187-204.
Berman, H. M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T. N., Weissig,
H., Shindyalov, I. N. and Bourne, P. E. (2000). The Protein Data Bank.
Nucleic Acids Res. 28: 235-242.
Bernstein, F. C., Koetzle, T. F., Williams, G. J. B., Meyer Jr. E. F., Brice,
M. D., Rodgers, J. R., Kennard, O., Shimanouchi, T., et al. (1977).
Protein Data Bank: a computer-based archival file for macromolecular
structures. J. Mol. Biol. 112: 535-542.
PDB: Management of Community Data
Bhat, T. N., Bourne, P., Feng, Z., Gilliland, G., Jain, S., Ravichandran, V.,
Schneider, B., Schneider, K., et al. (2001). The PDB data uniformity
project. Nucleic Acids Res. 29: 214-218.
Bourne, P. E., Berman, H. M., Watenpaugh, K., Westbrook, J. D. and
Fitzgerald, P. M. D. (1997). The macromolecular Crystallographic
Information File (mmCIF). Meth. Enzymol. 277: 571-590.
Cold Spring Laboratory Press (1972). Cold Spring Harbor Symposia on
Quantitative Biology, vol. 36.
Greer, D., Westbrook, J. and Bourne, P. E. (2001). In Objects in Bio- and
Chem-informatics (OIBC)Boston, MA.
Hendrickson, W. A. (1991). Determination of macromolecular structures
from anomalous diffraction of synchrotron radiation. Science 254: 5158.
Hogue, C., Ohkawa, H. and Bryant, S. (1996). A dynamic look at structures:
WWW-Entrez and the Molecular Modeling Database. Trends Biochem.
Sci. 21: 226-229.
International Union of Crystallography (1989). Policy on publication and
the deposition of data from crystallographic studies of biological
macromolecules. Acta Cryst. A45: 658.
Current Proteomics, 2004, Vol. 1, No. 1
57
Murzin, A. G., Brenner, S. E., Hubbard, T. and Chothia, C. (1995). SCOP: a
structural classification of proteins database for the investigation of
sequences and structures. J. Mol. Biol. 247: 536-540.
Nature New Biology (1971). Protein Data Bank. Nature New Biol. 233: 223.
Nature Structural Biology (2000). Structural Genomics Supplement Issue.
Nature Structural Biology 7: http://structbio.nature.com/.
Orengo, C. A., Michie, A. D., Jones, S., Jones, D. T., Swindells, M. B. and
Thornton, J. M. (1997). CATH–a hierarchic classification of protein
domain structures. Structure 5: 1093-1108.
Ulrich, E. L., Markley, J. L. and Kyogoku, Y. (1989). Creation of a Nuclear
Magnetic Resonance Data Repository and Literature Database. Protein
Seq. Data Anal. 2: 23-37.
Weissig, H. and Bourne, P. E. (2003). In Structural Bioinformatics(Eds,
Bourne, P. E. and Weissig, H.) John Wiley & Sons, Inc., Hoboken, NJ,
pp. 217-236.
Westbrook, J., Feng, Z., Jain, S., Bhat, T. N., Thanki, N., Ravichandran, V.,
Gilliland, G. L., Bluhm, W., et al. (2002). The Protein Data Bank:
Unifying the archive. Nucleic Acids Res. 30: 245-248.