Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Current Proteomics, 2004, 1, 49-57 49 The Protein Data Bank: A Case Study in Management of Community Data Helen M. Berman1,*, Philip E. Bourne2 and John Westbrook1 Research Collaboratory for Structural Bioinformatics (RCSB) 1Department of Chemistry and Chemical Biology; Rutgers, The State University of New Jersey, 610 Taylor Road, Piscataway, NJ 08854-8087, USA and 2Department of Pharmacology and San Diego Supercomputer Center, University of California San Diego, 9500 Gilman Drive, La Jolla, CA 92093-0537, USA Abstract: As the sole repository for three-dimensional structure data of biological macromolecules, the Protein Data Bank (PDB) is an important resource for research in the academic, pharmaceutical, and biotechnology sectors. Over the years, the methods and speed of structure determination have changed as technology has improved. At the same time the methods for data collection, archiving, and distribution of the structural data in the PDB have also evolved. Concurrently, the community of data depositors and users has expanded. As of October 2003, the PDB archive contains approximately 23,000 released structures and the website receives over 160,000 hits per day. The lessons learned from the development of the PDB may be applicable to the ongoing development of new data and knowledge resources in proteomics. Key Words: Structural biology, bioinformatics, databases, Protein Data Bank. INTRODUCTION We are witnessing a data explosion in biology. The recently completed Human Genome and model organism projects have produced approximately 250,000 sequences of over 1,000 genomes (http://www.ncbi.nlm.nih.gov/entrez/ query.fcgi?db=Genome). These sequences contain the instructions for building molecules and controlling their functions. The availability of all of these sequences has given rise to a myriad of research projects designed to help us understand relationships among sequence, structure, and function. An important sequel of these sequencing projects has been the structural genomics initiative whose goal is to determine the structures of all proteins on a genomic scale in a high throughput manner (Nature Structural Biology, 2000). Now, proteomics projects are emerging whose goals are to determine the functions of the expressed proteins in a variety of systems using a wide spectrum of techniques. These projects create formidable data management challenges at every level. High throughput pipelines require laboratory information management systems (LIMS) that keep track of all the experiments in the pipeline. From these LIMS it should be possible to not only cull the final results, but also analyze systems so as to improve the experimental processes. Archival databases then capture the final results of sequence and structure analyses to make the data available to the very wide community of users. Analyses of the contents of archival databases yield new results that are used to create specialized databases of information often called knowledge bases. Information among all of these databases needs to be easily exchanged, and in the long term, made interoperable. *Address correspondence to this author at the Research Collaboratory for Structural Bioinformatics (RCSB), Department of Chemistry and Chemical Biology; Rutgers, The State University of New Jersey, 610 Taylor Road, Piscataway, NJ 08854-8087, USA; Tel: 732-445-4667; Fax: 732-445-4320; E-mail: [email protected] 1570-1646/04 $45.00+.00 The Protein Data Bank (PDB) (Berman et al., 2000) was created more than thirty years ago as a repository for the results of crystal structure analyses. It has evolved into a sophisticated resource that contains not only the results of structure analyses but information about experimental procedures, derived information about structures and links to related data sources. The underlying architecture of the current instantiation of the PDB makes it scalable and generalizable. The lessons learned from the development of the PDB may be applicable to the ongoing development of new data and knowledge resources in proteomics. HISTORY The PDB began during the late 1960’s and early 1970’s with community discussions about the need for such a resource. Protein crystallography was still in its infancy, but it was apparent to the producers of these structures as well as the potential users that every structure contained valuable information that needed to be archived and maintained for posterity. In June 1971, the two communities attended the Cold Spring Harbor Symposium on Quantitative Biology (Cold Spring Laboratory Press, 1972) and agreed that the time was right to create the PDB. In October 1971, an article appeared in Nature New Biology (1971) and other journals announcing the formation of the PDB. In the beginning of the PDB, letters were sent to each author of a structure paper asking for the coordinates to be deposited. Some complied; some did not. The archive grew slowly. As time went on there were profound changes in technology as well as attitudes towards data sharing. The process of X-ray crystallography involves a series of steps (Fig. 1). Initially, a target is selected for structure determination. Subsequently, the material is purified and then crystallized. One or more crystals are put in front of an X-ray beam and the diffracted intensities are collected on a ©2004 Bentham Science Publishers Ltd. 50 Current Proteomics, 2004, Vol. 1, No. 1 Berman et al. Fig. (1). The steps involved in X-ray crystal structure determination, from target selection to publication. detector. These data are then analyzed using a variety of computational methods that ultimately lead to the determination of the molecular structure. Over time the technology has changed for every step in the process. The effects of changes in technology began to be evident in the 1980’s. The revolution in molecular biology meant that rather than extracting proteins from the original organisms, it was possible to clone and express large quantities of pure proteins using recombinant DNA technology. Rather than attempting to crystallize materials using batch methods, methods for multiple crystallization trials began to evolve. Crystallization kits became the forerunners of robots. The availability of synchrotron sources allowed for much more intense radiation sources. Multi-wire detectors and then image plates speeded up the collection of the diffracted intensities. The development of the multiple anomalous diffraction (MAD) (Hendrickson, 1991) methods allowed one to capitalize on the tunable wavelengths of the synchrotron sources. Faster computers with more storage furthered all aspects of the analysis. By the end of the 1980’s, crystal structure analysis was more straightforward, but not yet automated. Attitudes about data sharing also changed over time. In small molecule crystallography, all coordinates were printed in journal articles, and Acta Crystallographica published the structure factors as well. The coordinate data were extracted by the Cambridge Structural Database (Allen et al., 1991) and archived. The size of the data sets made a similar procedure impractical for the protein crystallography community. Protein structures took years to determine and there were many that felt that the data needed full analysis by the author before others could and should benefit from these data. In addition, it was felt that only experts would be able to correctly take into account the errors inherent in these early structures. So although many people submitted data to the PDB to let the public examine them, some did not. The recognition of the importance of three-dimensional structure to the public health had a profound effect on attitudes about depositing structural data. Some representatives of the funding agencies felt strongly that if they supported the determination of biological structures, the data had to be public. Individual scientists, most notably structural biologists involved in some of the earliest structure determinations (Frederic Richards and Richard Dickerson), were very vocal in their attitudes that these data must be shared. Committees were formed, petitions were written, and in 1989, a formal set of guidelines established the rules by which data would be publicly available (International Union of Crystallography, 1989). These guidelines required deposition of all coordinates, but did allow for a “hold” on the release of the data for a fixed period beyond publication. The requirements were adopted by journals and by funding agencies. Over time the guidelines have changed so that now virtually all structures are deposited in the PDB and released upon publication. Currently, only 8% of PDB depositions are held past publication. Experimental data are now deposited with approximately 67% of structures. In addition, the sequences of more than half the structures that are deposited are released prior to publication. The effects of the evolution in technology and changes in community attitudes toward data sharing are evident in the growth curve of the PDB (Fig. 2). EVOLUTION OF THE PROTEIN DATA BANK The technology used by the Protein Data Bank to archive and manage the data in the repository has changed. The first data sets were entered on punched cards. Later, structures were submitted on magnetic tape. Correspondence was done by postal mail. As time went on submissions were made via FTP, then electronic mail, and by the late 1990’s via web tools. Distribution of the data followed a similar pattern with magnetic tapes being the first method, followed by FTP, and now the web. The archive is also distributed via CD-ROM. The PDB file format was established early on (Bernstein et al., 1977) to contain the coordinates and related information, and still endures. In the late 1990’s a new format evolved called the macromolecular Crystallographic Information File (mmCIF) (Bourne et al., 1997) that conforms to well-documented standards and facilitates automated data management. The organization of the data itself was in the form of flat files that are still the main form of data exchange. Since 1998, relational databases based on mmCIF have allowed for more efficient query of the data. In the next sections we review the systems that are currently (Berman et al., 2000) in place in the PDB pipeline for data collection, processing, archiving, and query (Fig. 3). DATA COLLECTION AND PROCESSING The challenge in this part of the PDB pipeline is to collect accurate information quickly and in the most efficient way possible for the depositor. The information required includes the coordinates of all the atoms in the macromolecule, the chemical description of the various molecules in the crystal, experimental information about the structure determination, and a structural description of the biological molecule. PDB: Management of Community Data Current Proteomics, 2004, Vol. 1, No. 1 51 Fig. (2). Growth chart of the PDB showing the total number of depositions per year (in gray) and the total number of structures available per year (in black). Fig. (3). Data flow diagram for data processing. In the first step, the depositor interacts with the PDB web-based deposition system. During the deposition session, the depositor uploads coordinate and experimental data files, and adds additional data items. He/she may also run validation checks on the data. At the end of the deposition session, a PDB ID code is immediately generated for the deposition. A PDB annotator reviews each deposited entry. At the end of this review (step 2), the processed entry is returned to the depositor along with a validation summary and any questions raised during the annotation process. In corresponding with depositors, new or unusual scientific issues that arise during annotation or issues of policy are reviewed by a senior annotator. Any corrections provided by the depositor are integrated into the entry (step 3). Entries are then loaded into a relational database and all data processing materials, including depositor correspondence, are saved in a data processing archive. The finished entry is returned to the depositor for approval (step 4). According to their release status, data files and database updates are sent to the SDSC distribution site on a weekly basis. After an entry is released, the depositor may provide further corrections or additional information, such as annotation about the molecule’s biological function (step 5). This information is incorporated into the entry and the revised entry is re-released. This is a common situation for structural genomics entries in which the function is not known at the time a structure is solved, but then becomes known after further analysis of the structure. The mmCIF dictionary is the key element in enabling this part of the process. This dictionary contains the approximately 2,500 definitions for terms used to describe the crystallographic experiment. The dictionary definition language (DDL) provides for explicit specification of type, range, and relationships of all the terms. It is structured in such a way that data files that conform to this syntax can be readily loaded into a database. An editor called the AutoDep Input Tool (ADIT) is built on top of this dictionary (Fig. 4). ADIT accepts data either in 52 Current Proteomics, 2004, Vol. 1, No. 1 Berman et al. Fig. (4). The data processing architecture has five key components: data dictionaries, MAXIT, validation, data views, and a database loader. The data dictionaries contain the precise definitions for all the data items archived by the PDB. MAXIT is an application that does format interchange, integrity checks and nomenclature checks. The validation application checks geometrical features and compares the coordinates against the experimental data. The data views determine which data items are visible on the ADIT interface. The database loader is an application that allows the final data files to be loaded into a relational database. PDB or mmCIF formats. Data that are entered into ADIT are immediately translated into mmCIF files. Once loaded, the data are subjected to a series of computerized validation procedures: all the nomenclatures of the atoms and residues are standardized; the sequence records are checked for consistency and are checked against the sequence databases; the geometry of the macromolecule is checked against known standards for distances, angles, and chirality; stereochemical clashes and the chemistry of the small molecule ligands are checked; and the coordinate data are checked against the experimental data. The results of these checks are reviewed by the annotation staff and then sent to the author for review. Once approved by the author, the data are ready for release. Processing time, including all author correspondence, averages about two weeks for any single structure. Several improvements in the process have been instituted since ADIT was first introduced in 1998. Standalone versions of data collection and validation tools are now available (http://deposit.pdb.org/software/). This enables the depositor to precheck all aspects of the structure before submission to reduce the number of errors in the file and the subsequent time needed by the annotator to review the structure. Currently, in addition to the RCSB site, data are processed with ADIT by PDBj in Osaka (http://www.pdbj.org/) and by one member of the annotation staff in Prague. Data are also deposited via AutoDep and processed at the European Bioinformatics Institute (http://www.ebi.ac.uk/msd/). Another important innovation is the creation of a data harvesting tool that extracts data items required for submission from each step in the structure determination (Fig. 5). Reducing manual intervention required in the deposition process will further reduce errors in the data. Fig. (5). The steps involved in a simplified structure determination data pipeline. At each step along the pipeline, essential details about that step are captured and assembled to make a data file for PDB deposition. The PDB data processing system has been developed in anticipation of a structure determination data pipeline in which automated deposition would be a final step. PDB: Management of Community Data The current rate of data deposition is about 5,000 structures per year, which is more than double the rate of 5 years ago. The structures themselves are more complex, as measured by molecular weight and number of protein chains per molecule. With the advent of structural genomics and the use of cryo-electron microscopy for structure determination, the size of the PDB archive promises to continue to grow in the years to come. There is a perceptible sea change in the requirements of depositors. In the early days of the PDB, relatively few meta data were deposited and depositions occurred late in the publication process; currently, more meta data are being submitted (Fig. 6), and for some authors the deposition date determines the priority of their structure over competing ones. DATA DISTRIBUTION AND QUERY The PDB is committed to making all data files publicly available as soon as they are authorized for release. The author determines the release status at the moment of deposition. In the majority of cases (92%), structures are released prior to or at the time of publication in a journal. The PDB staff has tools that cull the literature to determine if publication has occurred. Once confirmed by the author, the data are ready for release. Once a week the data files are packaged and loaded into the various databases maintained by the PDB. The updated web and FTP sites are mirrored to eight international sites. There is heavy activity on the primary PDB website, with more than 160,000 web hits per day. There are many databases that depend on the PDB (Weissig and Bourne, 2003) and these databases regularly download the entire archive and curate the data to present them in their local data resources. Among these are two databases of folds, SCOP (Murzin et al., 1995) and CATH (Orengo et al., 1997), as well as the NCBI Entrez (Hogue et al., 1996). Current Proteomics, 2004, Vol. 1, No. 1 53 In addition to flat files in PDB and mmCIF format, the data are organized in a core relational database and several other databases of derived information (Fig. 7). A web-based query engine allows the user to ask a wide variety of questions about one structure or groups of structures. The results are presented in summary form on the Structure Explorer page and in various tabular reports. The current query capabilities are shown in (Table 1). In addition, the PDB dynamically links to the major data resources so that the PDB is a portal to information about three-dimensional structures of biological macromolecules. The query capability is only as good as the primary data. Over the 30+ years that the PDB has been in existence there have been subtle changes in content and format of the PDB files. Starting in 1998, the PDB began an effort to make all of the files as uniform as possible (Bhat et al., 2001; Westbrook et al., 2002). The entire archive was revalidated, and data were remediated with respect to nomenclature, ligand chemistry, consistency between sequence and atomic coordinates, and other items. These files were used to create new mmCIF files as well as files in XML format. A new three-tier architecture (Fig. 8) has been put in place in an effort to reengineer the entire query system, including the database, applications logic, and web site presentation layer. The new site is designed to take advantage of the newest tools available for navigation as well as the newly remediated data. The database schema will allow for query of the full structural hierarchy, from biological assembly to the atomic level, and should be much more user-friendly than the database that has been in use since 1998. In addition, an applications programming interface (API) that implements CORBA technology and is based on mmCIF (Greer et al., 2001) was developed and standardized by the Object Management Group (OMG). The availability of the API will allow direct programmatic interaction with the database. An API based on web services is under development. Fig. (6). Changes in content over the years. In addition to the coordinates, there are several hundred data items that can be archived in the PDB. 54 Current Proteomics, 2004, Vol. 1, No. 1 Table 1. Berman et al. Current Query Capability Query – Single or Iterative PDB ID – direct lookup of structures by PDB ID Text Search (“SearchLite”) – search for any word or partial word match Text Search by Attribute – same as text search, but restricted to one of 13 attributes (e.g., author, compound, source, etc.) Advanced Search (“SearchFields”) – search any combination of fields from about 20 categories, e.g., citations, deposition and release dates, source, experimental technique, sequence, secondary structure, resolution, cell dimensions, ligands, etc. Status Search – search structures on hold by ID, author, title, holding status, sequence availability, release date, deposition date Results Analysis – Single Structure Summary Information – title, compound name, authors, experimental method, classification, enzyme classification (name and EC number), source, citation, deposition and release dates, resolution, R-value, space group, unit cell dimensions, chains, residues, atoms, het groups Single-click Queries (“query by example”) – citations authors, EC numbers, and ligand IDs are hyperlinks to new queries View Structure (Render) – VRML, Rasmol, Swiss-PdbViewer, MICE, FirstGlance (Chime), Protein Explorer (Chime), Sting, QuickPDB Java applet, still images (JPEG, TIFF, fixed and custom size), fixed size still images of biological molecule Download/Display File – PDB and mmCIF format (curated files), uncompressed, Unix compressed (.Z), gzipped (.gz), ZIPed (.zip); XML format (gzipped) in beta testing Structural Neighbors – links to CATH, CE, FSSP, SCOP, and VAST classifications and alignments Geometry – bond lengths, bond angles, dihedrals, Ramachandran plot, fold deviation score, links to external resources (e.g., Procheck, What Check, Promotif, castP, etc.) Links (“Other Sources”) – dynamically tracks links to about 75 external resources through MIA (Molecular Information Agent) Sequence Details – graphical and tabular displays of sequence and secondary structure related data Crystallization Info – crystallization data from BMCD Previous versions – Links to previous (“obsoleted”) versions of the structure that were replaced by the current version (if applicable) NDB Atlas Entry – link to NDB Atlas entry for nucleic acid containing structures Quick Report – tabular reports with geometric data for nucleic acid containing structures Structure Factors – download in compressed form where available for X-ray structures NMR Restraints – download in compressed form where available for NMR structures Results Analysis – Multiple Structure Query Result Browser – summary information: PDB ID, deposition date, experimental method, title, classification, compound information Query Refinement – Iterative query over result set using OR, AND or NOT Boolean logic Tabular Reports – 8 tabular reports in html or txt format: structure summary, sequence, crystallization description, unit cell dimensions, data collection details, refinement details, refinement parameters, citation Custom Tabular Reports – choose any combination of about 30 columns from the 8 canned tabular reports Remove Sequence Homologs – remove sequence homologs at 90%, 70% or 50% sequence homology Download Structures or Sequences – download structures files (PDB or mmCIF) in one of three compressed/tarred formats, download sequences as FASTA file EXTENSIBILITY OF THE PDB An enormous amount of effort has been expended in developing a stable infrastructure based on the mmCIF dictionary that would be scalable and extensible. This dictionary-based approach has allowed the PDB to modify ADIT in order to collect structures derived by cryo-electron microscopy, interact and interoperate with the BioMagRes Bank (BMRB, the database for NMR experimental data; Ulrich et al., 1989), and expand the content needed for the structural genomics initiative. The syntax of mmCIF makes it possible to readily create relational databases with PDB: Management of Community Data Current Proteomics, 2004, Vol. 1, No. 1 55 Fig. (7). Current data query interface to the PDB. The current system contains several databases and a CGI layer. straightforward query capabilities. The semantics provide the basis for building APIs that allow programmatic access to the PDB. More recently, a new use is being made of the PDB database architecture. The structural genomics projects have developed a series of protocols for all aspects of the protein production pipeline. Although only a fraction of these targets will result in a structure determination, the availability of these protocols will help enable new research in biology. The PDB is now collaborating with the nine US centers to create such a database whose architecture is similar to the full archival PDB. LESSONS LEARNED Creating and maintaining a public community database is a complex challenge in technology and sociology. Here is what we have learned from managing the PDB. The technology must take advantage of the most current innovations in hardware and software. It is often useful to look at implementations in other domains. It is equally important that the data be uniform and well-curated. It is an absolute necessity to have a carefully designed data dictionary/ontology, use controlled vocabularies, provide validation tools, and constantly reassess the state of the data to maintain the archive. These technological developments, however, must be introduced so as to enable and not disrupt the users of the resource. Thus, it is critical to maintain an interactive dialog with the user community about desired new functionalities and the feasibility of their implementation. This type of outreach can be achieved electronically, but it must also be done by attending meetings, having workshops, and in general, keeping in good face-to-face contact with the users. This feedback will form the basis of development that must be always ongoing. Stability is an important goal of a community data resource, and when changes must be made these must be announced well in advance and thoroughly tested prior to their implementation. Developing and maintaining these resources should never be underestimated. Technical, sociological, and cultural issues associated with the data must always be taken into account. Even the best efforts may not achieve the desired goal, but if the vision is broad and the motivation high, the many challenges can be met. ACKNOWLEDGEMENTS The Protein Data Bank (PDB) is managed by three members of the Research Collaboratory for Structural Bioinformatics-Rutgers University, SDSC/UCSD, and CARB/NIST-and is funded by the National Science Foundation, the Department of Energy, and the National Institutes of Health. The dedication of all the members of the PDB staff since its inception has been key to its endurance and success. The help of Christine Zardecki in preparing this manuscript is very much appreciated. 56 Current Proteomics, 2004, Vol. 1, No. 1 Berman et al. Fig. (8). The newly engineered PDB under development now has a three-tier architecture consisting of a presentation layer, an application layer and a persistence layer. This modular design will allow for easier maintenance and greater flexibility, and for evolution of the site as new technologies are developed. ABBREVIATIONS SDSC = San Diego Supercomputer Center PDB = Protein Data Bank UCSD = University of California San Diego LIMS = Laboratory information management system CARB = Center for Advanced Research in Biotechnology MAD = Multiple anomalous diffraction NIST = National Institute of Standards and Technology mmCIF = Macromolecular Crystallographic Information File CGI = Common gateway interface FTP = File transfer protocol ADIT = AutoDep Input Tool API = Applications programming interface SCOP = Structural Classification of Proteins CATH = Class, Architecture, Topology and Homologous superfamily NCBI = National Center for Biotechnology Information CORBA = Common Object Request Broker Architecture OMG = Object Management Group BMRB = BioMagResBank RCSB = Research Collaboratory for Structural Bioinformatics REFERENCES Allen, F. H., Davies, J. E., Galloy, J. J., Johnson, O., Kennard, O., Macrae, C. F., Mitchell, E. M., Mitchell, G. F., et al. (1991). The development of versions 3 and 4 of the Cambridge Structural Database System. J. Chem. Inf. Comp. Sci. 31: 187-204. Berman, H. M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T. N., Weissig, H., Shindyalov, I. N. and Bourne, P. E. (2000). The Protein Data Bank. Nucleic Acids Res. 28: 235-242. Bernstein, F. C., Koetzle, T. F., Williams, G. J. B., Meyer Jr. E. F., Brice, M. D., Rodgers, J. R., Kennard, O., Shimanouchi, T., et al. (1977). Protein Data Bank: a computer-based archival file for macromolecular structures. J. Mol. Biol. 112: 535-542. PDB: Management of Community Data Bhat, T. N., Bourne, P., Feng, Z., Gilliland, G., Jain, S., Ravichandran, V., Schneider, B., Schneider, K., et al. (2001). The PDB data uniformity project. Nucleic Acids Res. 29: 214-218. Bourne, P. E., Berman, H. M., Watenpaugh, K., Westbrook, J. D. and Fitzgerald, P. M. D. (1997). The macromolecular Crystallographic Information File (mmCIF). Meth. Enzymol. 277: 571-590. Cold Spring Laboratory Press (1972). Cold Spring Harbor Symposia on Quantitative Biology, vol. 36. Greer, D., Westbrook, J. and Bourne, P. E. (2001). In Objects in Bio- and Chem-informatics (OIBC)Boston, MA. Hendrickson, W. A. (1991). Determination of macromolecular structures from anomalous diffraction of synchrotron radiation. Science 254: 5158. Hogue, C., Ohkawa, H. and Bryant, S. (1996). A dynamic look at structures: WWW-Entrez and the Molecular Modeling Database. Trends Biochem. Sci. 21: 226-229. International Union of Crystallography (1989). Policy on publication and the deposition of data from crystallographic studies of biological macromolecules. Acta Cryst. A45: 658. Current Proteomics, 2004, Vol. 1, No. 1 57 Murzin, A. G., Brenner, S. E., Hubbard, T. and Chothia, C. (1995). SCOP: a structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol. 247: 536-540. Nature New Biology (1971). Protein Data Bank. Nature New Biol. 233: 223. Nature Structural Biology (2000). Structural Genomics Supplement Issue. Nature Structural Biology 7: http://structbio.nature.com/. Orengo, C. A., Michie, A. D., Jones, S., Jones, D. T., Swindells, M. B. and Thornton, J. M. (1997). CATH–a hierarchic classification of protein domain structures. Structure 5: 1093-1108. Ulrich, E. L., Markley, J. L. and Kyogoku, Y. (1989). Creation of a Nuclear Magnetic Resonance Data Repository and Literature Database. Protein Seq. Data Anal. 2: 23-37. Weissig, H. and Bourne, P. E. (2003). In Structural Bioinformatics(Eds, Bourne, P. E. and Weissig, H.) John Wiley & Sons, Inc., Hoboken, NJ, pp. 217-236. Westbrook, J., Feng, Z., Jain, S., Bhat, T. N., Thanki, N., Ravichandran, V., Gilliland, G. L., Bluhm, W., et al. (2002). The Protein Data Bank: Unifying the archive. Nucleic Acids Res. 30: 245-248.