Download a Database of Nucleic Acids–Protein Interactions

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Helicase wikipedia , lookup

Helitron (biology) wikipedia , lookup

United Kingdom National DNA Database wikipedia , lookup

DNA nanotechnology wikipedia , lookup

Transcript
BIOINFORMATICS APPLICATIONS NOTE
Vol. 23 no. 23 2007, pages 3247–3248
doi:10.1093/bioinformatics/btm519
Structural bioinformatics
NPIDB: a Database of Nucleic Acids–Protein Interactions
Sergei Spirin1,*, Mikhail Titov2, Anna Karyagina2,3 and Andrei Alexeevski1
1
A.N.Belozersky Institute of Physical and Chemical Biology of Moscow State University, Moscow, Russia, 2Institute
of Agricultural Biotechnology and 3Gamaleya Institute of Epidemiology and Microbiology, Moscow, Russia
Received on June 2, 2006; revised on August 28, 2007; accepted on October 12, 2007
Advance Access publication October 31, 2007
Associate Editor: Alex Bateman
ABSTRACT
Summary: The database NPIDB (Nucleic Acids—Protein Interaction
DataBase) contains information derived from structures of
DNA–protein and RNA–protein complexes extracted from PDB
(1834 complexes in July 2007). It is organized as a collection of
files in PDB format and is equipped with a web-interface and a set
of tools for extracting biologically meaningful characteristics of
complexes. The content of the database is weekly updated.
Availability: http://monkey.belozersky.msu.ru/NPIDB/
Contact: [email protected]
1
INTRODUCTION
More than 1000 structures of DNA–protein and RNA–protein
complexes are currently available in Protein Data Bank (PDB).
However, there exist significant difficulties in obtaining
information concerning intermolecular contacts, mode of
interaction, comparison of related complexes, etc. We created
a web-based database for providing an access to adequately
organized information about all available structures of these
complexes. Also, we developed several algorithms for the
analysis of the complexes. Implementations of these algorithms
are integrated into the database.
Web databases such as AANT (Hoffman et al., 2004)
and Protein–DNA Recognition Database (http://gibk26.bse.
kyutech.ac.jp/jouhou/3dinsight/recognition.html) are devoted
to facilitate data mining on nucleic acid—protein interaction.
These resources provide a number of useful tools for an analysis
of protein–DNA interaction. However, there are several
requirements that are not met by the existing tools. For example,
it would be useful to have an incorporated into a database
classification of DNA-binding domains and motifs. Also
information should be available on all kinds of DNA–protein
contacts, including hydrophobic ones. Our goal is to develop a
new Internet resource, which should enlarge the spectrum of
available tools in the area of nucleic acid—protein interaction.
The resource NPIDB (Nucleic Acids—Protein Interaction
DataBase) includes a collection of files in PDB format
containing structural information on DNA–protein and
RNA–protein complexes extracted from PDB, and a number
of online tools for analysis of the complexes. Those tools are:
an original program CluD for analysis of hydrophobic clusters
on interfaces (Alexeevski et al., 2004; Karyagina et al., 2006),
*To whom correspondence should be addressed.
a program for detecting potential hydrogen bonds, visualization of structures with Jmol and Chime.
2
CONTENT
Structures of protein—nucleic acid complexes are extracted
from PDB as files in PDB format representing both PDB
entries (asymmetric units) and biological units.
Structures are revised by our experts in order to correct
possible mistakes (such as duplication of atoms) and inconvenience (such as two or more variants of a structure posed in
one coordinate space, see, e.g. PDB entry 1QPI, where two
variants of each DNA chain are superimposed). The manually
corrected entries are also included into the database as ‘revised
biological units’. Thus, in some cases, the NPIDB content
differs from the original PDB one; in particular, for some
complexes (1FJL, 1QPI, etc.) there are some additional
biological units in NPIDB compared with the PDB.
All structural files of NPIDB are downloadable.
Update of the content is done weekly by a special program
module.
3
SOFTWARE
We developed an original program module for extracting
structures of protein—nucleic acid complexes from PDB. Our
module identifies chains of DNA, RNA and protein directly in
the coordinate (ATOM and HETATM) section of a PDB file.
If a PDB file contains at least one protein chain and at least one
nucleic acid chain, it is selected for incorporating into NPIDB.
Weekly, the updating module (i) examines the new entries
appeared in the local PDB mirror at A.N. Belozersky Institute
by the above described selection tool, (ii) selects protein—nucleic
acid complexes and (iii) calls the module of database filling.
We developed a set of Perl scripts for operating with the
available
complexes.
The
first script
forms the
HTML table with the list of available complexes, including
information on PDB codes, PDB headers, protein, DNA and
RNA chains, PDB entry creation date, and nucleic acid type
(DNA, RNA or both); the second sorts this list with respect to
each feature and the third generates web pages with general
information about individual complexes.
A program is developed, which detects domains in protein
chains using information from Pfam and SCOP domain
databases. The information on the domain families and their
representatives in the database is organized as a set of
dynamical web pages generated by special scripts.
ß The Author 2007. Published by Oxford University Press. All rights reserved. For Permissions, please email: [email protected]
3247
S.Spirin et al.
A tool for detecting potential hydrogen bonds between a
nucleic acid and a protein is integrated into NPIDB. It uses a
simplest criterion of hydrogen bond formation: the potential
hydrogen bond is detected if the distance between oxygen or
nitrogen atoms of different molecules is 53.7 Å.
The program CluD (Alexeevski et al., 2004; Karyagina et al.,
2006) for detecting hydrophobic clusters in macromolecular
structures is adopted for the work with DNA–protein and
RNA–protein interfaces and integrated into NPIDB.
The set of determined hydrophobic clusters in each structure
depends on the chosen threshold distance of hydrophobic
interaction (the recommended threshold range is 4.5–5.4 Å, the
default value is 5.4 Å). The program Profile detects the
dependence of those set on the threshold. The result of
the program is a tree graph showing the process of cluster
appearance and joining as a result of increasing the threshold.
For every Pfam or SCOP domain presented in the database, it
is detected, whether this domain is in contact with nucleic acid.
4
Clicking ‘HBonds’ button at the individual page of the
complex produces a text table with the information about all
atom pairs potentially forming hydrogen bonds between a
nucleic acid and a protein.
Each biological unit of each structure can be visualized with
Jmol (http://jmol.sourceforge.net/).
A list of Pfam families containing domains from NPIDB
structures is available (http://monkey.belozersky.msu.ru/
cgi-bin/pfam.pl). All structures with domains of a specific
Pfam family can be selected. An analogous list of SCOP
families is also available (http://monkey.belozersky.msu.ru/
cgi-bin/scop.pl). For every Pfam family containing a domain
whose structure in complex with nucleic acid is presented in the
database, a representative with the best resolution is selected,
and a file in PDB format is generated with the structure of the
domain together with the part of nucleic acid in contact with
the domain. A collection of the representatives of Pfam families
of nucleic acid-binding domains is available (http://monkey.
belozersky.msu.ru/npidb/cgi-bin/sample.pl).
DESIGN
The set of available complexes is presented in the form of a list
(http://monkey.belozersky.msu.ru/cgi-bin/nuc_prot.pl)
that
can be sorted by the PDB code, the date of appearing in
PDB, the PDB header (classification) or by the type of the
nucleic acid (DNA, RNA or both). The list is equipped with a
search over keywords appearing in headers and titles of PDB
entries. Each complex has its own web page containing general
information about the complex, the table of biological units,
and shortcuts to all available tools for its analysis. The page is
accessible by a mouse-click on the PDB code in the list of
complexes. The available tools can be applied for the initial
PDB entry or for any biological unit; they include a possibility
to download the structure in PDB format, two variants of CluD
service, Profile service, hydrogen bond detection, Jmol visualization, links to other resources (PDB, AANT, etc.).
The table of biological units for each PDB entry includes
information on DNA, RNA and protein chains as well as the
number of models in each unit (see, e.g. http://monkey.
belozersky.msu.ru/cgi-bin/browse.pl?id¼1CVJ).
The table of Pfam domains includes the domain ID, its short
description, and its location (chain, residue numbers), for each
domain presented in the given PDB entry.
The table of SCOP domains includes the domain hierarchy,
its short description, and its location (chain, residue numbers),
for each domain presented in the given PDB entry.
The page of CluD service for identifying hydrophobic
clusters provides two possibilities: first, to find all clusters in
the entire structure, and second, to find clusters at the interface
nucleic acid—protein. Each individual web page of a complex
has two corresponding hyperlinks to the CluD page. When a
user visits the CluD page via one of these hyperlinks, the service
form appears pre-filled: it contains the code of complex, and the
default options are in accordance to the used hyperlink.
The result of CluD is available in three forms: a text file with
a table of hydrophobic clusters, a RasMol script containing
definitions of the sets of atoms forming the hydrophobic
clusters, and online, as a Chime (see http://www.mdl.com/
products/framework/chime/) page visualizing the clusters.
3248
5
FUTURE WORK
Each family of DNA-recognizing proteins has its own mode of
DNA recognition (e.g. recognition via the DNA major groove
by an alpha-helix, or recognition via the DNA minor groove by
beta-strands). We plan to classify all families of DNArecognizing domains according to their recognition mode and
to make it possible to select complexes of a certain class.
We plan to improve our tool predicting hydrogen bonds,
taking into account not only atom-to-atom distance but also
angles of the bonds and subtypes of the atoms involved. We
also plan to add a tool for calculating water bridges (Karyagina
et al., 2005) on nucleic acid—protein interface.
A statistics of residue-base contacts (for every family and
overall) will be available at the site.
For every family of domains, a sequence alignment of its
members will be available together with the information on
secondary structure, contacts with the nucleic acid and other
features. Superimposed structures of the family members will
be downloadable.
ACKNOWLEDGEMENTS
The work is supported by the Russian Foundation of Basic
Research, grants 06–07–89143 and 06–04–49558, and INTAS,
grant 05–1000008–8028.
Conflict of Interest: none declared.
REFERENCES
Alexeevski,A.V. et al. (2004) CluD, a Program for the Determination of
Hydrophobic Clusters in 3D Structures of Protein and Protein-Nucleic
Acids Complexes. Biofizika, 48 (Suppl. 1), 146–156.
Hoffman,M.M. et al. (2004) AANT: the Amino Acid-Nucleotide Interaction
Database. Nucleic Acids Res., 32 (Database issue), D174–D181.
Karyagina,A. et al. (2005) The role of water in homeodomain-DNA interaction.
In Kolchanov,N. and Hofestaedt,R. (eds.) Bioinformatics of Genome
Regulation and Structure II. Springer Science þ Business Media, Inc.,
New York, pp. 247–257.
Karyagina,A. et al. (2006) Analysis of conserved hydrophobic cores in proteins
and supramolecular complexes. J. Bioinform. Comput. Biol., 4, 357–372.