Download Protein-DNA interaction dataset Understanding the molecular

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cre-Lox recombination wikipedia , lookup

Molecular cloning wikipedia , lookup

Biochemistry wikipedia , lookup

Transcriptional regulation wikipedia , lookup

Protein wikipedia , lookup

SR protein wikipedia , lookup

Protein moonlighting wikipedia , lookup

Protein adsorption wikipedia , lookup

Western blot wikipedia , lookup

Silencer (genetics) wikipedia , lookup

List of types of proteins wikipedia , lookup

Deoxyribozyme wikipedia , lookup

Interactome wikipedia , lookup

Homology modeling wikipedia , lookup

Community fingerprinting wikipedia , lookup

Molecular evolution wikipedia , lookup

Cyclol wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Cell-penetrating peptide wikipedia , lookup

Intrinsically disordered proteins wikipedia , lookup

Two-hybrid screening wikipedia , lookup

Nucleic acid analogue wikipedia , lookup

Protein–protein interaction wikipedia , lookup

Transcript
Protein-DNA interaction dataset
Understanding the molecular details of protein-DNA interactions is critical for deciphering the mechanisms of gene
regulation.
This dataset contains 56 proteins bound to double-stranded DNA (dsDNA), 427 protein-DNA complexes with
resolution better than 3.0 Å were extracted from the Nucleic Acid Database (NDB). Proteins in the complexes were
clustered into homologous families and a non-redundant representative set of 56 proteins was obtained
Each input in the data set contains the identities of 2n+1 residues in the form of
X  ( xt n , xt n1 ,..., xt 1 , xt , xt 1 ,..., xt n1 , xt n ) where xt is the identity of target residue, xt n , xt n1 ,..., xt 1 and
xt 1 ,..., xt n1 , xt n are the identities of n residues on both sides of the target residue. (in this case n=4 was found to
yield optimal performance)
Relative accessible surface area (rASA) of each residue (in the absence of DNA) was computed using NACCESS
program. A training example with target residue rASA added is in the form
r
x
of X  ( xt n , xt n1,..., xt 1, xt , xt 1,..., xt n1, xt n , rt , c) where i and c are defined as above, and t is the rASA of target
X  ( xt n , xt n1 ,..., xt 1 , xt , xt 1 ,..., xt n1 , xt n , rt )
residue. A test example is a vector of
. Input is encoded in a similar
way when other features are added. Entropy of each sequence position was extracted from HSSP database
(http://www.cmbi.kun.nl/gv/hssp/). The entropy is normalized to the range of 0-100. The lower the entropy is, the
more conserved is the corresponding sequence position.