* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Protein-DNA interaction dataset Understanding the molecular
Cre-Lox recombination wikipedia , lookup
Molecular cloning wikipedia , lookup
Biochemistry wikipedia , lookup
Transcriptional regulation wikipedia , lookup
Protein moonlighting wikipedia , lookup
Protein adsorption wikipedia , lookup
Western blot wikipedia , lookup
Silencer (genetics) wikipedia , lookup
List of types of proteins wikipedia , lookup
Deoxyribozyme wikipedia , lookup
Interactome wikipedia , lookup
Homology modeling wikipedia , lookup
Community fingerprinting wikipedia , lookup
Molecular evolution wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Cell-penetrating peptide wikipedia , lookup
Intrinsically disordered proteins wikipedia , lookup
Two-hybrid screening wikipedia , lookup
Protein-DNA interaction dataset Understanding the molecular details of protein-DNA interactions is critical for deciphering the mechanisms of gene regulation. This dataset contains 56 proteins bound to double-stranded DNA (dsDNA), 427 protein-DNA complexes with resolution better than 3.0 Å were extracted from the Nucleic Acid Database (NDB). Proteins in the complexes were clustered into homologous families and a non-redundant representative set of 56 proteins was obtained Each input in the data set contains the identities of 2n+1 residues in the form of X ( xt n , xt n1 ,..., xt 1 , xt , xt 1 ,..., xt n1 , xt n ) where xt is the identity of target residue, xt n , xt n1 ,..., xt 1 and xt 1 ,..., xt n1 , xt n are the identities of n residues on both sides of the target residue. (in this case n=4 was found to yield optimal performance) Relative accessible surface area (rASA) of each residue (in the absence of DNA) was computed using NACCESS program. A training example with target residue rASA added is in the form r x of X ( xt n , xt n1,..., xt 1, xt , xt 1,..., xt n1, xt n , rt , c) where i and c are defined as above, and t is the rASA of target X ( xt n , xt n1 ,..., xt 1 , xt , xt 1 ,..., xt n1 , xt n , rt ) residue. A test example is a vector of . Input is encoded in a similar way when other features are added. Entropy of each sequence position was extracted from HSSP database (http://www.cmbi.kun.nl/gv/hssp/). The entropy is normalized to the range of 0-100. The lower the entropy is, the more conserved is the corresponding sequence position.