Download ProtoNet Database overview

ProtoNet Automatic Hierarchical Classification of Proteins Brief Database overview December 2003 ProtoNet Database overview Present Configuration • 2 Unix servers: Red Hat Linux 7.1, FreeBSD 4.7 the first serves as a web server the second as a database server • Each server is dual-PIII 1000Mhz w/ 2GB RAM • Database engine: PostgreSQL 7.2 ProtoNet Database overview Present Configuration Disks situation: 130GB (w/ mirror), 265GB (RAID 5) Both at critical utilization levels (93%) ProtoNet Database overview Database Figures: • Around 150 tables, consuming ~160GB • 2 major protein (sequence) sources • 3 protein motif sources • 8 keyword sources • Multiple versions of each source are kept and handled under release management (3 releases so far) – to be mentioned later • 7 clusterings (cluster-tree setups) ProtoNet Database overview Database setup process: Proteins Motif Systems Annotation Systems AVA Blast Keywords Clustering Web Interface Clusters Tree Annotated Clustering Satellite systems Research ProtoNet Database overview Building the PROTEINS table: Proteins Protein Sources: Swiss-Prot (~130K sequences) TrEMBL (~950K sequences) Each protein record comes with many attributes: • Swiss-Prot ID + Accession numbers • Description • Amino-Acid sequence + Molecular Weight + Length • Keywords + Motifs (features) • Linkage information to external databases (PDB, Taxonomy, Publication journals, etc) • and more… We also calculate additional attributes, such as the theoretical Iso-Electric point (based on AA sequence). ProtoNet Database overview Building the PROTEINPROTEIN table: Proteins AVA Blast Protein Sources: Swiss-Prot (~130K sequences) TrEMBL (~950K sequences) • Each pair of Swiss-Prot proteins is checked for its BLAST e-score, which serves as a distance measure • The set of distance measures between all proteins serves as the base metric for the clustering process (to be mentioned later) • TrEMBL sequences are ignored at this stage, both due to space and computation limitations and also due to potential noise they would probably inflict on the clustering process (we haven’t had the chance to actually test this hypothesis) ProtoNet Database overview Building the CLUSTERS table: The clustering process: Proteins • All initial proteins are defined as singleton clusters AVA Blast Clustering • Closest pair of clusters is merged into a new cluster (distance between clusters is the average distance of all of their proteins) • The distance between the merged clusters induces a time measure, representing their “death-time” which is equal to the new-born cluster’s “birth-time” • Each cluster gets additional attributes, such as: size, depth, height, average length of proteins, and more… Clusters Tree • The process repeats until a distance-cutoff is met • The overall result is the clusters tree ProtoNet Database overview Building the CLUSTERS table: Proteins clustid fatherid size height depth birthtime deathtime … AVA Blast Clustering Clusters Tree • The overall result is the clusters tree ProtoNet Database overview Building the CLUSTERS table: clustid fatherid size height depth birthtime deathtime … Clusters Tree • The overall result is the clusters tree ProtoNet Database overview Building the CLUSTERS table: At this point TrEMBL comes into the picture: Each TrEMBL sequence is “hung” aposterior on the tree, using our Classify-Your-Protein algorithm. clustid fatherid size height depth birthtime deathtime … This algorithm either finds a suitable cluster or declares the Several TrEMBLDFS-based orderings of the sequence as a singleton. clusters in the tree are kept, and allow efficient (and elegant!) way to query Note: Leaves are no longer the clusters We also calculatehierarchy. and attach additional attributes to singletons! each cluster, such as: Clusters Tree amountofclusters_atbirth, amountofsingletons_atbirth averlength, varlength averdepth, vardepth dfsnum, dfsnum_ofnextnondesc subtreesize, lifetime, … ProtoNet Database overview Extracting annotation from external sources: Motif Systems Annotation Systems Keywords Notes: • External databases are downloaded from the Internet in varying file formats (all sorts of strange ASCII formats, XML, …) • Generally speaking, the more “Biological” the database is, the more peculiar is its format • Most biological databases are indexed uniquely, but not all with an integer • Being dynamic and developing, the external databases tend to change their format every now and then!  PERL Parsing scripts were written for each external source… ProtoNet Database overview Building the MOTIFS + PROTEINMOTIF tables: Motif Systems Motif Sources: InterPro (PRINTS, Pfam, PROSITE, ProDom, Smart, TIGRFAMs, PIR SuperFamily, SuperFamily) Swiss-Prot (FEATUREs) TrEMBL (FEATUREs) Each motif record comes with the following attributes: • ID in external system • Description • Starting AA position • Ending AA position ProtoNet Database overview Building the KEYWORDS + PROTEINKEYWORD tables: Motif Systems Keyword Sources: Annotation Systems Swiss-Prot, TrEMBL, Enzyme, PDB, SCOP, GO, NCBI Taxonomy, InterPro • Biological protein properties are extracted from each database, and are stored in suitable tables (proteinec, proteinpdb, pdbscop, proteinspecies, ...) • Keywords are generated for each protein from its associations to these biological properties • These keywords are also grouped into keyword-types • Statistics (quantity/frequency) are calculated for each keyword and keyword-type, according to the proteinkeyword table ProtoNet Database overview Combining annotations with the clustering Proteins Motif Systems Annotation Systems AVA Blast Keywords Clustering Clusters Tree Annotated Clustering ProtoNet Database overview Combining annotations with the clustering • In the process of combining the annotations with the clustering, we “forget” where each keyword came from. • Statistics are calculated for each keyword in the clustering, and for selected cluster-keyword pairs ( TP, FN, specificity, sensitivity, significance measures). • Several tables are maintained for this stage, some of which are huge: clusterings, clusteringprotein, clusteringkeyword, Space/Time limitations enforce us to be selective with clusteringkeytype, clusters, clusterkeyword regards to: 1. Which keyword-types enter this phase in the first place 2. Which cluster-keyword pairs are analyzed Annotated Clustering ProtoNet Database overview Release management • RELEASE is a set of clustering-trees and protein / motif / keyword sources, each of a certain version. • Some of the releases were built explicitly for the web users. A default clustering of the most updated release is exposed to the standard user. Advanced users can switch between the various releases and clustering trees. • In order to allow multiple releases, we keep more than one version of each of the external (imported) databases + clustering trees. • This is also important for research work (that can also be comparative). ProtoNet Database overview Release management • In order to efficiently manage multiple versions (of external databases), each record contains a special number field which represents a bits-vector. implies ihaving NOON CONDENSATION inrecord is part of • BitThis in position is turned if (and only if) the the cluster-related tables, and hence they version i of the corresponding source. consume A LOT of disk space (especially the • This serves as a CONDENSATION for the data in the tables, cluster-keyword table). and was introduced due to our space problems. • For example, if a protein-sequence record appears (exactly the same) in versions 2 and 3 of Swiss-Prot, the bits-vector integer is equal to 6 (000001102) • This bits-vector field is unnecessary for the clustering tables, since each clustering is associated to a release (+versions) and the tuples are (generally) non-repetitive. ProtoNet Database overview Problems… ProtoNet’s database is a HUGE setup, which seems to be beyond our technical capabilities. We are limited in: 1. Men power: We starve for a steady, experienced database and system developers team, dedicated solely to this project (rather than preoccupied research students). This is the only way this project can fulfill its objective – to be an up-to-date homepage for Proteomics! 2. Disk space 3. CPU power: Both for serving outside users and for quicker release upgrades. 4. Robust, trustworthy and efficient database engine (PostgreSQL seems to fail on all three categories!) ProtoNet Database overview ProtoNet Database overview ProtoNet Database overview ProtoNet Database overview Thanks! Made by Uri Inbar and Hillel Fleischer

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download ProtoNet Database overview