Download ProtoNet Database overview

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Global serializability wikipedia , lookup

Microsoft Access wikipedia , lookup

Serializability wikipedia , lookup

Entity–attribute–value model wikipedia , lookup

Versant Object Database wikipedia , lookup

Database wikipedia , lookup

Relational model wikipedia , lookup

Concurrency control wikipedia , lookup

Clusterpoint wikipedia , lookup

Database model wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
ProtoNet
Automatic Hierarchical
Classification of Proteins
Brief Database overview
December 2003
ProtoNet Database overview
Present Configuration
• 2 Unix servers: Red Hat Linux 7.1, FreeBSD 4.7
the first serves as a web server
the second as a database server
• Each server is dual-PIII 1000Mhz w/ 2GB RAM
• Database engine: PostgreSQL 7.2
ProtoNet Database overview
Present Configuration
Disks situation:
130GB (w/ mirror),
265GB (RAID 5)
Both at critical utilization levels (93%)
ProtoNet Database overview
Database Figures:
• Around 150 tables, consuming ~160GB
• 2 major protein (sequence) sources
• 3 protein motif sources
• 8 keyword sources
• Multiple versions of each source are kept
and handled under release management
(3 releases so far) – to be mentioned later
• 7 clusterings (cluster-tree setups)
ProtoNet Database overview
Database setup process:
Proteins
Motif Systems
Annotation Systems
AVA Blast
Keywords
Clustering
Web
Interface
Clusters Tree
Annotated
Clustering
Satellite
systems
Research
ProtoNet Database overview
Building the PROTEINS table:
Proteins
Protein Sources: Swiss-Prot (~130K sequences)
TrEMBL (~950K sequences)
Each protein record comes with many attributes:
• Swiss-Prot ID + Accession numbers
• Description
• Amino-Acid sequence + Molecular Weight + Length
• Keywords + Motifs (features)
• Linkage information to external databases (PDB,
Taxonomy, Publication journals, etc)
• and more…
We also calculate additional attributes, such as the
theoretical Iso-Electric point (based on AA sequence).
ProtoNet Database overview
Building the PROTEINPROTEIN table:
Proteins
AVA Blast
Protein Sources: Swiss-Prot (~130K sequences)
TrEMBL (~950K sequences)
• Each pair of Swiss-Prot proteins is checked for its
BLAST e-score, which serves as a distance measure
• The set of distance measures between all proteins
serves as the base metric for the clustering process
(to be mentioned later)
• TrEMBL sequences are ignored at this stage, both
due to space and computation limitations and also due
to potential noise they would probably inflict on the
clustering process (we haven’t had the chance to
actually test this hypothesis)
ProtoNet Database overview
Building the CLUSTERS table:
The clustering process:
Proteins
• All initial proteins are defined as singleton clusters
AVA Blast
Clustering
• Closest pair of clusters is merged into a new cluster
(distance between clusters is the average distance of
all of their proteins)
• The distance between the merged clusters induces
a time measure, representing their “death-time” which
is equal to the new-born cluster’s “birth-time”
• Each cluster gets additional attributes, such as: size,
depth, height, average length of proteins, and more…
Clusters Tree
• The process repeats until a distance-cutoff is met
• The overall result is the clusters tree
ProtoNet Database overview
Building the CLUSTERS table:
Proteins
clustid
fatherid
size
height
depth
birthtime
deathtime
…
AVA Blast
Clustering
Clusters Tree
• The overall result is the clusters tree
ProtoNet Database overview
Building the CLUSTERS table:
clustid
fatherid
size
height
depth
birthtime
deathtime
…
Clusters Tree
• The overall result is the clusters tree
ProtoNet Database overview
Building the CLUSTERS table:
At this point TrEMBL comes into the
picture:
Each TrEMBL sequence is “hung” aposterior on the tree, using our
Classify-Your-Protein algorithm.
clustid
fatherid
size
height
depth
birthtime
deathtime
…
This algorithm either finds a suitable
cluster or declares the Several
TrEMBLDFS-based orderings of the
sequence as a singleton.
clusters in the tree are kept, and allow
efficient (and elegant!) way to query
Note: Leaves are no longer
the clusters
We also
calculatehierarchy.
and attach additional attributes to
singletons!
each cluster, such as:
Clusters Tree
amountofclusters_atbirth, amountofsingletons_atbirth
averlength, varlength
averdepth, vardepth
dfsnum, dfsnum_ofnextnondesc
subtreesize, lifetime, …
ProtoNet Database overview
Extracting annotation from external sources:
Motif Systems
Annotation Systems
Keywords
Notes:
• External databases are downloaded from the Internet in varying file formats
(all sorts of strange ASCII formats, XML, …)
• Generally speaking, the more “Biological” the database is, the more peculiar is
its format
• Most biological databases are indexed uniquely, but not all with an integer
• Being dynamic and developing, the external databases tend to change their
format every now and then!
 PERL Parsing scripts were written for each external source…
ProtoNet Database overview
Building the MOTIFS + PROTEINMOTIF tables:
Motif Systems
Motif Sources: InterPro (PRINTS, Pfam, PROSITE, ProDom, Smart,
TIGRFAMs, PIR SuperFamily, SuperFamily)
Swiss-Prot (FEATUREs)
TrEMBL (FEATUREs)
Each motif record comes with the following attributes:
• ID in external system
• Description
• Starting AA position
• Ending AA position
ProtoNet Database overview
Building the KEYWORDS + PROTEINKEYWORD tables:
Motif Systems
Keyword Sources:
Annotation Systems
Swiss-Prot, TrEMBL, Enzyme, PDB, SCOP,
GO, NCBI Taxonomy, InterPro
• Biological protein properties are extracted from each database,
and are stored in suitable tables (proteinec, proteinpdb, pdbscop,
proteinspecies, ...)
• Keywords are generated for each protein from its associations to
these biological properties
• These keywords are also grouped into keyword-types
• Statistics (quantity/frequency) are calculated for each keyword and
keyword-type, according to the proteinkeyword table
ProtoNet Database overview
Combining annotations with the clustering
Proteins
Motif Systems
Annotation Systems
AVA Blast
Keywords
Clustering
Clusters Tree
Annotated
Clustering
ProtoNet Database overview
Combining annotations with the clustering
• In the process of combining the annotations with the clustering, we
“forget” where each keyword came from.
• Statistics are calculated for each keyword in the clustering, and for
selected cluster-keyword pairs ( TP, FN, specificity, sensitivity,
significance measures).
• Several tables are maintained for this stage, some of which are huge:
clusterings, clusteringprotein, clusteringkeyword,
Space/Time
limitations enforce
us to
be selective with
clusteringkeytype,
clusters,
clusterkeyword
regards to:
1. Which keyword-types enter this phase in the first place
2. Which cluster-keyword pairs are analyzed
Annotated
Clustering
ProtoNet Database overview
Release management
• RELEASE is a set of clustering-trees and protein / motif /
keyword sources, each of a certain version.
• Some of the releases were built explicitly for the web users. A
default clustering of the most updated release is exposed to
the standard user. Advanced users can switch between the
various releases and clustering trees.
• In order to allow multiple releases, we keep more than one
version of each of the external (imported) databases +
clustering trees.
• This is also important for research work (that can also be
comparative).
ProtoNet Database overview
Release management
• In order to efficiently manage multiple versions (of external
databases), each record contains a special number field which
represents a bits-vector.
implies ihaving
NOON
CONDENSATION
inrecord is part of
• BitThis
in position
is turned
if (and only if) the
the cluster-related
tables, and
hence they
version
i of the corresponding
source.
consume A LOT of disk space (especially the
• This serves as a CONDENSATION for the data in the tables,
cluster-keyword table).
and was introduced due to our space problems.
• For example, if a protein-sequence record appears (exactly the
same) in versions 2 and 3 of Swiss-Prot, the bits-vector integer
is equal to 6 (000001102)
• This bits-vector field is unnecessary for the clustering tables,
since each clustering is associated to a release (+versions)
and the tuples are (generally) non-repetitive.
ProtoNet Database overview
Problems…
ProtoNet’s database is a HUGE setup, which seems to be
beyond our technical capabilities.
We are limited in:
1. Men power: We starve for a steady, experienced database
and system developers team, dedicated solely to this project
(rather than preoccupied research students).
This is the only way this project can fulfill its objective – to be
an up-to-date homepage for Proteomics!
2. Disk space
3. CPU power: Both for serving outside users and for quicker
release upgrades.
4. Robust, trustworthy and efficient database engine
(PostgreSQL seems to fail on all three categories!)
ProtoNet Database overview
ProtoNet Database overview
ProtoNet Database overview
ProtoNet Database overview
Thanks!
Made by Uri Inbar and Hillel Fleischer