* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Overview of Rule Curation
Protein design wikipedia , lookup
Structural alignment wikipedia , lookup
Bimolecular fluorescence complementation wikipedia , lookup
Protein folding wikipedia , lookup
Circular dichroism wikipedia , lookup
Protein moonlighting wikipedia , lookup
Protein purification wikipedia , lookup
Homology modeling wikipedia , lookup
Protein domain wikipedia , lookup
Western blot wikipedia , lookup
Protein structure prediction wikipedia , lookup
Protein mass spectrometry wikipedia , lookup
List of types of proteins wikipedia , lookup
Nuclear magnetic resonance spectroscopy of proteins wikipedia , lookup
1 PIR UniProtKB Curation: Following the Rules Washington, DC March 2013 Darren A. Natale, Ph.D. Protein Science Team Lead, PIR Research Assistant Professor, GUMC 2 Ultimate Objective: Provide as many proteins as possible with consistent annotation 1) Recognize sequence similarities among a set of proteins (BLAST) 2) Annotate individual proteins based on literature 3) Make annotation as consistent as possible 4) Recruit additional members 5) Provide annotation to new members All curation activities are in pursuit of one goal – to create annotation rules 3 Our Approach: Comprehensive & Efficient Comprehensive Each curation activity is integrated into a pipeline whose goal is rule production Efficient Minimize waste: use “filters” based on known constraints to prevent dead-end or duplicate curation Minimize effort: automate where possible and decrease work when manual effort is required Maximize impact: gear effort toward more “bang for the buck” (product for the pound?) 4 Protein Curation Systems PIRSF – classification based on evolutionary relationships of homeomorphic proteins Closest SIB equivalent: HAMAP families PIRNR – family-based “Name Rules” that define the conditions under which specific DE, EC and and other annotation can be propagated to each protein Closest SIB equivalent: HAMAP Rules PIRSR – family-based “Site Rules” that define conditions under which site-specific features can be propagated to each protein Closest SIB equivalent: ProRules 5 PIRSF Classification System PIRSF: A hierarchical structure from superfamilies to subfamilies Reflects evolutionary relationships of full-length proteins Definitions: Homeomorphic Family: Basic Unit Homologous: Common ancestry, inferred by sequence similarity Homeomorphic: Full-length similarity & common domain architecture Hierarchical Structure: Flexible number of levels with varying degrees of sequence conservation (kinase -> sugar kinase -> hexokinase) Advantages: Annotate both general biochemical and specific biological functions Accurate propagation of annotation and development of standardized protein nomenclature and ontology 6 PIRSF Purpose & Pipeline Provide a basis for rule construction and application Start with computer-generated families Curate cluster Curate annotation Curate membership (principle tools: BLAST results, iterative BlastClust, on-the-fly HMM) Curate domain architecture Select seeds Curate name and some references Optional: write abstract indicating function, structure, etc. Create HMM & verify performance FATE: Send all information (HMM, membership, annotation) to EBI for integration into InterPro. 7 PIRNR Purpose & Pipeline Apply annotation to uncharacterized proteins; can account for functional variations Start with fully-curated family Determine which annotation is appropriate to propagate to every member that matches rule criteria Define other match criteria necessary to apply rule to specific members, such as: Presence of residues necessary for enzymatic activity Presence of domains necessary for function Certain activities relevant only to one part of the taxonomic tree Curate recommended protein name (adhering to UniProt guidelines), synonyms, EC numbers, comment lines Monitor such variables to ensure accurate propagation FATE: After review, the rule (match/exclusion conditions and propagation fields) is sent to EBI for inclusion into the UniRule tool. 8 PIRSR Purpose & Pipeline Mechanism for accurate propagation of feature information (active sites, binding sites, or modified amino acids) based on 3-D structure Curate membership/seeds for family with a solved-structure member and known functional sites. Edit seed-to-structure MSA to define and retain conserved regions covering pertinent residues Build Site HMM from concatenated conserved regions Define features for annotation using controlled vocabulary with evidence attribution Match Rule Conditions Membership Check (PIRSF HMM threshold) Conserved Region Check (site HMM threshold) Site Residue Check (all position-specific residues in HMMAlign) Propagation Feature annotation propagated only to members of the selected family that pass the three conditions *exactly* FATE: Apply rule to PIRSF members, send annotation (FT, KW, CC) to EBI/SIB for inclusion in entry. 9 Ultimate Objective: Provide as many proteins as possible with consistent annotation Families 1) Recognize sequence similarities among a set of proteins (BLAST) Swiss-Prot submission 2) Annotate individual proteins based on literature 3) Make annotation as consistent as possible Subfamilies SP revision -------------------------- InterPro submission 4) Recruit additional members UniRules 5) Provide annotation to new members All curation activities are in pursuit of one goal – to create annotation rules 10 Build Filters Constraints: Annotation within Swiss-Prot entries used as basis for rule information must be consistent Annotation within entries used as basis for rule information must be literature-based (name rules) or structure-based with ligand bound (site rules) Families must be homeomorphic Desired Optimizations: Rules must have little-to-no overlap with existing rules Maximize impact on entries (both numbers and content) 11 Optimization points Minimization of Effort Filter Maximization of Impact Filter For each pre-computed Only make families using Representative Proteomes (applied pre-cluster) candidate (CC), only that will “touch” a reasonable number of entries Only considercluster candidate clusters (applied pre-cluster the RP entries) are given to the For curator* each Filters CC, calculate Dead-End potential members from all from query+hits when BLAST hits are not touched by Only consider candidate clusters of UniProtKB remove any rule system (and applied pre-cluster) those that are too small* Only consider clusters that are homeomorphic (applied pre-cluster) Remove CCcandidate if any potential Name Rule:CC Only consider candidate clusters where at least one potential member has members are Remove ifalready potential KB literature-based functional annotation (applied pre-cluster & pre-rule) annotated bythe another members or RP rule Site Rule: Only consider clusters Remove CC if potential RP where at least one potential member has a ligandmembers show too much bound structure (applied members do not mappre-cluster to a ) length deviation* Only make rules appears Remove ifwhen potential RP there can be consistent and meaningful annotation PMID thatCC does notitalso (applied for Swiss-Prot members domembers not map to a pre-rule; future will be pre-cluster also) map to many other entries* PDB with bound ligand*