* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Poster - Protein Information Resource
P-type ATPase wikipedia , lookup
Multi-state modeling of biomolecules wikipedia , lookup
Gene expression wikipedia , lookup
Protein (nutrient) wikipedia , lookup
Magnesium transporter wikipedia , lookup
G protein–coupled receptor wikipedia , lookup
Ancestral sequence reconstruction wikipedia , lookup
Metalloprotein wikipedia , lookup
Protein folding wikipedia , lookup
List of types of proteins wikipedia , lookup
Interactome wikipedia , lookup
Protein moonlighting wikipedia , lookup
Homology modeling wikipedia , lookup
Western blot wikipedia , lookup
Intrinsically disordered proteins wikipedia , lookup
Protein structure prediction wikipedia , lookup
Protein adsorption wikipedia , lookup
Nuclear magnetic resonance spectroscopy of proteins wikipedia , lookup
PIRSF protein family classification system Anastasia Nikolskaya, Sehee Chung, Hongzhan Huang, Raja Mazumder, Darren Natale, Lai-Su Yeh, Cathy Wu Protein Information Resource, Georgetown University Medical Center [email protected] http://pir.georgetown.edu/ Family-driven Protein Annotation PIRSF Classification System Abstract PIRSF: A network structure from Superfamilies to Subfamilies Reflects evolutionary relationships of full-length proteins Basic unit = Homeomorphic Family Homologous (Common Ancestry): Inferred by sequence similarity Homeomorphic: Full-length sequence similarity and common domain architecture Hierarchical Structure: Flexible number of levels with varying degrees of sequence conservation Network Structure: Interconnection between different hierarchies Advantages: Annotation of both generic biochemical and specific biological functions Accurate propagation of annotation and development of standardized protein nomenclature and ontology Pfam Domain • Exactly one level • Full-length sequence similarity and common domain architecture • One or more common domains • 0 or more levels • Functional specialization •• PIRSF003033: Ku70 autoantigen PF02735: Ku70/Ku80 beta-barrel domain PIRSF800001: Ku70/80 autoantigen PIRSF016570: Ku80 autoantigen PIRSF006493: Ku, prokaryotic type PIRSF500001: IGFBP-1 PF00219: PIRSF001969: IGFBP Insulin-like growth factor binding protein (IGFBP) • PIRSF500006: IGFBP-6 PIRSF018239: IGFBP-related protein, MAC25 type •• PIRSF017318: CM of AroQ class, eukaryotic PF01817: •• PIRSF001501: CM of AroQ class, prokaryotic PIRSF001500: Bifunctional CM/PDT (P-protein) IGFBP subfamilies PIRSF001499: Bifunctional CM/PDH (T-protein) Basic unit== Basic unit Homeomorphic Family Homeomorphic Family Network Structure: Network Structure: Flexible number Flexible number of of levels withvarying varying levels with degrees sequence degrees ofofsequence conservation conservation Advantages Advantages … Chorismate mutase (CM) Automatic clustering Preliminary Curation (4,500 PIRSFs ) Membership Signature Domains Full Curation (2,300 PIRSFs ) Family Name, Description, Bibliography PIRSF Name Rules Map domains on Families Computer assisted Manual Curation Add/remove members Final Homeomorphic Families Protein name rule/site rule Build and test HMMs Annotation generic Annotation ofofgeneric biochemical and biochemical and specific biological specific biological functions functions Accurate propagation Accurate propagation of annotation of annotation Development ofof Development standardized protein standardized protein nomenclature & & nomenclature ontology ontology Monitor such variables to ensure accurate propagation Subfamilies increase specificity (kinase -> sugar kinase -> hexokinase) Propagate other properties that describe function: Name Rules Name Rules 7 Reflects the function when possible Indicates the maximum specificity that still describes the entire group Standardized format Name tags: validated, tentative, predicted, functionally heterogeneous Define conditions under which names propagate to individual proteins Enable further specificity based on taxonomy or motifs Names adhere to Swiss-Prot conventions (though we may make suggestions for improvement) Name Rule types: Define conditions under which features propagate to individual proteins PIRSF006786: PDH, feedback inhibition-insensitive PIRSF005547: PDH, feedback inhibition-sensitive “Zero” Rule Default rule (only condition is membership in the appropriate family) Information is suitable for every member Site Rules Site Rules “Higher-Order” Rule Has requirements in addition to membership 27 Can have multiple rules that may or may not have mutually exclusive conditions 26 Example Name Rules Curated family name Rule ID Taxonomic distribution of PIRSF can be used to infer evolutionary history of the proteins in the PIRSF Description of family Sequence analysis tools Phylogenetic tree and alignment view allows further sequence analysis Defined rules for annotation Rule Conditions PIRNR000881-1 Name Rule in Action at UniProt Propagated Information PIRSF000881 member and vertebrates Name: S-acyl fatty acid synthase thioesterase EC: oleoyl-[acyl-carrier-protein] hydrolase (EC 3.1.2.14) PIRNR000881-2 PIRSF000881 member and not vertebrates Name: Type II thioesterase EC: thiolester hydrolases (EC 3.1.2.-) PIRNR025624-0 PIRSF025624 member Name: ACT domain protein Misnomer: chorismate mutase Automatic annotations (AA) are in a separate field AA only visible from www.ebi.uniprot.org Future: Automatic name annotations will become DE line if DE line will improve as a result Note the lack of a zero rule for PIRSF000881 28 AA will be visible from all consortium-hosted web sites Position-Specific Site Features: Yes Name rule exists? No Nothing to propagate PIRSF in DAG View Mapping to other protein classification databases Name rules and site rules allow precise annotation of UniProt proteins within the PIRSF Protein fits criteria for any higher-order rule? No Yes Assign name from Name Rule 1 (or 2 etc) PIRSF has zero rule? Assign name from Name Rule 0 30 Nothing to propagate at least one PDB structure experimental data on functional sites: CATRES database (Thornton) Rule Definition: Yes active sites binding sites modified amino acids Current requirements: No 29 PIR Site Rules Name Rule Propagation Pipeline Affiliation of Sequence: Homeomorphic Family or Subfamily (whichever PIRSF is the lowest possible node) Integrated value-added information from other databases Current: PF02153: Lack of active site residues necessary for enzymatic activity Certain activities relevant only to one part of the taxonomic tree Evolutionarily-related proteins whose biochemical activities are known to differ EC, GO terms, misnomer info, pathway Name, refs, abstract, domain arch. Create hierarchies (superfamilies/subfamilies) Account for functional variations within one PIRSF, including: Hierarchy Hierarchy Curated Homeomorphic Families PIRSF001499: Bifunctional CM/PDH (T-protein) Prehenate dehydrogenase (PDH) PIRSF Classification Name PIRSF Classification Name Preliminary Homeomorphic Families Merge/split clusters PIR Name Rules Objective: Optimize for protein annotation Definitions Definitions •• Family-Driven Protein Annotation Unassigned proteins Automatic Procedure PIRSF Report A network structure from superfamilies to subfamilies to reflect evolutionary relationships of full-length proteins PIRSF Homeomorphic Subfamily 3 PIRSF classification system PIRSF Protein Classification System Computer Generated (Uncurated ) Clusters (35,000 PIRSFs ) New proteins Orphans 25 PIRSF Homeomorphic Family Definitions: UniProtKB proteins Automatic placement The PIRSF protein classification system reflects evolutionary relationship of full-length proteins and domains. PIRSF families are extensively curated using a bioinformatics infrastructure implemented in a J2EE framework. Expert manual curation includes membership, annotation of specific biological functions, biochemical activities, and sequence features. Novel functional predictions for uncharacterized “hypothetical” proteins and protein families are routinely made in the annotation process. Fully curated families and their protein members provide basis for rich and accurate functional annotation of protein sequences in the UniProt Knowledgebase. The PIRSF database is accessible at http://pir.georgetown.edu/pirsf/ PIRSF Superfamily • 0 or more levels Creation and Curation of PIRSFs Select template structure Align PIRSF seed members with structural template Edit MSA to retain conserved regions covering all site residues Build Site HMM from concatenated conserved regions 31 System Implementation PCS Architecture Client s Middle Tier Data Source Web Browser DB2 Servlet [ Controller ] (JavaWebStart) Application s DAO Manager H T T P D SQL DAO FLAT DAO JSP, HTML, XML (XSLT) [ Presentation ] Domain Objects [Model] Graphical Analysis Tool Integration PCS Web Interface: Shopping Cart View PIRSF DAG Editor/Viewer Collaborative Curation Platform XML DAO JDBC FlatFile Adapter XML Adapter MySql Acknowledgements Oracle Legacy Databases XML Repositories UniProt is supported by the National Institutes of Health, grant # 1 U01 HG02712-01 • • • • Curator -guided clustering Single -linkage clustering using BLAST Retrieve all proteins sharing a common domain Iterative BlastClust (fixed length coverage)