Download Overview of Rule Curation

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Protein design wikipedia , lookup

Structural alignment wikipedia , lookup

Bimolecular fluorescence complementation wikipedia , lookup

Protein folding wikipedia , lookup

Circular dichroism wikipedia , lookup

Protein wikipedia , lookup

Protein moonlighting wikipedia , lookup

Protein purification wikipedia , lookup

Homology modeling wikipedia , lookup

Protein domain wikipedia , lookup

Proteomics wikipedia , lookup

Western blot wikipedia , lookup

Protein structure prediction wikipedia , lookup

Protein mass spectrometry wikipedia , lookup

List of types of proteins wikipedia , lookup

Nuclear magnetic resonance spectroscopy of proteins wikipedia , lookup

Cyclol wikipedia , lookup

Protein–protein interaction wikipedia , lookup

Intrinsically disordered proteins wikipedia , lookup

Transcript
1
PIR UniProtKB Curation:
Following the Rules
Washington, DC
March 2013
Darren A. Natale, Ph.D.
Protein Science Team Lead, PIR
Research Assistant Professor, GUMC
2
Ultimate Objective:
Provide as many proteins as possible with consistent annotation
1) Recognize sequence similarities
among a set of proteins (BLAST)
2) Annotate individual proteins based on
literature
3) Make annotation as consistent as
possible
4) Recruit additional members
5) Provide annotation to new members
All curation activities are in pursuit of one goal – to create annotation rules
3
Our Approach: Comprehensive & Efficient

Comprehensive


Each curation activity is integrated into a pipeline whose goal is
rule production
Efficient



Minimize waste: use “filters” based on known constraints to
prevent dead-end or duplicate curation
Minimize effort: automate where possible and decrease work
when manual effort is required
Maximize impact: gear effort toward more “bang for the buck”
(product for the pound?)
4
Protein Curation Systems

PIRSF – classification based on evolutionary relationships of
homeomorphic proteins
Closest SIB equivalent: HAMAP families

PIRNR – family-based “Name Rules” that define the conditions
under which specific DE, EC and and other annotation can be
propagated to each protein
Closest SIB equivalent: HAMAP Rules

PIRSR – family-based “Site Rules” that define conditions under
which site-specific features can be propagated to each protein
Closest SIB equivalent: ProRules
5
PIRSF Classification System



PIRSF:

A hierarchical structure from superfamilies to subfamilies

Reflects evolutionary relationships of full-length proteins
Definitions:

Homeomorphic Family: Basic Unit

Homologous: Common ancestry, inferred by sequence similarity

Homeomorphic: Full-length similarity & common domain architecture

Hierarchical Structure: Flexible number of levels with varying degrees
of sequence conservation (kinase -> sugar kinase -> hexokinase)
Advantages:

Annotate both general biochemical and specific biological functions

Accurate propagation of annotation and development of standardized
protein nomenclature and ontology
6
PIRSF Purpose & Pipeline
Provide a basis for rule construction and application


Start with computer-generated families
Curate cluster




Curate annotation



Curate membership (principle tools: BLAST results, iterative BlastClust, on-the-fly HMM)
Curate domain architecture
Select seeds
Curate name and some references
Optional: write abstract indicating function, structure, etc.
Create HMM & verify performance
FATE: Send all information (HMM, membership, annotation) to EBI for integration
into InterPro.
7
PIRNR Purpose & Pipeline
Apply annotation to uncharacterized proteins; can account for functional variations



Start with fully-curated family
Determine which annotation is appropriate to propagate to every
member that matches rule criteria
Define other match criteria necessary to apply rule to specific members,
such as:




Presence of residues necessary for enzymatic activity
Presence of domains necessary for function
Certain activities relevant only to one part of the taxonomic tree
Curate recommended protein name (adhering to UniProt guidelines), synonyms,
EC numbers, comment lines
Monitor such variables to ensure accurate propagation
FATE: After review, the rule (match/exclusion conditions and propagation fields) is
sent to EBI for inclusion into the UniRule tool.
8
PIRSR Purpose & Pipeline
Mechanism for accurate propagation of feature information (active sites, binding
sites, or modified amino acids) based on 3-D structure





Curate membership/seeds for family with a solved-structure member and known functional
sites.
Edit seed-to-structure MSA to define and retain conserved regions covering pertinent
residues
Build Site HMM from concatenated conserved regions
Define features for annotation using controlled vocabulary with evidence attribution
Match Rule Conditions




Membership Check (PIRSF HMM threshold)
Conserved Region Check (site HMM threshold)
Site Residue Check (all position-specific residues in HMMAlign)
Propagation

Feature annotation propagated only to members of the selected family that pass the three
conditions *exactly*
FATE: Apply rule to PIRSF members, send annotation (FT, KW, CC) to EBI/SIB
for inclusion in entry.
9
Ultimate Objective:
Provide as many proteins as possible with consistent annotation
Families
1) Recognize sequence similarities
among a set of proteins (BLAST)
Swiss-Prot
submission
2) Annotate individual proteins based on
literature
3) Make annotation as consistent as
possible
Subfamilies
SP revision
--------------------------
InterPro
submission
4) Recruit additional members
UniRules
5) Provide annotation to new members
All curation activities are in pursuit of one goal – to create annotation rules
10
Build Filters

Constraints:




Annotation within Swiss-Prot entries used as basis for rule
information must be consistent
Annotation within entries used as basis for rule information must
be literature-based (name rules) or structure-based with ligand
bound (site rules)
Families must be homeomorphic
Desired Optimizations:


Rules must have little-to-no overlap with existing rules
Maximize impact on entries (both numbers and content)
11
Optimization points

Minimization of Effort Filter


Maximization
of Impact Filter
For each pre-computed


Only make families using Representative Proteomes (applied pre-cluster)
candidate
(CC),
only that will “touch” a reasonable number of entries
Only
considercluster
candidate
clusters
(applied
pre-cluster
the RP
entries) are given to
the
For curator*
each Filters
CC, calculate
Dead-End





potential
members
from
all from query+hits when BLAST hits are not touched by
Only
consider
candidate
clusters
of UniProtKB
remove
any
rule system (and
applied
pre-cluster)
those
that are
too small*
Only
consider
clusters that are homeomorphic (applied pre-cluster)
Remove
CCcandidate
if any potential
Name
Rule:CC
Only
consider candidate
clusters where at least one potential member has
members
are
Remove
ifalready
potential
KB
literature-based
functional
annotation (applied pre-cluster & pre-rule)
annotated
bythe
another
members or
RP rule
Site
Rule: Only
consider
clusters
Remove
CC
if potential
RP where at least one potential member has a ligandmembers
show
too much
bound
structure
(applied
members
do not
mappre-cluster
to a )
length deviation*
Only
make
rules
appears
Remove
ifwhen
potential
RP there can be consistent and meaningful annotation
PMID
thatCC
does
notitalso
(applied
for
Swiss-Prot
members
domembers
not
map
to a pre-rule; future will be pre-cluster also)
map
to many
other
entries*
PDB with bound ligand*