Download PSI - European Bioinformatics Institute

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Silencer (genetics) wikipedia , lookup

SR protein wikipedia , lookup

Gene regulatory network wikipedia , lookup

Protein (nutrient) wikipedia , lookup

Gene expression wikipedia , lookup

G protein–coupled receptor wikipedia , lookup

Protein wikipedia , lookup

QPNC-PAGE wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Protein moonlighting wikipedia , lookup

Interactome wikipedia , lookup

Cyclol wikipedia , lookup

P-type ATPase wikipedia , lookup

Western blot wikipedia , lookup

Magnesium transporter wikipedia , lookup

List of types of proteins wikipedia , lookup

Protein adsorption wikipedia , lookup

Nuclear magnetic resonance spectroscopy of proteins wikipedia , lookup

Ancestral sequence reconstruction wikipedia , lookup

Proteolysis wikipedia , lookup

Intrinsically disordered proteins wikipedia , lookup

Protein structure prediction wikipedia , lookup

Protein–protein interaction wikipedia , lookup

Homology modeling wikipedia , lookup

Structural alignment wikipedia , lookup

Transcript
InterPro
Sandra Orchard
EBI is an Outstation of the European Molecular Biology Laboratory.
Why do we need predictive annotation tools?
14,000,000
12,000,000
UniProtKB
Number of sequences
10,000,000
UniProtKB/Swiss-Prot
8,000,000
6,000,000
4,000,000
2,000,000
0
5-Jan-04
5-Jan-06
5-Jan-08
Date
5-Jan-10
• Given a set of uncharacterised sequences, we usually want to know:
– what are these proteins; to what family do they
belong?
– what is their function; how can we explain this in
structural terms?
1. Pairwise alignment approaches (e.g. BLAST)
• Good at recognising similarity between closely related
sequences
• Perform less well at detecting divergent homologues
2. The protein signature approach
• Alternatively, we can model the conservation of amino acids
at specific positions within a multiple sequence alignment,
seeking ‘patterns’ across closely related proteins
• We can then use these models to infer relationships with
previously characterised sequences
• This is the approach taken by protein signature databases
What are protein signatures?
Protein family/domain
Multiple sequence alignment
Build model
Search
UniProt
Protein analysis
Significant
match
ITWKGPVCGLDGKTYRNECALL
AVPRSPVCGSDDVTYANECELK
Mature
model
Diagnostic approaches (sequence-based)
Single
motif
methods
Regex patterns
(PROSITE)
Full domain
alignment
methods
Profiles
(Profile Library)
HMMs
(Pfam)
Multiple
motif
methods
Identity matrices
(PRINTS)
Patterns
Sequence
alignment
Define
pattern
Extract pattern
sequences
Motif
xxxxxx
xxxxxx
xxxxxx
xxxxxx
Build
C-C-{P}-x(2)-C-[STDNEKPI]-x(3)-[LIVMFS]-x(3)-C
regular
expression
Pattern
signature
PS00000
Patterns
Advantages
• Some aa can be forbidden at some specific positions
which can help to distinguish closely related subfamilies
• Short motifs handling - a pattern with very few
variability and forbidden positions, can produce
significant matches e.g. conotoxins: very short toxins with few conserved
cysteines C-{C}(6)-C-{C}(5)-C-C-x(1,3)-C-C-x(2,4)-C-x(3,10)- C
Drawbacks
• High False Positive/False Negative rate
Patterns are mostly directed against functional
residues:
active sites, PTM, disulfide bridges, binding sites
Fingerprints
Sequence
alignment
Define
motifs
Extract motif
sequences
Fingerprint
signature
PR00000
Motif 1
xxxxxx
xxxxxx
xxxxxx
xxxxxx
Motif 2 Motif 3
Weight
matrices
xxxxxx
xxxxxx
xxxxxx
xxxxxx
xxxxxx
xxxxxx
xxxxxx
xxxxxx
Correct order
1
2
3
Correct spacing
The significance of motif context
• Identify small conserved regions in proteins
• Several motifs  characterise family
• Offer improved diagnostic reliability over single motifs by virtue of the
biological context provided by motif neighbours
order
interval
Profiles
&
HMMs
Whole protein
Sequence
alignment
Define
coverage
Use entire
alignment for
domain or protein
Build model
Profile or
HMM
signature
Entire domain
xxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Models
insertions and
deletions
HMM databases
Sequence-based
• PIR SUPERFAMILY: families/subfamilies reflect the evolutionary relationship
• PANTHER: families/subfamilies model the divergence of specific functions
• TIGRFAM: microbial functional family classification
• PFAM : families & domains based on conserved sequence
• SMART: functional domain annotation
Structure-based
•SUPERFAMILY : models correspond to SCOP domains
• GENE3D: models correspond to CATH domains
Why we created InterPro
By uniting the member databases, InterPro capitalises
on their individual strengths, producing a powerful
diagnostic tool & integrated database
– to simplify & rationalise protein analysis
– to facilitate automatic functional annotation of
uncharacterised proteins
– to provide concise information about the signatures and the
proteins they match, including consistent names, abstracts
(with links to original publications), GO terms and crossreferences to other databases
InterPro Entry
Groups similar signatures together
AddsAdds
extensive
extensive
annotation
annotation
Links Links
to other
to other
databases
databases
Structural information and viewers
 Hierarchical
classification
InterPro hierarchies: Families
FAMILIES can have parent/child relationships with
other Families
Parent/Child relationships are based on:
• Comparison of protein hits

child should be a subset of parent

siblings should not have matches in common
• Existing hierarchies in member databases
• Biological knowledge of curators
InterPro hierarchies: Domains
DOMAINS can have
parent/child
relationships with
other domains
Domains and Families may be linked through
Domain Organisation
Hierarc
hy
InterPro Entry
Groups similar signatures together
AddsAdds
extensive
extensive
annotation
annotation
Links Links
to other
to other
databases
databases
Structural information and viewers
InterPro Entry
Groups similar signatures together
AddsAdds
extensive
extensive
annotation
annotation
Links Links
to other
to other
databases
databases
Structural information and viewers
The Gene Ontology project provides a
controlled vocabulary of terms for
describing gene product characteristics
InterPro Entry
Groups similar signatures together
AddsAdds
extensive
extensive
annotation
annotation
Links Links
to other
to other
databases
databases
Structural information and viewers
UniProt
KEGG ... Reactome ... IntAct ...
UniProt taxonomy
PANDIT ... MEROPS ... Pfam clans ...
Pubmed
InterPro Entry
Groups similar signatures together
AddsAdds
extensive
extensive
annotation
annotation
Links Links
to other
to other
databases
databases
Structural information and viewers
PDB 3-D Structures
SCOP Structural
domains
CATH Structural
domain classification
Searching InterPro
Searching InterPro
Protein family membership
Domain organisation
Domains, repeats
& sites
GO terms
Searching InterPro
InterProScan access
Interactive:
http://www.ebi.ac.uk/Tools/pfa/iprscan/
Webservice (SOAP and REST):
http://www.ebi.ac.uk/Tools/webservices/services/pfa/iprscan_rest
http://www.ebi.ac.uk/Tools/webservices/services/pfa/iprscan_soap
Download:
ftp://ftp.ebi.ac.uk/pub/software/unix/iprscan/
?
?
?
?
?
?
?
?
? ?
?
?
?
?
?
?
?
?
?
?
Master headline
?