Download Bioinformatics in Brief This week: DB for structures Structure

Document related concepts

Biochemical cascade wikipedia , lookup

Metabolism wikipedia , lookup

Gene regulatory network wikipedia , lookup

Evolution of metal ions in biological systems wikipedia , lookup

Thylakoid wikipedia , lookup

Paracrine signalling wikipedia , lookup

Endogenous retrovirus wikipedia , lookup

Biochemistry wikipedia , lookup

Silencer (genetics) wikipedia , lookup

Ancestral sequence reconstruction wikipedia , lookup

Metalloprotein wikipedia , lookup

Gene expression wikipedia , lookup

Magnesium transporter wikipedia , lookup

G protein–coupled receptor wikipedia , lookup

Signal transduction wikipedia , lookup

Expression vector wikipedia , lookup

Bimolecular fluorescence complementation wikipedia , lookup

Protein purification wikipedia , lookup

Interactome wikipedia , lookup

Homology modeling wikipedia , lookup

Nuclear magnetic resonance spectroscopy of proteins wikipedia , lookup

SR protein wikipedia , lookup

Structural alignment wikipedia , lookup

Protein wikipedia , lookup

QPNC-PAGE wikipedia , lookup

Western blot wikipedia , lookup

Two-hybrid screening wikipedia , lookup

Protein–protein interaction wikipedia , lookup

Proteolysis wikipedia , lookup

Transcript
Bioinformatics in Brief
This week:
Bioinformatics -what is it, what for
DB for structures
Structure Classification
Structure-Function link
1
Proteins
Class1&2
M. Linial
‘02-’03
Swiss-Prot
• Established in 1986 and maintained collaboratively by SIB (Swiss
Institute of Bioinformatics) and EBI/EMBL
• Provides high-level annotations, including description of protein
function, structure of protein domains, post-translational
modifications, variants, etc
• Aims to be minimally redundant
• Linked to many other resources -Consider the best
2
Proteins
Class1&2
M. Linial
‘02-’03
November-2002
SWISS-PROT
116776 entries
TrEMBL
680075 entries
Best annotated DB
3
Proteins
Class1&2
M. Linial
‘02-’03
Still mistakes and algorithms may become a
source for incorrect annotations
In the protein world:
1. Wrong gene finding (exon- intron)
2. Premature cleavage -wrong tails (nt
sequencing mistakes)
3. ESTs may be misleading
4. Automatic assignment of features
5. No replacement for manual curators
4
Proteins
Class1&2
M. Linial
‘02-’03
Make sense of large DB - How??
uWhich database to search? There are many
uHow good are my results? (relevance and
reliability)
uI didn’t get any results, does it means there
aren’t any?
5
Proteins
Class1&2
M. Linial
‘02-’03
Linking Functional Databases
Essential addition
Pathways
Ligands
Essential addition
Putative TF -BS
6
Proteins
Class1&2
M. Linial
‘02-’03
Grand Plan
Find all the genes
Translate genes to proteins
“Compute” function
“Compute” structure
7
Goal of structure prediction
• Epstein & Anfinsen, 1961:
sequence uniquely determines structure
• INPUT:
• OUTPUT:
sequence
3D structure
and function
8
Prediction of Function
What is function?
This is not a simple term
Function may be:
• a biological process (e.g. serine protease activity)
• a molecular event (e.g. proteolysis of a specific substrate)
• a cellular structure (e.g. membrane; chromatin, etc.)
• relevance to a whole process (e.g. cell cycle)
• relevance to the whole organism (e.g. ovulation)
9
Pattern Recognition
•
Looks for motifs that may have
functional relevance (family signatures):
*
Membrane anchoring
*
Catalytic site
*
Nucleotide binding
*
Nuclear localization signal
*
Hormone response element
*
Calcium binding, etc.
•
Protein family resources (2nd week)
10
Homology
•
What is “homology”?
Definition: Two proteins are homologous
if they are related by divergence from a common
ancestor.
A
Ancestor
Divergent
Evolution
B
C
D
Homologous
11
Analogy
•
What is “analogy”?
Definition: Two proteins are “analogous”
if they acquired common structural and
functional features via convergent evolution
from unrelated ancestors.
A
C
Convergent
Evolution
B
D
Unrelated
Analogous
(similar structure and/or function)
12
Serine Proteases (Convergent Evolution)
Trypsin-like
Subtilisin-like
Analogous
proteins
Many homologous
members
Many homologous
members
Trypsin and subtilisin share groups of catalytic
residues with almost identical spatial geometries
but they have no other sequence or structural
similarities.
13
Aspartic acid - Histidine- Serine
D
H
S
14
Human Kallikrein Gene Family
(Divergent Evolution)
15 homologous genes on human
chromosome 19q13.4
Divergence in tissue expression and
substrate specificity
(trypsin like of S1, substrate Met|Lys; Arg|Ser in small mol)
activate Bradikynin
15
Orthologs
Proteins that usually perform same
function in different species (e.g. DNA
polymerase; glucose 6-phosphate
dehydrogenase; retinoblastoma gene;
p53, etc.).
Paralogs
Proteins that perform different but related
functions within one organism [usually
formed by gene duplication and divergent
evolution] (e.g. the 15 kallikrein genes).
16
Evolutionary time
New term:
Old Paralog
Ortholog
xenopus
Ancestor
Paralog-1
Ortholog
fish
human
Paralog-2
17
Why structure?
• Protein structure is more conserved than protein
sequence, and more closely related to function.
18
Proteins
Class1&2
M. Linial
‘02-’03
Structural information
•
Protein Data Bank: maintained by the Research Collaboratory
of Structural Bioinformatics(RCSB)
– > 16,500 structures of proteins
– Also contains of structures of Protein/Nucleic Acid
Complexes, Nucleic Acids, Carbohydrates
– 19,200 together (Nov 2002)
•
•
•
Most structures are determined by X-ray crystallography.
NMR (15%) and electron microscopy(few).
Some structures are also theoretically predicted.
19
Proteins
Class1&2
M. Linial
‘02-’03
Structural information
• From solved structure to classification
what for:
• The structural space - rules, organization??
• Infer structure (modeling)
• Infer function - not trivial
• Protein engineering, drug design…
Proteins
Class1&2
M. Linial
‘02-’03
20
PDB is growing
21
Proteins
Class1&2
M. Linial
‘02-’03
Structure Alignment
• Why Structure Alignment?
• Algorithms for Structure Alignment
22
Proteins
Class1&2
M. Linial
‘02-’03
Why Structure Alignment?
•
For homologous proteins (similar ancestry), this provides the
“gold standard” for sequence alignment — elucidates the
common ancestry of the proteins.
•
For non-homologous proteins, allows us to identify common
substructures of interest.
•
Allows us to classify proteins into clusters, based on structural
similarity.
23
Proteins
Class1&2
M. Linial
‘02-’03
How do we recognize structural
similarities?
• By eye (Alexei Murzin)
SCOP- Gold standard for structure classification
• Algorithmically
Growth of PDB demands automated techniques
for classification and fold detection.
24
Proteins
Class1&2
M. Linial
‘02-’03
Algorithms for Structure Alignment
•
•
•
Distance based methods
– DALI (Holm and Sander): Aligning scalar distance plots
– STRUCTAL (Gerstein and Levitt): Dynamic programming using pairwise
inter-molecular distances
– SSAP (Orengo and Taylor): Dynamic programming using intra-molecular
vector distance
– Others (PRISM, CE…)
Vector based methods
– VAST (Bryant): Graph theory based secondary structure alignment
– 3dSearch (Singh and Brutlag): Fast secondary structure index lookup
Both vector and distance based
– LOCK (Singh and Brutlag): Hierarchically uses both secondary structures
vectors and atomic distances
25
Proteins
Class1&2
M. Linial
‘02-’03
Databases of structural
classification
• Consider the GOLD standard - Expert view
• SCOP
– Murzin AG et al. 1995
– Structural classification of protein structures
– Manual assembly by inspection
– All nodes are annotated (eg. All-α, α/β)
– Structural similarity search using 3dSearch (Singh and
Brutlag)
26
Proteins
Class1&2
M. Linial
‘02-’03
From classification to
distances maps
27
Proteins
Class1&2
M. Linial
‘02-’03
28
Proteins
Class1&2
M. Linial
‘02-’03
29
Proteins
Class1&2
M. Linial
‘02-’03
30
Proteins
Class1&2
M. Linial
‘02-’03
31
Proteins
Class1&2
M. Linial
‘02-’03
Dynamic view -
too slow
3.5
Ratio SCOP 1.59 /1.37
3
2.5
2
1.5
1
0.5
0
Domains
PDB
Fam
SF
Fold
32
Proteins
Class1&2
M. Linial
‘02-’03
Dynamic view -
too slow
Number of SCOP Entities
2000
1800
Famil
y
1600
1400
1200
Superfamily
1000
800
600
Fold
400
200
0
0
10
20
30
40
Months from SCOP 1.37 Release
33
Proteins
Class1&2
M. Linial
‘02-’03
•CATH
–Orengo et al. 1997
– Class-Architecture-Topology-Homologous
superfamily
–Manual classification at Architecture level
–Automated topology classification using
the SSAP algorithms
–No structural similarity search
34
Proteins
Class1&2
M. Linial
‘02-’03
Protein Classifications
CATH
35
Proteins
Class1&2
M. Linial
‘02-’03
Christine Orengo
(Structures, 1997, 5, 1093-1108)
© Christine Orengo
36
Protein Classifications
CATH
Class is determined according to the secondary structure
composition
It can be assigned automatically for over 90% of the known
structures
For the remainder, manual inspection is used
37
Proteins
Class1&2
M. Linial
‘02-’03
38
Proteins
Class1&2
M. Linial
‘02-’03
Protein Classifications
CATH
Architecture, A
Determined by the orientations of the secondary structures but ignores the
connectivity between the secondary structures.
It is currently assigned manually using a simple description of arrangements
e.g. barrel or 3-layer sandwich.
Procedures are being developed for automating this step.
39
Proteins
Class1&2
M. Linial
‘02-’03
Protein Classifications
CATH
Topology(=fold)
Structures are grouped into depending on both the overall shape and
connectivity of the secondary structures (SSAP algorithm).
Structures which have a SSAP score of 70 and where at least 60% of the
larger protein matches the smaller protein are assigned to the same T level
or fold family.
Some fold families are very highly populated, they are currently subdivided
using a higher cutoff on the SSAP score.
40
Proteins
Class1&2
M. Linial
‘02-’03
Protein Classifications
CATH
‘Score’
as a classification criteria ??
Legitimate:
•Comparative study to SCOP etc. - good
•Separating power of the score - tested
•Using it in ‘real world’ competitions
•Applying to structural Genomics
41
Proteins
Class1&2
M. Linial
‘02-’03
Protein Classifications
CATH
Homologous Superfamily, HThis level groups together protein domains which are thought to share a
common ancestor and can therefore be described as homologous.
Similarities are identified first by sequence comparisons and subsequently
by structure comparison using SSAP.
the criteria:
Sequence identity >= 35%, 60% of larger structure equivalent to smaller
SSAP score >= 80.0 and sequence identity >= 20%
60% of larger structure equivalent to smaller
SSAP score >= 80.0, 60% of larger structure equivalent to smaller, and
domains which
have related
functions
Proteins Class1&2
M. Linial
‘02-’03
42
Protein Classifications
CATH
S level
Sequence families, S
Clustered on sequence identity.
Domains clustered in the same sequence families have sequence
identities >35% (with at least 60% of the larger domain equivalent to
the smaller).
43
Proteins
Class1&2
M. Linial
‘02-’03
Protein Classifications
CATH
44
Proteins
Class1&2
M. Linial
‘02-’03
45
Proteins
Class1&2
M. Linial
‘02-’03
Protein Classifications
46
Proteins
Class1&2
M. Linial
‘02-’03
Protein Classifications
47
Proteins
Class1&2
M. Linial
‘02-’03
Protein Classifications
48
Proteins
Class1&2
M. Linial
‘02-’03
Example
49
Proteins
Class1&2
M. Linial
‘02-’03
Protein Classifications
50
Proteins
Class1&2
M. Linial
‘02-’03
51
Proteins
Class1&2
M. Linial
‘02-’03
52
Proteins
Class1&2
M. Linial
‘02-’03
3 domains in Histone -Acetyltransferase
53
Proteins
Class1&2
M. Linial
‘02-’03
SCOP vs CATH
54
Proteins
Class1&2
M. Linial
‘02-’03
SCOP vs CATH
55
Proteins
Class1&2
M. Linial
‘02-’03
Coping with the data
CATH addition:
Temporary assignment
56
Proteins
Class1&2
M. Linial
‘02-’03
Moving from ‘classification’
to a distance map -why?
Biological Examples
Weak connections
“hoping in the ‘map’ to find relatedness (intermediate)
57
Proteins
Class1&2
M. Linial
‘02-’03
Databases of structural
classification
•
FSSP
– L.L. Holm and C. Sander
– Fully automated using the DALI algorithms (Holm and Sander)
– No internal node annotations
– Structural similarity search using DALI (considered best..)
•
Pclass
– A. Singh, X. Liu, J. Chang, D. Brutlag
– Fully automated using the LOCK and 3dSearch algorithms
– All internal nodes automatically annotated with common terms
– Structural similarity serach using 3dSearch
58
Proteins
Class1&2
M. Linial
‘02-’03
DALI
• Based on aligning 2-D intra-molecular distance matrices
• Computes the best subset of corresponding residues from
the two proteins such that similarity between the 2-D
distance matrices is maximized.
• Searches through all possible alignments of residues
(Monte-Carlo algorithms).
59
Proteins
Class1&2
M. Linial
‘02-’03
DALI
60
Proteins
Class1&2
M. Linial
‘02-’03
DALI
• DALI has been used to do an ALL vs. ALL
comparison of proteins in the PDB, and to create a
hierarchical clustering of families.
• FSSP=Fold classification based on StructureStructure alignment of Proteins
61
Proteins
Class1&2
M. Linial
‘02-’03
From classification to
distances maps
62
Proteins
Class1&2
M. Linial
‘02-’03
Structural distances - FSSP
The FSSP database includes all protein chains from the PDB
which are longer than 30 residues.
The chains are divided into a representative set (<25% identity).
The representative set contains no pair of such sequence
homologs.
An all-against-all structure comparison is performed on the
representative set.
63
Proteins
Class1&2
M. Linial
‘02-’03
Structural distances - FSSP
A hierarchical clustering method is used to
construct a tree based on the structural similarities
Family indices are constructed by cutting the tree
at levels of 2, 4, 8, 16, 32 and 64 standard
deviations above database average.
64
Proteins
Class1&2
M. Linial
‘02-’03
DALI finds surprising
homologues
Many unexpected links:
• Histon and heat-shock protein
• Adenosine deaminase &phosphodiesterase
(13% identity)
• Chemotrypsin & emydermolytic toxin A
(S.aureus)
65
Proteins
Class1&2
M. Linial
‘02-’03
Structural distances - FSSP
About 700 entries (as in the ENZYME DB)
66
Proteins
Class1&2
M. Linial
‘02-’03
Structural distances - FSSP
67
Proteins
Class1&2
M. Linial
‘02-’03
VAST-Vector Alignment Search Tool
• Aligns only secondary structure elements (SSE)
• Represents each SSE as a vector
• Finds all possible pairs of vectors from the two
structures that are similar
• Uses a graph theory algorithms to find maximal
subset of similar vectors.
68
Proteins
Class1&2
M. Linial
‘02-’03
VAST
• VAST has been used to do an ALL vs. ALL
comparison of proteins in the MMDB (NCBI’s
structure database), and to find structure neighbors
for each structure.
• MMDB provides service of searching structure
neighbors using VAST.
69
Proteins
Class1&2
M. Linial
‘02-’03
Proteins and major challenges
Predicting protein
function on a
genomic scale
70
Proteins
Class1&2
M. Linial
‘02-’03
Understand Proteins,
through analyzing large amounts
Structures
Functions
Evolution
(motions, packing, folds)
71
Proteins
Class1&2
M. Linial
‘02-’03
A new concept called PROTEOME
(PROTEin complement to a genOME)
Proteomics can be defined as the qualitative and
quantitative comparison of proteomes under different
conditions to further unravel biological processes.
72
Proteins
Class1&2
M. Linial
‘02-’03
How to predict function
for 1000s of proteins?
.……
~650
u250 of 650 known on chr. 22 [Dunham et al.]
u>>30K+ genes in Entire Human Genome
u
(alt. splicing)
73
Proteins
Class1&2
M. Linial
‘02-’03
How to predict functions
for 1000s of proteins?
1)
2)
3)
4)
"Traditional" sequence patterns
Via fold similarity (structural genomics)
Clustering a microarray experiment
Data integration
5) Advanced classification methods using multilayers information
74
Proteins
Class1&2
M. Linial
‘02-’03
Function is a fuzzy term..
Structural
Cytoskeletal
Transporters channels
Signaling
switch, adaptor
Transcription NA binding
Recognition
receptors, immune
+++++
++
+++
++
++++
Enzymes
++
+
+++++
+
+
++++
….
75
Proteins
Class1&2
M. Linial
‘02-’03
How to predict functions
for 1000s of proteins?
1) "Traditional" sequence patterns
Compare uncharacterized genome sequences
against known sequences in DBs, transferring
function annotation for similar sequences
Issue: Threshold is major parameter & limitation
Also, look for motifs & sites [Sternberg, Thornton, Rose, Koonin]
Wait for next class
76
Proteins
Class1&2
M. Linial
‘02-’03
1000s of structurally based alignments
of structurally and functionally characterized
sequences
Sequence
Similarity scores
Domains
Function
One feature ENZYMES
Motifs
Signatures
Organized DB
Families
….
77
Proteins
Class1&2
M. Linial
‘02-’03
Relatively easy function
ENZYMES
78
Proteins
Class1&2
M. Linial
‘02-’03
Functionally characterized Enzymes
By Cofactors
6-hydroxyDOPA
Ammonia
Ascorbate
ATP
Bicarbonate Bile salts
Biotin
Cadmium
Calcium Cobalamin
Cobalt
Coenzyme F430
Coenzyme-A
Copper
Dipyrromethane
Dithiothreitol
Divalent cation
F420
FAD
Fe(II)
Flavin
Flavoprotein
FMN
Glutathione
Heme
Heme-thiolate
Iron
Iron(II)
Iron-molybdenum
Iron-sulfur
Lipoyl group
Magnesium
Manganese
Molybdenum
Molybdopterin
Monovalent cation
NAD
NAD(P)H
Nickel
Potassium
PQQ
Protoheme IX
Pterin
Pyridoxal phosphate
Pyridoxal-phosphate
Pyruvate
Reduced flavin
Selenium
Siroheme
Sodium
Tetrahydropteridine
Thiamine pyrophosphate
Thiol-dependent
Tryptophan…………..
79
Proteins
Class1&2
M. Linial
‘02-’03
80
Proteins
Class1&2
M. Linial
‘02-’03
Functionally characterized Enzymes
Catalysis
1. -. -.- Oxidoreductases.
1. 1. -.1. 2. -.1. 3. -.1. 4. -.1. 5. -.1. 6. -.-
Acting on the CH-OH group of donors.
Acting on the aldehyde or oxo group of donors.
Acting on the CH-CH group of donors.
Acting on the CH-NH(2) group of donors.
Acting on the CH-NH group of donors.
Acting on NADH or NADPH.
5. -. -.- Isomerases.
5. 1. -.5. 2. -.5. 3. -.5. 4. -.5. 5. -.-
Racemases and epimerases.
Cis-trans-isomerases.
Intramolecular oxidoreductases..
Intramolecular transferases (mutases).
Intramolecular lyases.
81
Proteins
Class1&2
M. Linial
‘02-’03
1000s of structurally based alignments
of structurally and functionally characterized
sequences
Sequence
Function
5.3.1.1
(TP Isomerase)
5.3.1.1
(TP Isomerase)
5.3.1.1
(TP Isomerase)
(E coli)
5.3.1.24
(PRA Isomerase)
(B ster.)
5.3.1.15
(Xylose Isom.)
4.1.3.3
(Aldolase)
4.2.1.11
(Enolase)
(Human)
Same
Exact
90%
(Chick)
(E coli)
45%
20%
Both
Class 5
(isom.)
Different
Classes
(E coli)
(Yeast)
Proteins
Class1&2
M. Linial
‘02-’03
82
100
90
80
70
60
50
40
30
20
10
0
Percentage of pairs that have
same precise function as
defined by Enzyme & FlyBase
functional classifications
Sequence similarity of pairs of proteins
%ID 70
Proteins
Class1&2
60
M. Linial
50
‘02-’03
40
30
20
10
0
83
% Same Function
Relationship of Similarity in
Sequence to that in Function
%ID 70
Proteins
Class1&2
100
90
80
70
60
50
40
30
20
10
0
60
M. Linial
50
‘02-’03
40
30
20
10
0
84
% Same Function
Relationship of Similarity in
Sequence to that in Function
Can transfer both
Fold & Functional
Annotation
%ID 70
Proteins
Class1&2
60
M. Linial
50
‘02-’03
100
90
80
70
60
50
40
30
20
10
0
40
30
20
10
0
85
% Same Function
Relationship of Similarity in
Sequence to that in Function
Can transfer both
Fold & Functional
Annotation
%ID 70
Proteins
Class1&2
60
M. Linial
50
‘02-’03
Can transfer
Can not transfer
Annotation related Fold or Functional
Annotation
Fold but not
("Twilight Zone")
Function
40
30
20
10
100
90
80
70
60
50
40
30
20
10
0
0
86
% Same Function
Relationship of Similarity in
Sequence to that in Function
Can transfer both
Fold & Functional
Annotation
Can transfer
Can not transfer
Annotation related Fold or Functional
Annotation
Fold but not
("Twilight Zone")
Function
100
90
80
70
60
50
40
30
20
10
0
Broad
vs
Narrow
Similarity
%ID 70
Proteins
Class1&2
60
M. Linial
50
‘02-’03
40
30
20
10
0
87
% Same Function
Relationship of Similarity in
Sequence to that in Function
Caveats: Sequence Divergence of Multidomain
Proteins , Implies a high threshold >40-50%
Single Domain Sequences
Multidomain Sequences
(Human)
(Chick)
(E coli)
(E coli)
(B ster.)
(E coli)
(Yeast)
88
Proteins
Class1&2
M. Linial
(Rat)
‘02-’03
How to predict functions
for 1000s of proteins?
1) "Traditional" sequence patterns
2) Via fold similarity (structural
perspective)
Structures of ORFs with unknown function,
Use Fold & Site Similarity to Determine Function
Rationale for Structure Prediction
Issue:
To what degree does fold determine function?
[Kim, Edwards & Arrowsmith, Montelione, Burley, Eisenberg]
89
Proteins
Class1&2
M. Linial
‘02-’03
Fold Function
Combinations
Many Functions on
Same Fold (TIM-barrel)
Different Folds with
Same Function
(Carbonic
Anhydrases, 4.2.1.1)
90
Proteins
Class1&2
M. Linial
‘02-’03
Fold Function
Combinations
Same function -different fold
Same class
91
Proteins
Class1&2
M. Linial
‘02-’03
Carbonic
Anhydrase
92
Proteins
Class1&2
M. Linial
‘02-’03
Global View of FoldFunction Combinations
229 Folds
91 Enzymatic Functions
Non-Enz
93
Proteins
Class1&2
M. Linial
‘02-’03
Correlation with
Structural Features
all-
all-
229 Folds
Architectural
Class
small
91 Enzymatic Functions
Non-Enz
94
Proteins
Class1&2
M. Linial
‘02-’03
Correlation with
Structural Features
Enzyme Class
all-
all-
Slight Overpopulation
229 Folds
Architectural
Class
small
91 Enzymatic Functions
Non-Enz
95
Proteins
Class1&2
M. Linial
‘02-’03
Global View of FoldFunction Combinations
229 Folds
Sort
91 Enzymatic Functions
Non-Enz
96
Proteins
Class1&2
M. Linial
‘02-’03
Frequency in database of 229 folds
To what degree is fold associated with
function? Folds with multiple functions
Number of functions associated with a fold
Proteins
Class1&2
M. Linial
‘02-’03
97
[Similar results
by Thornton]
Most Versatile Folds – Relation to Interactions
The number of
interactions for
each fold =
the number of
other folds it is
found to contact in
the PDB
98
Some common folds
in phylogenetic
groups
99
Proteins
Class1&2
M. Linial
‘02-’03
Not all folds shared
between
phylogenetic groups
Evolution of new
folds
Plants
46
156
Eubacteria
Proteins
Class1&2
M. Linial
73
Eukaryotes
‘02-’03
20
104
Animals
90
100
Summary
Structural classification
Fold-function, not a simple relation
Structure is very informative
101
Proteins
Class1&2
M. Linial
‘02-’03