Download Function

Document related concepts

Magnesium transporter wikipedia , lookup

SR protein wikipedia , lookup

Signal transduction wikipedia , lookup

Paracrine signalling wikipedia , lookup

Gene wikipedia , lookup

Expression vector wikipedia , lookup

Biochemical cascade wikipedia , lookup

Point mutation wikipedia , lookup

Enzyme wikipedia , lookup

Ancestral sequence reconstruction wikipedia , lookup

Amino acid synthesis wikipedia , lookup

Interactome wikipedia , lookup

Western blot wikipedia , lookup

Biochemistry wikipedia , lookup

Evolution of metal ions in biological systems wikipedia , lookup

Metabolism wikipedia , lookup

Gene expression wikipedia , lookup

Gene regulatory network wikipedia , lookup

Protein–protein interaction wikipedia , lookup

Protein wikipedia , lookup

Metalloprotein wikipedia , lookup

Silencer (genetics) wikipedia , lookup

Metabolic network modelling wikipedia , lookup

Homology modeling wikipedia , lookup

Endogenous retrovirus wikipedia , lookup

Protein structure prediction wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Two-hybrid screening wikipedia , lookup

Proteolysis wikipedia , lookup

Transcript
Bologna Winter School 2007
Protein Function
Basic questions:
How do proteins evolve changed or
novel functions?
Given the amino acid sequences of
proteins inferred from genomic
sequences, how can we assign
functions to them?
Genomics gives us many new protein
sequences


Often there is little experimental information
about the proteins themselves
What can we deduce about proteins from their
amino acid sequences?
… from the amino acid sequence of one protein
alone?
… from comparisons of amino acid sequences of
related proteins from different species?
What properties of proteins do we want to
learn about and how do we measure and
analyse them?

amino acid sequence

three-dimensional structure

FUNCTION

expression pattern

regulation
Can we learn these properties by
studying purified proteins in isolation?

amino acid sequence – yes, in principle

three-dimensional structure -- certainly

FUNCTION -- ??????

expression pattern – yes if we had to

regulation – probably not
How do we learn these?


amino acid sequence – genomic sequences
three-dimensional structure –
X-ray, NMR, ... modelling

FUNCTION – experiment? inference?

expression pattern -- microarrays

regulation – chip/chip experiments
Does knowledge about related proteins help?

amino acid sequence – possibly

three-dimensional structure – MR, modelling

FUNCTION – YES!

expression pattern – maybe

regulation -- maybe
BUT, HOW??
Function is difficult



Sequence determines structure determines
function
From knowing sequence and structure of one
protein alone, can we deduce its function?

Identify binding site?

Identify catalytic residues?

Identify ligand?
Analogy to drug-design problem.
Given a protein structure can we
predict function directly?

Sometimes… To some extent …

What are reasonable goals?


Sometimes structure gives general idea,
guiding laboratory work to pin it down
Some examples from H. influenzae structural
genomics project
HI1679



α/β- hydrolase fold, putative remote homology
to L-2-haloacid dehydrogenases
Several substrates tried.
HI1679 cleaved 6-phosphogluconate,
phosphotyrosine
HI1434



related to a region in tRNA synthetases.
contains putative binding site, likely to bind
nucleotide
no specific ligand has yet been identified
Nuclear Transport Factor-2
• Protein known to be
involved in traffiicking
across nuclear
membrane
• Crystal structure
determined
• Mechanism of function
not obvious
• ???
NtF-2 homologous to scytalone
dehydratase
• Alexei Murzin spotted
a similarity of fold
between NTF-2 and
scytalone dehydratase
• This structure shows
scytalone dehydratase
binding an inhibitor
Scytalone dehydratase
Scytalone dehydratase is an enzyme in
the pathway for melanin synthesis
NTF-2
Superposition
Search for ligands



On the basis of the structural similarity, many
ligands were designed and tested
So far, none has shown any binding or
catalyzed reactivity
Conclusion: structural similarity is useful guide
to hypotheses about function, but doesn’t
always work …
But many similar proteins have
similar functions, don't they?

In many cases closely-related proteins have
closely-related functions.

Example: human and horse haemoglobin

43 residue differences out of 446 (α+β chains)

96% residue identity

SAME FUNCTION
Function assignment from homology?


OK, if the sequences differ greatly then the
function may differ
But if the sequences are similar, the functions
will be the same – WON'T THEY?

Well, sometimes ...
'Homology modelling' of function?



Sequence determines structure determines
function
Small changes in sequence produce small
changes in structure
BUT:
dependence of function on sequence (and even
on structure) doesn't have simple ‘topology’
Similar sequences produce similar structures
Recruitment




In many cases, similar proteins retain similar
functions (example: mammalian globins)
Distantly-related proteins can retain function or
diverge in function
But closely-related proteins can have very
different functions
Even identical proteins can carry out different
functions
Avian eye-lens proteins



In the duck, crystallins have identical
sequences to liver enolase and lactate
dehydrogenase
They never see the substrates in the eye
In other birds, sequences have changed
enough to lose catalytic activity. This proves
that enzymatic activity not necessary in eye
Proteinase do = DegP

Chaperone at low temperatures

Proteinase at high temperatures

Logic: moderate stress – try to rescue proteins

more extreme stress – give up and recycle
Function annotation in databases


Proteins appear in databases when their
sequences are known
Annotation of function?

Experimental evidence for function

Transfer of function from homologue

How well does this work?

How can we tell?

Requires measure of distance between functions
Two goals of this kind of work
1.
To study how protein function diverges as
amino acid sequence diverges
2.
To evaluate the accuracy of transfer of
annotation among homologous proteins
Problems associated with goal 2 make goal 1
harder
How do proteins change function as
their sequences diverge

Divergence v. recruitment

Divergence:

Change in specificity (chymotrypsin, trypsin)

Change in regulation
(myoglobin,
haemoglobin)

Related functions with similar mechanisms
(adaptation of catalytic site) (Gerlt & Babbitt)
Gene duplication and divergence


General way to develop new functions
Very old theory about how metabolic pathways
developed – new protein developed to provide
substrate for current initial step:





Now growing on B
(BCD…ATP)
Medium runs out of B.
BC enzyme duplicates, diverges to catalyze AB
Now you can grow on A
(ABCD…ATP)
Attractive because:


BC enzyme has binding site for B
explains gene organization in operon
WRONG: mechanism of AB in general different from
BC, needs different structure, catalytic residues
Derivation of function from coordinates
analysis of sequence and structure


Homologous proteins may have diverged in
sequence and function (leave aside recruitment)
Assume no strong sequence similarity to protein
of known function

Align sequences

Use structure to get better alignments

Check for conservation of binding site, catalytic
residues
Structure-based function assignment

Extract functional residues from structures of
known function

Residues contributing to function of entire
homologous family conserved in whole family

Residues contributing to specific function of
subfamily conserved only in subfamily
Several groups have applied these ideas

Cohen & Lichtarge, ‘Evolutionary Trace Method’
(J. Mol. Biol. 1996)

Irving, Whisstock, Lesk (Proteins 2001)

Hannenhalli & Russell (J. Mol. Biol. 2000)


Sternberg and coworkers (PNAS 2004, Phil.
Trans. Roy. Soc. 2006)
See also: Automated Function Prediction, ISMB
Special Interest Group Meeting, 2005
How could we test predictions of
function?
How to measure distance between functions?

For sequences and structures, there are natural
measures of divergence

Sequence: count identical residues

Structures: r.m.s.d. of well-fitting parts
(Specialists may argue about details, or
propose alternatives, but basically the answers
aren't too different.)

Function: no natural measure of difference
Enzyme Commission / EC numbers





(EC numbers NOT European Commission)
Authorized by International Union of
Biochemistry and Commission on Enzyme
Nomenclature
EC set up by International Union of
Biochemistry in 1955.
Report in 1961, modified 1964, several
supplements since then.
Published as book, now available on web
What does EC classify

Enzyme nomenclature

Classification of reactions catalysed by
enzymes

NOT a set of assignment of function to proteins
– That is a different task

(Note that Gene Ontology – another
classification scheme – also does not assign
functions to proteins)
Enzyme Commission numbers


Four-level hierarchy
Example: isopentenyl-diphosphate ∆-isomerase
EC number 5.3.3.2:

5 = general category (of isomerases)
 5.3 = intramolecular isomerases
 5.3.3 = enzymes that transpose C=C bonds
 5.3.3.2 = specific reaction

EC classifies reactions, names enzymes that
catalyse reactions, does not name proteins.
Gene Ontology




EC limited to enzymes
Gene Ontology consortium produced new, more
general classification of protein function
Three independent categories:

Molecular function

Biological process

Subcellular location
(overlaps EC)
GO: not tree structure, directed acyclic graph
Gene Ontology project

Initiated by Michael Ashburner (early 1990’s).

Has since grown, become de facto standard

References:

Lewis, S.E. (2004). Gene Ontology: looking
backwards and forwards.
Genome Biology 6:103.

Ashburner, M. (2006). Won for All / How the
Drosophila Genome was Sequenced. Cold Spring
Harbor Laboratory Press.
What is an ontology?

Specification of how to describe a body of
knowledge

Nomenclature (fixed vocabulary)

Rules of syntax of terms

Types of relationships among entities:

‘Is a’: for instance: ‘A cat is a mammal.’

‘Part of’: for instance: ‘A tail is part of a cat.’
What is an ontology?

Types of relationships among entities:

‘Is a’: for instance: ‘A cat is a mammal.’

‘Part of’: for instance: ‘A tail is part of a cat.’

Note that ‘A cat is a mammal. A mammal is an
animal’ implies that ‘A cat is an animal’

But ‘A tail is part of a cat. A cat is a mammal.’ does
NOT imply that a tail is a mammal.
Gene Ontology




EC limited to enzymes
Gene Ontology consortium produced new, more
general classification of protein function
Three independent categories:

Molecular function

Biological process

Subcellular location
(overlaps EC)
GO: not tree structure, directed acyclic graph
Gene Ontology




EC limited to enzymes
Gene Ontology consortium produced new, more
general classification of protein function
Three independent categories:

Molecular function

Biological process

Subcellular location
(overlaps EC)
GO: not tree structure, directed acyclic graph
GO classification of
isopentenyl-diphosphate ∆-isomerase
Several groups have measured relationship
between sequence divergence and
functional divergence using EC classification




Example: Todd, Orengo & Thornton, JMB 2001
For enzymes, sequence identity > 40%, all four
EC numbers conserved
sequence identity > 30% three levels of EC
numbers conserved for 70% of pairs
How can this work be extended to GO
classification?

Several groups have measured relationship
between sequence divergence and functional
divergence using EC classification

How to define metric on functions?

Distal GO-IDs

How to measure distance between SETS of
GO-IDs
How to define metric on functions?
Distal GO-IDs
How to measure distance between
SETS of GO-IDs
Dependence of function divergence on
sequence divergence: the EF-hand family
Fraction of
pairs
GO distance
GO: Sources of annotation

GO categories of sources of annotation:
IDA: Inferred from direct assay
TAS: Traceable author statement
IMP: Inferred from mutant phenotype
IGI: Inferred from genetic interaction
IPI: Inferred from physical interaction
ISS: Inferred from sequence similarity
IEA: Inferred from electronic annotation
NAS: Non-traceable author statement
Sources of Annotation: Experiment / Inferred
From: Thomas, P.D., Mi, H. & Lewis, S. (2007). Curr. Opin. Chem. Biol. 11, 4-11.
To study accuracy of annotation transfer, use
experimental annotation only?

Obviously.

But there are problems.

Many fewer data

Inconsistencies

Sometimes annotation correct, but source of
annotation incorrect
Conclusions




It is possible to define statistical distribution
describing relationship between divergence of
sequence and divergence of function
General rule: sequences diverge, function
diverges But: exceptions exist
Threshold at about 50% sequence identity at
which sequence starts to diverge more radically
Databases contain many errors or
incompleteness, still human, labour-intensive
activity
Errors in databases
1. Keep them out – But how?
2. natural language processing by computer?
(Automatic: literature → database)???
3. If you find them correct them (you = WHO?)
4. Correct them where?

Master copy of database?

What about copies? Errors propagate?

How to propagate corrections?
Correction of Errors in Databases?

Eternal vigilance at each installation?????

Community involvement – curation by experts?

Open source idea – bulletin board?

‘Knowbots’ running around web? Security?

Distribute programs for ‘health checks’?
Inconsistencies

Different databases use different versions of
GO

Different versions of different databases

Downloaded versions of different databases
may not be updated to reflect changes in parent
databases

What can be done?
Distributed updating of databases Park, Park &
Kim (2004). Bioinformatics Appl. Note.


Gene Ontology classification provides basis for
database annotations
Updates to GO include:







new terms
new obsoletions
term name changes
new definitions
new term merges
term movements
Require updating of annotations
GOChase (Park, Park & Kim)


Recommend updates (security considerations
require local file changes)
Web-based interfaces:

GOChase-History: evolution of GO ID
 GOChase-Correct: suggests change
 Health check of your database: flag problems
 Submit GO ID: report its use in annotation in a list of
common databases
http://www.strubi.org/software/GOChase/
What other relationships among
properties of organisms are useful in
assigning function?
What are we looking for?
We might try to identify proteins that have similar

functions in same or different species

Human and Horse haemoglobin

We may be able to find these if they are homologues
We might try to identify proteins that have

coordinated functions in same or different species

Two or more proteins in same metabolic pathway, or part
of same macromolecular complex

These may in general NOT be homologues
Various clues that proteins have
coordinated activities



Linked on genome? (Best for bacteria, not for
archaea; occasionally for eukaryotes)
Appear as separate (monomeric) proteins in
one species, and as single multidomain protein
in other species
Often separate proteins in prokaryotes are
fused in eukaryotes (but some examples of
opposite are known)
Function assignment by
reconstruction of metabolic
pathways
Shikimate kinase in Methanococcus jannaschii



In E. coli, shikimate kinase is an enzyme in the
pathway of synthesis of chorismate from
erythrose-4-phosphate
chorismate is a branch compound for the
synthesis of aromatic amino acids
tryptophan synthetase pathway one of the best
worked-out in E. coli, in terms of enzymology
and regulation
Pathway of synthesis of shikimate
from erythrose-4-P in E. coli
From: Daugherty et al., J Bacteriol. 2001 January; 183(1): 292–300.
Cross-table of metabolic steps and genes


Match up known genes and known metabolic
steps
No recognized protein for metabolic step?



Maybe metabolic step is missing from that organism
No recognized function for some gene?
Maybe can match up missing function with gene
missing function assignment
Matching gene with function


Check for homologues

Maybe find several

Maybe find none
Look in genome for operons containing
succession of genes for steps in pathway

Usually works in bacteria

Less common in archaea
Aromatic amino acid biosynthesis
R. Boyer
E. coli trp operon
Note collinearity of genes with order of reactions in pathway
From: Garret, R.H. & Grisham, C.M. (1999) Biochemistry. 2nd ed. (Thomson Higher Education, Belmont, CA)
Shikimate kinase in Methanococcus jannaschii




In M. jannaschii, the shikimate kinase pathway
is NOT catalysed by enzymes consecutive in
the genome in an operon
Sequence similarity identified most enzymes
but not shikimate kinase
In another archaeon, A. pernix, the genes in this
pathway ARE collinear.
From this is was possible to identify the A.
pernix shikimate kinase, and from that the M.
jannaschii homologue.
Reference: Dougherty et al., J. Bacteriology (2001). 183, 292–300.
Mapping of genes in silicate synthesis pathway in several prokaryotic genomes
From: Daugherty et al., J Bacteriol. 2001 January; 183(1): 292–300.
Mapping of genes for shikimate synthesis
in several prokaryotic genomes
From: Daugherty et al., J Bacteriol. 2001 January; 183(1): 292–300.
From: Daugherty et al., J Bacteriol. 2001 January; 183(1): 292–300.
Why didn’t homology search work?

Archaeal shikimate kinase is NOT related to
bacterial or eukaryotic shikimate kinases.

It is distantly related to homoserine kinases of
the GHMP kinase superfamily.

M. jannaschii homoserine kinase IS identifiable
by homology

The two enzymes are substrate-specific
Phylogenetic profiles




Clues to function from genes shared among
different organisms
Different groups of organisms need different
sets of genes
For instance, some bacteria have flagellae
Genes found in bacteria that contain flagellae
but not in other bacteria or other groups of
organisms: involved in flagellar function
Phylogenetic Profiles

Developed by Marcotte, Eisenberg et al. (PNAS
96, 4285-4288, 1999 and elsewhere)

Tabulate homologues of E. coli proteins in 16
other genomes

(Note: assume homologues share function –
this is input to method, not result)

Table: column = organism, row = gene

Put a  if organism has gene
From: Pellegrini et al. (1999). Proc. Natl. Acad. Sci. U.S.A. 96, 4285-4288
Phylogenetic profile




Pattern of row = barcode of which organisms a
gene occurs in
Result: Genes that share patterns are
‘functionally linked’
Functionally linked = participate in some
coordinated way in some structure or process
Note: proteins can be functionally linked even if
they are not homologous
Example: ribosomal proteins




Homologues of coil protein RL7 are found in 10
bacterial genomes and yeast, not in archaea
Those that match phylogenetic profile have
functions associated with ribosome
Have pulled out sets of ribosomal proteins on
basis of phylogenetic profile
Linked proteins need not be homologues nor be
localized in genome
Combine phylogenetic profiling with
matching ‘orphans’

Create metabolic network for an organism

Assign functions by homology when possible

Missing enzymes in pathway?

Genes that lack assignment?

Try to match these up (recall archaeal shikimate
kinase)

Phylogenetic profiles can assist in this
From: Chen & Vitkup (2006). Genome Biol. 7, R17
Phylogenetic profiles / orphan assignment Chen &
Vitkup (2006). Genome Biol. 7, R17



Phylogenetic profiles can link proteins in a
metabolic pathway
Even more, better fit of profile implies closer in
metabolic network
Test, using yeast:



remove gene from network
try to recover it from pool of ~6000 genes
results: 22.8% top prediction correct
(37.3% correct answer in top 10)
Conclusion


Inferring protein function from knowledge of
function of close relative is like solving the clue
of an American crossword puzzle. Finding the
precise word is difficult but task in principle
straightforward
Inferring function a priori from structure like
British crossword puzzle. Which clues are real?
which clues are misleading?
State of the art in function assignment

We have a ‘bag of tricks’ – that is, many
methods, all of which work sometimes and fail
sometimes.

In some cases, no method works except go
back to the lab and work it out.

We do not have a unified framework or a
systematic approach to function assignment