Download Large Scale Protein Association Prediction

Document related concepts

Magnesium transporter wikipedia , lookup

List of types of proteins wikipedia , lookup

Protein moonlighting wikipedia , lookup

JADE1 wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Transcript
Predicting interactions between
genes based on genome sequence
comparisons
The “genomic context” component of STRING
Bioinformatics seminar series
15-11-2005
Berend Snel
Today
• Announcement: the seminar of Jakob de Vlieg on
22 November is canceled. Please consult the
website of the seminar series
(www.cmbi.ru.nl/edu/seminars) for the new date.
• Seminar (today); please ask questions !!!
• Handing out article and questions : “Identification
of a bacterial regulatory system for
ribonucleotide reductases by phylogenetic
profiling.” Read the article and hand in the
answers to the questions by Monday November
28th.
Contents
• Predicting functional interactions between
proteins; what & why
• Genomic context methods
– General
– Gene fusion
– Gene order
– Presence / absence of genes across
genomes
• Integration and benchmarking of predictions
• Biochemistry by other means BolA
• In addition to genomic context: functional
genomics data
Complete genomes, now what?
• Post-genomic era = we have the parts list
(complete genomes)
• to understand the cell we need to know the
functions of the genes
A bacterial genome
gene
408..1748
/gene="dnaA"
/locus_tag="BCE33L0001"
/old_locus_tag="BCZK0001"
CDS
408..1748
/gene="dnaA"
/locus_tag="BCE33L0001“
/old_locus_tag="BCZK0001"
/inference="non-experimental evidence, no additional
details recorded“
/codon_start=1
/transl_table=11
/product="chromosomal replication initiator protein“
/protein_id="AAU20227.1"
/db_xref="GI:51978677“
/translation="MENISDLWNSALKELEKKVSKPSYETWLKSTTAHNLKKDVLTIT
APNEFARDWLESHYSELISETLYDLTGAKLAIRFIIPQSQAEEEIDLPPAKPNAAQDD
SNHLPQSMLNPKYTFDTFVIGSGNRFAHAASLAVAEAPAKAYNPLFIYGGVGLGKTHL
MHAIGHYVIEHNPNAKVVYLSSEKFTNEFINSIRDNKAVDFRNKYRNVDVLLIDDIQF
LAGKEQTQEEFFHTFNALHEESKQIVISSDRPPKEIPTLEDRLRSRFEWGLITDITPP
DLETRIAILRKKAKAEGLDIPNEVMLYIANQIDSNIRELEGALIRVVAYSSLINKDIN
ADLAAEALKDIIPNSKPKIISIYDIQKAVGDVYQVKLEDFKAKKRTKSVAFPRQIAMY
LSRELTDSSLPKIGEEFGGRDHTTVIHAHEKISKLLKTDTQLQKQVEEINDILK"
gene
1927..3066
/gene="dnaN"
/locus_tag="BCE33L0002"
/old_locus_tag="BCZK0002"
CDS
1927..3066
/gene="dnaN"
/locus_tag="BCE33L0002"
/old_locus_tag="BCZK0002"
/EC_number="2.7.7.7"
/inference="non-experimental evidence, no additional
details recorded"
/codon_start=1
/transl_table=11
/product="DNA polymerase III, beta subunit"
/protein_id="AAU20226.1"
/db_xref="GI:51978676"
/translation="MRFTIQKDYLVRSVQDVMKAVSSRTTIPILTGIKVVATEEGVTL
TGSDADISIESFIPVEEDGKEIVEVKQSGSIVLQAKYFSEIVKKLPKETVEISVENHL
MTKITSGKSEFNLNGLDSAEYPLLPQIEEHHVFKIPTDLLKHMIRQTVFAVSTSETRP
ILTGVNWKVYNSELTCIATDSHRLALRKAKIEGIADEFQANVVIPGKSLNELSKILDE
SEEMVDIVITEYQVLFRTKHLLFFSRLLEGNYPDTTRLIPAESKTDIFVNTKEFLQAI
DRASLLARDGRNNVVKLSTLEQAMLEISSNSPEIGKVVEEVQCEKVDGEELKISFSAK
YMMDALKALDSTEIKISFTGAMRPFLIRTVNDESIIQLILPVRTY"
For most genes in any genome we need function
prediction
- E. Coli, the most intensively studied organism:
only 1924 genes (~43%) have been (partially)
experimentally characterized.
Predicting protein function
What is function ?
Various levels of description:
Sequence
similarity/homology has the
largest relevance for
“Molecular Function”.
This aspect of protein
function is best conserved.
Molecular function can often
be predicted from similarities
between protein sequences
(BLAST), or structures.
Homology: BLAST and / or SMART/PFAM/CDD
gi|22209068|Mayven [Homo sapiens]
gi|21410410|Klhl2 protein [Mus musculus]
. . .
. . .
i|55725960|hypothetical protein [Pongo pygmaeus]
gi|6644176|Klhl3 [Homo sapiens]
gi|19354513|Klhl3 protein [Mus musculus]
gi|12644384| Ring canal kelch protein [Drosophila melanogaster]
1159
1145
887
885
765
676
“Beyond” homology and molecular function
Homology based function prediction works very
well, yet:
• a large fraction of genes are poorly described
(no homologs, uncharacterized homologs; this
holds for ~60% of the human genes)
• There are other aspects of function:
functional associations, e.g. the target of a
protein kinase or a transcriptional regulator,
I.e. to understand the cell we need to know
the interactions of the genes
Thus: predicting associations
There are many types of functional associations (AKA
functional interactions, interactions, functional links,
functional relations) in molecular biology
P
Transcription
regulation
Signalling
pathways
Protein complexes
Cellular process
Metabolic
pathways
Types of functional associations
metabolic pathways: filling gaps
Types of functional associations
Transcription regulation
Signalling pathways
P
Types of functional associations
Cellular process
(“DNA repair”, “Apoptosis”)
Protein complexes
So how can knowledge of the functional associations
help?
• If we did not know anything about the function
of the protein we can now say in which process
it is involved
• If we already knew something about the
function, we might now know much more about
the function (I.e. if we knew it was a hydrolase
we might now know in which metabolic pathway
it is active)
• If the gene was already well characterized, we
might understand its role better (I.e. new
targets for a kinase)
Contents
• Predicting functional interactions between
proteins
• Genomic context methods
– General (how do we predict functional
interactions)
– Gene fusion
– Gene order
– Presence / absence of genes across
genomes
• Integration and benchmarking of predictions
• Biochemistry by other means BolA
• In addition to genomic context: functional
genomics data
How can we now predict / detect functional
associations?
• Functional genomics / high throughput
experiments
• GENOMIC CONTEXT
functionally associated proteins leave evolutionary
traces of their relation in genomes
We can thus detect
evolutionary traces of a
functional association by
comparing genomes
Genomic context is an tool to predict functional
associations between genes
• Use the genome sequences Themselves (through
comparative genome analysis) for interaction
prediction: genomic context methods
•Genomic context
methods have been
shown to be reliable
indicators for
functional interaction
• Genomic context is
also known as in silico
interaction prediction,
or genomic
associations
1
0.8
0.6
0.4
Fusion
Gene Order
Co-occurrence
0.2
00
0.2
0.4
Score
0.6
0.8
1
http://string.embl.de
Three different genomic context methods in
STRING
• Gene fusion, Rosetta stone method
• Conserved gene order between divergent
genomes
• Co-occurrence of genes across genomes,
phylogenetic profiles
Contents
• Predicting functional interactions between
proteins
• Genomic context methods
– General
– Gene fusion
– Gene order
– Presence / absence of genes across
genomes
• Integration and benchmarking of predictions
• Biochemistry by other means BolA
• In addition to genomic context: functional
genomics data
Gene fusion
• i.e. the orthologs of two genes in another organism are
fused into one polypeptide
• A very reliable indicator for functional interaction; partly
because it is an relatively infrequent evolutionary event:
3470 distinct fusions when surveying 179 genomes
Fusion
Gene fusion: an example
Contents
• Predicting functional interactions between
proteins
• Genomic context methods
– General
– Fusion
– Gene order
– Presence / absence of genes across
genomes
• Integration and benchmarking of predictions
• Biochemistry by other means BolA
• In addition to genomic context: functional
genomics data
Gene order evolves rapidly
But …
Differential retention
of divergent /
convergent gene
pairs suggests that
conservation implies
a functional
association
“Operons”
Comparison to pathways conservation implies a functional
association
6
number of COGS
5
1000
average metabolic
distance
100
4
3
2
10
1
1
0
30
27
24
21
18
15
12
9
6
3
0
co-occurrences in operons
average metabolic
distance
number of COGs
10000
Conserved gene order
• i.e. genes that are present over ‘sufficiently large’
evolutionary distances in the same gene cluster
• Contributes by far the most predictions
Conserved gene order
NB1 predicting operons is not trivial; in fact
conserved gene order or functional
association is a major clue
NB2 using ‘only’ operons without requiring
conservation results in much less reliable
function prediction
Conserved gene order: an example from metabolism
of propionyl-CoA
“target”
“query”
Conserved gene order: an example from metabolism
of propionyl-CoA
Biochemical assays
confirm the function
of members of
COG0346 as a DLmethylmalonyl-CoA
racemase
Contents
• Predicting functional interactions between
proteins
• Genomic context methods
– General
– Gene fusion
– Gene order
– Presence / absence of genes across
genomes
• Integration and benchmarking of predictions
• Biochemistry by other means BolA
• In addition to genomic context: functional
genomics data
Presence / absence of genes
Gene content  co-evolution. (The easy case, few genomes. )
Differences between gene
Content reflect differences in
Phenotypic potentialities
Genomes share genes for phenotypes they have in common
Presence / absence of genes
L. innocua (non-pathogen) L. monocytogenes (pathogen)
Presence / absence of genes
Genes involved in pathogenecity
L. innocua (non-pathogenic)
L. monocytogenes (pathogenic)
......
...
..
..
species 5
species 4
species 3
species 2
species 1
Generalization: phylogenetic profiles / co-occurence
Gene 1:
Gene 2:
Gene 3:
....
1
1
0
0
1
1
1
0
0
1
0
0
......
...
..
..
species 5
species 4
species 3
species 2
species 1
Gene 1:
Gene 2:
Gene 3:
....
0
1
1
1
0
0
Co-occurrence of genes across genomes
• i.e. two genes
have the same
presence/ absence
pattern over
multiple genomes:
they have ‘coevolved’
•AKA phylogenetic
profiles
Predicting function of a disease gene protein with
unknown function, frataxin, using co-occurrence of
genes across genomes
• Friedreich’s ataxia
• No (homolog with) known function
Frataxin has co-evolved with hscA and hscB
indicating that it plays a role in iron-sulfur cluster
assembly
H.sapiens
D.melan.
C.elegans
S.cerevisiae
C.albicans
S.pombe
A.thaliana
M.jannaschii
A.pernix
E.coli
P.multocida
H.influenzae
V.cholerae
Buchnera
P.aeruginosa
X.fastidiosa
N.meningitidis
M.loti
C.crescentus
R.prowazekii
C.jejuni
H. pylori
D.radiodurans
M.tuberculosis
M.genitalium
B.subtilis
Synechocystis
A.aeolicus
cyaY Yfh1
hscB Jac1
hscA
ssq1
iscS Nfs1
iscU Isu1-2
iscA Isa1-2
fdx Yah1
RnaM
IscR
Hyp
Atm1
Nfu1
Arh1
Iron-Sulfur (2Fe-2S) cluster in the Rieske protein
Prediction:
Confirmation:
The opposite of co-occurrence:
anti-correlation / complementary patterns: predicting
analogous enzymes
Genes with complementary phylogenetic profiles tend to have
a similar biochemical function.
A
B
A
B
Complementary patterns in thiamin biosynthesis
predict analogous enzymes
Morett E, Korbel JO, Rajan E, Saab-Rincon
G, Olvera L, Olvera M, Schmidt S, Snel B,
Prediction of analogous enzymes is confirmed
Contents
• Predicting functional interactions between
proteins
• Genomic context methods
– General
– Gene fusion
– Gene order
– Presence / absence of genes across
genomes
• Integration and benchmarking of predictions
• Biochemistry by other means BolA
• In addition to genomic context: functional
genomics data
Benchmark and integration: KEGG maps
Integrating genomic context scores into one
single score
• Compare each individual method against an independent benchmark
(KEGG), and find “equivalency”
• Multiply the chances that two proteins are not interacting and subtract
from 1; naive bayesian i.e. assuming independence
S  1   (1  Si )
1
i
0.8
0.6
0.4
Fusion
Gene Order
Co-occurrence
0.2
00
0.2
0.4
Score
0.6
0.8
1
Benchmark
100000
10000
1000
100
10
0.5
Integrated
Gene Order (norm.)
Gene Order (abs.)
Cooccurrence
Fusion (norm.)
Fusion (abs.)
0.6
0.7
0.8
0.9
Accuracy (fraction of confirmed
predictions, i.e. same KEGG map)
1.0
fraction of reference set covered by data
Coverage
Performance of genomic context compared to
high-throughput interaction data
Purified
Complexes
HMS-PCI
purified
complexes
TAP
genomic context
mRNA
co-expression
two methods
synthetic
lethality
yeast
two-hybrid
raw data
filtered data
parameter choices
Accuracy
fraction of data confirmed by reference set
combined
evidence
three methods
Contents
• Predicting functional interactions between
proteins
• Genomic context methods
– General
– Gene fusion
– Gene order
– Presence / absence of genes across
genomes
• Integration and benchmarking of predictions
• Biochemistry by other means BolA
• In addition to genomic context: functional
genomics data
Data-mining proteins for protein function prediction: BolA
An interaction of BolA with a mono-thiol glutaredoxin
?
(STRING)
BolA
BolA and Grx occur as neighbors in a number of genomes
Grx
Bola
BolA and Grx have an (almost) identical phylogenetic distribution
BolA and Grx have been shown to interact in Y2H in S.cerevisiae
and D.melanogaster, and in Flag tag in S.cerevisiae
BolA phylogeny
Cell division / Cell wall
(oxidative) stress
BolA does have (predicted) interactions with cell-division / cell-wall proteins.
Those appear secondary to the link with GrX
 STRING has obtained a higher resolution in function prediction than
phenotypic analyses
BolA is homologous to the peroxide reductase OsmC,
suggesting a similar function
OsmC uses thiol groups of two, evolutionary conserved
cysteines to reduce substrates
Problem: The BolA family does not have conserved
cysteines.
…It would have to obtain its reducing equivalents from
elsewhere…
BolA family alignment
Prediction of interaction partner and molecular function complement each other
BolA interacts with GrX
?
BolA is (homologous to) a reductase
GrX provides BolA with reducing equivalents !? (or “scaffolding?”)
Genomic context: biochemistry by other means
Despite the high performance of genomic context
methods, as a tool for function prediction it is not a
button press method
It is more like biochemistry by other means.
Often quite a lot of manual input and expert
knowledge from the researcher is needed to distill
associations into a concrete function prediction
Small-scale bioinformatics?
Contents
• Predicting functional interactions between
proteins
• Genomic context methods
– General
– Fusion
– Gene order
– Co-occurrence across genomes
• Integration and benchmarking of predictions
• Interaction networks
• In addition to genomic context: functional
genomics data
STRING currently in addition includes:
• Functional association data from large scale / highthroughput biochemical experiments (functional
genomics data)
• protein complex purification
• yeast-2-hybrid
• ChIP-on-chip
• micro-array gene expression
• “known” functional relations, so called “legacy data”,
as present in PubMed abstracts and databases like
MIPS or KEGG.
• Handing out article and questions :
“Identification of a bacterial regulatory
system for ribonucleotide reductases by
phylogenetic profiling.” Read the article
and hand in the answers to the questions
by Monday November 28th.