Download ppt file

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Histone acetylation and deacetylation wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Endomembrane system wikipedia , lookup

Ubiquitin wikipedia , lookup

Thylakoid wikipedia , lookup

LSm wikipedia , lookup

Silencer (genetics) wikipedia , lookup

Proteasome wikipedia , lookup

Gene regulatory network wikipedia , lookup

Ancestral sequence reconstruction wikipedia , lookup

Protein (nutrient) wikipedia , lookup

G protein–coupled receptor wikipedia , lookup

Gene expression wikipedia , lookup

Circular dichroism wikipedia , lookup

Magnesium transporter wikipedia , lookup

Protein domain wikipedia , lookup

Protein folding wikipedia , lookup

SR protein wikipedia , lookup

List of types of proteins wikipedia , lookup

Protein structure prediction wikipedia , lookup

Expression vector wikipedia , lookup

Protein wikipedia , lookup

QPNC-PAGE wikipedia , lookup

Cyclol wikipedia , lookup

Nuclear magnetic resonance spectroscopy of proteins wikipedia , lookup

Protein moonlighting wikipedia , lookup

Intrinsically disordered proteins wikipedia , lookup

Protein mass spectrometry wikipedia , lookup

Western blot wikipedia , lookup

Protein adsorption wikipedia , lookup

Protein–protein interaction wikipedia , lookup

Transcript
Predicting protein functions from
redundancies in large-scale
protein interaction networks
Speaker: Chun-hui CAI
2006-08-30
Objective
 To derive functions of unannotated proteins
from large-scale high-throughput proteinprotein interaction data
– Importance of the work: Overcome the
widespread presence of random false positives.
Key Issue
 If two proteins have significant large numbers of common
interaction partners in the measured data set than what is
expected from a random network, it would suggest close
functional links between them.
To validate this hypothesis:
– Rank all possible protein pairs in the order of their probabilities for
having the experimentally measured number of common
interaction partners by using the probability expression (derived
below).
– If the computed probability is extremely small, it signifies that the
chosen protein pair has an unusually large number of common
partners. Such pairs are considered for further analysis.
Mathematical Expression for
Probability – 1
 Divide the whole set of
partners of the two proteins
into three non-overlapping
groups.
– m common protein partners
that interact with both protein 1
and 2;
– (n1 – m) partners that interact
only with protein 1;
– (n2 – m) partners that interact
only with protein 2.
Protein 1 Protein 2
(n1 – m)
partners
m common
partners
(n2 – m)
partners
Mathematical Expression for
Probability – 2
 Count the total number of distinct ways of assigning
these three groups to N proteins.
 N  N  m  N  n1 
 


m
n
1m
n
2 m
 


 Therefore, the P value is
 N  N  m  N  n1 
 


m
n

m
n

m


P( N , n , n , m)   
 N  N 
  
 n  n 
1
1
2
2
1

2
( N  n1)!( N  n 2)!n1!n 2 !
N !m !(n1  m)!(n 2  m)!( N  n1  n 2  m)!
For this calculation, the results remain
approximately the same, whether we
compute the probabilities for pairs with
exactly m common partners or we
compute for m or more partners. It can
be checked from the expression of
probability in the equation that
probability terms for increasing m fall
inversely with N. Since N for this case
is around 5,000, the additional terms in
the probability expression are negligible.
Mathematical Expression for
Probability – 3
 Clustering Techniques
– Compute P values for all
possible protein pairs and store
them in a matrix;
– Pick the protein pair with the
smallest P value and choose it as
the first group in the cluster;
– The rows and columns for these
tow proteins are merged into
one row and one column;
– Probabilities for this new group
are geometric means of the
individual probabilities (or
arithmetic means of the log(P)
values).
– The process is continued
repeatedly, thus adding more
and more clusters or making the
existing ones bigger, until a
chosen threshold is reached.
Results – 1
 The experimental data from yeast is collected
in the DIP (Database of Interacting Proteins).
The September 1, 2002, update of the DIP
data set containing 14,871 interactions for
4,692 proteins is used.
 By comparing the probabilities of
associations for all possible protein pairs in
the measured protein interaction network
with those in the other two randomly
constructed networks, the author observed
from the plot that the probabilities of some of
the associations in the measured network are
up to 40 orders of magnitude lower than the
other two random networks.
 The author concluded that those associations
are not artifacts caused by experimental noise,
but contain biologically meaningful
information and it is also clear that such low
probability associations did not arise from
scale-free nature of the network.
Results – 2
 The author inspect all pairs with
probabilities below a cutoff value
of 10-8. The group consists of
2,833 protein pairs involving 852
proteins.
 A strong functional link is
observed among proteins in these
pairs which is illustrated in the
Table. 10 pairs with lowest
probabilities are listed in the table.
Both proteins are either belong to
the same complex or are parts of
the same functional pathway. The
same trend is generally true for
the larger data set. By manually
inspecting the top 100 pairs, the
author found that in >95% of
them both proteins have similar
functions.
Results – 3
 To predict the functions of the
unannotated proteins.
– About 29% of the 2,833 chosen pairs
contain at least one unannotated
proteins;
– To assign a function to any one of
them, the author determine the other
proteins with which it forms
associations, e.g,
• The unannotated protein YKL059C
shares partners with many proteins
involved in transcription. Moreover
from the low probabilities of
associations with CFT2 and CFT1,
YKL059C is involved in pre-mRNA3’
end processing.
– The notion is further confirmed by the
clustering method presented below.
Results – 4
 The author derived 202 modules from the associations and
then compare the annotations of constituent proteins.
 A total of 163 of the derived modules have all proteins
annotated in the SGD, and 149 of them (92%) to have all
members of the module from the same functional complex
or pathway.
 If a unannotated protein belongs to the same module with
other proteins of known functions, its function is predicted
to be the same as the other ones with high confidence.
 By analyzing the derived modules, the author predict
functions for 81 unannotated proteins.
Discussion – 1.1 (threshold)
 The chosen cut-off value (10-8)
is not a sharp threshold. The
module’s numbers and sizes
increase gradually with
increasing cut-off.
– As a example, for the wellstudied mediator complex
shown in the Fig.3a, as we
increase the cut-off value, more
proteins known to be part of the
complex come together. Even
with a cut-off as high as 2 x 10-4,
the proteins included in the
mediator module are genuinely
related to the complex.
Discussion – 1.2 (threshold)
 Among the additional
modules derived with
higher threshold, there are
two that contain mostly
unannotated proteins and
therefore are possibly large
complexes not yet well
studied by experimentalists.
– One of them is suspected
to be involved in actin
cytoskeleton organization
and protein vacuolar
targeting and the other
one in splicing, rRNA
processing, and small
nucleolar RNA
processing.
Discussion – 2 (Method)
 The method here has several advantages.
– It is not sensitive to random false positive
• The author added connections randomly without changing the powerlaw nature of the network. Even after increasing the average number
of interactions by 50%, the author was able to recover 89% of the top
2,833 associations.
– The method is not biased by the number of partners a protein has.
• As a example, JSN1, a nuclear pore protein, has the largest number of
interactions in the measured data set, but none of the 2,833
associations derived by our method contains JSN1.
 Among the drawbacks, the method may not uncover all of
the functions for the proteins, including some multidomain proteins conducting many different functions in the
cell.
Discussions – 3
 The author also checked hot
many of the associations
derived are also ancient
paralogs among the list of
top 2,833 pairs. 22 such
ancient paralogs are among
the list of 2,833 pairs (0.7%).
Therefore, these are the
ancient paralogs that
maintained their functions
over time.
 Comprehensive Analysis of Combinatorial
Regulation using the Transcriptional
Regulatory Network of Yeast. (2006)
– S. Balaji and M. Madan Badu
Introductions