Download ppt - Chair of Computational Biology

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Gene expression programming wikipedia , lookup

Point mutation wikipedia , lookup

Epigenetics of neurodegenerative diseases wikipedia , lookup

Gene wikipedia , lookup

Metabolic network modelling wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Minimal genome wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

Epigenetics of human development wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Gene expression profiling wikipedia , lookup

Polycomb Group Proteins and Cancer wikipedia , lookup

Protein moonlighting wikipedia , lookup

NEDD9 wikipedia , lookup

Transcript
V5: Functional inference from interaction networks
5. Lecture SS 20005
Cell Simulations
1
Aim: associated protein function at the proteomic scale
Even the best-studied model organisms contain a large number of proteins whose
functions are currently unknown. E.g. about one-third of the proteins in
Saccharomyces cerevisiae remain uncharacterized.
Traditionally, computational methods to assign protein function have relied largely
on sequence homology. The recent emergence of high-throughput experimental
datasets have led to a number of alternative, non-homology based methods for
functional annotation. These methods have generally exploited the concept of
guilt by association, where proteins are functionally linked through either
experimental or computational means.
Nabieva et al., Bioinformatics 21, i1 (2005)
5. Lecture SS 20005
Cell Simulations
2
Associating protein function from proteomic data
Large-scale experiments have linked proteins that
- physically interact (Ito et al., 2001; Uetz et al., 2000; Gavin et al., 2002; Ho et
al., 2002; Rain et al., 2001; Giot et al., 2003; Li et al., 2004),
- are synthetic lethals (Tong et al., 2001, 2004),
- are coexpressed (Edgar et al., 2002) or
- are coregulated (Lee et al., 2002; Harbison et al., 2004).
In addition, computational techniques linking pairs of proteins include
- phylogenetic profiles (Gaasterland and Ragan, 1998; Pellegrini et al., 1999),
- gene clusters (Overbeek et al., 1999),
- conserved gene neighbors (Dandekar et al., 1998) and
- gene fusion analysis (Enright et al., 1999; Marcotte et al., 1999a).
Perhaps, though not surprisingly, integrating the information from several sources
provides the best method for linking proteins functionally.
Nabieva et al., Bioinformatics 21, i1 (2005)
5. Lecture SS 20005
Cell Simulations
3
Features of integrated networks
Several groups have attempted to partition interaction networks into functional
modules that correspond to sets of proteins that are part of the same cellular
function or take part in the same protein complex.
These functional modules, or clusters, are useful for annotating uncharacterized
proteins, as the most common functional annotation within a cluster can be
transferred to uncharacterized proteins.
Proteins in experimentally and computationally determined interaction graphs
have been grouped together based on
- shared interactions,
- the similarity between shortest path vectors to all other proteins in the network
- and shared membership within highly connected components or cliques.
Nabieva et al., Bioinformatics 21, i1 (2005)
5. Lecture SS 20005
Cell Simulations
4
Features of integrated networks
The strategy described here is more closely related
to recent attempts to classify proteins according to
the functional annotations of their network
neighbors; these methods do not explicitly cluster
proteins.
Schwikowski et al. (2000) use physical interaction
data for baker’s yeast, and predict the biological
process for each protein by considering its
neighboring interactions and taking the three most
frequent annotations.
Although such a simple majority vote approach, named Majority, has clear predictive value, it takes
only limited advantage of the underlying graph structure of the network. For example, in the interaction
network given in Figure 1, ‘Majority’ would assign functions to proteins d and f , but not to protein e, even
though our intuition might indicate that protein e has the same function as proteins d and f ; there are
several examples in the yeast proteome similar to this one (Schwikowski et al., 2000). Naturally, one
wishes to generalize this principle to consider functional linkages beyond the immediate neighbors in the
interaction graph, both to provide a systematic framework for analyzing the entirety of physical interaction
data for a given proteome and to make predictions for proteins with no annotated interaction partners.
Nabieva et al., Bioinformatics 21, i1 (2005)
5. Lecture SS 20005
Cell Simulations
5
Features of integrated networks
Hishigaki et al. (2001) extended
Majority algorithm by predicting
a protein’s function by looking at all
proteins within a particular radius
and finding over-represented
functional annotations.
However, this approach, named Neighborhood, does not consider any aspect of
network topology within the local neighborhood.
E.g., Figure 2 shows two interaction networks that are treated equivalently when
considering a radius of 2 and annotating protein a; however, in the first case, there
is a single link that connects protein a to the annotated proteins, and in the
second case, there are several independent paths between a and the annotated
proteins, and moreover, two of these proteins are directly adjacent to a.
Nabieva et al., Bioinformatics 21, i1 (2005)
5. Lecture SS 20005
Cell Simulations
6
Method: physical interaction network
Construct the protein–protein physical interaction network using the protein
interaction dataset compiled by GRID (Breitkreutz et al., 2003).
The resulting network is a simple undirected graph G = (V , E), where there is a
vertex or node v  V for each protein, and an edge between nodes u and
v if the corresponding proteins are known to interact physically (as determined by
one or more experiments).
Initially, we consider a graph with unit-weighted edges, and then consider
weighting the edges by our ‘confidence’ in the edge.
The weight of the edge between u and v is denoted by wu,v.
For all reported results, consider only the proteins making up the largest
connected component of the physical interaction map (4495 proteins and 12 531
physical interaction links).
Nabieva et al., Bioinformatics 21, i1 (2005)
5. Lecture SS 20005
Cell Simulations
7
Method: functional annotations
Several controlled vocabulary systems exist for describing biological function,
including Munich Information Center for Protein Sequences (MIPS) and the Gene
Ontology (GO) project.
We use the MIPS functional hierarchy, and consider the 72 MIPS biological
processes that comprise the second level of hierarchy.
Of the 4495 proteins in the largest connected component of the yeast physical
interaction map, 2946 have MIPS biological process annotations. We also
experimented with GO annotations; the overall conclusions made in this paper are
not affected.
Nabieva et al., Bioinformatics 21, i1 (2005)
5. Lecture SS 20005
Cell Simulations
8
Method: weighted functional linkages
The reliabilities of different data sources vary, even if they are based on the same
underlying technology.
In the context of network-based algorithms, it is possible to weight edges so as to model the
reliability of each interaction. For physical interactions, this reliability is in turn based on the
experimental sources that contribute to our knowledge
about the existence of the interaction.
To determine these values, separate all experimental sources of physical interaction data into
several groups, placing each high-throughput dataset into a separate group, and allocating
one group for the family of all specific experiments.
For each group of experiments, compute what fraction of its interactions connect proteins with
a known shared function. Assume that the reliabilities of different sources are independent,
and thus conclude by estimating the reliability of an interaction to be the noise (or unreliability)
of the underlying data sources.
If ri is the reliability of experimental group i, compute the reliability of the edge by 1 - i (1-ri ),
where the product runs over all experiments i where this interaction is found. This treats each
ri as a probability and assumes independence.
Nabieva et al., Bioinformatics 21, i1 (2005)
5. Lecture SS 20005
Cell Simulations
9
Algorithms: Majority
as in Schwikowsky et al. (2000): consider all neighboring proteins and sum up the
number of times each annotation occurs for each protein.
„The annotated functions of all neighbors of P are ordered in a list, from
the most frequent to the least frequent. Functions that occur the same number
of times are ordered arbitrarily. Everything after the third entry in the list
is discarded, and the remaining three or fewer functions are declared as
predictions for the function of P.“
In the case of weighted interaction graphs take a weighted sum.
Nabieva et al., Bioinformatics 21, i1 (2005)
5. Lecture SS 20005
Cell Simulations
10
Algorithms: Neighborhood algorithm
Adapted from Hishigaki et al. (2001) Protein interaction map (Figure B): each node
represents a protein and each edge represents the interaction between two proteins.
Predict function of each protein in the map (black circle in Figure C), based on the
functions of ‘n-neighbouring proteins’, which are defined as a set of proteins reached
via n physical interactions at most (n is an integer parameter).
E.g. all proteins enclosed by the inside dashed circle are ‘1-neighbouring proteins’,
and those enclosed by the outside circle are ‘2-neighbouring proteins’. The protein of
interest is assigned the function with the highest 2 value among functions of all nneighbouring proteins. For each member of the function category, the 2 value is
calculated using the following formula:
where i denotes a protein function, e.g. ‘Golgi’, ‘DNA repair’ and ‘transcription factor’,
ei denotes an expectation number of i in n-neighbouring proteins expected from the
distribution on the total map, and ni denotes an observed number of i in n-neighbouring
proteins.
Then, the function of a query protein is predicted to be the function i with the maximum
2 value. When there are multiple functions with the largest 2 value, both functions
are assigned. The optimal n value is determined by a so-called self-consistency test,
where the predicted functions of all proteins in the map are compared with their
annotated functions for each n.
Nabieva et al., Bioinformatics 21, i1 (2005)
Here: consider neighborhoods of radius 1, 2 and 3.
This method does not extend naturally to the case of weighted interaction graphs.
5. Lecture SS 20005
Cell Simulations
11
Algorithms: GenMultiCut
It was suggested that functional annotations on interaction networks should be
made in order to minimize the number of times different annotations are
associated with neighboring proteins.
Some authors used simulated annealing in an attempt to minimize this objective
function and aggregate results from multiple runs, whereas others used a
deterministic approximation, and consider the case where edges are weighted
using gene expression information.
The formulation in these two studies is similar to the minimum multiway k-cut
problem. In multiway k-cut, the task is to partition a graph in such a way that each
of k terminal nodes belongs to a different subset of the partition and so that the
(weighted) number of edges that are ‘cut’ in the process is minimized. In the more
general version of the multiway k-cut problem considered here, the goal is to
assign a unique function to all the unannotated nodes so as to minimize the sum
of the costs of the edges joining nodes with no function in common.
Nabieva et al., Bioinformatics 21, i1 (2005)
5. Lecture SS 20005
Cell Simulations
12
Algorithms: GenMultiCut
Although minimum multiway k-cut is NP-hard (Dahlhaus et al., 1994), it was
found that the particular instances of minimum multiway cut arising here can, in
practice, be solved exactly when stated as an ILP.
Introduce a node variable xu,a for each protein u and function a.
Set xu,a = 1 if protein u is predicted to have function a.
If a protein u has known functional annotations, variable xu,a is fixed as 1 for its
known annotations a and as 0 for all other annotations.
We also introduce an edge variable xu,v,a for each function a and each pair of
adjacent proteins u and v. This variable is set to 1 if both proteins u and v are
annotated with function a.
Minimizing the weighted number of neighboring proteins with different annotations
is the same as maximizing the number with the same annotation, and so we have
the following ILP:
Nabieva et al., Bioinformatics 21, i1 (2005)
5. Lecture SS 20005
Cell Simulations
13
Algorithms: GenMultiCut
annot(u) : set of known annotations for protein u,
FUNC = Uu annot(u) : set of all functional annotations.
The first constraint specifies that exactly one functional annotation is made for any
protein.
The second and third constraints ensure that if protein u is annotated with function
a, xu,a is set as a constant to 1, and if protein u is annotated but not with function
a, xu,a is set as a constant to 0.
The third and fourth constraints ensure that a particular function is picked for an
edge only if it is also chosen for the corresponding proteins.
Nabieva et al., Bioinformatics 21, i1 (2005)
5. Lecture SS 20005
Cell Simulations
14
considering multiple GenMultiCut optimal solutions
An important consideration in this framework is the existence of multiple optimal
solutions. E.g. the network in Figure 3 has seven minimum cuts of value 1, and
while the GenMultiCut criterion does not favor any one cut over the
other, if we find all optimal cuts for this graph, we observe that x2 is in fact
annotated with F1 more often than with F2 in the assignments made by these cuts.
Thus, a sense of distance to annotated nodes is in fact present in the set of all
optimal solutions.
Nabieva et al., Bioinformatics 21, i1 (2005)
5. Lecture SS 20005
Cell Simulations
15
Algorithms: GenMultiCut
attempt to sample from the space of optimal solutions:
(1) add constraints to the ILP which require that each consecutive solution is
different from any previous solution in the value it assigns to at least 5% of the
node variables xu,a.
(2) introduce uniform self-weights wu,a for each protein u and function a.
These self-weights are then perturbed randomly by adding a very small offset to
each. Then maximize
a number of times
Nabieva et al., Bioinformatics 21, i1 (2005)
5. Lecture SS 20005
Cell Simulations
16
Algorithms: FunctionalFlow
The functional flow algorithm generalizes the principle of ‘guilt by association’ to groups of
proteins that may or may not interact with each other physically.
Treating each protein of known functional annotation as a ‘source’ of ‘functional flow’ for that
function. After simulating the spread over time of this functional flow through the
neighborhoods surrounding the sources, the ‘functional score’ is obtained for each protein in
the neighborhood; this score corresponds to the amount of ‘flow’ that the protein has
received for that function, over the course of the simulation.
The functional flow-based model allows us to incorporate a distance effect, i.e. the effect of
each annotated protein on any other protein depends on the distance separating these two
proteins.
Running this process for each biological function in turn, we obtain, for each protein, the
score for each function (the score may be 0 if the ‘flow’ for a function did not reach that
protein during the simulation). Thereupon, for any protein, we take the functions for which
the highest score was obtained as its predicted functions.
Nabieva et al., Bioinformatics 21, i1 (2005)
5. Lecture SS 20005
Cell Simulations
17
Algorithms: FunctionalFlow
More specifically, for each function in turn, we simulate the spread of functional flow by an
iterative algorithm using discrete time steps. We associate with each node (protein) a
‘reservoir’ which represents the amount of flow that the node can pass on to its neighbors at
the next iteration, and with each edge, a capacity constraint that dictates the amount of flow
that can pass through the edge during one iteration.
The capacity of an edge is taken to be its weight. Each iteration of the algorithm updates
the reservoirs using simple local rules: a node pushes the flow residing in its reservoir to its
neighbors proportionally to the capacities of the respective edges and subject to further
constraints that the amount of flow pushed through an edge during an iteration does not
exceed the capacity of the edge, and that flow only spreads ‘downhill’ (i.e. from proteins with
more filled reservoirs to nodes with less filled reservoirs).
Finally, at each iteration, an ‘infinite’ amount of flow is pumped into the source protein nodes;
thus, the sources always have enough flow in their reservoir to fill the capacity of their
outgoing edges.
Nabieva et al., Bioinformatics 21, i1 (2005)
5. Lecture SS 20005
Cell Simulations
18
Algorithms: FunctionalFlow
The functional score is the amount of flow that has entered a protein’s reservoir in
the course of all iterations. Since the flow is ‘pumped’ into the sources at each
step, the amount of flow a node receives from each source is greater for nodes
that are closer to that source than for nodes that are farther away from it.
Thus, a source’s immediate neighbor in the graph receives d iterationsworth of
flowfrom the source, whereas a node that is two links away from the source
receives d - 1 iterations worth of flow.
Similarly, the number of iterations for which the algorithm is run determines the
maximum shortest-path distance that can separate a recipient node from a source
in order for the flowto propagate from the source to the recipient. In the context of
protein interaction, a relatively small number of iterations is sufficient. We choose
d = 6, which is half the diameter of the yeast physical interaction network.
Nabieva et al., Bioinformatics 21, i1 (2005)
5. Lecture SS 20005
Cell Simulations
19
Algorithms: FunctionalFlow
More formally, for each protein u in the interaction network, we define a variable
Rta(u) that corresponds to the amount in the reservoir for function a that node u
has at time t .
For each edge (u, v) in the interaction network, we define variables gta(u, v) and
gta (v, u) that represent the flow of function a at time t from protein u to protein v,
and from protein v to protein u. We will run the algorithm for d time steps or
iterations. At time 0, we only have reservoirs of function a at annotated
nodes:
At each subsequent time step, we recompute the reservoir of each protein by
considering the amount of flow that has entered the node and the amount that has
left:
Nabieva et al., Bioinformatics 21, i1 (2005)
5. Lecture SS 20005
Cell Simulations
20
Algorithms: FunctionalFlow
Initially, at time 0, there is no flow on the edges, and ga0 (u, v) = 0.
At each subsequent time step, we have the flow proceeding downhill and
satisfying the capacity constraints:
Finally, the functional score for node u and function a over d iterations is
calculated as the total amount of flow that has entered the node:
Why don‘t we simply take the current value of the reservoir?
Nabieva et al., Bioinformatics 21, i1 (2005)
5. Lecture SS 20005
Cell Simulations
21
Comparison of the four methods
Comparison of four basic methods on the
unweighted physical interaction map
We compare the performance of Majority,
Neighborhood, GenMultiCut and FunctionalFlow
on the unweighted yeast physical interaction
map, using a 2-fold cross-validation. Figure 4
plots as a function of FP the number of TPs each
method predicts (i.e. these graphs are obtained
by varying the scoring threshold for each of the
methods). The Functional-Flow algorithm
identifies more TPs over the entire range of FPs
than either GenMultiCut or Neighborhood
using radius 1, 2 or 3. FunctionalFlow
performs better than Majority when proteins
are not directly interacting with at least three
proteins of the same function; this is evident from
Figure 4 since the score for Majority counts up
the the most frequent neighboring annotation
(e.g. the rightmost point for Majority corresponds
to proteins whose highest functional scores are
one). Thus, FunctionalFlow is the method of
choice when considering proteins that do not
interact with many annotated proteins. Even in
well-characterized proteomes, such as baker’s
yeast, there are ca. 1200 proteins that have fewer
than three annotated neighbors.
The Neighborhood algorithm performs similarly with either
radius 1 or 2 in the high-confidence region (i.e. corresponding
to a low FP rate, given in left-most portion of the ROC curve).
However, radius 1 (i.e. considering just direct interactions) has
better overall performance than radius 2 or 3, demonstrating
that Neighborhood’s strategy of ignoring topology is not
optimal. Moreover, comparing Majority with Neighborhood
using radius 1 demonstrates that the ÷2-test is not as effective
in scoring as just summing up the number of times a particular
annotation occurs in the neighboring proteins.
Nabieva et al., Bioinformatics 21, i1 (2005)
5. Lecture SS 20005
Cell Simulations
22
Reliability and data integration
Figure 5 shows that
rewarding multiple
experimental evidence is
beneficial, but that the main
advantage comes from
taking into account the
actual reliability values for
the different experiments.
To evaluate the approach for modeling physical interaction reliability as edge weights, we test the
performance of FunctionalFlow using three ways of assigning physical interaction weights.
(1) assign each edge a unit weight; this corresponds to the unweighted physical interaction map.
(2) assign each experimental source a reliability score of 0.5; this rewards interactions that are
found by more than one experiment.
(3) assign each experimental source the predictive value (estimated in crossvalidation); here,
edges obtained from multiple, more reliable experiments are given higher weights.
Nabieva et al., Bioinformatics 21, i1 (2005)
5. Lecture SS 20005
Cell Simulations
23
All methods perform better on weighted map
Figure 6 shows how Majority,
GenMultiCut and FunctionalFlow
perform on the yeast physical interaction
map, where edges are weighted by
individual experimental reliability.
The baseline performance of Majority
on the unweighted physical interaction
graph is also shown. There is substantial
improvement in predictions using all
three methods when incorporating
edges weighted by reliability.
Nabieva et al., Bioinformatics 21, i1 (2005)
5. Lecture SS 20005
Cell Simulations
24
Is it useful to provide more information?
We further explored whether the
network analysis algorithms would
perform well when other types of
experimental information are added.
As a proof of principle, we explore the
effect of adding genetic linkages to
the graph (synthetic lethals).
As is evident from the figure, adding
genetic interaction data significantly
improves prediction quality.
Nabieva et al., Bioinformatics 21, i1 (2005)
5. Lecture SS 20005
Cell Simulations
25
Conclusions
Network analysis algorithm FunctionalFlow provides an effective means for
predicting protein function from protein interaction maps.
The algorithm utilizes indirect network interactions, network topology, network
distances and edges weighted by reliability estimated from multiple data sources.
The simplest methods, such as Majority, perform well if there are enough direct
neighbors with known function.
Simple independence assumptions were made for estimating the reliability of
interactions. Although these work reasonably well, it may be even more beneficial
to use a more sophisticated approach for weight assignment and perform more
complete data integration.
Finally, although the method was applied to baker’s yeast, FunctionalFlow is
likely to be especially useful when analyzing largely uncharacterized proteomes
where computational methods are used to infer protein interaction maps.
Nabieva et al., Bioinformatics 21, i1 (2005)
5. Lecture SS 20005
Cell Simulations
26
Alternative scheme to combine information from
various source: Bayesian network
Knowing the correct overall structures of gene networks will be invaluable for characterizing
the complex roles of individual genes and the interplay between the many systems in a cell.
Deriving gene networks from heterogeneous functional genomics data is often difficult,
because experiments such as microarray analyses of gene expression or systematic protein
interaction mapping measure different aspects of gene or protein associations. E.g.
Affinity purification of proteins analyzed by MS: measures the tendency for proteins to be
components of the same physical complex, although not necessarily to contact each other
directly.
Yeast two-hybrid assays: often indicate direct physical interactions (stable or transient)
between proteins.
Synthetic lethal screens: measure the tendency for genes to compensate for the loss of
other genes.
These analyses range considerably in accuracy, and it is not clear a priori which
measurements are correct.
In spite of these differences, these data sets can, in principle, be computationally integrated,
primarily by the reconstruction of network models of the relations between genes.
Lee, ..., Marcotte, Science 306, 1555 (2004)
5. Lecture SS 20005
Cell Simulations
27
Aim: construct complete network of gene association
Network reconstructions have largely focused on physical protein interaction and so
represent only a subset of biologically important relations.
Aim: construct a more accurate and extensive gene network by considering functional,
rather than physical, associations, realizing that each experiment, whether genetic,
biochemical, or computational, adds evidence linking pairs of genes, with associated error
rates and degree of coverage.
In this framework, gene-gene linkages are probabilistic summaries representing functional
coupling between genes. Only some of the links represent direct protein-protein interactions;
the rest are associations not mediated by physical contact, such as regulatory, genetic, or
metabolic coupling, that, nonetheless, represent functional constraints satisfied by the cell
during the course of the experiments.
Working with probabilistic functional linkages allows many diverse classes of
experiments to be integrated into a single coherent network which enables the linkages
themselves to be more reliably
Lee, ..., Marcotte, Science 306, 1555 (2004)
5. Lecture SS 20005
Cell Simulations
28
Method for integrating functional genomics data
Functional genomics data sets are first
benchmarked for their relative accuracies; these
are used as weights in a probabilistic integration of
the data. Several raw data sets already have
intrinsic scoring schemes, indicated in
parentheses (e.g., CC, correlation coefficients; P,
probabilities, and MI, mutual information scores).
These data are rescored with LLS, then integrated
into an initial network (IntNet). Additional linkages
from the genes’ network contexts (ContextNet) are
then integrated to create the final network
(FinalNet) , with È34,000 linkages between 4681
genes (ConfidentNet) scoring higher than the gold
standard (small-scale assays of protein
interactions). Hierarchical clustering of
ConfidentNet defined 627 modules of functionally
linked genes spanning 3285 genes
(‘‘ModularNet’’), approximating the set of cellular
systems in yeast.
5. Lecture SS 20005
Cell Simulations
Lee, ..., Marcotte, Science 306, 1555 (2004)
29
Scoring scheme for linkages
Unified scoring scheme for linkages is based on a Bayesian statistics approach.
Each experiment is evaluated for its ability to reconstruct known gene pathways
and systems by measuring the likelihood that pairs of genes are functionally
linked conditioned on the evidence, calculated as a log likelihood score:
P(L|E) and P(L|E) : frequencies of linkages (L) observed in the given experiment
(E) between annotated genes operating in the same pathway and in different
pathways
P(L) and P(L): the prior expectations (i.e., the total frequency of linkages
between all annotated yeast genes operating in the same pathway and operating
in different pathways).
Scores > 0 indicate that the experiment tends to link genes in the same pathway,
with higher scores indicating more confident linkages.
Lee, ..., Marcotte, Science 306, 1555 (2004)
5. Lecture SS 20005
Cell Simulations
30
Benchmarks
As scoring benchmarks, the method was tested against two primary annotation
references: the Kyoto-based KEGG pathway database and the experimentally
observed yeast protein subcellular locations determined by genome-wide green
fluorescent protein (GFP)–tagging and microscopy.
KEGG scores were used for integrating linkages, with the other benchmark
withheld as an independent test of linkage accuracy.
Cross-validated benchmarks and benchmarks based on the Gene Ontology (GO)
and KOG gene annotations provided comparable results.
Lee, ..., Marcotte, Science 306, 1555 (2004)
5. Lecture SS 20005
Cell Simulations
31
Functional inference from interaction networks
Benchmarked accuracy and extent of functional genomics data sets and the integrated networks. A critical
point is the comparable performance of the networks on distinct benchmarks, which assess the
tendencies for linked genes to share (A) KEGG pathway annotations or (B) protein subcellular locations.
Each x axis indicates the percentage of protein-encoding yeast genes provided with linkages by the
plotted data; each y axis indicates relative accuracy, measured as the of the linked genes’ annotations on
that benchmark. The gold standards of accuracy (red star) for calibrating the benchmarks are smallscale
protein-protein interaction data from DIP. Colored markers indicate experimental linkages; gray markers,
computational. The initial integrated network (lower black line), trained using only the KEGG benchmark,
has measurably higher accuracy than any individual data set on the subcellular localization benchmark;
adding context-inferred linkages in the final network (upper black line) further improves the size and
accuracy of the network.
Lee, ..., Marcotte, Science 306, 1555 (2004)
5. Lecture SS 20005
Cell Simulations
32
Features of integrated networks
Functional linkages were first inferred on the basis of genes‘ mRNA coexpression across each of 12 sets
of DNA microarray experiments (497 microarray experiments in total), then integrated via a rank-weighted
sum of log likelihood scores to create the combined set of coexpression-derived linkages. To construct the
initial integrated network (BIntNet,[ Fig. 1), we combined 8 categories of data, including the physical and
genetic interaction data sets, mRNA coexpression linkages, functional linkages from literature mining, and
computational linkages from two comparative genomics methods, Rosetta stone (gene-fusion) linkages
and phylogenetic profiles.
Integrating functional genomics data also allowed
discovery of additional relations between genes linked, in
turn, to a common set of genes [BContextNet]; these
linkages were scored and integrated as above to
construct the final gene network (BFinalNet, see figure).
The final network has 34,000 linkages at an accuracy
comparable to the gold standard small-scale interaction
assays, which provides linkages (BConfidentNet) for
more than 4681 yeast genes (81% of the yeast
proteome). The network is reasonably distinct from
networks of physical interacting proteins Ee.g., sharing
only È16% of linkages with (Jansen et al.).
The final network shows extensive clustering of genes
into modules, evident in the ‘‘clumping’’ (A).
Lee, ..., Marcotte, Science 306, 1555 (2004)
5. Lecture SS 20005
Cell Simulations
33
Features of integrated networks
At an intermediate degree of clustering that maximizes cluster size and functional coherence, 564 (of 627)
modules are shown connected by the 950 strongest intermodule linkages.
Module colors and shapes indicate associated functions, as defined by Munich Information Center for
Protein Sequencing (MIPS), with sizes proportional to the number of genes, and connections inversely
proportional to the fraction of genes linking the clusters.
Lee, ..., Marcotte, Science 306, 1555 (2004)
5. Lecture SS 20005
Cell Simulations
34
Features of integrated networks
Adding context-inferred linkages increased clustering of genes, which produced a
highly modular gene network with well-defined subnetworks.
We expected these gene clusters to reflect gene systems and modules. We could
therefore generate a simplified view of the major trends in the network (Fig. 3B) by
clustering genes of ConfidentNet according to their connectivities. Of the 4681
genes, 3285 (70.2%) were grouped into 627 clusters, reflecting the high degree of
modularity.
Genes‘ functions within each cluster are highly coherent, and with 2 to 154 genes
per cluster (ca. 5 genes per cluster on average), the clusters effectively capture
typical gene pathways and/or systems.
Lee, ..., Marcotte, Science 306, 1555 (2004)
5. Lecture SS 20005
Cell Simulations
35
Features of integrated networks
Portions of the final, confident gene network are shown for (C) DNA damage
response and/or repair, where modularity gives rise to gene clusters, indicated by
similar colors, and (D) chromatin remodeling, with several uncharacterized
genes (red labels).
Networks are visualized with Large Graph Layout (LGL).
Lee, ..., Marcotte, Science 306, 1555 (2004)
5. Lecture SS 20005
Cell Simulations
36
Features of integrated networks
A region of the modular network centered on the DNA damage response and
repair systems is shown in Fig. 3C. The network is clearly hierarchical: Individual clusters
represent distinct systems related to DNA damage response and/or repair; these clusters
are in turn connected to modules of cell cycle regulatory genes and chromatin silencing,
functionally linked to the DNA damage response and/or repair system.
One can infer individual genes‘ functions on the basis of linked neighbors.
(Comment: this simple scheme can be improved by combining approach with Nabieva et al.)
For example, seven uncharacterized genes are implicated in chromatin remodeling (Fig.
3D). All but 1 of the 18 linkages made by these genes arise from the comparative genomics
analysis or from the network context methods, which represent examples of the insights that
arise only after data integration. Three of the uncharacterized proteins are predicted by
sequence homology to have helicase activity, which is reasonable for a relation to chromatin
remodeling; four of these proteins localize to the nucleus, further supporting their
association. After this network‘s construction, one gene, VID21, was implicated in chromatin
modification as a component of the NuA4 histone acetyl transferase.
Lee, ..., Marcotte, Science 306, 1555 (2004)
5. Lecture SS 20005
Cell Simulations
37
Experimental verification of one prediction
The function of the RNA helicase PRP43, previously thought to be involved only in
pre-mRNA splicing and implicated in lariatintron release from the spliceosome, is
also clarified in the network.
PRP43 is linked most strongly to genes of ribosome biogenesis and rRNA
processing. The tightest links are to ERB1, RRB1, NUG1, LHP1, and PWP1, the
first three of which are confirmed ribosome biogenesis factors. These links derive
only from the coexpression and context methods; data integration is therefore
critical.
The association of PRP43 with ribosome biogenesis has now been experimentally
validated: the growth defect conferred by a PRP43 conditional lethal mutation
corresponds to a rapid and major defect in rRNA processing.
These data indicate that rRNA processing is the essential function of PRP43, and
it joins a growing group of RNA helicases with two or more distinct functions.
Lee, ..., Marcotte, Science 306, 1555 (2004)
5. Lecture SS 20005
Cell Simulations
38
Summary
The probabilistic gene network integrates evidence from diverse sources to reconstruct an
accurate network, by estimating the functional coupling among yeast genes, and provides a
view of the relations between yeast proteins distinct from their physical interactions.
The application of this strategy to other organisms, such as to the human genome, is
conceptually straightforward:
(i) assemble benchmarks for measuring the accuracy of linkages between human genes
based on properties shared among genes in the same systems,
(ii) assemble gold standard sets of highly accurate interactions for calibrating the
benchmarks, and
(iii) benchmark functional genomics data for their ability to correctly link human genes, then
integrate the data as described.
New data can be incorporated in a simple manner serving to reinforce the correct linkages.
Thus, the gene network will ultimately converge by successive approximation to the
correct structure simply by continued addition of functional genomics data in this
framework.
Lee, ..., Marcotte, Science 306, 1555 (2004)
5. Lecture SS 20005
Cell Simulations
39