Download Scale-free networks - Chair of Computational Biology

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Artificial gene synthesis wikipedia , lookup

Gene expression programming wikipedia , lookup

Protein moonlighting wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

NEDD9 wikipedia , lookup

Median graph wikipedia , lookup

Transcript
1 Scale-free networks: mathematical properties
Random graphs: classical field in graph theory. Well studied analytically and
numerically.
Scale-free networks: quite new. Properties were mostly studied numerically and
heuristically (sofar).
Nice review (suggested reading of today):
Graph Theory Approaches to Protein Interaction Data Analysis,
Nataša Pržulj.
2. Lecture WS 2004/05
Bioinformatics III
1
Erdös-Renyi model
n nodes (vertices) joined by edges that have been chosen and placed between
pairs of nodes uniformly at random.
Gn,p : each possible edge in the graph on n nodes is present with probability p
and absent with probability 1 – p.
Average number of edges in Gn,p :
nn  1
p
2
Each edge connects two vertices
 average degree of a vertex:
2. Lecture WS 2004/05
nn  1 p
 n  1 p  np
n
n 
Bioinformatics III
2
Erdös-Renyi model: components
Erdös and Renyi studied how the expected topology of a random graph with n
nodes changes as a function of the number of edges m.
When m is small, the graph is likely fragmented into many small connected
components having vertex sets of size at most O(log n).
As m increases the components grow at first by linking to isolated nodes, and
later by fusing with other components.
A transition happens at m = n/2, when many clusters cross-link spontaneously
to form a unique largest component called the giant component.
Its vertex set size is much larger than the vertex set sizes of any other
components.
It contains O(n) nodes, while the second largest component contains O(log n)
nodes.
In statistical physics, this phenomenon is called percolation.
2. Lecture WS 2004/05
Bioinformatics III
3
Erdös-Renyi model: shortest path length
The shortest path length between any pairs of nodes in the giant component
grows like log n.
Therefore, these graphs are called „small worlds“.
The properties of random graphs have been studied very extensively.
Literature: B. Bollobas. Random Graphs. Academic, London, 1985.
However, random graphs are no adequate models for real-world networks
because
(1) real networks appear to have a power-law degree distribution,
(while random graphs have Poisson distribution) and
(2) real networks have strong clustering while the clustering coefficient of a
random graph is C = p, independent of whether two vertices have a
common neighbor.
2. Lecture WS 2004/05
Bioinformatics III
4
Generalized Random Graphs
Aim: allow a power-law degree distribution in a graph while leaving all other
aspects as in the random graph model.
 Given a degree sequence (e.g. power-law distribution) one can generate a
random graph by assigning to a vertex i a degree ki from the given degree
sequence. Then choose pairs of vertices uniformly at random to make edges so
that the assigned degrees remain preserved.
When all degrees have been used up to make edges, the resulting graph is a
random member of the set of graphs with the desired degree distribution.
Problem: method does not allow to specify clustering coefficient.
On the other hand, this property makes it possible to exactly determine many
properties of these graphs in the limit of large n.
E.g. almost all random graphs with a fixed degree distribution and no nodes of
degree smaller than 2 have a unique giant component.
2. Lecture WS 2004/05
Bioinformatics III
5
Barabasi scale-free model
Input (n0, m, t) where n0 is the initial number of vertices, m (m n0) is the
number of added edges every time one new vertex is added to the graph, and t
is the number of iterations.
Algorithm
a) Start with n0 isolated nodes.
b) Every time we add one new node v, m edges will be linked to the existing
nodes from v with a preferential attachment probability
 i ki ) 
ki
N 1
k
i 1
i
where ki is the number of links at the i-th node.
Eventually, the graph has (n0 + t) nodes and (mt) edges.
Problem of „pure“ mathematicians with this algorithm: how to start from n0 = 0?
2. Lecture WS 2004/05
Bioinformatics III
6
Properties of Barabasi-Albert scale-free model
P(k)  k- with  = 3.
Real networks often show   2.1 – 2.4
Observation: if either growth or preferential attachment is eliminated, the
resulting network does not exhibit scale-free properties.
The average path length in the BA-model is proportional to ln n/ln ln n
which is shorter than in random graphs  scale-free networks are ultrasmall
worlds.
Observation: non-trivial correlations = clustering between the degrees of
connected nodes.
Numerical result for AB-model C  n-0.75. No analytical predictions of C sofar.
2. Lecture WS 2004/05
Bioinformatics III
7
Properties of scale-free models
Scale-free networks are resistant to random failures („robustness“) because a
few high-degree hubs dominate their topology;
a deliberate node that fails probably has a small degree, and thus not severly
affects the rest of the network.
However, scale-free networks are quite vulnerable to attacks on the hubs.
See example of last lecture about
lethality of gene deletions in yeast.
These properties have been confirmed
numerically and analytically by studying
the average path length and the size
of the giant component.
2. Lecture WS 2004/05
Bioinformatics III
8
Properties of Barabasi-Albert scale-free model
BA-model is a minimal model that captures the mechanisms responsible for the
power-law degree distribution observed in real networks.
A discrepany is the fixed exponent of the predicted power-law distribution ( = 3).
 Does the BA-model describe the „true“ biological evolution of networks?
Recent efforts:
- study variants with cleaner mathematical properties (Bollobas, LCD-model)
- include effects of adding or re-wiring edges,
allow nodes to age so that they can no longer accept new edges
or vary forms of preferential attachment.
These models also predict exponential and truncated power-law degree
distribution in some parameter regimes.
2. Lecture WS 2004/05
Bioinformatics III
9
2 Scale-free behavior in protein domain networks
‚Domains‘ are fundamental units of protein structure.
Most proteins only contain one single domain.
Some sequences appear as multidomain proteins. On average, they have 2-3
domains, but can have up to 130 domains!
Most new sequences show homologies to parts of known protein sequences
 most proteins may have descended from relatively few ancestral types.
Sequence of large proteins often seem to have evolved by joining preexisting
domains in new combinations, „domain shuffling“:
domain duplication or domain insertion.
Wuchty Mol. Biol. Evol. 18, 1694 (2001)
2. Lecture WS 2004/05
Bioinformatics III
10
Protein domain database SMART
http://smart.embl-heidelberg.de/
contains 153 signalling domains
176 nuclear domains, e.g. HLH domains
225 extracellular domains
115 „other“ domains
Wuchty Mol. Biol. Evol. 18, 1694 (2001)
2. Lecture WS 2004/05
Bioinformatics III
11
Protein Domain databases
Prosite (http://expasy.proteome.org.au/prosite/) contains 1400 biologically
significant motifs and profiles.
Pfam (http://www.sanger.ac.uk/Software/Pfam/index.shtml) : collection of multiplesequence alignments of protein families and profile HMMs. Curated documentation
on 2500 families.
ProDom (http://www.toulouse.inra.fr/prodom.html) : contains all 160.000 protein
domain families that can be automatically generated from SwissProt and TrEMBL
databases.
Here, only consider families with 10 members  6000 ProDom families.
InterPro Proteome Analysis of 41 nonredundant proteomes of genomes of
archaea, bacteria, and eukaryotes (http://www.ebi.ac.uk/proteome) yields domains
which appear along with other domains in a protein sequence  vertices + links.
Wuchty Mol. Biol. Evol. 18, 1694 (2001)
2. Lecture WS 2004/05
Bioinformatics III
12
Protein Domain databases
P(number of links to other domains)
Prosite
(http://expasy.proteome.org.au/prosit
e/) contains 1400 biologically
significant motifs and profiles.
number of links to other domains
2. Lecture WS 2004/05
Bioinformatics III
Wuchty Mol. Biol. Evol. 18, 1694 (2001)
13
Which are highly connected domains?
The majority of highly connected InterPro domains appear in signalling pathways.
List of the 10 best linked domains in various species.
Number of links increases. Number of signalling domains (PH, SH3), their ligands
(proline-rich extensions), and receptors (GPCR/RHODOPSIN) increases.
 evolutionary trend toward compartementalization of the cell and multicellularity
demands a higher degree of organization.
Wuchty Mol. Biol. Evol. 18, 1694 (2001)
2. Lecture WS 2004/05
Bioinformatics III
14
Evolutionary Aspects
BA-model of scale-free networks is constructed by preferential attachment of newly
added vertices to already well connected ones.
 Fell and Wagner (2000) argued that vertices with many connections in metabolic
network were metabolites originating very early in the course of evolution where
they shaped a core metabolism.
 Analogously, highly connected domains could have also originated very early.
Is this true?
No. Majority of highly connected domains in Methanococcus and in E.coli are
concerned with maintanced of metabolism.
None of the highly connected domains of higher organisms is found here.
On the other hand, helicase C has roughly similar degrees of connection in all
organisms.
Wuchty Mol. Biol. Evol. 18, 1694 (2001)
2. Lecture WS 2004/05
Bioinformatics III
15
Conclusions
Expansion of protein families in multcellular vertebrates coincides with higher
connectivity of the respective domains.
Extensive shuffling of domains to increase combinatorial diversity might provide
protein sets which are sufficient to preserve cellular procedures without dramatically
expanding the absolute size of the protein complement.
 greater proteome complexity of higher eukaryotes is not simply a consequence
of the genome size, but must also be a consequence of innovations in domain
arrangements.
 highly linked domains represent functional centers in various different cellular
aspects.
They could be treated as „evolutionary hubs“ which help to organize the domain
space.
Wuchty Mol. Biol. Evol. 18, 1694 (2001)
2. Lecture WS 2004/05
Bioinformatics III
16
3 How did complexity evolve II?
At the molecular level, biological complexity involves networks of ligand-protein,
protein-protein, and protein-nucleic acid interactions in metabolism, signal
transduction, gene regulation, protein synthesis etc.
As organismal complexity increases, more control is required for the positive and
negative regulation of genes.
 Complexity correlates with an increase in both the ratio and absolute number of
transcription factors.
Amoutzias et al. EMBO reports 5, 1 (2004)
2. Lecture WS 2004/05
Bioinformatics III
17
Evolution of complex genetic networks
Duplication of genes is predominant factor for the generation of new members of a
protein family.
Duplicated gene creates redundancy if multiple proteins have the same or
overlapping function.
Alternatively, due to reduced selective pressure, one of the gene copies can
become nonfunctional or acquire new function.
What is more important?
Duplication of single genes
or
duplication of large gene clusters = „building blocks“ ?
Here: Empirical evidence for scale-free protein networks emerging through
single-gene duplication.
Amoutzias et al. EMBO reports 5, 1 (2004)
2. Lecture WS 2004/05
Bioinformatics III
18
bHLH protein family
bHLH protein family:
ancient class of eukaryotic transcription factors
found in fungi, plants, and animals.
bHLHs may form homo- and heterodimers.
They form complex protein-protein interaction network.
Very conserved 60-residue ‚basic region‘ – helix – loop –
helix motif
bHLHs dimerize into 4-helix bundle and
recognize DNA with basic regions.
Additional regions responsible for activation or
repression of target gene activity.
http://www.biochem.ucl.ac.uk/bsm/pdbsum/
2. Lecture WS 2004/05
Bioinformatics III
19
Two possible patterns of network evolution
A Evolution of a heterodimerization network by single-gene duplication.
B Evolution of a heterodimerization network by large-scale gene duplication.
Amoutzias et al. EMBO reports 5, 1 (2004)
2. Lecture WS 2004/05
Bioinformatics III
20
bHLH heterodimerization network
A phylogenetic analysis of human bHLH proteins
B domain architecture
Amoutzias et al. EMBO reports 5, 1 (2004)
2. Lecture WS 2004/05
Bioinformatics III
21
Topology of bHLH heterodimerization network
Topology based on protein-protein interaction data (from literature).
Networks with
hubs: E2A, Arnt,
Max.
E2A and Arnt subnetworks are
connected.
Max is distinct.
Hubs are shown as circles.
2. Lecture WS 2004/05
Amoutzias et al. EMBO reports 5, 1 (2004)
Bioinformatics III
22
Topology of bHLH heterodimerization network
P(k) follows a scale-free behavior!
Relative connectivity of hubs is
higher ( ~ 1) than reported for most
other networks.
This high connectivity appears to result
from gene duplication generating new,
peripheral proteins that interact preferentially with the hub.
Amoutzias et al. EMBO reports 5, 1 (2004)
2. Lecture WS 2004/05
Bioinformatics III
23
Topology of bHLH heterodimerization network
The hub proteins are usually widely
expressed in different tissues and
organs.
They heterodimerize with peripheral
proteins with more limited expression
pattern  specific effects.
Topology of Max network parallels that of
E2A/arnt superfamily:
2 hubs connected by Mad family (repressor)
vs. 2 families connected by HES family
(repressors).
Hubs are shown as circles.
2. Lecture WS 2004/05
Amoutzias et al. EMBO reports 5, 1 (2004)
Bioinformatics III
24
Phylogenetic relationships
Parallelism between E2A/arnt and
Max families also reflected in
phylogetic relationship:
- in ‚Max‘ (and E2A/arnt) network, the
2 hubs (families) are not clustered
together
- the ‚bridge‘ linking the 2 hubs is a
family of repressor proteins that are
phylogenetically quite distant from
the hubs.
 2 bHLH networks show
evolutionary convergence.
Amoutzias et al. EMBO reports 5, 1 (2004)
2. Lecture WS 2004/05
Bioinformatics III
25
Conclusions
• For the evolution of networks based on one kind of binding domain,
a model of single-gene duplication, followed by domain rearrangements, point
mutations and ongoing gene duplication is sufficient to generate quite complex
interaction patterns, which mediate activation and repression.
Seems first example where hub-based network with scale-free properties is based
on real-data phylogenies.
(2) Compelling symmetry between the 2 networks E2A/arnt and Max.
Amoutzias et al. EMBO reports 5, 1 (2004)
2. Lecture WS 2004/05
Bioinformatics III
26
Watts-Strogartz model
Input: (n, k, p) where n is the number of vertices, k is the distance in which each
vertex is connected initially to its neighbors by undirected edges, and p(0 p  1)
is the probability of rewiring each edge.
Algorithm
a) start with a ring lattice with n noces, each has the kth nearest neighbors;
Thus, the degree of each vertex is 2k and the ring has (nk) edges.
b) Replace original edges by random ones based on the probability p.
2. Lecture WS 2004/05
Bioinformatics III
27