Download Powerpoint file

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Why bacteria run Linux
while eukaryotes run Windows?
Sergei Maslov
Brookhaven National Laboratory
New York
Physical vs. Biological Laws


Physical Laws are often discovered by
finding simple common explanation for
very different phenomena
Newton’s Law:



Apples fall to the ground
Planets revolve around the Sun
Discovery of Biological Laws is slowed down
by us having cookie-cutter explanation in
terms of natural selection:
2
Drawing from Facebook group: Trust me, I'm a "Biologist"'
~
Genes encoded
in bacterial genomes
Packages installed on
Linux computers
4




Complex systems have many components
 Genes (Bacteria)
 Software packages (Linux OS)
Components do not work alone:
they need to be assembled to work
In individual systems only a subset of
components is installed
 Genome (Bacteria) – collection of genes
 Computer (Linux OS) – collection of
software packages
Components have vastly
different frequencies of installation
5
IKEA kits have many components
Justin Pollard,
http://www.designboom.com
6
They need to be assembled to work
Justin Pollard,
http://www.designboom.com
7
Different frequencies of use
vs
Common
Rare
8
What determines the frequency
of installation/use of a gene/package?

Popularity: AKA preferential attachment



Frequency ~ self-amplifying popularity
Relevant for social systems: WWW links,
facebook friendships, scientific citations
Functional role:


Frequency ~ breadth or importance
of the functional role
Relevant for biological and technological systems
where selection adjusts undeserved popularity
9
Empirical data on component frequencies

Bacterial genomes (eggnog.embl.de):



Linux packages (popcon.ubuntu.com):



500 sequenced prokaryotic genomes
44,000 Orthologous Gene families
200,000 Linux packages installed on
2,000,000 individual computers
Binary tables: component is either
present or not in a given system
10
Frequency distributions
Cloud
Shell
Core
ORFans
P(f)~ f-1.5 except the top √N “universal” components with f~1
TY Pang, S. Maslov, PNAS (2013)
11
How to quantify functional importance?



We want to check Frequency ~ Importance
Usefulness=Importance ~ Component is
needed for proper functioning of other
components
Dependency network




A  B means A depends on B for its function
Formalized for Linux software packages
For metabolic enzymes given by upstreamdownstream positions in pathways
Frequency ~ dependency degree, Kdep

Kdep = the total number of components that
directly or indirectly depend on the selected one
12
TY Pang, S. Maslov, PNAS (2013)
13
Frequency is positively correlated
with functional importance
Correlation coefficient ~0.4 for both Linux and genes
Could be improved by using weighted dependency degree
TY Pang, S. Maslov, PNAS (2013)
14
Warm-up: tree-like metabolic network
TCA cycle
Kdep=15
Kdep=5
TY Pang, S. Maslov, PNAS (2013)
15
Dependency degree distribution
on a critical branching tree

P(K)~K-1.5 for a critical branching tree

Paradox: Kmax-0.5 ~ 1/N  Kmax=N2>N

Answer: parent tree size imposes a cutoff:
there will be √N “core” nodes with Kmax=N


present in almost all systems (ribosomal genes or core
metabolic enzymes)
Need a new model: in a tree D=1, while in real
systems D~2>1
16
Bottom-down model of
dependency network evolution



Components added gradually over
evolutionary time
New component directly depends on
D previously existing components selected
randomly
Versions:



D is drawn from some distribution
same as above
Recent components are preferentially selected
citations
There is a fixed probability to connect to any
previously existing components
food webs
17
•
p(t,T) –probability that component added at time T
directly or indirectly depends on one added at time t
T
DD
1- p(t,T
(1- p(tp(
,Tt ),T× ) ×) )dt
log(1p(t,T) =)) =Õ
ò log(1-
t t
t =t+1:T
t
dp(t,T )
D
dt
= log(1- p(t,T ) )
1- p(t,T )
t
p(t,T ) =
1
Tætö
1+ ç ÷
DèTø
D
18
N
K dep (t) = 1+ ò p(t,T )dT
t
K dep
N
t ~T
*
( D-1)/ D
æ t ö
K dep (t) = ç ÷
è Nø
-D
t
K out (t) = 1+ ò p(t ,t)dt
1
K out (t) = Ct
D-1
D
19
Kdep and Kout degree distributions
P(Kdep (t) ³ K dep ) = P((t / N )- D ³ K dep ) = P(t / N £ Kdep -1/ D ) = K dep -1/ D
P(Kdep ) ~ K
-(1+1/ D)
dep
P(Kout (t) £ Kout ) = P(t ( D-1)/ D £ Kout ) = P(t £ Kout D/( D-1) ) = K D/( D-1) / N
P(Kout ) ~ K
1/( D-1)
out
20
Kdep decreases layer number
Linux
Model with D=2
TY Pang, S. Maslov, PNAS (2013)
21
Zipf plot for Kdep distributions
Metabolic enzymes
vs
Model
Linux
vs
Model
TY Pang, S. Maslov, PNAS (2013)
22
Frequency distributions
Core
Shell
Cloud
ORFans
P(f)~ f-1.5 except the top √N “universal” components with f~1
TY Pang, S. Maslov, PNAS (2013)
23
What experiments does
P(f) help to interpret?
24
Pan-genome of E. coli strains
M Touchon et al. PLoS Genetics (2009)
Metagenomes
The Human Microbiome Project Consortium, Nature (2012)
26
Pan-genome scaling
DP
-g
-(2-g )
= ò f × f ×exp(- Nf )df = N
DN
27
Pan-genome of all bacteria
(# of genes in pan-genome)
(# of new genes added to pan-genome)

~ (# of sequenced genomes)0.5
~ (# of sequenced genomes)-0.5
P. Lapierre
JP Gogarten
TIG 2009
Slope=-0.4
predictions of
the toolbox
model (-0.5)
28
Bacterial genome evolution
happens in cooperation with phages
+
=
Comparative genomics of E. coli
implicates phages for BitTorrent
1kb:
gene length
K-12 to B comparison
Phage capacity: 20kb
Other strains up to 40kb
WWW from AT&T website circa 1996
visualized by Mark Newman
Phage-Bacteria Infection Network
Data from Flores et al 2011
experiments by Moebus,Nattkemper,1981
Why eukaryotes run windows?

Dependency network = reuse of components





Bacteria do not keep redundant genes after HGT
Linux developers rely on previous efforts
Pros: smaller genomes, open source,
economies of scale
Cons: less specialized, potentially unstable,
“dependency hell”
Eukaryotes are like Windows or Mac OS X


Keep redundant components
Proprietary software
32
# of pathways (or their regulators)
Figure adapted from S. Maslov, TY Pang, K. Sneppen, S. Krishna, PNAS (2009)
# of genes
33
5
10
Software packages for Linux
Nselected packages ~ Ninstalled packages1.7
1.8
# of selected packages
4
10
1.7
3
10
1.6 0
10
2
10
4
10
2
10
Linux data
1
10
slope 1.7
0
10 1
10
2
3
4
10
10
10
# of installed packages
5
10
34
Collaborators: Tin Yau Pang, Stony Brook University
Support:
35
Office of Biological and Environmental Research
Thank you!