Download Computational Methodology for Microbial and Metagenomic

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Computational Methodology for Microbial and
Metagenomic Characterization using Large Scale
Functional Genomic Data Integration
Curtis Huttenhower
Harvard School of Public Health
Department of Biostatistics
03-08-10
Outline
1. Network models of
functional data
2. Network models of
microbes
3. Network models of
microbiomes
2
Meta-analysis for unsupervised
functional data integration
Huttenhower 2006
Hibbs 2007
1 1  

 '  log 
2  1   
z
+
 '   '
 '
=
Following up with round-robin
and semi-supervised evaluations
3
Functional network prediction from
diverse microbial data
486 bacterial
expression
experiments
876 raw
datasets
310
postprocessed
datasets
304 normalized
coexpression networks
in 27 species
307 bacterial
interaction
experiments
154796 raw
interactions
114786
postprocessed
interactions
Integrated functional
interaction networks
in 15 species
4
Functional maps for cross-species
knowledge transfer
Following up with unsupervised and
partially anchored network alignment
Huttenhower 2008
Huttenhower 2009
5
Functional maps for functional metagenomics
GOS 4441599.3
Hypersaline Lagoon, Ecuador
Mapping organisms
into phyla
=
+
Integrated functional
interaction networks
in 27 species
Mapping genes
into pathways
Mapping pathways
into organisms
6
Functional maps for functional metagenomics
Summarizes information
from ~10M metagenomic
reads and ~500 genomescale microbial experiments.
Edges
Process association in obesity
Less
Coregulated
Baseline
(no change)
More
Coregulated
Nodes
Process cohesiveness in obesity
Very
Downregulated
Baseline
(no change)
Very
Upregulated
7
Efficient Computation For Biological Discovery
Massive datasets and genomes require
efficient algorithms and implementations.
• Sleipnir C++ library for computational
functional genomics
• Data types for biological entities
•
•
Microarray data, interaction data, genes and gene sets,
functional catalogs, etc. etc.
Network communication, parallelization
• Efficient machine learning algorithms
•
Generative (Bayesian) and discriminative (SVM)
It’s also speedy:•microbial
And it’s
data integration
computation
takes <3hrs.
fully documented!
8
Thanks!
Olga Troyanskaya
Matt Hibbs
Chad Myers
David Hess
Chris Park
Ana Pop
Aaron Wong
Jacques Izard
Hilary Coller
Erin Haley
Sarah Fortune
Tracy Rosebrock
Wendy Garrett
http://huttenhower.sph.harvard.edu/sleipnir
http://function.princeton.edu/hefalmp
9
Functional mapping:
Functional associations between processes
Edges
Associations between processes
Moderately
Strong
Very
Strong
Nodes
Cohesiveness of processes
Below
Baseline
Baseline
(genomic
background)
Very
Cohesive
Borders
Data coverage of processes
Sparsely
Covered
Information mapped from ~100 E. coli experiments
Well
Covered
11
Meta-analysis for unsupervised
functional data integration
Huttenhower 2006
Hibbs 2007
Evangelou 2007
1 1  

 '  log 
2  1   
z
+
 '   '
 '
=
Following up with round-robin
and semi-supervised evaluations
12
Functional mapping: mining integrated networks
Predicted relationships
between genes
Low
Confidence
High
Confidence
The strength of these
relationships indicates how
cohesive a process is.
Chemotaxis
13
Functional mapping: mining integrated networks
Predicted relationships
between genes
Low
Confidence
High
Confidence
Chemotaxis
14
Functional mapping: mining integrated networks
Predicted relationships
between genes
Low
Confidence
High
Confidence
The strength of these
relationships indicates how
associated two processes are.
Chemotaxis
Flagellar assembly
15
Functional maps for cross-species
knowledge transfer
ECG1, ECG2
BSG1
ECG3, BSG2
…
O1: G1, G2, G3
O2: G4
O3: G6
…
G2
G3
G4
G1
O2
G5
G6
G7
O3
G8
O5
O4
G9
G10
G12
O8
G11
O6
G13
G15
G16
O7
O9
G14
G17
16
Functional network prediction from
diverse microbial data
486 bacterial
expression
experiments
876 raw
datasets
310
postprocessed
datasets
304 normalized
coexpression networks
in 27 species
307 bacterial
interaction
experiments
154796 raw
interactions
114786
postprocessed
interactions
Integrated functional
interaction networks
in 15 species
E. Coli Integration
← Precision ↑, Recall ↓
17
Functional maps for functional metagenomics
GOS 4441599.3
Hypersaline Lagoon, Ecuador
KEGG Pathways
Organisms
Mapping organisms
into phyla
Env.
+
Integrated functional
interaction networks
in 27 species
Pathogens
=
Mapping genes
into pathways
Mapping pathways
into organisms
18
Functional maps for cross-species
knowledge transfer
Following up with unsupervised and
partially anchored network alignment
← Precision ↑, Recall ↓
19
Functional network prediction from
diverse microbial data
486 bacterial
expression
experiments
876 raw
datasets
310
postprocessed
datasets
304 normalized
coexpression networks
in 27 species
307 bacterial
interaction
experiments
154796 raw
interactions
114786
postprocessed
interactions
Integrated functional
interaction networks
in 15 species
E. Coli Integration
20
Functional Maps:
Focused Data Summarization
ACGGTGAACGTACA
GTACAGATTACTAG
GACATTAGGCCGTA
TCCGATACCCGATA
Data integration summarizes an
impossibly huge amount of
experimental data into an
impossibly huge number of
predictions; what next?
21
Functional Maps:
Focused Data Summarization
ACGGTGAACGTACA
GTACAGATTACTAG
GACATTAGGCCGTA
TCCGATACCCGATA
How can a biologist take
advantage of all this data to
study his/her favorite
gene/pathway/disease without
losing information?
Functional mapping
•
•
•
•
Very large collections of genomic data
Specific predicted molecular interactions
Pathway, process, or disease associations
Underlying experimental results and
functional activities in data
22
Functional Mapping:
Scoring Functional Associations
How can we formalize
these relationships?
Any sets of genes G1 and G2
in a network can be compared
using four measures:
• Edges between their genes
• Edges within each set
• The background edges
incident to each set
• The baseline of all edges
in the network
Stronger connections between
the sets increase association.
FAG1 ,G2
between(G1 , G2 )
baseline


background (G1 , G2 ) within(G1 , G2 )
Stronger within self-connections or nonspecific
background connections decrease association.
23
Functional Mapping:
Bootstrap p-values
For any graph, compute FA scores for many
Null distribution
is
• Scoring
functional
associations
is great…
randomly
chosen gene
sets of different
sizes.
approximately normal
…how do you interpret an association
score?
with
mean
1.
#
Genes–
1 gene5 sets 10
50 sizes?
For
of arbitrary
ˆ FA (Gi , G j )  1
– In arbitrary graphs?
A(| Gi |) | G j |  B
of
edges?
1 – Each with its own bizarre distribution
ˆ FA (Gi , G j ) 
| Gi | C (| G j |)
5
Standard deviation is
0.45
0.4
0.35

0.3
0.25
0.2
0.15
0.1
10
0
0.05
10
0
2
0
10
10
50
10
1
asymptotic in the sizes
of both gene sets.
P( FAG1 ,G2  x)  1   ˆ (G1 ,G2 ),ˆ (G1 ,G2 ) ( x)
2
10
3
10
4
10
|G1|
|G2|
Null distribution
one graph
Histograms
of FAsσs
forfor
random
sets
Maps FA scores to p-values
for any gene sets and
underlying graph.
24
Microbial Communities and
Functional Metagenomics
With Jacques Izard, Wendy Garrett
• Metagenomics: data analysis from environmental samples
– Microflora: environment includes us!
• Pathogen collections of “single” organisms form similar communities
• Another data integration problem
– Must include datasets from multiple organisms
• What questions can we answer?
– What pathways/processes are present/over/underenriched in a newly sequences microbe/community?
– What’s shared within community X?
What’s different? What’s unique?
– How do human microflora interact with diabetes,
obesity, oral health, antibiotics, aging, …
– Current functional methods annotate
~50% of synthetic data, <5% of environmental data
25
Data Integration for Microbial Communities
~350 available
expression
datasets
~25 species
•
•
•
•
Data integration works just as well in microbes as it does in yeast and humans
We know an awful lot about some microorganisms and almost nothing about others
Sequence-based and network-based tools for function transfer both work in isolation
We can use data integration to leverage both and mine out additional biology
Weskamp et al 2004
Flannick et al 2006
Kanehisa et al 2008
Tatusov et al 1997
26
Functional Maps for
Functional Metagenomics
27
Validating Orthology-Based
Functional Mapping
What is the effect of “projecting”
through an orthologous space?
GO
GO
Individual
datasets
log(Precision/Random)
Unsupervised
integration
log(Precision/Random)
Does unweighted data integration
predict functional relationships?
Recall
Recall
KEGG
Unsupervised
integration
Individual
datasets
Recall
log(Precision/Random)
log(Precision/Random)
KEGG
Recall
28
Validating Orthology-Based
Functional Mapping
YG2
YG3
YG4
Holdout set,
uncharacterized “genome”
YG1
YG5
Random subsets,
characterized “genomes”
YG6
YG7
YG8
YG9
YG10
YG12
YG11
YG13
YG15
YG14
YG16
YG17
29
Validating Orthology-Based
Functional Mapping
30
Validating Orthology-Based
Functional Mapping
Can subsets of the yeast genome
predict a heldout subset’s
functional maps?
Can subsets of the yeast genome
predict a heldout subset’s
interactome?
GO
GO
What have we learned?
0.68
• Yeast is incredibly well-curated
0.48
0.30
• KEGG tends to be more specific than GO
0.37
0.40
• Predicting interactomes by projecting through functional maps
works decently in the absolute best case
0.39
KEGG
0.25
0.27
0.43
0.39
KEGG
31
Related documents