Survey							
                            
		                
		                * Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Scalable data mining for functional
genomics and metagenomics
Curtis Huttenhower
Harvard School of Public Health
Department of Biostatistics
09-16-10
Greatest discoveries in biology?
Our job is to create
computational microscopes:
To ask and answer specific
biological questions using
millions of experimental results
2
Outline
1. Data mining:
2. Metagenomics:
Integrating very large
genomic data compendia
Network models of
microbial communities
3
A computational definition of
functional genomics
Genomic data
Gene
↓
Function
Gene
↓
Gene
Prior knowledge
Data
↓
Function
Function
↓
Function
4
A framework for functional genomics
Low
Correlation
G1 G4
G2 G9
+
+
0.9
0.7
High
Correlation
…
…
G3 G7
G6 G8
-
-
0.1
0.2
…
G2
G5
?
…
0.8
P(G2-G5|Data) = 0.85
Frequency
← 1Ks datasets
100Ms gene pairs →
=
+
-
…
-
-
…
+
Not let.
Low
Similarity
High
Similarity
0.8
0.5
…
0.05 0.1
…
0.6
Let.
Frequency
+
High
Correlation
Frequency
Low
Correlation
Dissim.
Similar
5
Functional network
prediction and analysis
Global interaction network
HEFalMp
Currently includes data from
30,000 human experimental results,
15,000 expression conditions +
15,000 diverse others, analyzed for
200 biological functions and
150 diseases
Carbon metabolism network
Extracellular signaling network
Gut community network
6
Functional network prediction from
diverse microbial data
486 bacterial
expression
experiments
876 raw
datasets
310
postprocessed
datasets
304 normalized
coexpression networks
in 27 species
307 bacterial
interaction
experiments
154796 raw
interactions
114786
postprocessed
interactions
Integrated functional
interaction networks
in 15 species
E. Coli Integration
← Precision ↑, Recall ↓
7
Meta-analysis for unsupervised
functional data integration
Huttenhower 2006
Hibbs 2007
Evangelou 2007
1 1  
 '  log 
2  1   
z
y e ,i   e  
ye ,i   e   e   e ,i
̂ e   we*,i ye,i
i
we*,i 
 '   '
 '
Simple regression:
All datasets are
equally accurate
Random effects:
Variation within and
among datasets
and interactions
1
se2,i  ˆ 2e
8
Meta-analysis for unsupervised
functional data integration
Huttenhower 2006
Hibbs 2007
Evangelou 2007
1 1  
 '  log 
2  1   
z
+
 '   '
 '
=
9
Unsupervised data integration:
TB virulence and ESX-1 secretion
With Sarah Fortune
Graphle
http://huttenhower.sph.harvard.edu/graphle/
10
Unsupervised data integration:
TB virulence and ESX-1 secretion
With Sarah Fortune
?
Graphle
http://huttenhower.sph.harvard.edu/graphle/
X
11
Predicting gene function
Predicted relationships
between genes
Low
Confidence
High
Confidence
Cell cycle genes
12
Predicting gene function
Predicted relationships
between genes
Low
Confidence
High
Confidence
Cell cycle genes
13
Predicting gene function
Predicted relationships
between genes
Low
Confidence
High
Confidence
These edges provide
a measure of how
likely a gene is to
specifically participate
in the process of
interest.
Cell cycle genes
14
Comprehensive validation of
computational predictions
With David Hess, Amy Caudy
Genomic data Prior knowledge
Computational Predictions of Gene Function
SPELL
bioPIXIE
MEFIT
Hibbs et al 2007
Myers et al 2005
Retraining
New known functions for
correctly predicted genes
Genes predicted to function in
mitochondrion organization
and biogenesis
Laboratory Experiments
Growth
curves
Petite
frequency
Confocal
microscopy
15
Evaluating the performance of
computational predictions
Genes involved in mitochondrion organization and biogenesis
106
135
Original GO Annotations
Under-annotations
82
17
Novel Confirmations, Novel Confirmations,
First Iteration
Second Iteration
340 total: >3x previously known genes in ~5 person-months
16
Evaluating the performance of
computational predictions
Genes involved in mitochondrion organization and biogenesis
Computational
95 predictions
40from large 80
17
Original GO Annotations
Under-annotations
collections
of genomicConfirmed
data canNovel
be Confirmations Novel Confirmations
Under-annotations
First Iteration
Second Iteration
accurate despite incomplete or
misleading gold
standards,
340 total: >3x previously
known
genesand
in they
~5 person-months
continue to improve as additional data
are incorporated.
106
17
Functional mapping: mining integrated networks
Predicted relationships
between genes
Low
Confidence
High
Confidence
The strength of these
relationships indicates how
cohesive a process is.
Chemotaxis
18
Functional mapping: mining integrated networks
Predicted relationships
between genes
Low
Confidence
High
Confidence
Chemotaxis
19
Functional mapping: mining integrated networks
Predicted relationships
between genes
Low
Confidence
High
Confidence
The strength of these
relationships indicates how
associated two processes are.
Chemotaxis
Flagellar assembly
20
Functional mapping:
Associations among processes
Hydrogen
Transport
Electron
Transport
Edges
Associations between processes
Cellular
Respiration
Aldehyde
Metabolism
Very
Strong
Cell Redox
Homeostasis
Peptide
Metabolism
Energy
Reserve
Metabolism
Moderately
Strong
Vacuolar
Protein
Catabolism
Protein
Processing
Negative Regulation
of Protein Metabolism
Protein
Depolymerization
Organelle
Fusion
Organelle
Inheritance
21
Functional mapping:
Associations among processes
Hydrogen
Transport
Electron
Transport
Edges
Associations between processes
Cellular
Respiration
Aldehyde
Metabolism
Very
Strong
Cell Redox
Homeostasis
Peptide
Metabolism
Energy
Reserve
Metabolism
Moderately
Strong
Vacuolar
Protein
Catabolism
Protein
Processing
Negative Regulation
of Protein Metabolism
Borders
Protein
Depolymerization
Data coverage of processes
Organelle
Fusion
Sparsely
Covered
Well
Covered
Organelle
Inheritance
22
Functional mapping:
Associations among processes
Hydrogen
Transport
Electron
Transport
Edges
Associations between processes
Cellular
Respiration
Aldehyde
Metabolism
Very
Strong
Cell Redox
Homeostasis
Peptide
Metabolism
Energy
Reserve
Metabolism
Moderately
Strong
Vacuolar
Protein
Catabolism
Protein
Processing
Negative Regulation
of Protein Metabolism
Nodes
Cohesiveness of processes
Below
Baseline
Baseline
(genomic
background)
Very
Cohesive
Borders
Protein
Depolymerization
Data coverage of processes
Organelle
Fusion
Sparsely
Covered
Well
Covered
Organelle
Inheritance
23
Functional mapping:
Associations among processes
Edges
Associations between processes
Moderately
Strong
Very
Strong
Nodes
Cohesiveness of processes
Below
Baseline
Baseline
(genomic
background)
Very
Cohesive
Borders
Data coverage of processes
Sparsely
Covered
Well
Covered
24
Cross-species knowledge transfer
using functional data
P( FRs | Ds )  P( Ds | FRs ) P( FRs )
P( FRs , FRt )
P( FRs | D)
 P( FRs | {FRt  s }, Ds )
 P({FRt  s }, Ds | FRs ) P( FRs )
 P( FRs ) P( Ds | FRs ) P( FRt | FRs )
t s
Ds
Pinaki
Sarder
TaFTan
25
TaFTan: Cross-species knowledge
transfer using functional data
log(precision/random)
E. coli
log(recall)
P. aeruginosa
Species-specific data
Species’ data excluded
All species’ data
• Important to take advantage of all
available data for any one organism
• Important to take advantage of all
available data for every organism
• Scalable to dozens of organisms with
hundreds of functional datasets
• Currently working on making this
more context-specific
B. subtilis
M. tuberculosis
26
Outline
1. Data mining:
2. Metagenomics:
Integrating very large
genomic data compendia
Network models of
microbial communities
27
So what does all of this
~2000
have to do with
microbial communities ?
AML/ALL
Survival
Mutation
Batch
effects
Gene
expression
Functional
modules
28
~2005
Healthy/Diabetes
BMI
M/F
Population
structure
SNP
genotypes
LD
29
2010
Intervention/
perturbation
Healthy/IBD
Temperature
Location
???
Biological
story?
Crossvalidate
Taxa &
Orthologs
Niches &
Phylogeny
Independent
sample
Confounds/
stratification/
environment
Test for
correlates
Multiple
hypothesis
correction
Feature
selection
p >> n
30
What’s metagenomics?
Total collection of microorganisms
within a community
Also microbial community or microbiota
Total genomic potential of
a microbial community
Study of uncultured microorganisms
from the environment, which can include
humans or other living hosts
Total biomolecular repertoire
of a microbial community
31
The Human Microbiome Project
• 300 “normal” adults, 18-40
• 16S rDNA + WGS
• 5 sites/18 samples + blood
• Oral cavity: saliva, tongue,
•
•
•
•
palate, buccal mucosa, gingiva,
tonsils, throat, teeth
Skin: ears, inner elbows
Nasal cavity
Gut: stool
Vagina: introitus, mid, fornix
• Reference genomes (~200-800)
Hamady, 2009
2006 - ongoing
All healthy subjects;
followup projects in
psoriasis, Crohn’s,
colitis, obesity, acne,
cancer, resistant
infection…
32
What features to test?
Microbiome data
Genomic data
(Reference genomes)
16S reads
Taxa
WGS
reads
Orthologous
clusters
Functional
roles
Pathways/
modules
Pathway
activity
Functional data
(Experimental models)
Binning
Clustering
33
HMP: Data  features
Taxa
16S reads
Genes
(KOs)
Pathways
(KEGGs)
Orthologous
clusters
Pathways/
modules
34
HMP: Body sites
Vanilla linear SVM
Taxa
KOs
KEGGs
35
HMP: Subjects
Taxa
We can tell who you
are by the bugs in
your mouth!
KEGGs
36
HMP: Metabolic reconstruction
300 subjects
1-3 visits/subject
15-18 body sites/visit
10-20M reads/sample
100bp reads
Functional seq.
KEGG + MetaCYC
BLAST
CAZy, TCDB,
VFDB, MEROPS…
BLAST → Genes
 (1  p )(a  g )
c( g )  
1  p
a
a(r )
r
r
a(r )
Genes
(KOs)
Pathways
(KEGGs)
Genes → Pathways
MinPath (Ye 2009)
WGS
reads
?
Pathways/
modules
Smoothing
Witten-Bell
TN /(V  T ) /( N  T ) c( g )  0
c( g )  
otherwise
c( g ) N /( N  T )
Gap filling
37
HMP: Metabolic reconstruction
Pathway coverage
Pathway abundance
38
HMP: Metabolic reconstruction
← Samples →
All body sites (“core”)
Pathway abundance
← Pathways→
Aerobic body sites
Gastrointestinal body sites
Pathway coverage
39
MetaHIT: Data  features
85 healthy, 15 IBD +
12 healthy, 12 IBD
Taxa
Phymm
Brady 2009
Genes
(KOs)
Pathways
(KEGGs)
ReBLASTed against
KEGG since published
data obfuscates read
counts
10x bootstrap within
training cohort, test on
12+12 as validation
WGS
reads
Pathways/
modules
40
MetaHIT: Taxonomic CD biomarkers
Bacteroidetes
Methanomicrobia
Enterobacteriaceae
Firmicutes
Chromatiales
Desulfobacterales
Bradyrhizobiaceae
iTOL
Letunic 2007
Rhodobacteraceae
Oxalobacteraceae
41
MetaHIT: Taxonomic CD biomarkers
Down in CD
Up in CD
42
MetaHIT: Functional CD biomarkers
Down in CD
Up in CD
Growth/replication
Motility
Transporters
Sugar metabolism
43
MetaHIT: KO IBD biomarkers
Down in IBD
Growth/
replication
LEfSe
Motility
Transporters
Nicola
Segata
Sugar
metabolism
Up in IBD
44
Metagenomic differential analysis: LEfSe
1. Is there a statistically significant difference?
t-tests, ANOVA, MANOVA, Friedman, Kruskal–Wallis…
2. Is the difference biologically significant?
expert supervision, specific post-hoc tests…
3. How large is the difference?
PCA, LDA, mean difference, class or cluster distance…
LEfSe:
p(ANOVA) < 0.05
pairwise post-hoc Wilcoxon OK
Log(Score(LDA)) = 3.68
45
LEfSe: A non-human example
Viromes vs. bacterial metagenomes
Dinsdale 2008
Metastats (White 2009): p < 0.001
ANOVA:
p < 0.05
LEfSE: NO
DIFF!
DIFF!
Hi-level
functional
category:
Nucleosides
and Nucleotides
Hi-level
functional
category:
Carbohydrates
Hi-level
functional
category:
Transporters
Microbial
Viral
46
Sleipnir: Software for
scalable functional genomics
Massive datasets require efficient
algorithms and implementations.
• Sleipnir C++ library for computational
functional genomics
• Data types for biological entities
•
•
Microarray data, interaction data, genes and gene sets,
functional catalogs, etc. etc.
Network communication, parallelization
• Efficient machine learning algorithms
•
Generative (Bayesian) and discriminative (SVM)
It’s also speedy:•microbial
And it’s
data integration
computation
takes <3hrs.
fully documented!
47
Outline
• Network framework for
scalable data integration
• 16S and WGS community
metabolic reconstruction
• HEFalMp: human data
integration
• LEfSe: biologically relevant
community differences
• TaFTan: cross-species
knowledge transfer from
functional data
• Sleipnir: software for
scalable genomic data
mining
1. Data mining:
2. Metagenomics:
Integrating very large
genomic data compendia
Network models of
microbial communities
48
Thanks!
Pinaki Sarder
Nicola Segata
Olga
Troyanskaya
Chris Park
David Hess
Matt Hibbs
Chad Myers
Ana Pop
Aaron Wong
Sarah Fortune
Hilary Coller
Erin Haley
Jacques Izard
Wendy Garrett
Levi Waldron
Larisa
Miropolsky
Willythssa
Pierre-Louis
Interested? We’re looking
for postdocs!
http://huttenhower.sph.harvard.edu
http://huttenhower.sph.harvard.edu/sleipnir
49
HEFalMp: Predicting human gene function
HEFalMp
51
HEFalMp: Predicting human
genetic interactions
HEFalMp
52
HEFalMp: Analyzing human genomic data
HEFalMp
53
HEFalMp: Understanding human disease
HEFalMp
54
Validating Human Predictions
With Erin Haley, Hilary Coller
Autophagy
5½ of 7 predictions
currently confirmed
Luciferase
ATG5
(Negative control)
(Positive control)
Predicted novel autophagy proteins
LAMP2
RAB11A
Not
Starved
Starved
(Autophagic)
55
Functional Mapping:
Scoring Functional Associations
How can we formalize
these relationships?
Any sets of genes G1 and G2
in a network can be compared
using four measures:
• Edges between their genes
• Edges within each set
• The background edges
incident to each set
• The baseline of all edges
in the network
Stronger connections between
the sets increase association.
FAG1 ,G2
between(G1 , G2 )
baseline
background (G1 , G2 ) within(G1 , G2 )
Stronger within self-connections or nonspecific
background connections decrease association.
56
Functional Mapping:
Bootstrap p-values
For any graph, compute FA scores for many
Null distribution
is
• Scoring
functional
associations
is great…
randomly
chosen gene
sets of different
sizes.
approximately normal
…how do you interpret an association
score?
with
mean
1.
#
Genes–
1 gene5 sets 10
50 sizes?
For
of arbitrary
ˆ FA (Gi , G j )  1
– In arbitrary graphs?
A(| Gi |) | G j |  B
of
edges?
1 – Each with its own bizarre distribution
ˆ FA (Gi , G j ) 
| Gi | C (| G j |)
5
Standard deviation is
0.45
0.4
0.35
0.3
0.25
0.2
0.15
0.1
10
0
0.05
10
0
2
0
10
10
50
10
1
asymptotic in the sizes
of both gene sets.
P( FAG1 ,G2  x)  1   ˆ (G1 ,G2 ),ˆ (G1 ,G2 ) ( x)
2
10
3
10
4
10
|G1|
|G2|
Null distribution
one graph
Histograms
of FAsσs
forfor
random
sets
Maps FA scores to p-values
for any gene sets and
underlying graph.
57
Functional maps for cross-species
knowledge transfer
ECG1, ECG2
BSG1
ECG3, BSG2
…
O1: G1, G2, G3
O2: G4
O3: G6
…
G2
G3
G4
G1
O2
G5
G6
G7
O3
G8
O5
O4
G9
G10
G12
O8
G11
O6
G13
G15
G16
O7
O9
G14
G17
58
Functional maps for functional metagenomics
GOS 4441599.3
Hypersaline Lagoon, Ecuador
KEGG Pathways
Organisms
Mapping organisms
into phyla
Env.
+
Integrated functional
interaction networks
in 27 species
Pathogens
=
Mapping genes
into pathways
Mapping pathways
into organisms
59
Functional Maps:
Focused Data Summarization
ACGGTGAACGTACA
GTACAGATTACTAG
GACATTAGGCCGTA
TCCGATACCCGATA
Data integration summarizes an
impossibly huge amount of
experimental data into an
impossibly huge number of
predictions; what next?
60
Functional Maps:
Focused Data Summarization
ACGGTGAACGTACA
GTACAGATTACTAG
GACATTAGGCCGTA
TCCGATACCCGATA
How can a biologist take
advantage of all this data to
study his/her favorite
gene/pathway/disease without
losing information?
Functional mapping
•
•
•
•
Very large collections of genomic data
Specific predicted molecular interactions
Pathway, process, or disease associations
Underlying experimental results and
functional activities in data
61
Functional maps for cross-species
knowledge transfer
Following up with unsupervised and
partially anchored network alignment
← Precision ↑, Recall ↓
62
LEfSe: A non-human example
Viromes vs. bacterial metagenomes
Metastats (White 2009): p < 0.001
ANOVA:
p < 0.05
LEfSE: NO
DIFF!
DIFF!
Hi-level
functional
category:
Nitrogen
Metabolism
Hi-level
functional
category:
Carbohydrates
Hi-level
functional
category:
Nucleosides
Membrane
and
Transport
Nucleotides
Microbial
Viral
63