Download Functional data

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

List of types of proteins wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Genome evolution wikipedia , lookup

RNA-Seq wikipedia , lookup

Community fingerprinting wikipedia , lookup

Gene regulatory network wikipedia , lookup

Gene expression profiling wikipedia , lookup

Transcript
Large scale functional data mining:
What can we find in the data we have?
Curtis Huttenhower
Harvard School of Public Health
Department of Biostatistics
08-12-10
Greatest biological discoveries?
Our job is to create
computational microscopes:
To ask and answer specific
biological questions using
millions of experimental results
2
A computational definition of
functional genomics
Genomic data
Gene
↓
Function
Gene
↓
Gene
Prior knowledge
Data
↓
Function
Function
↓
Function
3
A framework for functional genomics
Low
Correlation
G1 G4
G2 G9
+
+
0.9
0.7
High
Correlation
…
…
G3 G7
G6 G8
-
-
0.1
0.2
…
G2
G5
?
…
0.8
P(G2-G5|Data) = 0.85
Frequency
← 1Ks datasets
100Ms gene pairs →
=
+
-
…
-
-
…
+
Not let.
Low
Similarity
High
Similarity
0.8
0.5
…
0.05 0.1
…
0.6
Let.
Frequency
+
High
Correlation
Frequency
Low
Correlation
Dissim.
Similar
4
Functional network
prediction and analysis
Global interaction network
HEFalMp
Currently includes data from
30,000 human experimental results,
15,000 expression conditions +
15,000 diverse others, analyzed for
200 biological functions and
150 diseases
Carbon metabolism network
Extracellular signaling network
Gut community network
5
Functional network prediction from
diverse microbial data
486 bacterial
expression
experiments
876 raw
datasets
310
postprocessed
datasets
304 normalized
coexpression networks
in 27 species
307 bacterial
interaction
experiments
154796 raw
interactions
114786
postprocessed
interactions
Integrated functional
interaction networks
in 15 species
E. Coli Integration
← Precision ↑, Recall ↓
6
Cross-species knowledge transfer
using functional data
P( FRs | Ds )  P( Ds | FRs ) P( FRs )
P( FRs , FRt )
P( FRs | D)
 P( FRs | {FRt  s }, Ds )
 P({FRt  s }, Ds | FRs ) P( FRs )
 P( FRs ) P( Ds | FRs ) P( FRt | FRs )
t s
Ds
Pinaki
Sarder
TaFTan
7
TaFTan: Cross-species knowledge
transfer using functional data
log(precision/random)
E. coli
log(recall)
P. aeruginosa
Species-specific data
Species’ data excluded
All species’ data
• Important to take advantage of all
available data for any one organism
• Important to take advantage of all
available data for every organism
• Scalable to dozens of organisms with
hundreds of functional datasets
• Currently working on making this
more context-specific
B. subtilis
M. tuberculosis
8
Meta-analysis for unsupervised
functional data integration
Huttenhower 2006
Hibbs 2007
Evangelou 2007
1 1  

 '  log 
2  1   
z
y e ,i   e  
ye ,i   e   e   e ,i
̂ e   we*,i ye,i
i
we*,i 
 '   '
 '
Simple regression:
All datasets are
equally accurate
Random effects:
Variation within and
among datasets
and interactions
1
se2,i  ˆ 2e
9
Meta-analysis for unsupervised
functional data integration
Huttenhower 2006
Hibbs 2007
Evangelou 2007
1 1  

 '  log 
2  1   
z
+
 '   '
 '
=
10
So what does all of this
~2000
have to do with
microbial communities ?
AML/ALL
Temperature
DNA damage
Batch
effects
Gene
expression
Functional
modules
11
2010
Intervention/
perturbation
Healthy/IBD
Temperature
Location
???
Biological
story?
Crossvalidate
Taxa &
Orthologs
Niches &
Phylogeny
Independent
sample
Confounds/
stratification/
environment
Test for
correlates
Multiple
hypothesis
correction
Feature
selection
p >> n
12
What features to test?
Microbiome data
Genomic data
(Reference genomes)
16S reads
Taxa
WGS
reads
Orthologous
clusters
Functional
roles
Pathways/
modules
Pathway
activity
Functional data
(Experimental models)
Binning
Clustering
13
MetaHIT: Data  features
85 healthy, 15 IBD +
12 healthy, 12 IBD
Taxa
KO clusters
Phymm
Brady 2009
ReBLASTed against
KEGG since published
data obfuscates read
counts
10x bootstrap within
training cohort, test on
12+12 as validation
WGS
reads
Pathways/
modules
KEGG pathways
14
MetaHIT: Taxonomic CD biomarkers
Bacteroidetes
Methanomicrobia
Enterobacteriaceae
Firmicutes
Chromatiales
Desulfobacterales
Bradyrhizobiaceae
iTOL
Letunic 2007
Rhodobacteraceae
Oxalobacteraceae
15
MetaHIT: Taxonomic CD biomarkers
Down in CD
Up in CD
16
MetaHIT: Functional CD biomarkers
Down in CD
Up in CD
Growth/replication
Motility
Transporters
Sugar metabolism
17
MetaHIT: KO IBD biomarkers
Down in IBD
Growth/
replication
LEfSe
Motility
Transporters
Nicola
Segata
Sugar
metabolism
Up in IBD
18
Metagenomic differential analysis: LEfSe
1. Is there a statistically significant difference?
t-tests, ANOVA, MANOVA, Friedman, Kruskal–Wallis…
2. Is the difference biologically significant?
expert supervision, specific post-hoc tests…
3. How large is the difference?
PCA, LDA, mean difference, class or cluster distance…
LEfSe:
p(ANOVA) < 0.05
pairwise post-hoc Wilcoxon OK
Log(Score(LDA)) = 3.68
19
LEfSe: A non-human example
Viromes vs. bacterial metagenomes
Dinsdale 2008
Metastats (White 2009): p < 0.001
ANOVA:
p < 0.05
LEfSE: NO
DIFF!
DIFF!
Hi-level
functional
category:
Nucleosides
and Nucleotides
Hi-level
functional
category:
Carbohydrates
Hi-level
functional
category:
Transporters
Microbial
Viral
20
Sleipnir: Software for
scalable functional genomics
Massive datasets require efficient
algorithms and implementations.
• Sleipnir C++ library for computational
functional genomics
• Data types for biological entities
•
•
Microarray data, interaction data, genes and gene sets,
functional catalogs, etc. etc.
Network communication, parallelization
• Efficient machine learning algorithms
•
Generative (Bayesian) and discriminative (SVM)
It’s also speedy:•microbial
And it’s
data integration
computation
takes <3hrs.
fully documented!
21
Recap
• Network framework for
scalable data integration
• Cross-species knowledge
transfer from functional data
• Unsupervised system for
data mining without curated
prior knowledge
TaFTan
Meta-analytic
integration
• Comparative microbiome
analysis by taxa, orthologs,
and pathways
• Sleipnir software for
scalable functional genomics
LEfSe
22
Thanks!
Pinaki Sarder
Nicola Segata
Jacques Izard
Sarah Fortune
Levi Waldron
Larisa
Miropolsky
Willythssa
Pierre-Louis
Wendy Garrett
http://huttenhower.sph.harvard.edu/sleipnir
23
Predicting Gene Function
Predicted relationships
between genes
Low
Confidence
High
Confidence
Cell cycle genes
25
Predicting Gene Function
Predicted relationships
between genes
Low
Confidence
High
Confidence
Cell cycle genes
26
Predicting Gene Function
Predicted relationships
between genes
Low
Confidence
High
Confidence
These edges provide
a measure of how
likely a gene is to
specifically participate
in the process of
interest.
Cell cycle genes
27
Comprehensive Validation of
Computational Predictions
With David Hess, Amy Caudy
Genomic data Prior knowledge
Computational Predictions of Gene Function
SPELL
bioPIXIE
MEFIT
Hibbs et al 2007
Myers et al 2005
Retraining
New known functions for
correctly predicted genes
Genes predicted to function in
mitochondrion organization
and biogenesis
Laboratory Experiments
Growth
curves
Petite
frequency
Confocal
microscopy
28
Evaluating the Performance of
Computational Predictions
Genes involved in mitochondrion organization and biogenesis
106
135
Original GO Annotations
Under-annotations
82
17
Novel Confirmations, Novel Confirmations,
First Iteration
Second Iteration
340 total: >3x previously known genes in ~5 person-months
29
Evaluating the Performance of
Computational Predictions
Genes involved in mitochondrion organization and biogenesis
Computational
95 predictions
40from large 80
17
Original GO Annotations
Under-annotations
collections
of genomicConfirmed
data canNovel
be Confirmations Novel Confirmations
Under-annotations
First Iteration
Second Iteration
accurate despite incomplete or
misleading gold
standards,
340 total: >3x previously
known
genesand
in they
~5 person-months
continue to improve as additional data
are incorporated.
106
30
Validating Human Predictions
With Erin Haley, Hilary Coller
Autophagy
5½ of 7 predictions
currently confirmed
Luciferase
ATG5
(Negative control)
(Positive control)
Predicted novel autophagy proteins
LAMP2
RAB11A
Not
Starved
Starved
(Autophagic)
31
Functional mapping: mining integrated networks
Predicted relationships
between genes
Low
Confidence
High
Confidence
The strength of these
relationships indicates how
cohesive a process is.
Chemotaxis
32
Functional mapping: mining integrated networks
Predicted relationships
between genes
Low
Confidence
High
Confidence
Chemotaxis
33
Functional mapping: mining integrated networks
Predicted relationships
between genes
Low
Confidence
High
Confidence
The strength of these
relationships indicates how
associated two processes are.
Chemotaxis
Flagellar assembly
34
Functional Mapping:
Scoring Functional Associations
How can we formalize
these relationships?
Any sets of genes G1 and G2
in a network can be compared
using four measures:
• Edges between their genes
• Edges within each set
• The background edges
incident to each set
• The baseline of all edges
in the network
Stronger connections between
the sets increase association.
FAG1 ,G2
between(G1 , G2 )
baseline


background (G1 , G2 ) within(G1 , G2 )
Stronger within self-connections or nonspecific
background connections decrease association.
35
Functional Mapping:
Bootstrap p-values
For any graph, compute FA scores for many
Null distribution
is
• Scoring
functional
associations
is great…
randomly
chosen gene
sets of different
sizes.
approximately normal
…how do you interpret an association
score?
with
mean
1.
#
Genes–
1 gene5 sets 10
50 sizes?
For
of arbitrary
ˆ FA (Gi , G j )  1
– In arbitrary graphs?
A(| Gi |) | G j |  B
of
edges?
1 – Each with its own bizarre distribution
ˆ FA (Gi , G j ) 
| Gi | C (| G j |)
5
Standard deviation is
0.45
0.4
0.35

0.3
0.25
0.2
0.15
0.1
10
0
0.05
10
0
2
0
10
10
50
10
1
asymptotic in the sizes
of both gene sets.
P( FAG1 ,G2  x)  1   ˆ (G1 ,G2 ),ˆ (G1 ,G2 ) ( x)
2
10
3
10
4
10
|G1|
|G2|
Null distribution
one graph
Histograms
of FAsσs
forfor
random
sets
Maps FA scores to p-values
for any gene sets and
underlying graph.
36
Functional Mapping:
Functional Associations Between Processes
Hydrogen
Transport
Electron
Transport
Edges
Associations between processes
Cellular
Respiration
Aldehyde
Metabolism
Very
Strong
Cell Redox
Homeostasis
Peptide
Metabolism
Energy
Reserve
Metabolism
Moderately
Strong
Vacuolar
Protein
Catabolism
Protein
Processing
Negative Regulation
of Protein Metabolism
Nodes
Cohesiveness of processes
Below
Baseline
Baseline
(genomic
background)
Very
Cohesive
Borders
Protein
Depolymerization
Data coverage of processes
Organelle
Fusion
Sparsely
Covered
Well
Covered
Organelle
Inheritance
37
Functional Mapping:
Functional Associations Between Processes
Edges
Associations between processes
Moderately
Strong
Very
Strong
Nodes
Cohesiveness of processes
Below
Baseline
Baseline
(genomic
background)
Very
Cohesive
Borders
Data coverage of processes
Sparsely
Covered
Well
Covered
38
Functional maps for cross-species
knowledge transfer
ECG1, ECG2
BSG1
ECG3, BSG2
…
O1: G1, G2, G3
O2: G4
O3: G6
…
G2
G3
G4
G1
O2
G5
G6
G7
O3
G8
O5
O4
G9
G10
G12
O8
G11
O6
G13
G15
G16
O7
O9
G14
G17
39
Functional maps for functional metagenomics
GOS 4441599.3
Hypersaline Lagoon, Ecuador
KEGG Pathways
Organisms
Mapping organisms
into phyla
Env.
+
Integrated functional
interaction networks
in 27 species
Pathogens
=
Mapping genes
into pathways
Mapping pathways
into organisms
40
Functional Maps:
Focused Data Summarization
ACGGTGAACGTACA
GTACAGATTACTAG
GACATTAGGCCGTA
TCCGATACCCGATA
Data integration summarizes an
impossibly huge amount of
experimental data into an
impossibly huge number of
predictions; what next?
41
Functional Maps:
Focused Data Summarization
ACGGTGAACGTACA
GTACAGATTACTAG
GACATTAGGCCGTA
TCCGATACCCGATA
How can a biologist take
advantage of all this data to
study his/her favorite
gene/pathway/disease without
losing information?
Functional mapping
•
•
•
•
Very large collections of genomic data
Specific predicted molecular interactions
Pathway, process, or disease associations
Underlying experimental results and
functional activities in data
42
Functional maps for cross-species
knowledge transfer
Following up with unsupervised and
partially anchored network alignment
← Precision ↑, Recall ↓
43
LEfSe: A non-human example
Viromes vs. bacterial metagenomes
Metastats (White 2009): p < 0.001
ANOVA:
p < 0.05
LEfSE: NO
DIFF!
DIFF!
Hi-level
functional
category:
Nitrogen
Metabolism
Hi-level
functional
category:
Carbohydrates
Hi-level
functional
category:
Nucleosides
Membrane
and
Transport
Nucleotides
Microbial
Viral
44