Download 2. Metagenomics

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Large scale genomic data integration for
functional genomics and metagenomics
Curtis Huttenhower
Harvard School of Public Health
Department of Biostatistics
05-21-10
Greatest Biological Discoveries?
2
Are We There Yet?
Species Diversity of
Environmental Samples
• How much biology is out there?
• How much have we found?
• How fast are we finding it?
Fierer 2008
Human Proteins with
Annotated Biological Roles
Age-Adjusted Citation Rates for
Major Sequencing Projects
#
Distinct
Roles
Matt Hibbs
3
Are We There Yet?
Species Diversity of
Environmental Samples
Lots!
• How much biology is out there?
• How much have we
Ourfound?
job is toNot
create
nearly all
• How fast arecomputational
we finding it? microscopes:
To ask
Notand
fastanswer
enoughspecific
biomedical questions using
Human Proteins with
Age-Adjusted Cost per Citation for
millions
results
Annotated Biological
Roles of experimentalMajor
Sequencing Projects
Fierer 2008
#
Distinct
Roles
Matt Hibbs
4
Outline
1. Data mining:
2. Metagenomics:
Algorithms for integrating
very large data compendia
Network models of
microbial communities
5
A framework for functional genomics
Low
Correlation
G1 G4
G2 G9
+
+
0.9
0.7
High
Correlation
…
…
G3 G7
G6 G8
-
-
0.1
0.2
…
G2
G5
?
…
0.8
P(G2-G5|Data) = 0.85
Frequency
← 1Ks datasets
100Ms gene pairs →
=
+
-
…
-
-
…
+
Not coloc.
Low
Similarity
High
Similarity
0.8
0.5
…
0.05 0.1
…
0.6
Coloc.
Frequency
+
High
Correlation
Frequency
Low
Correlation
Dissim.
Similar
6
Functional network
prediction and analysis
Global interaction network
HEFalMp
Currently includes data from
30,000 human experimental results,
15,000 expression conditions +
15,000 diverse others, analyzed for
200 biological functions and
150 diseases
Metabolism network
Signaling network
Gut community network
7
HEFalMp: Predicting human gene function
HEFalMp
8
HEFalMp: Predicting human
genetic interactions
HEFalMp
9
HEFalMp: Analyzing human genomic data
HEFalMp
10
HEFalMp: Understanding human disease
HEFalMp
11
Meta-analysis for unsupervised
functional data integration
Huttenhower 2006
Hibbs 2007
Evangelou 2007
1 1  

 '  log 
2  1   
z
y e ,i   e  
ye ,i   e   e   e ,i
̂ e   we*,i ye,i
i
we*,i 
 '   '
 '
Simple regression:
All datasets are
equally accurate
Random effects:
Variation within and
among datasets
and interactions
1
se2,i  ˆ 2e
12
Meta-analysis for unsupervised
functional data integration
Huttenhower 2006
Hibbs 2007
Evangelou 2007
1 1  

 '  log 
2  1   
z
+
 '   '
 '
=
Following up with semisupervised approach
13
Functional mapping: mining integrated networks
Predicted relationships
between genes
Low
Confidence
High
Confidence
The strength of these
relationships indicates how
cohesive a process is.
Chemotaxis
14
Functional mapping: mining integrated networks
Predicted relationships
between genes
Low
Confidence
High
Confidence
Chemotaxis
15
Functional mapping: mining integrated networks
Predicted relationships
between genes
Low
Confidence
High
Confidence
The strength of these
relationships indicates how
associated two processes are.
Chemotaxis
Flagellar assembly
16
Functional Mapping:
Functional Associations Between Processes
Hydrogen
Transport
Electron
Transport
Edges
Associations between processes
Cellular
Respiration
Aldehyde
Metabolism
Very
Strong
Cell Redox
Homeostasis
Peptide
Metabolism
Energy
Reserve
Metabolism
Moderately
Strong
Vacuolar
Protein
Catabolism
Protein
Processing
Negative Regulation
of Protein Metabolism
Protein
Depolymerization
Organelle
Fusion
Organelle
Inheritance
17
Functional Mapping:
Functional Associations Between Processes
Hydrogen
Transport
Electron
Transport
Edges
Associations between processes
Cellular
Respiration
Aldehyde
Metabolism
Very
Strong
Cell Redox
Homeostasis
Peptide
Metabolism
Energy
Reserve
Metabolism
Moderately
Strong
Vacuolar
Protein
Catabolism
Protein
Processing
Negative Regulation
of Protein Metabolism
Borders
Protein
Depolymerization
Data coverage of processes
Organelle
Fusion
Sparsely
Covered
Well
Covered
Organelle
Inheritance
18
Functional Mapping:
Functional Associations Between Processes
Hydrogen
Transport
Electron
Transport
Edges
Associations between processes
Cellular
Respiration
Aldehyde
Metabolism
Very
Strong
Cell Redox
Homeostasis
Peptide
Metabolism
Energy
Reserve
Metabolism
Moderately
Strong
Vacuolar
Protein
Catabolism
Protein
Processing
Negative Regulation
of Protein Metabolism
Nodes
Cohesiveness of processes
Below
Baseline
Baseline
(genomic
background)
Very
Cohesive
Borders
Protein
Depolymerization
Data coverage of processes
Organelle
Fusion
Sparsely
Covered
Well
Covered
Organelle
Inheritance
19
Functional Mapping:
Functional Associations Between Processes
Edges
Associations between processes
Moderately
Strong
Very
Strong
Nodes
Cohesiveness of processes
Below
Baseline
Baseline
(genomic
background)
Very
Cohesive
Borders
Data coverage of processes
Sparsely
Covered
Well
Covered
20
Functional Maps:
Focused Data Summarization
ACGGTGAACGTACA
GTACAGATTACTAG
GACATTAGGCCGTA
TCCGATACCCGATA
Data integration summarizes an
impossibly huge amount of
experimental data into an
impossibly huge number of
predictions; what next?
21
Functional Maps:
Focused Data Summarization
ACGGTGAACGTACA
GTACAGATTACTAG
GACATTAGGCCGTA
TCCGATACCCGATA
How can a biologist take
advantage of all this data to
study his/her favorite
gene/pathway/disease without
losing information?
Functional mapping
•
•
•
•
Very large collections of genomic data
Specific predicted molecular interactions
Pathway, process, or disease associations
Underlying experimental results and
functional activities in data
22
Outline
1. Data mining:
2. Metagenomics:
Algorithms for integrating
very large data compendia
Network models of
microbial communities
23
Microbial Communities and
Functional Metagenomics
With Jacques Izard, Wendy Garrett
• Metagenomics: data analysis from environmental samples
– Microflora: environment includes us!
• Pathogen collections of “single” organisms form similar communities
• Another data integration problem
– Must include datasets from multiple organisms
• What questions can we answer?
– What pathways/processes are present/over/underenriched in a newly sequences microbe/community?
– What’s shared within community X?
What’s different? What’s unique?
– How do human microflora interact with diabetes,
obesity, oral health, antibiotics, aging, …
– Current functional methods annotate
~50% of synthetic data, <5% of environmental data
24
Data Integration for Microbial Communities
~300 available
expression
datasets
~30 species
•
•
•
•
Data integration works just as well in microbes as it does in yeast and humans
We know an awful lot about some microorganisms and almost nothing about others
Sequence-based and network-based tools for function transfer both work in isolation
We can use data integration to leverage both and mine out additional biology
Weskamp et al 2004
Flannick et al 2006
Kanehisa et al 2008
Tatusov et al 1997
25
Functional network prediction from
diverse microbial data
486 bacterial
expression
experiments
876 raw
datasets
310
postprocessed
datasets
304 normalized
coexpression networks
in 27 species
307 bacterial
interaction
experiments
154796 raw
interactions
114786
postprocessed
interactions
Integrated functional
interaction networks
in 15 species
E. Coli Integration
← Precision ↑, Recall ↓
26
Functional maps for cross-species
knowledge transfer
ECG1, ECG2
BSG1
ECG3, BSG2
…
O1: G1, G2, G3
O2: G4
O3: G6
…
G2
G3
G4
G1
O2
G5
G6
G7
O3
G8
O5
O4
G9
G10
G12
O8
G11
O6
G13
G15
G16
O7
O9
G14
G17
27
Functional maps for cross-species
knowledge transfer
Following up with unsupervised and
partially anchored network alignment
← Precision ↑, Recall ↓
28
Functional maps for functional metagenomics
GOS 4441599.3
Hypersaline Lagoon, Ecuador
KEGG Pathways
Organisms
Mapping organisms
into phyla
Env.
+
Integrated functional
interaction networks
in 27 species
Pathogens
=
Mapping genes
into pathways
Mapping pathways
into organisms
29
Functional maps for functional metagenomics
Edges
Process association in obesity
Less
Coregulated
Baseline
(no change)
More
Coregulated
Nodes
Process cohesiveness in obesity
Very
Downregulated
Baseline
(no change)
Very
Upregulated
30
Efficient Computation For Biological Discovery
Massive datasets and genomes require
efficient algorithms and implementations.
• Sleipnir C++ library for computational
functional genomics
• Data types for biological entities
•
•
Microarray data, interaction data, genes and gene sets,
functional catalogs, etc. etc.
Network communication, parallelization
• Efficient machine learning algorithms
•
Generative (Bayesian) and discriminative (SVM)
It’s also speedy:•microbial
And it’s
data integration
computation
takes <3hrs.
fully documented!
31
Outline
• Bayesian and unsupervised
methods for data integration
• HEFalMp system for human data
analysis and integration
• Functional mapping to statistically
summarize large data collections
• Integration for microbial
communities and metagenomics
• Accurate cross-species
interactome transfer
• Sleipnir software for efficient
large scale data mining
1. Data mining:
2. Metagenomics:
Algorithms for integrating
very large data compendia
Network models of
microbial communities
32
Thanks!
Olga Troyanskaya
Chris Park
David Hess
Matt Hibbs
Chad Myers
Ana Pop
Aaron Wong
Jacques Izard
Hilary Coller
Erin Haley
Sarah Fortune
Tracy Rosebrock
Wendy Garrett
http://huttenhower.sph.harvard.edu/sleipnir
http://function.princeton.edu/hefalmp
33
Related documents