Download BU Bioinformatics seminar

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Supervised and unsupervised methods for
large scale genomic data integration
Curtis Huttenhower
Harvard School of Public Health
Department of Biostatistics
03-25-10
Greatest Biological Discoveries?
2
Are We There Yet?
Species Diversity of
Environmental Samples
• How much biology is out there?
• How much have we found?
• How fast are we finding it?
Fierer 2008
Human Proteins with
Annotated Biological Roles
Age-Adjusted Citation Rates for
Major Sequencing Projects
#
Distinct
Roles
Matt Hibbs
3
Are We There Yet?
Species Diversity of
Environmental Samples
Lots!
• How much biology is out there?
• How much have we
Ourfound?
job is toNot
create
nearly all
• How fast arecomputational
we finding it? microscopes:
To ask
Notand
fastanswer
enoughspecific
biomedical questions using
Human Proteins with
Age-Adjusted Cost per Citation for
millions
results
Annotated Biological
Roles of experimentalMajor
Sequencing Projects
Fierer 2008
#
Distinct
Roles
Matt Hibbs
4
Outline
1. Big picture:
2. Details:
Algorithms for mining
genome-scale datasets
Recovering mechanistic detail
from high-throughput data
3. Applications:
Microbial communities and
functional metagenomics
5
A framework for functional genomics
Low
Correlation
G1 G4
G2 G9
+
+
0.9
0.7
High
Correlation
…
…
G3 G7
G6 G8
-
-
0.1
0.2
…
G2
G5
?
…
0.8
P(G2-G5|Data) = 0.85
Frequency
← 1Ks datasets
100Ms gene pairs →
=
+
-
…
-
-
…
+
Not coloc.
Low
Similarity
High
Similarity
0.8
0.5
…
0.05 0.1
…
0.6
Coloc.
Frequency
+
High
Correlation
Frequency
Low
Correlation
Dissim.
Similar
6
Functional network
prediction and analysis
Global interaction network
HEFalMp
Currently includes data from
30,000 human experimental results,
15,000 expression conditions +
15,000 diverse others, analyzed for
200 biological functions and
150 diseases
Metabolism network
Signaling network
Gut community network
7
HEFalMp: Predicting human gene function
HEFalMp
8
HEFalMp: Predicting human
genetic interactions
HEFalMp
9
HEFalMp: Analyzing human genomic data
HEFalMp
10
HEFalMp: Understanding human disease
HEFalMp
11
Meta-analysis for unsupervised
functional data integration
Huttenhower 2006
Hibbs 2007
Evangelou 2007
1 1  

 '  log 
2  1   
z
ye ,i   e  
y e ,i   e   e   e ,i
̂ e   we*,i ye,i
i
we*,i 
 '   '
 '
Simple regression:
All datasets are
equally accurate
Random effects:
Variation within and
among datasets
and interactions
1
se2,i  ˆ 2e
12
Meta-analysis for unsupervised
functional data integration
Huttenhower 2006
Hibbs 2007
Evangelou 2007
1 1  

 '  log 
2  1   
z
+
 '   '
 '
=
Following up with semisupervised approach
13
Functional mapping: mining integrated networks
Predicted relationships
between genes
Low
Confidence
High
Confidence
The strength of these
relationships indicates how
cohesive a process is.
Chemotaxis
14
Functional mapping: mining integrated networks
Predicted relationships
between genes
Low
Confidence
High
Confidence
Chemotaxis
15
Functional mapping: mining integrated networks
Predicted relationships
between genes
Low
Confidence
High
Confidence
The strength of these
relationships indicates how
associated two processes are.
Chemotaxis
Flagellar assembly
16
Functional Mapping:
Scoring Functional Associations
How can we formalize
these relationships?
Any sets of genes G1 and G2
in a network can be compared
using four measures:
• Edges between their genes
• Edges within each set
• The background edges
incident to each set
• The baseline of all edges
in the network
Stronger connections between
the sets increase association.
FAG1 ,G2
between(G1 , G2 )
baseline


background (G1 , G2 ) within(G1 , G2 )
Stronger within self-connections or nonspecific
background connections decrease association.
17
Functional Mapping:
Bootstrap p-values
For any graph, compute FA scores for many
Null distribution
is
• Scoring
functional
associations
is great…
randomly
chosen gene
sets of different
sizes.
approximately normal
…how do you interpret an association
score?
with
mean
1.
#
Genes–
1 gene5 sets 10
50 sizes?
For
of arbitrary
ˆ FA (Gi , G j )  1
– In arbitrary graphs?
A(| Gi |) | G j |  B
of
edges?
1 – Each with its own bizarre distribution
ˆ FA (Gi , G j ) 
| Gi | C (| G j |)
5
Standard deviation is
0.45
0.4
0.35

0.3
0.25
0.2
0.15
0.1
10
0
0.05
10
0
2
0
10
10
50
10
1
asymptotic in the sizes
of both gene sets.
P( FAG1 ,G2  x)  1   ˆ (G1 ,G2 ),ˆ (G1 ,G2 ) ( x)
2
10
3
10
4
10
|G1|
|G2|
Null distribution
one graph
Histograms
of FAsσs
forfor
random
sets
Maps FA scores to p-values
for any gene sets and
underlying graph.
18
Functional Mapping:
Functional Associations Between Processes
Hydrogen
Transport
Electron
Transport
Edges
Associations between processes
Cellular
Respiration
Aldehyde
Metabolism
Very
Strong
Cell Redox
Homeostasis
Peptide
Metabolism
Energy
Reserve
Metabolism
Moderately
Strong
Vacuolar
Protein
Catabolism
Protein
Processing
Negative Regulation
of Protein Metabolism
Nodes
Cohesiveness of processes
Below
Baseline
Baseline
(genomic
background)
Very
Cohesive
Borders
Protein
Depolymerization
Data coverage of processes
Organelle
Fusion
Sparsely
Covered
Well
Covered
Organelle
Inheritance
19
Functional Mapping:
Functional Associations Between Processes
Edges
Associations between processes
Moderately
Strong
Very
Strong
Nodes
Cohesiveness of processes
Below
Baseline
Baseline
(genomic
background)
Very
Cohesive
Borders
Data coverage of processes
Sparsely
Covered
Well
Covered
20
Functional Maps:
Focused Data Summarization
ACGGTGAACGTACA
GTACAGATTACTAG
GACATTAGGCCGTA
TCCGATACCCGATA
Data integration summarizes an
impossibly huge amount of
experimental data into an
impossibly huge number of
predictions; what next?
21
Functional Maps:
Focused Data Summarization
ACGGTGAACGTACA
GTACAGATTACTAG
GACATTAGGCCGTA
TCCGATACCCGATA
How can a biologist take
advantage of all this data to
study his/her favorite
gene/pathway/disease without
losing information?
Functional mapping
•
•
•
•
Very large collections of genomic data
Specific predicted molecular interactions
Pathway, process, or disease associations
Underlying experimental results and
functional activities in data
22
Outline
1. Big picture:
2. Details:
Algorithms for mining
genome-scale datasets
Recovering mechanistic detail
from high-throughput data
3. Applications:
Microbial communities and
functional metagenomics
23
How do functional interactions
become pathways?
• Gene expression
• Physical PPIs
+
=
• Genetic interactions
?
• Colocalization
• Sequence
• Protein domains
• Regulatory binding sites
…
24
Simultaneous inference of physical, genetic,
regulatory, and functional networks
With Chris Park, Olga Troyanskaya
Functional
genomic data
Functional
interactions
Regulatory
interactions
Post-transcriptional
regulation
Phosphorylation
Metabolic
interactions
Protein
complexes
25
Learning a compendium of interaction networks
Train one SVM per
interaction type
Resolve consistency using
hierarchical Bayes net
26
Learning a compendium of interaction networks
Both presence/absence
and directionality of
interactions are
accurately inferred
AUC
0.5
1.0
27
Using network compendia to predict
complete pathways
With David Hess
Additional 20 novel
synthetic lethality
predictions tested,
14 confirmed
(>100x better than random)
Confirmed
Unconfirmed
28
Interactive aligned network viewer –
coming soon!
Graphle
29
Outline
1. Big picture:
2. Details:
Algorithms for mining
genome-scale datasets
Recovering mechanistic detail
from high-throughput data
3. Applications:
Microbial communities and
functional metagenomics
30
Microbial Communities and
Functional Metagenomics
With Jacques Izard, Wendy Garrett
• Metagenomics: data analysis from environmental samples
– Microflora: environment includes us!
• Pathogen collections of “single” organisms form similar communities
• Another data integration problem
– Must include datasets from multiple organisms
• What questions can we answer?
– What pathways/processes are present/over/underenriched in a newly sequences microbe/community?
– What’s shared within community X?
What’s different? What’s unique?
– How do human microflora interact with diabetes,
obesity, oral health, antibiotics, aging, …
– Current functional methods annotate
~50% of synthetic data, <5% of environmental data
31
Data Integration for Microbial Communities
~300 available
expression
datasets
~30 species
•
•
•
•
Data integration works just as well in microbes as it does in yeast and humans
We know an awful lot about some microorganisms and almost nothing about others
Sequence-based and network-based tools for function transfer both work in isolation
We can use data integration to leverage both and mine out additional biology
Weskamp et al 2004
Flannick et al 2006
Kanehisa et al 2008
Tatusov et al 1997
32
Functional network prediction from
diverse microbial data
486 bacterial
expression
experiments
876 raw
datasets
310
postprocessed
datasets
304 normalized
coexpression networks
in 27 species
307 bacterial
interaction
experiments
154796 raw
interactions
114786
postprocessed
interactions
Integrated functional
interaction networks
in 15 species
E. Coli Integration
← Precision ↑, Recall ↓
33
Functional maps for cross-species
knowledge transfer
Following up with unsupervised and
partially anchored network alignment
← Precision ↑, Recall ↓
34
Efficient Computation For Biological Discovery
Massive datasets and genomes require
efficient algorithms and implementations.
• Sleipnir C++ library for computational
functional genomics
• Data types for biological entities
•
•
Microarray data, interaction data, genes and gene sets,
functional catalogs, etc. etc.
Network communication, parallelization
• Efficient machine learning algorithms
•
Generative (Bayesian) and discriminative (SVM)
It’s also speedy:•microbial
And it’s
data integration
computation
takes <3hrs.
fully documented!
35
Outline
• Bayesian and unsupervised
methods for data integration
• HEFalMp system for human data
analysis and integration
• Functional mapping to statistically
summarize large data collections
• Simultaneous inference of an
interaction network compendium
• Accurate prediction of interaction
types and directionality
• Validated pathways and specific
individual interactions in yeast
1. Big picture:
2. Details:
Algorithms for mining
genome-scale datasets
Recovering mechanistic detail
from high-throughput data
• Integration for microbial
communities and metagenomics
• Sleipnir software for efficient
large scale data mining
3. Applications:
Microbial communities and
functional metagenomics
36
Thanks!
Olga Troyanskaya
Chris Park
David Hess
Matt Hibbs
Chad Myers
Ana Pop
Aaron Wong
Jacques Izard
Hilary Coller
Erin Haley
Sarah Fortune
Tracy Rosebrock
Wendy Garrett
http://huttenhower.sph.harvard.edu/sleipnir
http://function.princeton.edu/hefalmp
37
Validating Human Predictions
With Erin Haley, Hilary Coller
Autophagy
5½ of 7 predictions
currently confirmed
Luciferase
ATG5
(Negative control)
(Positive control)
Predicted novel autophagy proteins
LAMP2
RAB11A
Not
Starved
Starved
(Autophagic)
39
Functional maps for cross-species
knowledge transfer
ECG1, ECG2
BSG1
ECG3, BSG2
…
O1: G1, G2, G3
O2: G4
O3: G6
…
G2
G3
G4
G1
O2
G5
G6
G7
O3
G8
O5
O4
G9
G10
G12
O8
G11
O6
G13
G15
G16
O7
O9
G14
G17
40
Functional maps for functional metagenomics
GOS 4441599.3
Hypersaline Lagoon, Ecuador
KEGG Pathways
Organisms
Mapping organisms
into phyla
Env.
+
Integrated functional
interaction networks
in 27 species
Pathogens
=
Mapping genes
into pathways
Mapping pathways
into organisms
41
Functional maps for functional metagenomics
Edges
Process association in obesity
Less
Coregulated
Baseline
(no change)
More
Coregulated
Nodes
Process cohesiveness in obesity
Very
Downregulated
Baseline
(no change)
Very
Upregulated
42
Current Work: Molecular Mechanisms
in a Colorectal Cancer Cohort
With Shuji Ogino, Charlie Fuchs
Nurse’s
Health
Study
Health
Professionals
Follow-Up
Study
LINE-1 Methylation
• Repetitive element making up ~20% of
mammalian genomes
• Very easy to assay methylation level (%)
• Good proxy for whole-genome methylation level
~3,100
gastrointestinal
subjects
~2,100
cancer
mutation tests
~1,200
LINE-1
methylation
~3,800
tissue samples
~1,450
colon cancer
samples
~1,150
CpG island
methylation
~700
TMA immunohistochemistry
~775
gene
expression
DASL Gene Expression
• Gene expression analysis from
paraffin blocks
• Thanks to Todd Golub, Yujin Hoshida
43
Molecular Subtypes of Colorectal Cancer:
Stem Cell Programs and Proliferation
← Genes
Tumors →
C1
C2
C3
C4
Nonnegative matrix factorization
Cell cycle regulation
Chr. 19 rearrangement,
membrane receptors/channels
Angiogenesis, proliferation
HSC signature
Neural/ESC signature
BRCA interactors,
chrom. stability factors
44
Molecular Subtypes of Colorectal Cancer:
Stem Cell Programs and Proliferation
Subramanian et al, 2005
CD133 + Bcl-X(L)
Hematopoeitic
Stem Cell Signature
Neural
Stem Cell Signature
CD44 + CD166
166
799
945
195
678
18
146
Chr. 19q
BAX
Hypotheses?
• Two main pathways to proliferation:
7
8
325
Embryonic
Stem Cell Signature
Note that these regulatory programs
do not appear to correspond
with demographics or common
pathologic markers…
Testing now for correlation with outcome.
• HSC program + BAX
• ESC/NSC program
• Two main pathways to deregulation:
• Angiogenesis + chrom. instability
• Cell cycle disruption (MSI?)
45
Epigenetics of Colorectal Cancer:
LINE-1 methylation levels
Lower LINE-1 methylation associates
with poor colon cancer prognosis.
LINE-1 methylation varies
remarkably between individuals…
…but it is highly correlated
within individuals.
LINE-1 Methylation in
Multiple Tumors from the
Same Subject
Ogino et al, 2008
Methylation %, Tumor #2
80
70
60
50
40
30
What does it all mean??
What is the biological
mechanism linking LINE-1
methylation to colon cancer?
30
40
50
60
70
Methylation %, Tumor #1
80
ρ = 0.718, p < 0.01
46
Epigenetics of Colorectal Cancer:
LINE-1 methylation levels
Lower LINE-1 methylation associates
with poor colon cancer prognosis.
LINE-1 methylation varies
remarkably between individuals…
…but it is highly correlated
within individuals.
LINE-1 Methylation in
Multiple Tumors from the
Same Subject
Is anything different
about these outliers?
Ogino et al, 2008
This suggests linkage to a
cancer-related pathway.
What is the biological
mechanism linking LINE-1
methylation to colon cancer?
Methylation %, Tumor #2
80
70
60
50
40
30
This suggests a copy number variation.
This suggests a genetic effect.
30
40
50
60
70
Methylation %, Tumor #1
80
ρ = 0.718, p < 0.01
47
Epigenetics of Colorectal Cancer:
LINE-1 methylation levels
Preliminary Data
•
•
•
•
•
10 genes differentially expressed even using simple methods
1/3 are from the same family with known GI tumor prognostic value
1/3 are X-chromosome testis/cancer-specific antigens
1/2 fall in same cytogenic band, which is also a known CNV hotspot
HEFalMp links to a cascade of antigens/membrane receptors/TFs
Cell adhesion p-value ≈ 0, moderate correlation in many cancer arrays
• GSEA pulls out a wide range of proliferation up (E2F),
immune response down; need to regress out prognosis correlates
Check back in a
couple of months!
What is the biological
mechanism linking LINE-1
methylation to colon cancer?
48