Download An Overview of Functional Genomic Data

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Genomic Data Integration
Curtis Huttenhower
Harvard School of Public Health
Department of Biostatistics
07-10-10
A Definition of Integrative Data Mining
Genomic data
Gene
↓
Function
Gene
↓
Gene
Prior knowledge
Data
↓
Function
Function
↓
Function
2
Machine Learning for Data Integration
Low
Correlation
G1 G4
G2 G9
+
+
0.9
0.7
High
Correlation
…
…
G3 G7
G6 G8
-
-
0.1
0.2
…
G2
G5
?
…
0.8
P(G2-G5|Data) = 0.85
Frequency
← 1Ks datasets
100Ms gene pairs →
=
+
-
…
-
-
…
+
Not coloc.
Low
Similarity
High
Similarity
0.8
0.5
…
0.05 0.1
…
0.6
Coloc.
Frequency
+
High
Correlation
Frequency
Low
Correlation
Dissim.
Similar
3
Machine Learning for Data Integration
Functional
Relationship
Jansen 2003
Troyanskaya 2003
Golub
1999
Butte
2000
Whitfield
2002
Hansen
1998
4
Alternative Data Integration Frameworks
Lee 2004
Lanckriet 2004
Aerts 2006
5
Functional Networks
Global interaction network
string-db.org
function.princeton.edu/hefalmp
funcnet.eu
Metabolism network
Conserved network
homes.esat.kuleuven.be/
~bioiuser/endeavour
Kidney network
6
Biological Networks:
Clusters, Hubs, Bottlenecks, and Flow
7
Biological Networks: Network Motifs
Bi-fan
Feedback
Positive
auto-regulation
Negative
auto-regulation
memory
delay
WGD and evolvability
www.weizmann.ac.il/mcb/UriAlon/
groupNetworkMotifSW.html
speed + stability
Coherent feed-forward
mavisto.ipk-gatersleben.de
filter
theinf1.informatik.uni-jena.de/~wernicke/motifs
Incoherent feed-forward
pulse
Milo 2002
Alon 2007
8
Predicting Gene Function
Predicted relationships
between genes
Low
Confidence
High
Confidence
Cell cycle genes
9
Predicting Gene Function
Predicted relationships
between genes
Low
Confidence
High
Confidence
Cell cycle genes
10
Predicting Gene Function
Predicted relationships
between genes
Huttenhower 2009
Low
Confidence
High
Confidence
These edges provide
a measure of how
likely a gene is to
specifically participate
in the process of
interest.
Cell cycle genes
Rodrigues 2007
Pena-Castillo 2008
11
Comprehensive Validation of
Computational Predictions
Hess, 2009
Hibbs, 2009
Genomic data Prior knowledge
Computational Predictions of Gene Function
SPELL
bioPIXIE
MEFIT
Hibbs et al 2007
Myers et al 2005
Retraining
New known functions for
correctly predicted genes
Genes predicted to function in
mitochondrion organization
and biogenesis
Laboratory Experiments
Growth
curves
Petite
frequency
Confocal
microscopy
12
Evaluating the Performance of
Computational Predictions
Huttenhower, 2009
Genes involved in mitochondrion organization and biogenesis
106
135
Original GO Annotations
Under-annotations
82
17
Novel Confirmations, Novel Confirmations,
First Iteration
Second Iteration
340 total: >3x previously known genes in ~5 person-months
13
Evaluating the Performance of
Computational Predictions
Huttenhower, 2009
Genes involved in mitochondrion organization and biogenesis
Computational
95 predictions
40from large 80
17
Original GO Annotations
Under-annotations
collections
of genomicConfirmed
data canNovel
be Confirmations Novel Confirmations
Under-annotations
First Iteration
Second Iteration
accurate despite incomplete or
misleading gold
standards,
340 total: >3x previously
known
genesand
in they
~5 person-months
continue to improve as additional data
are incorporated.
106
14
Functional Mapping:
Mining Integrated Networks
Predicted relationships
between genes
Low
Confidence
High
Confidence
The strength of these
relationships indicates how
cohesive a process is.
Chemotaxis
15
Functional Mapping:
Mining Integrated Networks
Predicted relationships
between genes
Low
Confidence
High
Confidence
Chemotaxis
16
Functional Mapping:
Mining Integrated Networks
Predicted relationships
between genes
Low
Confidence
High
Confidence
The strength of these
relationships indicates how
associated two processes are.
Chemotaxis
Flagellar assembly
17
Functional Mapping:
Associations Between Gene Sets
Hydrogen
Transport
Edges
Electron
Transport
Associations between processes
Cellular
Respiration
Aldehyde
Metabolism
Very
Strong
Cell Redox
Homeostasis
Peptide
Metabolism
Energy
Reserve
Metabolism
Moderately
Strong
Vacuolar
Protein
Catabolism
Protein
Processing
Negative Regulation
of Protein
Metabolism
Protein
Depolymerization
Organelle
Fusion
Organelle
Inheritance
18
Functional Mapping:
Associations Between Gene Sets
Hydrogen
Transport
Edges
Electron
Transport
Associations between processes
Cellular
Respiration
Aldehyde
Metabolism
Very
Strong
Cell Redox
Homeostasis
Peptide
Metabolism
Energy
Reserve
Metabolism
Moderately
Strong
Vacuolar
Protein
Catabolism
Protein
Processing
Negative Regulation
of Protein
Metabolism
Borders
Data coverage of processes
Protein
Depolymerization
Organelle
Fusion
Sparsely
Covered
Well
Covered
Organelle
Inheritance
19
Functional Mapping:
Associations Between Gene Sets
Hydrogen
Transport
Edges
Electron
Transport
Associations between processes
Cellular
Respiration
Aldehyde
Metabolism
Moderately
Strong
Cell Redox
Homeostasis
Nodes
Peptide
Metabolism
Protein
Processing
Cohesiveness of processes
Below
Baseline
Energy
Reserve
Metabolism
Very
Strong
Vacuolar
Protein
Catabolism
Negative Regulation
of Protein
Metabolism
Baseline
(genomic
background)
Very
Cohesive
Borders
Data coverage of processes
Protein
Depolymerization
Organelle
Fusion
Sparsely
Covered
Well
Covered
Organelle
Inheritance
20
Functional Maps: Focused Data Summarization
ACGGTGAACGTACA
GTACAGATTACTAG
GACATTAGGCCGTA
TCCGATACCCGATA
Data integration summarizes an
impossibly huge amount of
experimental data into an
impossibly huge number of
predictions; what next?
21
Functional Maps: Focused Data Summarization
ACGGTGAACGTACA
GTACAGATTACTAG
GACATTAGGCCGTA
TCCGATACCCGATA
How can a biologist take
advantage of all this data to
study his/her favorite
gene/pathway/disease without
losing information?
Functional mapping
•
•
•
•
Very large collections of genomic data
Specific predicted molecular interactions
Pathway, process, or disease associations
Underlying experimental results and
functional activities in data
22
Efficient Computation For Biological Discovery
Massive datasets and genomes require
efficient algorithms and implementations.
• Sleipnir C++ library for computational
functional genomics
• Data types for biological entities
huttenhower.sph.harvard.edu/
sleipnir
•
•
Microarray data, interaction data, genes and gene sets,
functional catalogs, etc. etc.
Network communication, parallelization
• Efficient machine learning algorithms
•
Generative (Bayesian) and discriminative (SVM)
It’s also speedy: microbial
• And
data integration
computation
takes <3hrs.
it’s fully documented!
23
Thanks!
Curtis Huttenhower
Harvard School of Public Health
Department of Biostatistics
http://huttenhower.sph.harvard.edu
Meta-Analysis for Data Integration
Evangelou 2007
1 1  

 '  log 
2  1   
z
+
 '   '
 '
=
26
Meta-Analysis for Data Integration
Evangelou 2007
1 1  

 '  log 
2  1   
z
y e ,i   e  
ye ,i   e   e   e ,i
̂ e   we*,i ye,i
i
we*,i 
 '   '
 '
Simple regression:
All datasets are
equally accurate
Random effects:
Variation within and
among datasets
and interactions
1
se2,i  ˆ 2e
27
Related documents