Download Presentation - CS

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Using Bayesian Networks to
Analyze Expression Data
N. Friedman M. Linial I. Nachman D. Pe’er
Hebrew University, Jerusalem
.
Central Dogma
Transcription
Translation
mRNA
Gene
Cells express different subset of the genes
In different tissues and under different conditions
Protein
Microarrays (aka “DNA chips”)
 New
technological breakthrough:
 Measure RNA expression levels of thousands
of genes in one experiment
 Measure expression on
a genomic scale
 Opens up new
experimental designs
 Many major labs are using,
or will use this technology
in the near future
The Problem
Experiments
j
Genes
i
Goal:
Aij - the mRNA level of gene j in experiment i
 Learn regulatory/metabolic networks
 Identify causal sources of the biological
phenomena of interest
Our Approach
 Characterize
statistical relationships between
expression patterns of different genes
 Beyond pair-wise interactions


Many interactions are explained by intermediate factors
Regulation involves combined effects of several geneproducts
We build on the language of Bayesian networks
Network: Example
Noisy stochastic process:
Example: Pedigree Homer
 A node represents
an individual’s
genotype
Bart
 Modeling

Marge
Lisa
Maggie
assumptions:
Ancestors can effect descendants' genotype only by
passing genetic materials through intermediate
generations
Ancestor
Network Structure
Parent
Generalizing to DAGs:

A child is conditionally
independent from its
non-descendents, given the
value of its parents
Y1
Y2
X
Often a natural assumption
for causal processes

if we believe that we capture
the relevant state of each
intermediate stage.
Non-descendent
Descendent
Local Probabilities
 Associated
with each variable Xi is a conditional
probability distribution P(Xi|Pai:)
X P(Y |X)
 Discrete variables:
X
x 0.9 0.1
Multinomial distribution
x
Y
variables:
Choice: for example linear gaussian
P(Y | X)
 Continuous
X
Y
0.3 0.7
Bayesian Network Semantics
B
E
R
A
C
Qualitative part
DAG specifies
conditional
independence
statements
Quantitative part
+
local
probability
models
=
Unique joint
distribution
over domain
 Compact


& efficient representation:
 k parents  O(2kn) vs. O(2n) params
parameters pertain to local interactions
P(C,A,R,E,B) = P(B)*P(E|B)*P(R|E,B)*P(A|R,B,E)*P(C|A,R,B,E)
versus
P(C,A,R,E,B) = P(B)*P(E) * P(R|E) * P(A|B,E) * P(C|A)
Why Bayesian Networks?
Bayesian Networks:
 Flexible representation of dependency structure
of multivariate distributions
 Natural for modeling processes with local
interactions
Learning of Bayesian Networks
 Can learn dependencies from observations
 Handles stochastic processes:


“true” stochastic behavior
noise in measurements
Modeling Regulatory Interactions
Variables of interest:
 Expression levels of genes
 Concentration levels of proteins (proteomics!)
 Exogenous variables: Nutrient levels, Metabolite
Levels, Temperature,
 Phenotype information
 …
Bayesian Network Structure:
 Capture dependencies among these variables
Examples
Measured expression
level of each gene
Gene interaction
Random variables
Probabilistic
dependencies
Interactions are represented by a graph:
 Each gene is represented by a node in the graph
 Edges between the nodes represent direct
dependency
X
A
B
A
B
More Complex Examples
 Dependencies
can be mediated through other
nodes
B
A
A
C
C
Common cause
 Common
B
Intermediate gene
effects can imply conditional dependence
A
B
C
Outline of Our Approach
Bayesian Network
Learning Algorithm
Expression data
B
E
R
A
C
Use learned network to make predictions about
structure of the interactions between genes
Experiment
Data from Spellman et al.
(Mol.Bio. of the Cell 1998)
 Contains 76 samples of all the
yeast genome:
 Different methods for
synchronizing cell-cycle in
yeast
 Time series at few minutes
(5-20min) intervals
 Spellman et al. identified 800
cell-cycle regulated genes.
Methods
 Treat
samples as IID (ignoring temporal order)
Experiment 1:
 Discretized into three levels of expression
0
-
-0.5
+
0.5
Log(ratio to control)
 Learn
multinomial probabilities
Experiment 2:
 Learn linear interactions (w/ Gaussian noise)
No prior biological knowledge was used
Network Learned
Challenge: Statistical Significance
Sparse Data
 Small number of samples
 “Flat posterior” -- many networks fit the data
Solution
 estimate confidence in network features
 Two types of features
 Markov neighbors: X directly interacts with Y
 Order relations: X is an ancestor of Y
Confidence Estimates
B
E
Bootstrap approach
[FGW, UAI99]
D1
Learn
R
A
C
E
D
resample
D2
Learn
R
A
C
...
Dm
E
R
B
A
Learn
C
1
C (f )   1f  Gi 
m i 1
m
Estimate:
B
Testing for Significance
 We
run our procedure on randomized data where
we reshuffled the order of values for each gene
 Histograms of number of Markov features at each
confidence level
Original Data
Randomized Data
Testing for Significance
 We
run our procedure on randomized data where
we reshuffled the order of values for each gene
Markov w/ Gaussian Models
Features with Confidence above t
4000
500
450
3500
Random
Real
Random
Real
400
350
3000
300
2500
250
200
2000
150
1500
100
50
1000
500
0
0.1
0.2
0.3
0.4
0.5
0.6
t
0
0.1
0.2
0.7
0.8
0.3
0.9
0.4
0.5
1
0.6
0.7
0.8
0.9
1
Testing for Significance
Markov w/ Multinomial Models
Features with Confidence above t
250
1400
Random
Real
Random
Real
200
1200
1000
150
800
100
600
50
400
0
0.1
200
0.2
0.3
0.4
0
0.1
0.2
0.3
0.4
0.5
0.6
t
0.7
0.8
0.9
1
0.5
0.6
0.7
0.8
0.9
1
Local Map
Finding Key Genes
Key gene: a gene that preceeds many other genes
YLR183C
 MCD1 Mitotic Chromosome Determinant;
 RAD27 DNA repair protein
 CLN2 role in cell cycle START
 SRO4 involved in cellular polarization during budding
 YOX1 Homeodomain protein that binds leu-tRNA gene
 POL30 required for DNA replication and repair
 YLR467W
 CDC5
 MSH6 Homolog of the human GTBP protein
 YML119W
 CLN1 role in cell cycle START

Future Work
 Finding
suitable local distribution models
 Correct handling of hidden variables

Can we recognize hidden causes of coordinated
regulation events?
 Incorporating

prior knowledge
Incorporate large mass of biological knowledge, and
insight from sequence/structure databases
 Abstraction

Combine with cluster analysis of higher confidence
conclusions
Related documents