Download Gene network inference

Document related concepts

Airborne Networking wikipedia , lookup

Transcript
Gene Expression Analysis
and Modeling
Guillaume Bourque
Centre de Recherches Mathématiques
Université de Montréal
August 2003
DNA Microarrays
• Experiment design
• Noise reduction
• Normalization
•…
http://www.sri.com/pharmdisc/cancer_biology/laderoute.html
Guillaume Bourque, CRM Summer School
• Data analysis
2/61
Outline
• Microarray data analysis techniques
– Clustering: hierarchical and k-means
– SVD and PCA
– SVM – Support Vector Machines
• Gene network modeling
– Boolean networks
– Bayesian models
– Differential equations
Guillaume Bourque, CRM Summer School
3/61
Outline
• Microarray data analysis techniques
– Clustering: hierarchical and k-means
– SVD and PCA
– SVM – Support Vector Machines
• Gene network modeling
– Boolean networks
– Bayesian models
– Differential equations
Guillaume Bourque, CRM Summer School
4/61
Gene Expression Data
Guillaume Bourque, CRM Summer School
5/61
Gene Expression Matrix
Given an experiment with m genes and n assays we produce a
matrix X where:
xij = expression level of the ith gene in the jth assay.
gi = Transcriptional
response of the ith gene
aj = Expression profile of the jth assay
Guillaume Bourque, CRM Summer School
6/61
Goals of Clustering
• Clustering genes:
– Classify genes by their transcriptional response and get an
idea of how groups of genes are regulated.
– Potentially infer gene functions of unknown genes.
• Clustering assays:
– Classify diseased versus normal samples by their
expression profile.
– Track the expression levels at different stages in the cell.
– Study the impact of external stimuli.
Guillaume Bourque, CRM Summer School
7/61
Clustering Genes
X
similarity matrix
cluster
genes based
on similarity
m
genes
m
genes
n assays
Guillaume Bourque, CRM Summer School
m genes
8/61
Clustering Steps
• Choose a similarity metric to compare the
transcriptional response or the expression profiles:
–
–
–
–
Pearson Correlation
Spearman Correlation
Euclidean Distance
…
• Choose a clustering algorithm:
– Hierarchical
– K-means
– …
Guillaume Bourque, CRM Summer School
9/61
Similarity Metric
• Choice of the best metric depends on the
normalization procedure.
• Must be cautious of potential pitfalls.
• Correlations:
Correlation coefficients are values from –1 to 1,
with 1 indicating a similar behavior, –1 indicating an
opposite behavior and 0 indicating no direct relation.
• Euclidean distance:
Guillaume Bourque, CRM Summer School
10/61
Hierarchical Clustering
g1
g1
g2
g3
0.23
0.00
0.95 -0.63
0.91
0.56
0.56
0.32
0.77
g2
g3
g4
g4
g5
-0.36
g5
• Find largest value is similarity matrix.
g1
• Join clusters together.
g4
• Recompute matrix and iterate.
Guillaume Bourque, CRM Summer School
11/61
Hierarchical Clustering
g1 , g4
g1 ,
g4
g2
g2
0.37
g3
g5
0.16 -0.52
0.91
g3
0.56
0.77
g5
• Find largest value is similarity matrix.
g1
• Join clusters together.
g4
g2
g3
• Recompute matrix and iterate.
Guillaume Bourque, CRM Summer School
12/61
Hierarchical Clustering
g1 , g4
g1 ,
g4
g2 ,
g3
g2 , g3
g5
0.27
-0.52
0.68
g5
• Find largest value is similarity matrix.
g1
• Join clusters together.
g4
g5
g2
g3
• Recompute similarity matrix and iterate.
Guillaume Bourque, CRM Summer School
13/61
Cluster Joining
One of the issue with hierarchical clustering is how
to recompute the similarity matrix after joining
clusters. Here are 3 common solutions that define
different types of hierarchical clustering:
• Single-link: minimum distance between any member of
one cluster to any member of the other cluster.
• Complete-link: maximum distance between any
member of one cluster to any member of the other cluster.
• Average-link: average distance between any member of
one cluster to any member of the other cluster.
Guillaume Bourque, CRM Summer School
14/61
Interpreting the Results
2 clusters ?
3 clusters ?
g1
Guillaume Bourque, CRM Summer School
g4
g5
g2
g3
15/61
Clustering Example
Eisen et al. (1998),
PNAS, 95(25):
14863-14868
Guillaume Bourque, CRM Summer School
16/61
K-means Clustering
• Expression profiles are
displayed in n dimensional
space.
k=3
• First cluster center is picked
at random between all data
points.
• Other cluster centers are
picked as far as possible
from previous clusters
centers.
Guillaume Bourque, CRM Summer School
17/61
K-means Clustering
• Associate each data point to
the closest cluster center.
k=3
• Recompute cluster centers
based on new clusters.
Guillaume Bourque, CRM Summer School
18/61
K-means Clustering
• Associate each data point to
the closest cluster center.
k=3
• Recompute cluster centers
based on new clusters.
• Iterate until the clusters
remain unchanged.
Guillaume Bourque, CRM Summer School
19/61
K-means Clustering
• Associate each data point to
the closest cluster center.
k=3
• Recompute cluster centers
based on new clusters.
• Iterate until the clusters
remain unchanged.
Guillaume Bourque, CRM Summer School
20/61
K-means Clustering
• Associate each data point to
the closest cluster center.
k=3
• Recompute cluster centers
based on new clusters.
• Iterate until the clusters
remain unchanged.
Guillaume Bourque, CRM Summer School
21/61
Singular Value Decomposition
Xm x n = Um x n Sn x n V T n x n (n  m)
=
gene
expression
matrix
Guillaume Bourque, CRM Summer School
sk =
singular
value
vk = eigengene
uk = eigenassay
22/61
Singular Value Matrix
Sn x n =
=
Singular values are organized from largest to smallest:
s1  s2  …  sk  …  sn .
Guillaume Bourque, CRM Summer School
23/61
Why SVD?
• SVD extracts from the gene expression matrix:
• n eigenassays
• m eigengenes
• n singular values
• We can represent the transcriptional response of each
gene as a linear combination of the eigengenes.
• We can represent the expression profile of each assay
as a linear combination of the eigenassays.
• Allows for dimensionality reduction and for the
identification of important components.
Guillaume Bourque, CRM Summer School
24/61
SVD and PCA
• There is a direct correspondence between SVD and PCA
(Principal Component Analysis) when calculated on
covariance matrices.
• If we normalize X so that it’s columns have a 0 mean. We get
that the eigengenes are the principal components of the
transcriptional responses.
• If we normalize X so that it’s rows have a 0 mean. We get that
the eigenassays are the principal components of the expression
profiles.
• In both cases, we get that the square of the singular values are
proportional to the variance of the principal components.
Guillaume Bourque, CRM Summer School
25/61
SVD Special Property
U
VT
S
X=
U
X(r) =
Guillaume Bourque, CRM Summer School
VT
S(r)
0
0
X(r) is the
closest
rank r
approximation
of X
0
where r is the number
of non-null rows
26/61
Applications of SVD
• Detects redundancies and allows for the
representation of the data with the minimal set of
essential features (components). These features can
themselves represent signals (e.g. cell-cyle).
• Data visualization. SVD can identify subspaces that
capture most of the variance in the data which allows
for the visualization of high-dimensional data in 1, 2
or 3-dimensional subspace.
• Signal extraction in noisy data.
Guillaume Bourque, CRM Summer School
27/61
Essential Features
Alter et al. (2000), PNAS, 97(18): 10101-10106
Guillaume Bourque, CRM Summer School
28/61
Data Visualization
Yeung and Ruzzo. (2001), Bioinformatics, 17(9): 763-774
Guillaume Bourque, CRM Summer School
29/61
Support Vector Machines (SVM)
• Instead of trying to identify clusters directly in the
data, we assume the genes are already pre-clustered
into different classes. The goal is the find a model
that best predicts these classes.
• We need to find the
hyperplane that best divide
the data points.
• We must do so while
minimize the error rates of
the predictions.
Guillaume Bourque, CRM Summer School
30/61
References
• Clustering
– deRisi et al. (1997), Science, 278(5338): 680-686.
– Eisen et al. (1998), PNAS, 95(25): 14863-14868.
• SVD and PCA
–
–
–
–
Alter et al. (2000), PNAS, 97(18): 10101-10106.
Holter et al. (2000), PNAS, 97(15): 8409-8414.
Yeung and Ruzzo. (2001), Bioinformatics, 17(9): 763-774.
Wall et al. (2003), A Pratical Approach to Microarray Data Analysis,
Chapter 5.
• SVM
– Brown et al. (2000), PNAS, 97(1), 262-267.
Guillaume Bourque, CRM Summer School
31/61
Outline
• Microarray data analysis techniques
– Clustering: hierarchical and k-means
– SVD and PCA
– SVM – Support Vector Machines
• Gene network modeling
– Boolean networks
– Bayesian models
– Differential equations
Guillaume Bourque, CRM Summer School
32/61
Problem
_
x1
?
+
+
x4
_
+
_
x2
x3
_
_
Time series
Guillaume Bourque, CRM Summer School
Gene network
33/61
Boolean Networks
• Genes are assumed to be ON or OFF.
• At any given time, combining the gene states
gives a gene activity pattern (GAP).
• Given a GAP at time t, a deterministic
function (a set of logical rules) provides the
GAP at time t +1.
• GAPs can be classified into attractor and
transient states.
Guillaume Bourque, CRM Summer School
34/61
Boolean Network Example
t
t+1
x1
x2
x3
x1
x2
x3
or
nor
nand
t
0
1
2
3
4
x1
x2
x3
1
1
0
1
1
1
0
0
0
0
1
0
1
1
0
transient
Guillaume Bourque, CRM Summer School
attractors
35/61
State Space
Picture generated using the program DDLab.
Wuensche,A., (1998), Proceedings of Complex Systems '98 .
Guillaume Bourque, CRM Summer School
36/61
Boolean Network Example
AND
NAND
NOT
I. Shmulevich et al., Bioinformatics (2002), 18 (2): 261-274
Guillaume Bourque, CRM Summer School
37/61
Issues with Boolean Networks
• Gene trajectories are continuous and modeling
them as ON/OFF might be inadequate.
• A deterministic set of logical rules forces a
very stringent model.
– It doesn’t allow for external input.
– Very susceptible to noise.
• Probability Boolean Networks aims at fixing
some of these issues by combining multiple
sets of rules (related to Bayesian Networks).
Guillaume Bourque, CRM Summer School
38/61
Threshold(s)
ON
OFF
Guillaume Bourque, CRM Summer School
39/61
Bayesian Networks
• A gene regulatory network is represented by directed
acyclic graph:
– Vertices correspond to genes.
– Edges correspond to direct influence or interaction.
• For each gene xi, a conditional distribution
p(xi | ancestors(xi) ) is defined.
• The graph and the conditional distributions, uniquely
specify the joint probability distribution.
Guillaume Bourque, CRM Summer School
40/61
Bayesian Network Example
x1
x2
x4
x3
Conditional distributions:
p(x1), p(x2), p(x3| x2),
p(x4| x1,x2), p(x5| x4)
x5
p(X) = p(x1) p(x2| x1) p(x3| x1,x2) p(x4| x1,x2, x3) p(x5| x1,x2, x3,x4)
p(X) = p(x1) p(x2) p(x3| x2) p(x4| x1,x2) p(x5| x4)
Guillaume Bourque, CRM Summer School
41/61
Learning Bayesian Models
• Using gene expression data, the goal is to find the
bayesian network that best matches the data.
• Recovering optimal conditional probability
distributions when the graph is known is “easy”.
• Recovering the structure of the graph is NP-hard.
• But, good statistics are available:
– What is the likelihood of a specific assignment?
– What is the distribution of xi given xj?
– …
Guillaume Bourque, CRM Summer School
42/61
Issues with Bayesian Models
• Computationally intensive.
• Requires lots of data.
• Does not allow for feedback loops which are known
to play an important role in biological networks.
• Does not make use of the temporal aspect of the data.
• Dynamical Bayesian Networks aim at solving some
of these issues but they require even more data.
Guillaume Bourque, CRM Summer School
43/61
Differential Equations
• Typically uses linear differential equations to
model the gene trajectories:
dxi(t) / dt = a0 + ai,1 x1(t)+ ai,2 x2(t)+ … + ai,n xn(t)
• Several reasons for that choice:
– lower number of parameters implies that we are
less likely to over fit the data
– sufficient to model complex interactions between
the genes
Guillaume Bourque, CRM Summer School
44/61
Small_ Network Example
x1
+
+
x4
_
+
_
x2
x3
_
_
dx1(t) / dt = 0.491 - 0.248 x1(t)
dx2(t) / dt = -0.473 x3(t) + 0.374 x4(t)
dx3(t) / dt = -0.427 + 0.376 x1(t) - 0.241 x3(t)
dx4(t) / dt = 0.435 x1(t) - 0.315 x3(t) - 0.437 x4(t)
Guillaume Bourque, CRM Summer School
45/61
Small_ Network Example
x1
+
+
x4
_
_
+
_
x2
x3
_
one interaction
coefficient
dx1(t) / dt = 0.491 - 0.248 x1(t)
dx2(t) / dt = -0.473 x3(t) + 0.374 x4(t)
dx3(t) / dt = -0.427 + 0.376 x1(t) - 0.241 x3(t)
dx4(t) / dt = 0.435 x1(t) - 0.315 x3(t) - 0.437 x4(t)
Guillaume Bourque, CRM Summer School
46/61
Small_ Network Example
x1
+
+
x4
_
_
+
_
x2
x3
_
constant
coefficients
dx1(t) / dt = 0.491 - 0.248 x1(t)
dx2(t) / dt = -0.473 x3(t) + 0.374 x4(t)
dx3(t) / dt = -0.427 + 0.376 x1(t) - 0.241 x3(t)
dx4(t) / dt = 0.435 x1(t) - 0.315 x3(t) - 0.437 x4(t)
Guillaume Bourque, CRM Summer School
47/61
Problem Revisited
a0,i
x1
x2
x3
x4
a1,i
.431 -.248
0
0
-.427 .376
0
.435
a2,i
a3,i
a4,i
0
0
0
0
-.473 .374
0
-.241
0
-.315 -.437
0
Given the time-series data, can we find
the interactions coefficients?
Guillaume Bourque, CRM Summer School
48/61
Issues with Differential Equations
• Even under the simplest linear model, there are
m(m+1) unknown parameters to estimate:
• m(m-1) directional effects
• m self effects
• m constant effects
• Number of data points is mn and we typically have
that n << m (few time-points).
• To avoid over fitting, extra constraints must be
incorporated into the model such as:
• Smoothness of the equations
• Sparseness of the network (few non-null interaction coefficients)
Guillaume Bourque, CRM Summer School
49/61
Algorithm for Network Inference
• To recover the interaction coefficients, we use
stepwise multiple linear regression.
• Why?
– This procedure only finds coefficient that
significantly improve the fit in the regression.
Hence it limits the number of non-zero
coefficients (i.e. it finds sparse networks) a feature
we were seeking.
– It is highly flexible and provides p-value scores
which can be interpreted easily.
Guillaume Bourque, CRM Summer School
50/61
Partial F Test
• The procedure finds the interaction coefficients
iteratively for each gene xi.
• A partial F test is constructed to compare the total
square error of the predicted gene trajectory with a
specific subset of coefficients being added or
removed.
• If the p-value obtained from the test exceeds a
certain cutoff, the subset of coefficients is
significant and will be added or removed.
• The procedures iterates until no more subsets of
coefficients are either added or removed.
Guillaume Bourque, CRM Summer School
51/61
Simulations
•
Difficult to find coefficients that will produce
realistic gene trajectories.
•
We select coefficients such that the resulting
trajectories satisfy 3 conditions:
•
•
They are bounded
•
The correlation of any pair is not too high
•
They are not too stable
We add gaussian noise to model errors.
Guillaume Bourque, CRM Summer School
52/61
Noise
Guillaume Bourque, CRM Summer School
53/61
Network Inference
a0,i
x1
a1,i
.431 -.248
x2
0
x3
0
-.427 .376
x4
0
.435
a2,i
a3,i
a4,i
0
0
0
0
-.473 .374
0
-.241
0
-.315 -.437
0
_
x1
+
x2
+
_ x4
+
_
_
x3
_
Procedure recovers perfectly this network with
4 genes and 10 interactions coefficients.
Guillaume Bourque, CRM Summer School
54/61
10 Genes
Procedure also recovers perfectly this network
with 10 genes and 22 interactions coefficients.
Guillaume Bourque, CRM Summer School
55/61
Multiple Networks
Guillaume Bourque, CRM Summer School
56/61
Multiple Network Problem
• Multiple networks related by a graph or a tree can
arise from various situations:
– Different species
– Different developments stages
– Different tissues
• The goal is now not only to maximize the fit (with as
few interactions as possible) but also to minimize an
evolutionary score on the graph of the network.
Guillaume Bourque, CRM Summer School
57/61
Multiple Network Inference
• The stepwise regression algorithm is modified to act
directly on the edges of the graph and to take into
account the evolutionary score.
• The inference is done concurrently in all the
networks.
• Results: the comparative framework actually
simplifies the inference process especially when
more genes or noise are involved.
Guillaume Bourque, CRM Summer School
58/61
Simulation Tests
Guillaume Bourque, CRM Summer School
59/61
Simple?
Guillaume Bourque, CRM Summer School
60/61
References
• Boolean Networks
– Kauffman (1993), The Origins of Order
– Lian et al. (1998), PSB, 3: 18-29.
• Bayesian Networks
– Friedman et al. (2000), RECOMB 2000.
– Hartemink et al. (2001), PSB, 6: 422-433.
• Differential Equations
– Chen et al. (1999), PSB, 4: 29-40.
– D’haeseleer et al. (1999), PSB, 4: 41-52.
– Yeung et al. (2002), PNAS, 99(9): 6163-6168.
• Literature Review
– De Jong (2002). JCB, 9(1): 67-103.
Guillaume Bourque, CRM Summer School
61/61