Download Myasnikova Kozlov-2014-Statistical method for estimation of the predictive power of a gene circuit

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Journal of Bioinformatics and Computational Biology
Vol. 12, No. 2 (2014) 1441002 (16 pages)
#
.c Imperial College Press
DOI: 10.1142/S0219720014410029
Statistical method for estimation of the predictive
power of a gene circuit model
Ekaterina Myasnikova*,†,‡ and Konstantin N. Kozlov†,§
*Department of Computational Biology
St. Petersburg State Polytechnical University
29 Polytekhnicheskaya, St. Petersburg
195251 Russia
†Department of Bioinformatics
Moscow Institute of Physics and Technology
Institutskiy per. 9, Dolgoprudny 141700
Moscow Region, Russia
‡[email protected]
§
[email protected]
Received 7 October 2013
Revised 18 December 2013
Accepted 12 January 2014
Published 27 March 2014
In this paper, a speci¯c aspect of the prediction problem is considered: high predictive power is
understood as a possibility to reproduce correct behavior of model solutions at prede¯ned values
of a subset of parameters. The problem is discussed in the context of a speci¯c mathematical
model, the gene circuit model for segmentation gap gene system in early Drosophila development. A shortcoming of the model is that it cannot be used for predicting the system behavior in
mutants when ¯tted to wild type (WT) data. In order to answer a question whether experimental data contain enough information for the correct prediction we introduce two measures of
predictive power. The ¯rst measure reveals the biologically substantiated low sensitivity of the
model to parameters that are responsible for correct reconstruction of expression patterns in
mutants, while the second one takes into account their correlation with the other parameters. It
is demonstrated that the model solution, obtained by ¯tting to gene expression data in WT and
Kr mutants simultaneously, and exhibiting the high predictive power, is characterized by
much higher values of both measures than those ¯tted to WT data alone. This result leads us to
the conclusion that information contained in WT data is insu±cient to reliably estimate the
large number of model parameters and provide predictions of mutants.
Keywords: Predictive power; sensitivity analysis; identi¯ability analysis; gene circuits;
over¯tting.
1. Introduction
The correct prediction of system behavior is a necessary property of a mathematical
model that is largely determined by the identi¯ability of model parameters. The
1441002-1
E. Myasnikova & K. N. Kozlov
number of parameters that are estimated by ¯tting to experimental data is typically
large. For the comprehensive analysis of modeling results it is necessary to know how
reliable the parameter estimates are, that constitutes the identi¯ability problem. In
practice insu±cient or noisy data, as well as the strong parameter correlation or even
their functional relation may prevent the unambiguous determination of parameter
values. In addition, some of the parameters bearing certain biological sense however
have estimates that can vary by orders of magnitude without signi¯cantly in°uencing the quality of the ¯t. Such parameters are referred to as \sloppy."1
Analysis of parameter identi¯abilities is closely connected with the study of
predictive properties of the model. The correct prediction of the model behavior at
¯xed prede¯ned values of several parameters is only possible if the rest (free parameters) are identi¯able. If there exist strong correlations between ¯xed parameters
and those estimated by ¯tting, the prediction may become infeasible, as in this case
the changes of parameter values cause the simultaneous changes in correlated
parameters. Thus, the parameters compensate e®ect of each other on the cost
functional and hence if one of them is ¯xed the value of the other is unreliable.
Besides, if the ¯xed parameters are those to which the cost functional is the most
sensitive and the rest of the parameters do not essentially a®ect the quality of ¯t, the
model can also exhibit poor predictive results. In this paper, we will focus on these
two sources of parameter nonidenti¯ability that are responsible for the poor
predictive power.
A typical example of predictive approach is a gene circuit model that dynamically
reconstitutes the set of interactions within the genetic network. The model was successfully applied to correctly reproduce the dynamics of pattern formation in the
context of segmentation gap gene system in wild type (WT) Drosophila embryo.2,3
Theoretically, if the model is ¯tted to WT data the gene expression in embryos mutant
for one of gap genes is predicted by setting the parameters related to the missing gene
to zero. However, the only example of correct modeling of mutants is presented in the
paper,4 where the model is ¯tted to gap gene expression patterns in WT and in
embryos with homozygous null mutation in Kr gene simultaneously. All the attempts
to predict the system behavior in mutants without ¯tting to data from two genotypes
failed. The unsuccessful predictions could be explained, for example, by over¯tting,
that is a consequence of a model overparameterization if there are insu±cient experimental data used for ¯tting. This problem will be explored in our paper.
Basically, two approaches are used to handle nonidenti¯ability. The ¯rst one is
referred to as a priori or structural identi¯ability analysis, as the model structure is
examined for nonidenti¯abilities before simulating and ¯tting procedures. Within the
second approach, a posteriori or practical identi¯ability study, nonidenti¯abilities
are detected by ¯tting to data and investigating the parameter estimates.5–8 Besides,
parameter identi¯ability can be addressed either locally near a given point or globally
over the whole parameter space. In our current study we will focus on a local
a posteriori analysis.6,7,9 This approach is typically based on asymptotic con¯dence
intervals5,10,11 that characterize the model sensitivity to parameters.
1441002-2
Statistical method for estimation of the predictive power of a gene circuit model
To study the model predictive properties there is no need to consider con¯dence
intervals for each individual parameter, while it is su±cient to select those parameter
combinations that make a maximum impact on the model solution. Such combinations are found for two sets of parameters: the full parameter set and a subset of
parameters that de¯ne the predicted behavior of the system. Then a measure of
predictive power is constructed as a relative sensitivity to these two types of parameter combinations. Although the methods of local sensitivity analysis forming the
basis of our approach are well known, the predictive sensitivity measures are novel
and ¯rst introduced in this paper.
In this paper, we introduce two relative sensitivity measures that characterize two
sources of poor predictive power formulated above. While the method is general and
can be applied to study predictive power of any model, it will be discussed in the
context of a speci¯c biological system.
2. Methods
2.1. Problem statement
Let the dynamics of a biological system be described by a system of ordinary differential equations with an unknown m-dimensional vector of parameters 2 .
The model solution is obtained by ¯tting the model to experimental data through
minimization of
SðÞ ¼
N
X
ðyi ðti ; Þ y~i Þ 2 ¼ Y T ðÞY ðÞ;
ð1Þ
i¼1
with respect to the parameter vector. Here y~i is an observed value, yi ðti ; Þ is the
corresponding model value and N is a number of observations. Y ðÞ ¼ Y ð; tÞ is the
model solution de¯ned by a vector of parameters . If measurement errors are independent and normally distributed, values of parameter vector ^ that minimize
Eq. (1) are the maximum likelihood estimates (MLE).
2.2. Con¯dence intervals for parameter estimates
The most commonly used approach to local identi¯ability analysis of parameters is
based on asymptotic con¯dence intervals.5,10 The asymptotic (1 )-con¯dence
region for an unknown parameter vector is determined from the inequality
^ T J T J ð Þ
^ ð Þ
m
^ ;m;Nm ;
SðÞF
N m
ð2Þ
where the Jacobian J ¼ JðÞ ¼ @Y ðÞ=@ is the so-called sensitivity matrix of size
N m; F;m;Nm is an -quantile of F -distribution with m and N m degrees of
freedom. The inverse of matrix J T ðÞJðÞ multiplied by the variance of observation
error is the covariance matrix of the parameter estimates. It is convenient to
1441002-3
E. Myasnikova & K. N. Kozlov
represent matrix J T J using its singular value decomposition (SVD). SVD factorizes
a symmetric matrix M as M ¼ V V T , where the columns of V are the eigenvectors
of matrix M and matrix is diagonal with entries equal to the eigenvalues of M. The
con¯dence region given in Eq. (2) with regard to SVD of matrix J T J will take on
the form
^ T V V T ð Þ
^ R 2;
ð Þ
ð3Þ
^ ¼ m SðÞF
^ ;m;Nm is the right-hand side of Eq. (2).
where R 2 ¼ R 2 ð; m; N; Þ
Nm
Lengths of the parameter con¯dence intervals allow us to make conclusions about
the reliability of each parameter estimate. However, con¯dence intervals for individual parameters i can be expressed exactly from Eq. (2) only in case of parameter
orthogonality, i.e. if the covariance matrix of the parameter vector is diagonal and
hence parameters are uncorrelated. For correlated parameters there may exist different interpretations of individual con¯dence intervals. We will consider two types
of intervals introduced in Ref. 5. Dependent con¯dence intervals do not take into
account parameter correlations and are computed for i as values of all the other
parameters are ¯xed at MLE values
R
ji ^i j pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ;
ðV V T Þii
i ¼ 1; . . . ; m:
ð4Þ
Geometrically, this interval represents an intersection of the ellipsoid [Eq. (2)] by the
line parallel to the ith parameter axis (see Fig. 1).
Another type of a con¯dence interval, referred to as independent one, is de¯ned as
the whole area of the parameter variation as the other parameters take any possible
values from the m-dimensional area given by Eq. (2):
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
ji ^i j R ðV 1 V T Þii ; i ¼ 1; . . . ; m:
ð5Þ
This interval represents the projection of the con¯dence area onto the ith parameter
axis. We denote the half-lengths of dependent and independent con¯dence intervals
as ID ði Þ and II ði Þ, respectively.
Dependent and independent con¯dence intervals coincide if all the parameters are
orthogonal, while if some parameters are correlated the independent con¯dence
intervals exceed the dependent ones. In the latter case, the con¯dence region is
oblong and its principal axes are inclined with respect to the parameter axes. As a
result its projection is much larger than any intersection of the ellipsoid with any
line parallel to parameter axis. This consideration is illustrated in Fig. 1 for twoparameter case.a
a Strong correlations between parameters cause di±culties in numerical calculation of con¯dence intervals
as in this case matrix J T J is ill-conditioned and its exact inversion is infeasible. Instead a standard
approximation of the inverse of an ill-conditioned matrix M, the Moore–Penrose pseudo-inverse, is used.
1441002-4
Statistical method for estimation of the predictive power of a gene circuit model
(a)
(b)
Fig. 1. Example of independent and dependent con¯dence intervals for two-dimensional parameter space.
(a) Well-identi¯able parameters; (b) poorly-identi¯able parameters.
2.3. Measures of predictive power
2.3.1. Sensitivity analysis
In this section, we introduce measures of relative sensitivity of the model to di®erent
linear combinations of parameters.
The full information about the model sensitivity to parameters can be obtained
from the shape of the con¯dence region [Eq. (2)]. Without loss of generality we can
consider a centered parameter vector such that ^ ¼ 0, i.e. the ellipsoid center
coincides with the origin of coordinates. The rotation transform of the parameter
space de¯ned by matrix V , composed of eigenvectors of the information matrix J T J,
generates the canonical basis with respect to which the principal axes of the con¯dence ellipsoid coincide with the parameter axes. Such a change of basis transforms
the centered parameter vector to the new vector ~ ¼ V T with the components
that are linear combinations of parameters i . The con¯dence ellipsoid is then de¯ned
T
by ~ ~ R 2 . New parameters constructed in such a way are referred to as principal components of parameter space. Principal components are orthogonal and their
dependent and independent con¯dence intervals Ið~i Þ coincide and are given by the
right-hand side of
R
j~i j pffiffiffiffiffi ;
i
i ¼ 1; . . . ; m
ð6Þ
where i is an ith eigenvalue of J T J. If elements of vector ~ are numbered in descending order with respect to eigenvalues i , several ¯rst principal components ~i
contain almost all the information about the parameter estimates that can be
extracted from the data. In other words, it is possible to reduce the dimensionality
of the parameter space and only consider a few of parameters, say L, for the
comprehensive analysis of the model sensitivity.
1441002-5
E. Myasnikova & K. N. Kozlov
If parameters are properly normalized the length of the con¯dence interval is the
~
shortest for the ¯rst principal component and increases for each next component of .
Hence, the identi¯ability of principal components becomes worse with the increase of
a component number.10
As a total measure of the sensitivity of the model solution de¯ned by a parameter
vector we accept the sum of squared reciprocals of half-lengths of con¯dence
intervals for the ¯rst L principal components ~1 ; . . . ; ~L ; the shorter the con¯dence
interval is, the higher the sensitivity will be. For two types of con¯dence intervals
two types of sensitivity measures can be de¯ned as
MI ðÞ ¼
L
X
I I2 ði Þ and
MD ðÞ ¼
i¼1
L
X
2
ID
ði Þ:
i¼1
These quantities will be referred to as independent and dependent measures of
sensitivity, respectively. Obviously, MD is always less or equal than MI . For or~ ¼ MD ðÞ
~ ¼ MI ðÞ
~ ¼ 1=R 2 PLi¼1 i :
thogonal principal components MðÞ
We will estimate the sensitivity of the model to any parameter vector that is a
~ i.e. the vector of linear
result of rotation of ~ by some orthogonal matrix ¼ V T ,
combinations of principal components. It will be shown in the next section that a
vector of the model parameters, with some of these being ¯tted and the rest being
¯xed, can be presented as such a rotation. The components of may be correlated
and lengths of their dependent and independent con¯dence intervals, ID and II , may
not coincide.b
Principal components are those linear combinations of parameters that contribute
the most to the cost functional. Our aim is to compare sensitivity of the model to
principal components ~ and parameter combinations generated by the space rotation V . In other words, we want to compare con¯dence intervals de¯ned by
Eq. (6) for the principal components and those for parameters given by ID ð Þ and
II ð Þ. As it is illustrated geometrically in Fig. 2, exact con¯dence intervals for the
¯rst principal components are shorter than both dependent and independent intervals with respect to the rotated basis, while for the highest components the opposite
situation is observed.
We introduce two quantities to characterize the relative sensitivity of rotated
parameter estimates. The relative dependent sensitivity of the model to rotated
vector is de¯ned as a ratio
~
GD ð Þ ¼ MD ð Þ=MðÞ:
ð7Þ
Geometrically, this is the measure of inclination of the con¯dence ellipsoid with
respect to the canonical basis. In practice, the ratio value characterizes to what
extent the space rotation reduces the model sensitivity to ¯rst L components of as
b Obviously, the original parameter vector can be also represented as an orthogonal rotation of ~ by the
matrix V , i.e. we can just consider the original parameters as one of possible rotations of the principal
components, and not distinguish them among the other parameter vectors.
1441002-6
Statistical method for estimation of the predictive power of a gene circuit model
(a)
(b)
Fig. 2. Construction of relative sensitivity measures in two-parameter case. Dependent and independent
con¯dence intervals for parameters are presented by two-sided solid arrows for 1 (a) and 2 (b). Con¯dence intervals for principal components ~1 (a) and ~2 (b) are dotted two-sided arrows. ~1 is the ¯rst
principal component as its con¯dence interval is the shortest. In this case Ið~1 Þ < II ð1 Þ < ID ð1 Þ and
L
II ð2 Þ < ID ð2 Þ < Ið~2 Þ. If L ¼ 1, then ~ ¼ f~1 g. Relative sensitivity measures are given by GD ¼
2
2
2
2
~
I ð1 Þ=I D ð1 Þ and GI ¼ I D ð1 Þ=I I ð1 Þ.
~ If the ratio is close to 1, the rotation does not essentially in°uence the
compared to .
model sensitivity. Note that correlations between parameters are not taken into
account so far and hence we analyze the sensitivity of the model to rotated combinations of parameters with no regard to the values of other parameters and their
combinations. Therefore this consideration is also true for independent parameters
and re°ects the intrinsic properties of the model. If there exist correlations between
parameters it is additionally important to consider the ratio of measures
GI ð Þ ¼ MI ð Þ=MD ð Þ:
ð8Þ
This quantity, referred to as relative independent sensitivity, characterizes the loss
of the model sensitivity to the rotated parameters due to parameter correlations.
From a geometric point of view the latter ratio characterizes the degree of ellipsoid
oblongness: The more the ellipsoid is oblong along the axes of parameters correlated
with the ¯rst components of , the lower the ratio is.
Both relative measures are always less or equal than 1, where equality is only
possible if parameters are uncorrelated. The value of each of them much less than 1
evidences poor identi¯ability of parameters.
2.3.2. Sensitivity measures characterize the model predictive power
Now we will show how the sensitivity measures formally introduced in the previous
section can be associated with the predictive properties of the model. Application of
the method will be illustrated using a gene circuit model. First of all, the structure
of the parameter set that de¯nes the model solution is explored. We aim to predict
the behavior of a model solution at given prede¯ned values of a subset of parameters
g . For de¯niteness let the parameters be set to zero. Denote the complement of this
1441002-7
E. Myasnikova & K. N. Kozlov
subset as g ¼ ng . With respect to the gene circuit model this means that for
each gene g we consider a subset g composed of parameters that describe the action
of g either as target or regulator. To model the evolution of gene expression in null
mutants for gene g the parameters from g are zeroed.
The correct prediction is only possible if estimates of g , the subset composed of
parameters nonrelated to g, are well-identi¯able. Nonidenti¯ability of these parameters can be caused either by biologically substantiated low sensitivity of the model
to the parameter changes or their correlation with the other parameters. The low
sensitivity to g means that ¯tting is for the most part implemented with respect to
parameters including g gene, while the rest do not essentially a®ect the value of
functional. If there exists strong correlation between g parameters and the other
ones, the change of a parameter from g causes the simultaneous changes in the
correlated parameters from the complementary subset g . Thus, setting the g
parameters to zero leads to nonidenti¯ability of parameters that are not related
to g gene.
For the sake of clarity, the model with the full set of ¯tted parameters is referred
to as full model. The model properties are explored in the vicinity of , a point in
parameter space that de¯nes a speci¯c model solution. For the informative sensitivity analysis it is su±cient to reveal parameter combinations that introduce the
maximum contribution to the cost functional and study their role in the predictions.
First, we ¯nd the linear combinations of all the parameters from the parameter set to which the model is the most sensitive. For this purpose, we apply the SVD to full
matrix J T J ¼ V V T and extract L principal components that compose vector
L
~ ¼ ð~1 ; . . . ; ~L Þ. The dependent and independent measures of sensitivity MI and
L
MD coincide for ~ as principal components are uncorrelated.
Next, we address the parameters from subset g only. The sensitivity matrix for
this subset reduces to Jg that results from matrix J when all its rows corresponding
to parameters from g are set to zero. SVD is applied to matrix J gT Jg ¼ Vg g V gT to
extract principal components ~g ¼ V gT with respect to the basis generated by
eigenvectors of J gT Jg , that are linear combinations not including parameters related
L
to g. The ¯rst principal components ~ g ¼ ð~g1 ; . . . ; ~gL Þ are those parameter combinations that mainly de¯ne behavior of the system that we aim to predict. Hence,
the identi¯ability of these parameter estimates obtained by ¯tting the full model is
necessary for good prediction.
L
Thus, we wonder what will be the model sensitivity to parameters ~ g . The parameter vector ~g can be represented as a result of rotation ~g ¼ V gT V ~ with respect
to the canonical basis generated by eigenvectors of the full information matrix J T J.
L
Therefore, linear combinations ~ g are not necessarily orthogonal with respect to this
L
L
basis and the sensitivity measures MI ð~ g Þ and MD ð~ g Þ may be unequal. For the sake
of brevity superscript \L" will be omitted in what follows.
To analyze the identi¯ability of parameters from g we consider the relative
measures of sensitivity introduced in Eqs. (7) and (8). The ratio Gð~g Þ ¼ MI ð~g Þ=
~ characterizes the model's relative sensitivity to the subset of parameters g as
MD ðÞ
1441002-8
Statistical method for estimation of the predictive power of a gene circuit model
compared with the sensitivity to the full set . The low value of GD ð~g Þ ¼
~ means that the model is relatively insensitive to parameters not
MD ð~g Þ=MD ðÞ
related to g and hence when the parameters from g are zeroed the model behavior
may be reproduced incorrectly. In particular, such a situation may happen if g
parameters are sloppy and do not signi¯cantly in°uence the quality of ¯t. The second
measure GI ð~g Þ ¼ MI ð~g Þ=MD ð~g Þ re°ects the degree of correlation between subsets
g and g , so that its low value also evidences poor identi¯ability of the subsets, and
hence the low predictive power.
3. Results
We study predictive power of the gene circuit model that is ¯tted to the data on
expression of four target genes, hb, Kr, gt and kni, and thereby four parameter
subsets hb , Kr , gt and kni , are de¯ned as described in Sec. 2.3.2, each composed
of all the parameters including the corresponding gene as target or regulator. The
sensitivity of the model to parameters is analyzed in the vicinity of the solutions
de¯ned by four parameter sets C1 , C2 , C WT
and C WT
which are introduced in
1
2
Appendix A.
Con¯dence intervals for parameter estimates were constructed and analyzed in
detail in Ref. 4. Dependent and independent con¯dence intervals for gene circuit C1
are reproduced in Fig. 3. In the cited paper, we were focused on the reliability of the
estimates of regulatory weights, i.e. reliability of our conclusions about the type of
Fig. 3. 95% dependent(thin bars) and independent (thick bars) con¯dence intervals for parameter set C1 .
MLE of parameters are depicted as small black squares. The horizontal axis is labeled by notations of
regulatory weights. For simplicity each parameter is denoted by two gene names the ¯rst of which is a
hb
target, the second is a regulator. For example, E cad
is denoted CadHb.
1441002-9
E. Myasnikova & K. N. Kozlov
gene-to-gene interaction within the genetic network. The sensitivity of the model to
parameters in case of properly normalized parameter values is characterized by the
size of their con¯dence intervals. For example, the shortest independent and dependent intervals are obtained for parameters describing the action of a strong
activator Cad on all the target genes, that indicates the highest sensitivity of the
model to these parameters. The longest dependent intervals correspond to the
parameters from Kr that were classi¯ed as unreliable.4 The di®erence between
dependent and independent con¯dence intervals is a result of existing correlations
between the parameter estimates. This issue was explored in detail in Ref. 4 using the
local collinearity analysis.6 For more information see Appendix A.
In sensitivity analysis, we ¯rst focus on parameter vectors C1 and C2 that are
obtained as a result of ¯tting to two genotype data. These parameter estimates
provide good ¯t both to WT and Kr mutant data. We compute relative sensitivity
measures GD and GI introduced in Eqs. (7) and (8). For this purpose principal
components ~ are constructed for the full set of parameters and ~g for parameter
subsets g as described in Sec. 2.3.
Using the terminology adopted in factor analysis we will call the absolute values of
coe±cients at model parameters as parameter loadings. The small number of parameters with large loadings almost fully characterizes the model sensitivity. Seven
Table 1. Parameter loadings for subsets C1 and C WT
1 .
C WT
1
C1
Parameter
cad
E gt
cad
E hb
Tll
E hb
cad
E Kr
cad
E kni
T hb
T kni
~
~Kr
0.62
0.64
0.30
0.65
0.53
2
0.85
4
0.88
0.30
1
0.34
1
3
1
2
3
3
~
0.64
0.65
0.30
0.65
0.59
0.45
2
0.88
0.33
1
0.34
0.74
1
~Kr
0.96
1
3
1
0.56
3
0.59
3
0.69
0.87
0.46
0.46
0.81
0.39
3
0.65
3
0.98
2
0.33
3
2
3
4
3
1
2
1
2
3
4
Note: Parameter loadings are shown for two model solutions
C1 (columns 1 and 2) and C WT
(columns 3 and 4). Columns
1
1 and 3 contain loadings in principal components (PCs)
constructed for the full parameter set, while in columns 2 and
3 the loadings are given for PCs composed of parameters not
related to P
Kr. The P
number of PCs, L, is de¯ned from the
inequality Li¼1 i = m
i¼1 i > 0:95 [for notation see Eq. (6)].
The order number of a component is shown as a superscript.
The loadings greater than 0.3 in absolute value are shown.
1441002-10
Statistical method for estimation of the predictive power of a gene circuit model
parameters from set C1 extracted in such a way are presented in Table 1. The most
informative parameters are those that have high loadings within the ¯rst principal
components. Parameter loadings are shown for principal components constructed for
the full parameter set (¯rst column) and subset Kr (second column). Examining
the table one can see that the parameters from to which the full model is the most
cad
cad
sensitive are: E kni
(loading 0.88 in the ¯rst principal component ~1 ), E hb
(0.30 in
cad
kni
~
~
~
~
(0.34 in 1 ), E gt (0.62 in 2 ). Obviously these are the para 2 ; 0.65 in 2 ), T
meters with the shortest con¯dence intervals (see Fig. 3). Note that the only parameter from Kr appears in the fourth principal component, i.e. the model
sensitivity to parameters from this subset is not too high. The second column presents the loadings for parameters not related to Kr. Comparing two columns, we see
that the maximal loadings take very similar values that means that parameter
combinations that mainly de¯ne the model solutions both for WT and Kr-mutants
are almost the same.
The relative sensitivity measures computed for C1 and C2 are shown in the upper
part of Table 2. The dependent measure GD for subset Kr takes a high value, equal
to 0.91, that is in a good agreement with the fact that the model correctly reproduces
gene expression in Kr-mutants. On the other hand, the full model is highly sensitive
cad
to parameters related to kni: E kni
and T kni , and, consequently, the value of measure
GD is the lowest for subset kni being as low as 0.13. Thus, we can expect the bad
prediction in null mutants for kni, that was indeed demonstrated on kni mutants
published in Ref. 12.
Up to now, we did not pay attention to parameter correlations, the e®ect of which
on predictive power is taken into account in the independent relative measures GI . In
our previous study,4 it was shown that parameters of the model ¯tted to two genotypes are less correlated and hence better identi¯able than those of the model ¯tted
to WT data only. The criterion introduced here also re°ects the same tendency: For
Table 2. Relative sensitivity measures.
C2
C1
GD
GI
hb
Kr
gt
kni
hb
Kr
gt
kni
0.56
0.005
0.91
0.26
0.73
0.01
0.13
0.004
0.26
0.001
0.97
0.33
0.73
0.001
0.13
0.005
C WT
1
GD
GI
C WT
2
hb
Kr
gt
kni
hb
Kr
gt
kni
0.76
0.001
0.33
0.01
0.36
0.001
0.61
0.003
0.61
0.01
0.67
0.009
0.24
0.006
0.41
0.003
Note: Relative sensitive measures computed for four parameter sets de¯ning model
solutions. High values of the measures characterize high predictive power of the model
with the parameters from subset g being zeroed. The values of GD and GI computed for
subset Kr are shown in bold.
1441002-11
E. Myasnikova & K. N. Kozlov
subset Kr the measure GI is by orders of magnitude higher than for the parameter
subsets related to other genes.
and C WT
¯tted to WT data alone.
Now we address parameter vectors C WT
1
2
WT
Parameter loadings for vector C 1 are given in third (full parameter set ) and
fourth (subset Kr ) columns of Table 1. Here, we see a situation that is somewhat
di®erent from the one observed in the ¯rst two columns: The full model is the most
cad
sensitive to parameter E Kr
from Kr , while the highest loadings in the fourth colcad
cad
umn are those at parameters E gt
and E kni
. In other words, the model ¯tted with
respect to the subset of parameters is highly sensitive to parameters that are not
reliably estimated in the full model. Naturally in this case the value of dependent
measure GD for subset Kr is not high. The highest value of GD is related to hb in
both sets C WT
and C WT
1
2 . The measure for gt is the lowest, and ¯nally for parameters from Kr and kni the measure takes similar values. Nevertheless, these
values are not high enough to reliably provide good prediction. All the values of
independent measures GI are not high, approximately all of the same order, that
evidences high correlations between parameters.
It is shown that the model solution obtained by ¯tting to gene expression data in
WT and null mutants for Kr demonstrates much better predictions for Kr mutants
than those ¯tted to WT data. This result may serve as an explanation to the fact why
all the previous attempts to predict the dynamics of pattern formation in mutants
turned to be unsuccessful. For correct prediction of gene expression in mutant embryos it is necessary to add the mutant data to the dataset used for model ¯tting, as
the information contained in WT experimental data is insu±cient to reliably estimate the large number of model parameters.
4. Discussion
High predictive power is a necessary property of any mathematical model. In this
paper, a speci¯c aspect of the problem is considered: Predictive power is understood
as a possibility to predict correct behavior of model solutions at prede¯ned values of a
subset of parameters. The problem is discussed in the context of a speci¯c mathematical model, the gene circuit model for segmentation gap gene system in early
Drosophila development. The model was successfully applied to correctly reproduce
the dynamics of pattern formation in WT embryo.2,3 However, when ¯tted to WT
data, the model could not be used for prediction of system behavior in mutants. In
order to obtain model solutions describing gap gene expression in WT and mutants
in our recent work,4 the model was ¯tted to data from two genotypes simultaneously.
These results demonstrated the existence of parameter sets describing gap gene
expression in two genotypes simultaneously and thus the applicability of the gene
circuit formalism to model genotypes of gap mutants. As the number of model
parameters is very high, one may wonder whether the over¯tting was the reason for
these parameter sets not to be discovered during the ¯t to the WT data alone. In our
current study, we focus on this problem.
1441002-12
Statistical method for estimation of the predictive power of a gene circuit model
In the context of over¯tting problem, the following questions arise: Whether
experimental data contain enough information to correctly predict the system behavior at ¯xed values of a parameter subset and isn't the model overparameterized?
We make an attempt to address these issues in this paper. The developed method is
based on the analysis of parameter identi¯ability and applied to explore the predictive properties of the gene circuit model.
Two types of relative measures of the predictive power are considered: The ¯rst
one reveals the biologically substantiated low sensitivity of the model to parameters
that are responsible for correct reconstruction of expression patterns in mutants,
while the second one takes into account their correlation with the other parameters.
It is shown that the model solution obtained by ¯tting to gene expression data in WT
and Kr mutants demonstrates much higher predictive power than those ¯tted to
WT data alone. This fact may be interpreted as a manifestation of over¯tting
problem: Information contained in WT experimental data is insu±cient for correct
prediction of gene expression in mutant embryos. This conclusion does not exclude
the other explanations of the fact of incorrect predictions, but even assuming that the
model represents the underlying mechanism of the modelled process accurately and
correctly, it is not suitable for mutant predictions.
Another practical situation for which the proposed method may be appropriate
is a problem of selection among the optimal solutions those that provide better
predictions. When applied for this purpose, the measures can be introduced
as additional regularization criteria into a global optimization problem of high
dimensionality. The model regularization narrows the search space and thereby
allows to obtain solutions with the required properties. An example of optimization
approach that presumes this kind of regularization is published in Ref. 13.
Acknowledgments
This work was supported by EC Collaborative project HEALTH-F5-2010-260429
and RFBR projects 13-01-00405 and 11-01-00573.
Appendix A. Estimation and Identi¯ability Analysis of Parameters
of the Gene Circuit Model
The gene circuit model2–4 describes the dynamics of segmentation gene expression in
the syncytial blastoderm of Drosophila melanogaster during cleavage cycle 14A. The
aim of modeling is to decipher the molecular mechanisms which control the process of
segment determination in Drosophila. Most of segmentation genes encode transcription factors, which regulate the expression of the other genes in the segmentation gene network. The regulatory topology of the network is obtained by solving the
inverse problem of mathematical modeling. We consider the model ¯rst presented in
Ref. 4 that successfully reproduces the time evolution of protein concentrations of
gap genes hb, Kr, gt and kni in two genotypes: WT and in embryos with homozygous
1441002-13
E. Myasnikova & K. N. Kozlov
Fig A. 1. Temporal dynamics of gt gap gene expression during the modeling period in WT embryos. The
model is ¯tted to the data extracted from the outlined strip along A–P axis and presented as averaged
protein concentrations in a row of nuclei. Confocal images and quantitative data are available from FlyEx
database (http://urchin.spbcas.ru/°yex).
null mutation in Kr gene. A sample of gene expression dynamics is presented in
Fig. A.1.
The model considers a one-dimensional row of nuclei along the anteroposterior
(A–P) axis of the embryo. The modeled region covers the posterior half of an embryo
body. Concentration v ai for each gap gene product a in each nucleus i over time t is
described by the following system of ordinary di®erential equations:
dv ai =dt ¼ Ra gðu ai Þ þ D a ðnÞ½ðv ai1 v ai Þ þ ðv aiþ1 v ai Þ a v ai ;
ðA:1Þ
The right-hand side of the equation represents protein synthesis, protein di®usion
P
P
and protein decay. u a ¼ 4b¼1 T ab v bi þ 3e¼1 E ae v ei þ h a is the total regulatory input
to gene a. Genes denoted as e (bcd, cad and tll) are external inputs, i.e. those genes
that are not regulated by gap genes, but regulate these genes. T ab and E ae are genetic
inter-connectivity matrices that characterize the action of regulator b or external
input e on gene a. h a is a threshold parameter of the sigmoid regulation-expression
function gðuÞ. Ra is the maximum synthesis rate, D a the di®usion coe±cient, and a
the decay rate of the product of gene a. Thus, the expression level of each of the four
target genes is described by 10 parameters and the whole set, , is composed of 40
parameters.
The model parameters are estimated by ¯tting the model output to gene expression data through minimization of the cost functional (1). The minimization is
performed by means of di®erential evolution entirely parallel (DEEP) method.4,14
In our previous work,4 11 vectors of parameter estimates (optimal circuits) were
obtained by ¯tting the model to two genotypes simultaneously: WT and Kr embryos. We will use two of these vectors for our analysis: vector C1 that de¯nes the
consensus gap gene network and vector C2 , the one that is the most di®ering from the
vector C1 . A consensus network is such that the signs of regulatory parameters in it
coincide with the predicted network topology inferred from all the ¯ts. Additionally
two parameter vectors are obtained by ¯tting the model to WT data only using
estimates from sets C1 and C2 as initial values for minimization. These vectors are
denoted C 1WT and C 2WT , respectively.
The possible over¯tting problem was treated by applying the local identi¯ability
analysis of the parameter estimates. Two approaches were used. First, the sensitivity
of the model to parameter changes and identi¯ability of parameters in the vicinity of
1441002-14
Statistical method for estimation of the predictive power of a gene circuit model
the model solutions were analyzed on the basis of con¯dence intervals. The analysis
allowed us to re¯ne the predicted regulatory network topology. It was shown that
while most of the regulatory parameters T ab and E ae were well identi¯ed and their
estimates could be used to make conclusions about the type of gene interaction, some
of the parameters the most of which belonged to Kr were poorly identi¯able. As
parameter nonidenti¯ability could be a consequence of their strong correlation, the
collinearity analysis6 of the sensitivity matrix was applied to reveal subsets of correlated parameters. This method con¯rmed that poor identi¯ability of parameters
could be explained by their correlations with the rest of parameters. Our analysis also
demonstrated that parameters of the model ¯tted to two genotypes were better
identi¯able than those of the model ¯tted to WT data alone.
References
1. Gutenkunst R, Waterfall J, Casey F, Brown K, Myers C, Sethna J, Universally sloppy
parameter sensitivities in systems biology models., PLoS Comput Biol 3(10):1871–1878,
2007.
2. Jaeger J, Surkova S, Blagov M, Janssens H, Kosman D, Kozlov K, Manu, Myasnikova E,
Vanario-Alonso C, Samsonova M, Sharp D, Reinitz J, Dynamic control of positional
information in the early Drosophila embryo, Nature 430:368–371, 2004.
3. Jaeger J, Blagov M, Kosman D, Kozlov K, Manu, Myasnikova E, Surkova S, Samsonova
M, Sharp D, Reinitz J, Dynamical analysis of regulatory interactions in the gap gene
system of Drosophila melanogaster, Genetics 167:1721–1737, 2004.
4. Kozlov K, Surkova S, Myasnikova E, Reinitz J, Samsonova M, Modeling of gap gene
expression in Drosophila Kruppel mutants, PLoS Comput Biol 8(8):e1002635, 2012.
5. Ashyraliyev M, Jaeger J, Blom J, Parameter estimation and determinability analysis
applied to drosophila gap gene circuits, BMC Syst Biol 2:83, 2008.
6. Brun R, Reichert P, Kunsch H, Practical identi¯ability analysis of large environmental
simulation models, Water Resour Res 37:1015–1030, 2001.
7. Raue A, Kreutz C, Maiwald T, Bachmann J, Schilling M, Klingmuller U, Timmer J,
Structural and practical identi¯ability analysis of partially observed dynamical models by
exploiting the pro¯le likelihood, Bioinformatics 25:1923–1929, 2009.
8. Hengl S, Kreutz C, Timmer J, Maiwald T, Data-based identi¯ability analysis of nonlinear dynamical models, Bioinformatics 23:2612–2618, 2007.
9. Jacquez JA, Perry T, Parameter estimation: Local identi¯ability of parameters, Am J
Physiol 258(4):E727–E736, 1990.
10. Bates D, Watts D, Nonlinear Regression Analysis and its Applications, J. Wiley, 1988.
11. Dresch J, Liu X, Arnosti D, Ay A, Thermodynamic modeling of transcription: Sensitivity
analysis di®erentiates biological mechanism from mathematical model-induced e®ects,
BMC Syst Biol 4:142, 2010.
12. Surkova S, Golubkova E, Manu, Panok L, Mamon L, Reinitz J, Samsonova M, Quantitative dynamics and increased variability of segmentation gene expression in the Drosophila Kr and kni mutants, Dev Biol 376:99–112, 2013.
13. Pisarev A, Samsonova M, A method for solving multiobjective reverse problems under
conitions of uncertainty (in Russian), Bio¯zika 58(2):221–232, 2013.
14. Kozlov K, Samsonov A, DEEP Di®erential evolution entirely parallel method for gene
regulatory networks, J Supercomput 57:172–178, 2010.
1441002-15
E. Myasnikova & K. N. Kozlov
Ekaterina Myasnikova obtained her Ph.D. in mathematics
from St. Petersburg State Polytechnical University, Russia. Her
recent research interests mainly rest in the ¯eld of systems biology
and biostatistics. She is a leading researcher in Department of
Computational Biology in the Center of Advanced Studies of
St. Petersburg Polytechnical University and Associate Professor
at the Department of Bioinformatics, Moscow Institute of Physics
and Technology, Russia.
Konstantin N. Kozlov graduated in 2000 from the State
Polytechnical University, Applied Math Department, St. Petersburg, Russia. He has got the Ph.D. degree in Bioinformatics,
Computational biology and Modeling in 2013. He has worked at
the Department of Computational Biology, CAS, Polytechnical
University, St. Petersburg, since 2001. Dr. Kozlov is a specialist in
applied mathematical methods and information technologies, and
conducts scienti¯c research in the ¯eld of mathematical modeling
of biological systems, optimization and biomedical image processing. He is a coauthor of more than 20 papers published in international peer-reviewed journals and
made more than 30 oral and poster presentations at the international conferences.
1441002-16