Survey

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Survey

Document related concepts

no text concepts found

Transcript

Journal of Bioinformatics and Computational Biology Vol. 12, No. 2 (2014) 1441002 (16 pages) # .c Imperial College Press DOI: 10.1142/S0219720014410029 Statistical method for estimation of the predictive power of a gene circuit model Ekaterina Myasnikova*,†,‡ and Konstantin N. Kozlov†,§ *Department of Computational Biology St. Petersburg State Polytechnical University 29 Polytekhnicheskaya, St. Petersburg 195251 Russia †Department of Bioinformatics Moscow Institute of Physics and Technology Institutskiy per. 9, Dolgoprudny 141700 Moscow Region, Russia ‡[email protected] § [email protected] Received 7 October 2013 Revised 18 December 2013 Accepted 12 January 2014 Published 27 March 2014 In this paper, a speci¯c aspect of the prediction problem is considered: high predictive power is understood as a possibility to reproduce correct behavior of model solutions at prede¯ned values of a subset of parameters. The problem is discussed in the context of a speci¯c mathematical model, the gene circuit model for segmentation gap gene system in early Drosophila development. A shortcoming of the model is that it cannot be used for predicting the system behavior in mutants when ¯tted to wild type (WT) data. In order to answer a question whether experimental data contain enough information for the correct prediction we introduce two measures of predictive power. The ¯rst measure reveals the biologically substantiated low sensitivity of the model to parameters that are responsible for correct reconstruction of expression patterns in mutants, while the second one takes into account their correlation with the other parameters. It is demonstrated that the model solution, obtained by ¯tting to gene expression data in WT and Kr mutants simultaneously, and exhibiting the high predictive power, is characterized by much higher values of both measures than those ¯tted to WT data alone. This result leads us to the conclusion that information contained in WT data is insu±cient to reliably estimate the large number of model parameters and provide predictions of mutants. Keywords: Predictive power; sensitivity analysis; identi¯ability analysis; gene circuits; over¯tting. 1. Introduction The correct prediction of system behavior is a necessary property of a mathematical model that is largely determined by the identi¯ability of model parameters. The 1441002-1 E. Myasnikova & K. N. Kozlov number of parameters that are estimated by ¯tting to experimental data is typically large. For the comprehensive analysis of modeling results it is necessary to know how reliable the parameter estimates are, that constitutes the identi¯ability problem. In practice insu±cient or noisy data, as well as the strong parameter correlation or even their functional relation may prevent the unambiguous determination of parameter values. In addition, some of the parameters bearing certain biological sense however have estimates that can vary by orders of magnitude without signi¯cantly in°uencing the quality of the ¯t. Such parameters are referred to as \sloppy."1 Analysis of parameter identi¯abilities is closely connected with the study of predictive properties of the model. The correct prediction of the model behavior at ¯xed prede¯ned values of several parameters is only possible if the rest (free parameters) are identi¯able. If there exist strong correlations between ¯xed parameters and those estimated by ¯tting, the prediction may become infeasible, as in this case the changes of parameter values cause the simultaneous changes in correlated parameters. Thus, the parameters compensate e®ect of each other on the cost functional and hence if one of them is ¯xed the value of the other is unreliable. Besides, if the ¯xed parameters are those to which the cost functional is the most sensitive and the rest of the parameters do not essentially a®ect the quality of ¯t, the model can also exhibit poor predictive results. In this paper, we will focus on these two sources of parameter nonidenti¯ability that are responsible for the poor predictive power. A typical example of predictive approach is a gene circuit model that dynamically reconstitutes the set of interactions within the genetic network. The model was successfully applied to correctly reproduce the dynamics of pattern formation in the context of segmentation gap gene system in wild type (WT) Drosophila embryo.2,3 Theoretically, if the model is ¯tted to WT data the gene expression in embryos mutant for one of gap genes is predicted by setting the parameters related to the missing gene to zero. However, the only example of correct modeling of mutants is presented in the paper,4 where the model is ¯tted to gap gene expression patterns in WT and in embryos with homozygous null mutation in Kr gene simultaneously. All the attempts to predict the system behavior in mutants without ¯tting to data from two genotypes failed. The unsuccessful predictions could be explained, for example, by over¯tting, that is a consequence of a model overparameterization if there are insu±cient experimental data used for ¯tting. This problem will be explored in our paper. Basically, two approaches are used to handle nonidenti¯ability. The ¯rst one is referred to as a priori or structural identi¯ability analysis, as the model structure is examined for nonidenti¯abilities before simulating and ¯tting procedures. Within the second approach, a posteriori or practical identi¯ability study, nonidenti¯abilities are detected by ¯tting to data and investigating the parameter estimates.5–8 Besides, parameter identi¯ability can be addressed either locally near a given point or globally over the whole parameter space. In our current study we will focus on a local a posteriori analysis.6,7,9 This approach is typically based on asymptotic con¯dence intervals5,10,11 that characterize the model sensitivity to parameters. 1441002-2 Statistical method for estimation of the predictive power of a gene circuit model To study the model predictive properties there is no need to consider con¯dence intervals for each individual parameter, while it is su±cient to select those parameter combinations that make a maximum impact on the model solution. Such combinations are found for two sets of parameters: the full parameter set and a subset of parameters that de¯ne the predicted behavior of the system. Then a measure of predictive power is constructed as a relative sensitivity to these two types of parameter combinations. Although the methods of local sensitivity analysis forming the basis of our approach are well known, the predictive sensitivity measures are novel and ¯rst introduced in this paper. In this paper, we introduce two relative sensitivity measures that characterize two sources of poor predictive power formulated above. While the method is general and can be applied to study predictive power of any model, it will be discussed in the context of a speci¯c biological system. 2. Methods 2.1. Problem statement Let the dynamics of a biological system be described by a system of ordinary differential equations with an unknown m-dimensional vector of parameters 2 . The model solution is obtained by ¯tting the model to experimental data through minimization of SðÞ ¼ N X ðyi ðti ; Þ y~i Þ 2 ¼ Y T ðÞY ðÞ; ð1Þ i¼1 with respect to the parameter vector. Here y~i is an observed value, yi ðti ; Þ is the corresponding model value and N is a number of observations. Y ðÞ ¼ Y ð; tÞ is the model solution de¯ned by a vector of parameters . If measurement errors are independent and normally distributed, values of parameter vector ^ that minimize Eq. (1) are the maximum likelihood estimates (MLE). 2.2. Con¯dence intervals for parameter estimates The most commonly used approach to local identi¯ability analysis of parameters is based on asymptotic con¯dence intervals.5,10 The asymptotic (1 )-con¯dence region for an unknown parameter vector is determined from the inequality ^ T J T J ð Þ ^ ð Þ m ^ ;m;Nm ; SðÞF N m ð2Þ where the Jacobian J ¼ JðÞ ¼ @Y ðÞ=@ is the so-called sensitivity matrix of size N m; F;m;Nm is an -quantile of F -distribution with m and N m degrees of freedom. The inverse of matrix J T ðÞJðÞ multiplied by the variance of observation error is the covariance matrix of the parameter estimates. It is convenient to 1441002-3 E. Myasnikova & K. N. Kozlov represent matrix J T J using its singular value decomposition (SVD). SVD factorizes a symmetric matrix M as M ¼ V V T , where the columns of V are the eigenvectors of matrix M and matrix is diagonal with entries equal to the eigenvalues of M. The con¯dence region given in Eq. (2) with regard to SVD of matrix J T J will take on the form ^ T V V T ð Þ ^ R 2; ð Þ ð3Þ ^ ¼ m SðÞF ^ ;m;Nm is the right-hand side of Eq. (2). where R 2 ¼ R 2 ð; m; N; Þ Nm Lengths of the parameter con¯dence intervals allow us to make conclusions about the reliability of each parameter estimate. However, con¯dence intervals for individual parameters i can be expressed exactly from Eq. (2) only in case of parameter orthogonality, i.e. if the covariance matrix of the parameter vector is diagonal and hence parameters are uncorrelated. For correlated parameters there may exist different interpretations of individual con¯dence intervals. We will consider two types of intervals introduced in Ref. 5. Dependent con¯dence intervals do not take into account parameter correlations and are computed for i as values of all the other parameters are ¯xed at MLE values R ji ^i j pﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ ; ðV V T Þii i ¼ 1; . . . ; m: ð4Þ Geometrically, this interval represents an intersection of the ellipsoid [Eq. (2)] by the line parallel to the ith parameter axis (see Fig. 1). Another type of a con¯dence interval, referred to as independent one, is de¯ned as the whole area of the parameter variation as the other parameters take any possible values from the m-dimensional area given by Eq. (2): pﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ ji ^i j R ðV 1 V T Þii ; i ¼ 1; . . . ; m: ð5Þ This interval represents the projection of the con¯dence area onto the ith parameter axis. We denote the half-lengths of dependent and independent con¯dence intervals as ID ði Þ and II ði Þ, respectively. Dependent and independent con¯dence intervals coincide if all the parameters are orthogonal, while if some parameters are correlated the independent con¯dence intervals exceed the dependent ones. In the latter case, the con¯dence region is oblong and its principal axes are inclined with respect to the parameter axes. As a result its projection is much larger than any intersection of the ellipsoid with any line parallel to parameter axis. This consideration is illustrated in Fig. 1 for twoparameter case.a a Strong correlations between parameters cause di±culties in numerical calculation of con¯dence intervals as in this case matrix J T J is ill-conditioned and its exact inversion is infeasible. Instead a standard approximation of the inverse of an ill-conditioned matrix M, the Moore–Penrose pseudo-inverse, is used. 1441002-4 Statistical method for estimation of the predictive power of a gene circuit model (a) (b) Fig. 1. Example of independent and dependent con¯dence intervals for two-dimensional parameter space. (a) Well-identi¯able parameters; (b) poorly-identi¯able parameters. 2.3. Measures of predictive power 2.3.1. Sensitivity analysis In this section, we introduce measures of relative sensitivity of the model to di®erent linear combinations of parameters. The full information about the model sensitivity to parameters can be obtained from the shape of the con¯dence region [Eq. (2)]. Without loss of generality we can consider a centered parameter vector such that ^ ¼ 0, i.e. the ellipsoid center coincides with the origin of coordinates. The rotation transform of the parameter space de¯ned by matrix V , composed of eigenvectors of the information matrix J T J, generates the canonical basis with respect to which the principal axes of the con¯dence ellipsoid coincide with the parameter axes. Such a change of basis transforms the centered parameter vector to the new vector ~ ¼ V T with the components that are linear combinations of parameters i . The con¯dence ellipsoid is then de¯ned T by ~ ~ R 2 . New parameters constructed in such a way are referred to as principal components of parameter space. Principal components are orthogonal and their dependent and independent con¯dence intervals Ið~i Þ coincide and are given by the right-hand side of R j~i j pﬃﬃﬃﬃﬃ ; i i ¼ 1; . . . ; m ð6Þ where i is an ith eigenvalue of J T J. If elements of vector ~ are numbered in descending order with respect to eigenvalues i , several ¯rst principal components ~i contain almost all the information about the parameter estimates that can be extracted from the data. In other words, it is possible to reduce the dimensionality of the parameter space and only consider a few of parameters, say L, for the comprehensive analysis of the model sensitivity. 1441002-5 E. Myasnikova & K. N. Kozlov If parameters are properly normalized the length of the con¯dence interval is the ~ shortest for the ¯rst principal component and increases for each next component of . Hence, the identi¯ability of principal components becomes worse with the increase of a component number.10 As a total measure of the sensitivity of the model solution de¯ned by a parameter vector we accept the sum of squared reciprocals of half-lengths of con¯dence intervals for the ¯rst L principal components ~1 ; . . . ; ~L ; the shorter the con¯dence interval is, the higher the sensitivity will be. For two types of con¯dence intervals two types of sensitivity measures can be de¯ned as MI ðÞ ¼ L X I I2 ði Þ and MD ðÞ ¼ i¼1 L X 2 ID ði Þ: i¼1 These quantities will be referred to as independent and dependent measures of sensitivity, respectively. Obviously, MD is always less or equal than MI . For or~ ¼ MD ðÞ ~ ¼ MI ðÞ ~ ¼ 1=R 2 PLi¼1 i : thogonal principal components MðÞ We will estimate the sensitivity of the model to any parameter vector that is a ~ i.e. the vector of linear result of rotation of ~ by some orthogonal matrix ¼ V T , combinations of principal components. It will be shown in the next section that a vector of the model parameters, with some of these being ¯tted and the rest being ¯xed, can be presented as such a rotation. The components of may be correlated and lengths of their dependent and independent con¯dence intervals, ID and II , may not coincide.b Principal components are those linear combinations of parameters that contribute the most to the cost functional. Our aim is to compare sensitivity of the model to principal components ~ and parameter combinations generated by the space rotation V . In other words, we want to compare con¯dence intervals de¯ned by Eq. (6) for the principal components and those for parameters given by ID ð Þ and II ð Þ. As it is illustrated geometrically in Fig. 2, exact con¯dence intervals for the ¯rst principal components are shorter than both dependent and independent intervals with respect to the rotated basis, while for the highest components the opposite situation is observed. We introduce two quantities to characterize the relative sensitivity of rotated parameter estimates. The relative dependent sensitivity of the model to rotated vector is de¯ned as a ratio ~ GD ð Þ ¼ MD ð Þ=MðÞ: ð7Þ Geometrically, this is the measure of inclination of the con¯dence ellipsoid with respect to the canonical basis. In practice, the ratio value characterizes to what extent the space rotation reduces the model sensitivity to ¯rst L components of as b Obviously, the original parameter vector can be also represented as an orthogonal rotation of ~ by the matrix V , i.e. we can just consider the original parameters as one of possible rotations of the principal components, and not distinguish them among the other parameter vectors. 1441002-6 Statistical method for estimation of the predictive power of a gene circuit model (a) (b) Fig. 2. Construction of relative sensitivity measures in two-parameter case. Dependent and independent con¯dence intervals for parameters are presented by two-sided solid arrows for 1 (a) and 2 (b). Con¯dence intervals for principal components ~1 (a) and ~2 (b) are dotted two-sided arrows. ~1 is the ¯rst principal component as its con¯dence interval is the shortest. In this case Ið~1 Þ < II ð1 Þ < ID ð1 Þ and L II ð2 Þ < ID ð2 Þ < Ið~2 Þ. If L ¼ 1, then ~ ¼ f~1 g. Relative sensitivity measures are given by GD ¼ 2 2 2 2 ~ I ð1 Þ=I D ð1 Þ and GI ¼ I D ð1 Þ=I I ð1 Þ. ~ If the ratio is close to 1, the rotation does not essentially in°uence the compared to . model sensitivity. Note that correlations between parameters are not taken into account so far and hence we analyze the sensitivity of the model to rotated combinations of parameters with no regard to the values of other parameters and their combinations. Therefore this consideration is also true for independent parameters and re°ects the intrinsic properties of the model. If there exist correlations between parameters it is additionally important to consider the ratio of measures GI ð Þ ¼ MI ð Þ=MD ð Þ: ð8Þ This quantity, referred to as relative independent sensitivity, characterizes the loss of the model sensitivity to the rotated parameters due to parameter correlations. From a geometric point of view the latter ratio characterizes the degree of ellipsoid oblongness: The more the ellipsoid is oblong along the axes of parameters correlated with the ¯rst components of , the lower the ratio is. Both relative measures are always less or equal than 1, where equality is only possible if parameters are uncorrelated. The value of each of them much less than 1 evidences poor identi¯ability of parameters. 2.3.2. Sensitivity measures characterize the model predictive power Now we will show how the sensitivity measures formally introduced in the previous section can be associated with the predictive properties of the model. Application of the method will be illustrated using a gene circuit model. First of all, the structure of the parameter set that de¯nes the model solution is explored. We aim to predict the behavior of a model solution at given prede¯ned values of a subset of parameters g . For de¯niteness let the parameters be set to zero. Denote the complement of this 1441002-7 E. Myasnikova & K. N. Kozlov subset as g ¼ ng . With respect to the gene circuit model this means that for each gene g we consider a subset g composed of parameters that describe the action of g either as target or regulator. To model the evolution of gene expression in null mutants for gene g the parameters from g are zeroed. The correct prediction is only possible if estimates of g , the subset composed of parameters nonrelated to g, are well-identi¯able. Nonidenti¯ability of these parameters can be caused either by biologically substantiated low sensitivity of the model to the parameter changes or their correlation with the other parameters. The low sensitivity to g means that ¯tting is for the most part implemented with respect to parameters including g gene, while the rest do not essentially a®ect the value of functional. If there exists strong correlation between g parameters and the other ones, the change of a parameter from g causes the simultaneous changes in the correlated parameters from the complementary subset g . Thus, setting the g parameters to zero leads to nonidenti¯ability of parameters that are not related to g gene. For the sake of clarity, the model with the full set of ¯tted parameters is referred to as full model. The model properties are explored in the vicinity of , a point in parameter space that de¯nes a speci¯c model solution. For the informative sensitivity analysis it is su±cient to reveal parameter combinations that introduce the maximum contribution to the cost functional and study their role in the predictions. First, we ¯nd the linear combinations of all the parameters from the parameter set to which the model is the most sensitive. For this purpose, we apply the SVD to full matrix J T J ¼ V V T and extract L principal components that compose vector L ~ ¼ ð~1 ; . . . ; ~L Þ. The dependent and independent measures of sensitivity MI and L MD coincide for ~ as principal components are uncorrelated. Next, we address the parameters from subset g only. The sensitivity matrix for this subset reduces to Jg that results from matrix J when all its rows corresponding to parameters from g are set to zero. SVD is applied to matrix J gT Jg ¼ Vg g V gT to extract principal components ~g ¼ V gT with respect to the basis generated by eigenvectors of J gT Jg , that are linear combinations not including parameters related L to g. The ¯rst principal components ~ g ¼ ð~g1 ; . . . ; ~gL Þ are those parameter combinations that mainly de¯ne behavior of the system that we aim to predict. Hence, the identi¯ability of these parameter estimates obtained by ¯tting the full model is necessary for good prediction. L Thus, we wonder what will be the model sensitivity to parameters ~ g . The parameter vector ~g can be represented as a result of rotation ~g ¼ V gT V ~ with respect to the canonical basis generated by eigenvectors of the full information matrix J T J. L Therefore, linear combinations ~ g are not necessarily orthogonal with respect to this L L basis and the sensitivity measures MI ð~ g Þ and MD ð~ g Þ may be unequal. For the sake of brevity superscript \L" will be omitted in what follows. To analyze the identi¯ability of parameters from g we consider the relative measures of sensitivity introduced in Eqs. (7) and (8). The ratio Gð~g Þ ¼ MI ð~g Þ= ~ characterizes the model's relative sensitivity to the subset of parameters g as MD ðÞ 1441002-8 Statistical method for estimation of the predictive power of a gene circuit model compared with the sensitivity to the full set . The low value of GD ð~g Þ ¼ ~ means that the model is relatively insensitive to parameters not MD ð~g Þ=MD ðÞ related to g and hence when the parameters from g are zeroed the model behavior may be reproduced incorrectly. In particular, such a situation may happen if g parameters are sloppy and do not signi¯cantly in°uence the quality of ¯t. The second measure GI ð~g Þ ¼ MI ð~g Þ=MD ð~g Þ re°ects the degree of correlation between subsets g and g , so that its low value also evidences poor identi¯ability of the subsets, and hence the low predictive power. 3. Results We study predictive power of the gene circuit model that is ¯tted to the data on expression of four target genes, hb, Kr, gt and kni, and thereby four parameter subsets hb , Kr , gt and kni , are de¯ned as described in Sec. 2.3.2, each composed of all the parameters including the corresponding gene as target or regulator. The sensitivity of the model to parameters is analyzed in the vicinity of the solutions de¯ned by four parameter sets C1 , C2 , C WT and C WT which are introduced in 1 2 Appendix A. Con¯dence intervals for parameter estimates were constructed and analyzed in detail in Ref. 4. Dependent and independent con¯dence intervals for gene circuit C1 are reproduced in Fig. 3. In the cited paper, we were focused on the reliability of the estimates of regulatory weights, i.e. reliability of our conclusions about the type of Fig. 3. 95% dependent(thin bars) and independent (thick bars) con¯dence intervals for parameter set C1 . MLE of parameters are depicted as small black squares. The horizontal axis is labeled by notations of regulatory weights. For simplicity each parameter is denoted by two gene names the ¯rst of which is a hb target, the second is a regulator. For example, E cad is denoted CadHb. 1441002-9 E. Myasnikova & K. N. Kozlov gene-to-gene interaction within the genetic network. The sensitivity of the model to parameters in case of properly normalized parameter values is characterized by the size of their con¯dence intervals. For example, the shortest independent and dependent intervals are obtained for parameters describing the action of a strong activator Cad on all the target genes, that indicates the highest sensitivity of the model to these parameters. The longest dependent intervals correspond to the parameters from Kr that were classi¯ed as unreliable.4 The di®erence between dependent and independent con¯dence intervals is a result of existing correlations between the parameter estimates. This issue was explored in detail in Ref. 4 using the local collinearity analysis.6 For more information see Appendix A. In sensitivity analysis, we ¯rst focus on parameter vectors C1 and C2 that are obtained as a result of ¯tting to two genotype data. These parameter estimates provide good ¯t both to WT and Kr mutant data. We compute relative sensitivity measures GD and GI introduced in Eqs. (7) and (8). For this purpose principal components ~ are constructed for the full set of parameters and ~g for parameter subsets g as described in Sec. 2.3. Using the terminology adopted in factor analysis we will call the absolute values of coe±cients at model parameters as parameter loadings. The small number of parameters with large loadings almost fully characterizes the model sensitivity. Seven Table 1. Parameter loadings for subsets C1 and C WT 1 . C WT 1 C1 Parameter cad E gt cad E hb Tll E hb cad E Kr cad E kni T hb T kni ~ ~Kr 0.62 0.64 0.30 0.65 0.53 2 0.85 4 0.88 0.30 1 0.34 1 3 1 2 3 3 ~ 0.64 0.65 0.30 0.65 0.59 0.45 2 0.88 0.33 1 0.34 0.74 1 ~Kr 0.96 1 3 1 0.56 3 0.59 3 0.69 0.87 0.46 0.46 0.81 0.39 3 0.65 3 0.98 2 0.33 3 2 3 4 3 1 2 1 2 3 4 Note: Parameter loadings are shown for two model solutions C1 (columns 1 and 2) and C WT (columns 3 and 4). Columns 1 1 and 3 contain loadings in principal components (PCs) constructed for the full parameter set, while in columns 2 and 3 the loadings are given for PCs composed of parameters not related to P Kr. The P number of PCs, L, is de¯ned from the inequality Li¼1 i = m i¼1 i > 0:95 [for notation see Eq. (6)]. The order number of a component is shown as a superscript. The loadings greater than 0.3 in absolute value are shown. 1441002-10 Statistical method for estimation of the predictive power of a gene circuit model parameters from set C1 extracted in such a way are presented in Table 1. The most informative parameters are those that have high loadings within the ¯rst principal components. Parameter loadings are shown for principal components constructed for the full parameter set (¯rst column) and subset Kr (second column). Examining the table one can see that the parameters from to which the full model is the most cad cad sensitive are: E kni (loading 0.88 in the ¯rst principal component ~1 ), E hb (0.30 in cad kni ~ ~ ~ ~ (0.34 in 1 ), E gt (0.62 in 2 ). Obviously these are the para 2 ; 0.65 in 2 ), T meters with the shortest con¯dence intervals (see Fig. 3). Note that the only parameter from Kr appears in the fourth principal component, i.e. the model sensitivity to parameters from this subset is not too high. The second column presents the loadings for parameters not related to Kr. Comparing two columns, we see that the maximal loadings take very similar values that means that parameter combinations that mainly de¯ne the model solutions both for WT and Kr-mutants are almost the same. The relative sensitivity measures computed for C1 and C2 are shown in the upper part of Table 2. The dependent measure GD for subset Kr takes a high value, equal to 0.91, that is in a good agreement with the fact that the model correctly reproduces gene expression in Kr-mutants. On the other hand, the full model is highly sensitive cad to parameters related to kni: E kni and T kni , and, consequently, the value of measure GD is the lowest for subset kni being as low as 0.13. Thus, we can expect the bad prediction in null mutants for kni, that was indeed demonstrated on kni mutants published in Ref. 12. Up to now, we did not pay attention to parameter correlations, the e®ect of which on predictive power is taken into account in the independent relative measures GI . In our previous study,4 it was shown that parameters of the model ¯tted to two genotypes are less correlated and hence better identi¯able than those of the model ¯tted to WT data only. The criterion introduced here also re°ects the same tendency: For Table 2. Relative sensitivity measures. C2 C1 GD GI hb Kr gt kni hb Kr gt kni 0.56 0.005 0.91 0.26 0.73 0.01 0.13 0.004 0.26 0.001 0.97 0.33 0.73 0.001 0.13 0.005 C WT 1 GD GI C WT 2 hb Kr gt kni hb Kr gt kni 0.76 0.001 0.33 0.01 0.36 0.001 0.61 0.003 0.61 0.01 0.67 0.009 0.24 0.006 0.41 0.003 Note: Relative sensitive measures computed for four parameter sets de¯ning model solutions. High values of the measures characterize high predictive power of the model with the parameters from subset g being zeroed. The values of GD and GI computed for subset Kr are shown in bold. 1441002-11 E. Myasnikova & K. N. Kozlov subset Kr the measure GI is by orders of magnitude higher than for the parameter subsets related to other genes. and C WT ¯tted to WT data alone. Now we address parameter vectors C WT 1 2 WT Parameter loadings for vector C 1 are given in third (full parameter set ) and fourth (subset Kr ) columns of Table 1. Here, we see a situation that is somewhat di®erent from the one observed in the ¯rst two columns: The full model is the most cad sensitive to parameter E Kr from Kr , while the highest loadings in the fourth colcad cad umn are those at parameters E gt and E kni . In other words, the model ¯tted with respect to the subset of parameters is highly sensitive to parameters that are not reliably estimated in the full model. Naturally in this case the value of dependent measure GD for subset Kr is not high. The highest value of GD is related to hb in both sets C WT and C WT 1 2 . The measure for gt is the lowest, and ¯nally for parameters from Kr and kni the measure takes similar values. Nevertheless, these values are not high enough to reliably provide good prediction. All the values of independent measures GI are not high, approximately all of the same order, that evidences high correlations between parameters. It is shown that the model solution obtained by ¯tting to gene expression data in WT and null mutants for Kr demonstrates much better predictions for Kr mutants than those ¯tted to WT data. This result may serve as an explanation to the fact why all the previous attempts to predict the dynamics of pattern formation in mutants turned to be unsuccessful. For correct prediction of gene expression in mutant embryos it is necessary to add the mutant data to the dataset used for model ¯tting, as the information contained in WT experimental data is insu±cient to reliably estimate the large number of model parameters. 4. Discussion High predictive power is a necessary property of any mathematical model. In this paper, a speci¯c aspect of the problem is considered: Predictive power is understood as a possibility to predict correct behavior of model solutions at prede¯ned values of a subset of parameters. The problem is discussed in the context of a speci¯c mathematical model, the gene circuit model for segmentation gap gene system in early Drosophila development. The model was successfully applied to correctly reproduce the dynamics of pattern formation in WT embryo.2,3 However, when ¯tted to WT data, the model could not be used for prediction of system behavior in mutants. In order to obtain model solutions describing gap gene expression in WT and mutants in our recent work,4 the model was ¯tted to data from two genotypes simultaneously. These results demonstrated the existence of parameter sets describing gap gene expression in two genotypes simultaneously and thus the applicability of the gene circuit formalism to model genotypes of gap mutants. As the number of model parameters is very high, one may wonder whether the over¯tting was the reason for these parameter sets not to be discovered during the ¯t to the WT data alone. In our current study, we focus on this problem. 1441002-12 Statistical method for estimation of the predictive power of a gene circuit model In the context of over¯tting problem, the following questions arise: Whether experimental data contain enough information to correctly predict the system behavior at ¯xed values of a parameter subset and isn't the model overparameterized? We make an attempt to address these issues in this paper. The developed method is based on the analysis of parameter identi¯ability and applied to explore the predictive properties of the gene circuit model. Two types of relative measures of the predictive power are considered: The ¯rst one reveals the biologically substantiated low sensitivity of the model to parameters that are responsible for correct reconstruction of expression patterns in mutants, while the second one takes into account their correlation with the other parameters. It is shown that the model solution obtained by ¯tting to gene expression data in WT and Kr mutants demonstrates much higher predictive power than those ¯tted to WT data alone. This fact may be interpreted as a manifestation of over¯tting problem: Information contained in WT experimental data is insu±cient for correct prediction of gene expression in mutant embryos. This conclusion does not exclude the other explanations of the fact of incorrect predictions, but even assuming that the model represents the underlying mechanism of the modelled process accurately and correctly, it is not suitable for mutant predictions. Another practical situation for which the proposed method may be appropriate is a problem of selection among the optimal solutions those that provide better predictions. When applied for this purpose, the measures can be introduced as additional regularization criteria into a global optimization problem of high dimensionality. The model regularization narrows the search space and thereby allows to obtain solutions with the required properties. An example of optimization approach that presumes this kind of regularization is published in Ref. 13. Acknowledgments This work was supported by EC Collaborative project HEALTH-F5-2010-260429 and RFBR projects 13-01-00405 and 11-01-00573. Appendix A. Estimation and Identi¯ability Analysis of Parameters of the Gene Circuit Model The gene circuit model2–4 describes the dynamics of segmentation gene expression in the syncytial blastoderm of Drosophila melanogaster during cleavage cycle 14A. The aim of modeling is to decipher the molecular mechanisms which control the process of segment determination in Drosophila. Most of segmentation genes encode transcription factors, which regulate the expression of the other genes in the segmentation gene network. The regulatory topology of the network is obtained by solving the inverse problem of mathematical modeling. We consider the model ¯rst presented in Ref. 4 that successfully reproduces the time evolution of protein concentrations of gap genes hb, Kr, gt and kni in two genotypes: WT and in embryos with homozygous 1441002-13 E. Myasnikova & K. N. Kozlov Fig A. 1. Temporal dynamics of gt gap gene expression during the modeling period in WT embryos. The model is ¯tted to the data extracted from the outlined strip along A–P axis and presented as averaged protein concentrations in a row of nuclei. Confocal images and quantitative data are available from FlyEx database (http://urchin.spbcas.ru/°yex). null mutation in Kr gene. A sample of gene expression dynamics is presented in Fig. A.1. The model considers a one-dimensional row of nuclei along the anteroposterior (A–P) axis of the embryo. The modeled region covers the posterior half of an embryo body. Concentration v ai for each gap gene product a in each nucleus i over time t is described by the following system of ordinary di®erential equations: dv ai =dt ¼ Ra gðu ai Þ þ D a ðnÞ½ðv ai1 v ai Þ þ ðv aiþ1 v ai Þ a v ai ; ðA:1Þ The right-hand side of the equation represents protein synthesis, protein di®usion P P and protein decay. u a ¼ 4b¼1 T ab v bi þ 3e¼1 E ae v ei þ h a is the total regulatory input to gene a. Genes denoted as e (bcd, cad and tll) are external inputs, i.e. those genes that are not regulated by gap genes, but regulate these genes. T ab and E ae are genetic inter-connectivity matrices that characterize the action of regulator b or external input e on gene a. h a is a threshold parameter of the sigmoid regulation-expression function gðuÞ. Ra is the maximum synthesis rate, D a the di®usion coe±cient, and a the decay rate of the product of gene a. Thus, the expression level of each of the four target genes is described by 10 parameters and the whole set, , is composed of 40 parameters. The model parameters are estimated by ¯tting the model output to gene expression data through minimization of the cost functional (1). The minimization is performed by means of di®erential evolution entirely parallel (DEEP) method.4,14 In our previous work,4 11 vectors of parameter estimates (optimal circuits) were obtained by ¯tting the model to two genotypes simultaneously: WT and Kr embryos. We will use two of these vectors for our analysis: vector C1 that de¯nes the consensus gap gene network and vector C2 , the one that is the most di®ering from the vector C1 . A consensus network is such that the signs of regulatory parameters in it coincide with the predicted network topology inferred from all the ¯ts. Additionally two parameter vectors are obtained by ¯tting the model to WT data only using estimates from sets C1 and C2 as initial values for minimization. These vectors are denoted C 1WT and C 2WT , respectively. The possible over¯tting problem was treated by applying the local identi¯ability analysis of the parameter estimates. Two approaches were used. First, the sensitivity of the model to parameter changes and identi¯ability of parameters in the vicinity of 1441002-14 Statistical method for estimation of the predictive power of a gene circuit model the model solutions were analyzed on the basis of con¯dence intervals. The analysis allowed us to re¯ne the predicted regulatory network topology. It was shown that while most of the regulatory parameters T ab and E ae were well identi¯ed and their estimates could be used to make conclusions about the type of gene interaction, some of the parameters the most of which belonged to Kr were poorly identi¯able. As parameter nonidenti¯ability could be a consequence of their strong correlation, the collinearity analysis6 of the sensitivity matrix was applied to reveal subsets of correlated parameters. This method con¯rmed that poor identi¯ability of parameters could be explained by their correlations with the rest of parameters. Our analysis also demonstrated that parameters of the model ¯tted to two genotypes were better identi¯able than those of the model ¯tted to WT data alone. References 1. Gutenkunst R, Waterfall J, Casey F, Brown K, Myers C, Sethna J, Universally sloppy parameter sensitivities in systems biology models., PLoS Comput Biol 3(10):1871–1878, 2007. 2. Jaeger J, Surkova S, Blagov M, Janssens H, Kosman D, Kozlov K, Manu, Myasnikova E, Vanario-Alonso C, Samsonova M, Sharp D, Reinitz J, Dynamic control of positional information in the early Drosophila embryo, Nature 430:368–371, 2004. 3. Jaeger J, Blagov M, Kosman D, Kozlov K, Manu, Myasnikova E, Surkova S, Samsonova M, Sharp D, Reinitz J, Dynamical analysis of regulatory interactions in the gap gene system of Drosophila melanogaster, Genetics 167:1721–1737, 2004. 4. Kozlov K, Surkova S, Myasnikova E, Reinitz J, Samsonova M, Modeling of gap gene expression in Drosophila Kruppel mutants, PLoS Comput Biol 8(8):e1002635, 2012. 5. Ashyraliyev M, Jaeger J, Blom J, Parameter estimation and determinability analysis applied to drosophila gap gene circuits, BMC Syst Biol 2:83, 2008. 6. Brun R, Reichert P, Kunsch H, Practical identi¯ability analysis of large environmental simulation models, Water Resour Res 37:1015–1030, 2001. 7. Raue A, Kreutz C, Maiwald T, Bachmann J, Schilling M, Klingmuller U, Timmer J, Structural and practical identi¯ability analysis of partially observed dynamical models by exploiting the pro¯le likelihood, Bioinformatics 25:1923–1929, 2009. 8. Hengl S, Kreutz C, Timmer J, Maiwald T, Data-based identi¯ability analysis of nonlinear dynamical models, Bioinformatics 23:2612–2618, 2007. 9. Jacquez JA, Perry T, Parameter estimation: Local identi¯ability of parameters, Am J Physiol 258(4):E727–E736, 1990. 10. Bates D, Watts D, Nonlinear Regression Analysis and its Applications, J. Wiley, 1988. 11. Dresch J, Liu X, Arnosti D, Ay A, Thermodynamic modeling of transcription: Sensitivity analysis di®erentiates biological mechanism from mathematical model-induced e®ects, BMC Syst Biol 4:142, 2010. 12. Surkova S, Golubkova E, Manu, Panok L, Mamon L, Reinitz J, Samsonova M, Quantitative dynamics and increased variability of segmentation gene expression in the Drosophila Kr and kni mutants, Dev Biol 376:99–112, 2013. 13. Pisarev A, Samsonova M, A method for solving multiobjective reverse problems under conitions of uncertainty (in Russian), Bio¯zika 58(2):221–232, 2013. 14. Kozlov K, Samsonov A, DEEP Di®erential evolution entirely parallel method for gene regulatory networks, J Supercomput 57:172–178, 2010. 1441002-15 E. Myasnikova & K. N. Kozlov Ekaterina Myasnikova obtained her Ph.D. in mathematics from St. Petersburg State Polytechnical University, Russia. Her recent research interests mainly rest in the ¯eld of systems biology and biostatistics. She is a leading researcher in Department of Computational Biology in the Center of Advanced Studies of St. Petersburg Polytechnical University and Associate Professor at the Department of Bioinformatics, Moscow Institute of Physics and Technology, Russia. Konstantin N. Kozlov graduated in 2000 from the State Polytechnical University, Applied Math Department, St. Petersburg, Russia. He has got the Ph.D. degree in Bioinformatics, Computational biology and Modeling in 2013. He has worked at the Department of Computational Biology, CAS, Polytechnical University, St. Petersburg, since 2001. Dr. Kozlov is a specialist in applied mathematical methods and information technologies, and conducts scienti¯c research in the ¯eld of mathematical modeling of biological systems, optimization and biomedical image processing. He is a coauthor of more than 20 papers published in international peer-reviewed journals and made more than 30 oral and poster presentations at the international conferences. 1441002-16