* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Dimension Reduction of Chemical Process Simulation Data
Survey
Document related concepts
Transcript
Dimension Reduction of Chemical Process Simulation Data Gábor Janiga1 and Klaus Truemper2 1 Laboratory of Fluid Dynamics and Technical Flows, University of Magdeburg “Otto von Guericke”, 39106 Magdeburg, Germany 2 Department of Computer Science, University of Texas at Dallas, Richardson, TX 75083, U.S.A. Corresponding author: G. Janiga [email protected] Abstract. In the analysis of combustion processes, simulation is a costefficient tool that complements experimental testing. The simulation models must be precise if subtle differences are to be detected. On the other hand, computational evaluation of precise models typically requires substantial effort. To escape the computational bottleneck, reduced chemical schemes, for example, ILDM-based methods or the flamelet approach, have been developed that result in substantially reduced computational effort and memory requirements. This paper proposes an additional analysis tool based on the Machine Learning concepts of Subgroup Discovery and Lazy Learning. Its goal is compact representation of chemical processes using few variables. Efficacy is demonstrated for simulation data of a laminar methane/air combustion process described by 29 chemical species, 3 thermodynamic properties (pressure, temperature, enthalpy), and 2 velocity components. From these data, the reduction method derives a reduced set of 3 variables from which the other 31 variables are estimated with good accuracy. Key words: dimension reduction, subgroup discovery, lazy learner, modeling combustion 1 Introduction Virtually all transportation systems and the majority of electric power plants rely directly or indirectly on the combustion of fossil fuels. Numerical simulation has become increasingly important for improving the efficiency of these combustion processes. Indeed, simulation constitutes a cost-efficient complement of experimental testing of design prototypes. 2 Dimension Reduction of Simulation Data A key demand is that simulation be precise since otherwise subtle differences of complex processes escape detection. For example, the analysis of quantitative problems like predictions of intermediate radicals, pollutant emissions, or ignition/extinction limits requires complete chemical models [6]. Until recently, most models have been 1- or 2-dimensional due to the huge computational effort required for 3-dimensional models; see [11, 9] for the computational effort carried out for some 3-dimensional models. Reduced chemical schemes, for example, ILDM-based methods or the flamelet approach [17], lead to a considerable speed-up of the computations when compared with a complete reaction mechanism, with small loss of accuracy. They drastically reduce required computational memory as well. These chemical schemes are based on an analysis of chemical pathways that identifies the most important ones. The other reactions are then modeled using the main selected variables. This paper proposes an additional analysis tool called REDSUB that is based on Machine Learning methods, specifically Subgroup Discovery [12, 19] and the concept of Lazy Learner [15]. It analyses simulation data of chemical processes using a complete reaction scheme and searches for a compact representation that uses fewer variables while keeping errors within specified limits. To demonstrate efficacy, a laminar methane/air burner is considered that occurs in many practical applications. The numerical simulation evaluates a 2-dimensional model using detailed chemistry [10]. For each grid point, the simulation produces values for 29 chemical species, 3 thermodynamic properties (pressure, temperature, enthalpy), and 2 velocity components. Given this information, REDSUB selects 3 variables from which the values of the remaining 31 variables are estimated with good accuracy. We begin with a general definition of the reduction problem. Let X be a finite collection of vectors in Rn , the n-dimensional real space. In the present setting, X contains the results of a simulation model for some process. For efficient use of X, for example, for optimization purposes, we want to select a small subset J of {1, 2, . . . , n} and construct a black box so that the following holds. For any vector of x ∈ X, the black box accepts as input the xj , j ∈ J, and outputs values that are close estimates of the xj , j ∈ / J. For a simplified treatment of numerical accuracy and tolerances, we translate, scale, and round values of vectors x ∈ X so that the entries become uniformly comparable in magnitude. Thus, we select translation constants, scaling, and rounding so that all variables have integer values ranging from 0 to 106 , with both bounds achieved on X. From now on we assume that the vectors of X themselves are of this form. In the situations of interest here, we are given some additional information. Specifically, with each vector of x ∈ X an associated m-dimensional vector y of Rm and the value of a real function F (x) are supplied. For given input xj , j ∈ J, the above black box must also output a close estimate of the function value F (x). On the other hand, the black box need not supply estimates of the values of the vector y. Dimension Reduction of Simulation Data 3 The reader may wonder why we introduce the function value F (x), since it could be declared to be an additional entry in the vectors of X. In a moment we introduce an assumption about the function F (·), and for that reason it is convenient to treat its values separately. As a final condition that is part of the problem definition, the index set J selected in the reduction process must be a subset of a specified nonempty set I ⊆ {1, 2, . . . , n} and must contain a possibly empty subset M ⊆ I. For example, the xj , j ∈ I, may represent values that are easily measured in the laboratory and thus support a simplified partial verification of the full model via the reduced model. The subset M contains the indices of variables that we require in the reduced model for any number of reasons. In the example combustion process discussed in Section 3, each vector x ∈ X contains n = 33 values for various chemical components, temperature, pressure, and two velocities. The function F (x) is the enthalpy. The associated vector y ∈ Rm has m = 2 and defines the point in the plane to which the values of x and F (x) apply. Problem of that size are easily handled by REDSUB. Indeed, in principle the method can handle cases where the vectors x of X have up to several hundred entries. The size of m is irrelevant. Once REDSUB has identified the index set J, the Lazy Learner of REDSUB can use J and the set X to estimate for any vector where just the values xj , j ∈ J are given, the values for all xj , j ∈ / J, and F (x). This feature allows application of the results in settings similar to that producing X. We introduce two assumptions concerning the vectors y and the function F (x). They are motivated by the applications considered here. The first assumption says that the vectors y constitute a particular sampling of a compact convex subspace of Rm . The second one demands that F (·) is close to being a one-to-one function on the data set X. Assumption 1 Collectively, the vectors y associated with the vectors x ∈ X constitute a grid of some compact convex subset of Rm . Assumption 2 F (·) is close to being a one-to-one function, in the sense that the number of distinct function values F (x), x ∈ X, is close to the number of vectors in X. We motivate the assumptions by the combustion processes we have in mind. There, the vectors y are the coordinate values of the 1-, 2-, or 3-dimensional simulation. Since the simulation generates the points of a grid, Assumption 1 is trivially satisfied. Assumption 2 demands that the function distinguishes between almost all points x ∈ X. In the example application discussed later, the function is the enthalpy of the process and turns out to satisfy Assumption 2. The paper proceeds as follows. The next section, 2, describes Algorithm REDSUB. Section 3 covers application of the method to a methane combustion process. Section 4 has details of the subgroup discovery method used in REDSUB. Section 5 summarizes the results. 4 2 Dimension Reduction of Simulation Data Algorithm REDSUB In a traditional approach for the problem at hand, we could (1) estimate the function values with ridge regression [7, 8]; (2) derive the index set J via the estimating coefficients of the ridge regression equations of (1) and covariance considerations; and (3) use ridge regression once more to compute estimating equations that estimate values for the variables j ∈ / J, and F (x) from those for the variables xj , j ∈ J. The approach typically requires that the user specifies nonlinear transformations of the variables for the regression steps (1) and (3). Here, we propose a new approach that is based on two Machine Learning concepts and does not require user-specified nonlinear transformations. Specifically, we use Subgroup Discovery [12, 19] to select interesting indices j from the set I; from these indices, the set J is defined. Then we use the concept of Lazy Learner to predict values for the xj , j ∈ / J, and F (x). 2.1 Subgroup Discovery The Subgroup Discovery method, called SUBARP, constructs polyhedra that represent important partial characterizations of level sets {x ∈ X | F (x) ≥ c} of F (·) for various values c. SUBARP has been constructed from the classification method Lsquare [2–5] and the extension of [18], using the general construction scheme of [14] for deriving subgroup discovery methods from rule-based classification methods. Details are included in Section 4. The key arguments for use of the polyhedra are as follows. If a variable occurs in the specification of a polyhedron, then control of the value of that variable is important if function values are to be in a certain level set. This is a direct consequence of the fact that Subgroup Discovery constructed the polyhedron. Accordingly, the frequency qj with which a variable xj occurs in the definition of the various polyhedra is a measure of the importance the variable generally has for inducing function values F (·). The frequency qj is computed as follows. Define the length of an inequality to be the number of its nonzero coefficients. For a given polyhedron, the contribution to qj is 0 if xj does not occur in any inequality of the polyhedron, and otherwise is 1/k where k is the length of the shortest inequality containing xj . For a simplified discussion, momentarily assume that indices j of high frequencies qj constitute the set J. We know that the variables xj , j ∈ J, are essential for determining the level sets of F (·). By Assumption 2, F (·) is close to being a one-to-one function; thus, function values F (x) or, equivalently, level sets may be used to estimate x vectors. Since variables xj , j ∈ J, are important for determining level sets of F (·), we know that such xj are also important for determining entire x vectors, in particular, for estimating values of the variables xj , j ∈ / J. The actual construction of the set J is a bit more involved and involves iterative steps carried out by REDSUB. Details are discussed after the description of the Lazy Learner. Here we mention only that REDSUB splits the input data set X, which from now on we call a learning set, 50/50 into a training set S and Dimension Reduction of Simulation Data 5 testing set T . The split is done in such a way that the y vectors corresponding to the vectors of S constitute a coarser subgrid of the original grid. SUBARP utilizes both sets S and T , while the Lazy Learner only uses S. 2.2 Lazy Learner A Lazy Learner is a method that, well, never learns anything. Instead, it directly derives from a given training set an estimate for the case at hand and then discards any insight that could be gained from those computations [15]. In the present situation, given index set J and training set S, the Lazy Learner estimates for a given vector x the entries xj , j ∈ / J, and F (x) from the xj , j ∈ J. This section describes that estimation process, which is a particular Locally Weighted Learning algorithm; see [1] for an excellent survey. For the given vector x, let v be the subvector of x indexed by J, and define k = |J|. We first find a vector x1 ∈ S whose subvector z 1 indexed by J is closest to v according the infinity norm, which defines the distance between any two vectors g and h to be maxj {|gj −hj |)}. We use that distance measure throughout unless distance is explicitly specified to be Euclidean. If z 1 = v, x1 and F (x1 ) supply the desired estimates. Otherwise, we use the following 5-step interpolation process. (1) Let L be the line passing through v and z 1 , and denote by d the distance between the two points. Define p1 and p2 to be the two points on L whose distance from v is equal to d and 2d, respectively, such that on L the point v lies between z 1 and p1 as well as between z 1 and p2 ; see Figure 1. (2) Let Z be the set of vectors z such that z is a subvector of some Fig. 1. Lazy Learner Interpolation Process x ∈ S and is within distance d of p2 . (3) If Z is empty, x1 and F (x1 ) supply the desired estimates. Otherwise, let z 2 be the vector of Z that is closest to p1 , and let x2 be any vector of S with z 2 as subvector; break ties in the choices of z 2 and 6 Dimension Reduction of Simulation Data x2 arbitrarily. (4) Project v onto the line K passing through z 1 and z 2 , getting a point v 0 . The projection uses Euclidean distance. (5) If the point v 0 lies on the line segment of K defined by z 1 and z 2 and thus is a convex combination of z 1 and z 2 , say v 0 = λz 1 +(1−λ)z 2 for some λ ∈ [0, 1], then the vector λx1 +(1−λ)x2 and λF (x1 ) + (1 − λ)F (x2 ) provide the desired estimates. Otherwise, use x1 and F (x1 ) to obtain the estimates. The above choices of distance measure and interpolation were derived via experimentation. Indeed, the infinity norm emphasizes maximum deviations, and with that measure, the interpolation seemingly is effective in producing estimates of uniform accuracy for the entire range of variables and the function F (x). The estimation errors for any vector x of the testing set T is the distance between the vector [x, F (X)] and the corresponding vector of estimated values. The average error is the sum of the estimation errors for the set T , divided by |T |. It is used next in Section 2.3. 2.3 Iterations of REDSUB Algorithm REDSUB constructs the set J in an iterative process from the given learning set X, where in iteration i ≥ 1 a set Ji is selected. Recall that I ⊆ {1, 2, . . . , n} contains the indices j of variables xj that may occur in the reduced model, and that M has the indices j for which xj must occur in that model. Furthermore, the set X has been split 50/50 into a training set S and a testing set T . Iteration i ≥ 1 begins with the set Ji−1 ; for i = 1, the set Ji−1 = J0 is defined to be the set I of variables that may be considered for the reduced representation. Before iteration 1 commences, the Lazy Learner uses the index set J0 and the training set S to compute estimates for the testing set T as well as the average error of those estimates. If the average error exceeds a user-specified bound, the method declares that a reduced model with acceptable estimation errors cannot be found, and stops. Iteration i >= 1 proceeds as follows. If Ji−1 = M , then no further reduction is possible, and REDSUB outputs Ji−1 as the desired set J and stops; otherwise, SUBARP computes frequencies qj , j ∈ Ji−1 , as described in Section 2.1 except that Ji−1 plays the role of J. Let q ∗ be the minimum of the qj with j ∈ (Ji−1 − M ). REDSUB processes each j ∈ (Ji−1 − M ) for which qj is at or near the minimum q ∗ , as follows. First, the index j is temporarily removed from Ji−1 , resulting in a set J 0 , and the Lazy Learner computes the average error as above, except that J 0 is used instead of J0 . Second, let j ∗ ∈ (Ji−1 − M ) be the index for which the average error is smallest. If that smallest error is below a user-specified bound, the associated J 0 = Ji−1 − {j ∗ } is defined to be Ji , and iteration i + 1 begins; otherwise, the process stops, and Ji−1 is declared to be the desired set J. If q ∗ = 0, which typically happens in early iterations i, then the method accelerates the process by also considering the case where all j ∈ (Ji−1 − M ) Dimension Reduction of Simulation Data 7 with qj = 0 are simultaneously deleted. If the average error of that case is below the user-specified bound, then that case defines Ji , and iteration i + 1 begins; otherwise, the above process using j ∗ is carried out. Computational Complexity REDSUB is polynomial if suitable polynomial implementations of SUBARP and the Lazy Learner are used. We should mention, though, that SUBARP of our implementation uses the Simplex Method for the solution of certain linear programming instances. Since the Simplex Method is not polynomial, our implementation of REDSUB cannot be claimed to be polynomial. But the Simplex Method is well known for solving linear programs rapidly and reliably, and for this reason it is used in SUBARP. The next section presents computational results of REDSUB for a methane combustion process. 3 Application In [10], a simulation model represents a partially-premixed laminar methane/air combustion process in 2-dimensional space. Thus, m = 2. We skip a review of the model and only mention that the simulation has n = 34 variables measuring important characteristics such as mixture composition, temperature, pressure, two velocities, and enthalpy. The mixture composition covers 29 gases, among them H2 , H2 O, O2 , CO, CH4 , CO2 , and N2 . The 29 gases define the set I from which the set J is to be selected. The set M of mandated selections is empty. For the 7,310 grid points y shown in Figure 2, the simulation supplies vectors x, each with values for the 33 variables. Thus, the set X consists of 7,310 vectors. The function F (x) is the enthalpy. We split the data 50/50 into a learning set X from which the reduced model is derived via SUBARP and the Lazy Learner, and a verification set V . The split is so done that the y vectors corresponding to the vectors of X constitute a coarser subgrid of the original grid produced by the simulation. That way the y vectors associated with X satisfy Assumption 1. The learning set X has 3,655 vectors. Of these, 3,412 vectors (= 93%) have distinct function values F (x). Thus, F (x) is close to one-to-one on X, and Assumption 2 is satisfied. Recall that the length of an inequality is the number of nonzero coefficients. One option of SUBARP controls the complexity of the polyhedra via a userspecified upper bound on the length of the defining inequalities. The option relies on the method of [18]. In principle, the upper bound on length can be any value 2l , l ≥ 0, but experimental results cited in the reference indicate that l ≤ 2 is preferred. We run SUBARP with each of the corresponding three bounds of 1, 2, and 4 on the length of the inequalities. In subgroup applications of SUBARP, we typically use 20 distinct values for any target variable, which is a variable to be investigated. Here, the target variable is F (x), and in agreement with prior uses of SUBARP we employ 20 distinct c values to define the level sets 8 Dimension Reduction of Simulation Data Fig. 2. Numerical Grid in the Vicinity of the Flame Front {x ∈ X | F (x) ≥ c}. For each of these level sets, SUBARP may find polyhedra contained in the level set as well as in the complement. Hence, a number of polyhedra may be determined for given level set. Of these, we choose for each level set the polyhedron with highest significance for the computation of the frequencies qj as described in Section 2.1. We run REDSUB once for each of the three bounds on the length of inequalities. For each run, the overall error is the ratio of the average error computed by the Lazy Learner, divided by the range of the variable values. Due to the transformation described in Section 1, the latter range is uniformly equal to 106 for each variable. REDSUB stops the reduction process when the overall error begins to exceed a user-specified bound e. Using e = 0.5%, each of the three REDSUB runs produces the same solution. It consists of the three variables H2 , H2 O, and N2 . To evaluate the effectiveness of the solution variables, we employ the entire set X in the Lazy Learner and use it as black box to estimate values for the verification set V . The overall error turns out to be 0.26%. Indeed, the estimated values are quite precise. For a demonstration, Figs. 3–8 show sample pictures for six variables, specifically, for C2 H2 , CH4 , CO, HCO, the enthalpy, and the temperature. In each picture, the area of the flame is shown. On the right-hand side of the center vertical axis, the colors encode the values determined by the simulation. On the left-hand side, colors encoding the estimated values are displayed. Due to the symmetry, consistency of the color patterns of the two sides implies accurate estimation, and conversely. Dimension Reduction of Simulation Data Fig. 3. 3-Variable Solution: Accuracy for C2 H2 Fig. 4. 3-Variable Solution: Accuracy for CH4 9 10 Dimension Reduction of Simulation Data Fig. 5. 3-Variable Solution: Accuracy for CO Fig. 6. 3-Variable Solution: Accuracy for HCO Dimension Reduction of Simulation Data Fig. 7. 3-Variable Solution: Accuracy for Enthalpy Fig. 8. 3-Variable Solution: Accuracy for Temperature 11 12 Dimension Reduction of Simulation Data We include a table that shows the correlation between estimated and actual values. Since H2 , H2 O and N2 are used to produce the estimates, their correlation values are trivially equal to 1.0. Clearly, estimated and actual values are highly correlated for all variables. Table 1. Correlation of Actual and Estimated Values Using H2 , H2 O and N2 Variable Correlation Variable Correlation u-velocity v-velocity Pressure Temperature H OH O HO2 H2 H2 O O2 CO CO2 CH HCO CH2 S CH2 0.9942 0.9924 0.9923 0.9989 0.9999 0.9998 0.9998 0.9996 1.0000 1.0000 0.9999 0.9999 0.9999 0.9997 0.9998 0.9998 0.9997 CH2 O CH3 CH3 O CH2 OH CH4 C2 H HCCO C2 H2 CH2 CO C2 H3 C2 H4 C2 H5 C2 H6 C C2 N2 Enthalpy 0.9998 0.9998 0.9996 0.9998 0.9999 0.9996 0.9996 0.9998 0.9998 0.9997 0.9997 0.9997 0.9997 0.9996 0.9996 1.0000 0.9929 For the variable HO2 and the enthalpy, whose correlation values approximately cover the range exhibited in Table 1, we include graphs that show the relationship between actual and estimated values. Considering that in the two graphs almost all of the 3,655 points occur on or very close to the diagonal, estimation errors are clearly low. There are a few exceptions, most noticeably for a few small values of the enthalpy. Dimension Reduction of Simulation Data 13 Fig. 9. 3-Variable Solution: HO2 actual versus estimated values Fig. 10. 3-Variable Solution: Enthalpy actual versus estimated values We repeat the above evaluation except that we use the slightly larger error bound of e = 1%. Each of the three runs selects the two variables H2 and H2 O. The overall error for the verification set V is 0.33%. Figs. 11–16 show the corresponding pictures. They demonstrate that reduction to the two variables H2 and H2 O still produces potentially useful results, but at a lesser level of accuracy, 14 Dimension Reduction of Simulation Data as is evident from occasional and moderate disparities of the color patterns on the left-hand side versus the right-hand side. In agreement with the results shown for the 3-variable case, we also include a table showing the correlation of actual and estimated values for all variables and enthalpy, and two graphs providing details for HO2 and enthalpy. The reader may wonder whether a much simpler approach could produce the same results. For example, a greedy method could use in each iteration minimum average error computed by the Lazy Learner als deletion criterion and thus would avoid the subgroup discovery process. We have implemented that method and obtained the following results. First, for the case of e = 0.5%, the simplified process obtains a solution with the three variables OH, H2 , and O2 . The overall error is virtually identical to that obtained earlier, 0.25% now versus 0.26% earlier. But for e = 1%, the simplified method finds a solution with the two variables H2 and O2 . The average error turns out to be 0.94%, which is almost triple of the 0.33% error for the earlier derived solution with two variables. Indeed, the solution obtained by the simplified method is essentially unusable. Fig. 11. 2-Variable Solution: Accuracy for C2 H2 Dimension Reduction of Simulation Data Fig. 12. 2-Variable Solution: Accuracy for CH4 Fig. 13. 2-Variable Solution: Accuracy for CO 15 16 Dimension Reduction of Simulation Data Fig. 14. 2-Variable Solution: Accuracy for HCO Fig. 15. 2-Variable Solution: Accuracy for Enthalpy Dimension Reduction of Simulation Data Fig. 16. 2-Variable Solution: Accuracy for Temperature Table 2. Correlation of Actual and Estimated Values Using H2 and H2 O Variable Correlation Variable Correlation u-velocity v-velocity Pressure Temperature H OH O HO2 H2 H2 O O2 CO CO2 CH HCO CH2 S CH2 0.9942 0.9890 0.9845 0.9988 0.9999 0.9997 0.9998 0.9996 1.0000 1.0000 0.9997 0.9999 0.9996 0.9997 0.9998 0.9998 0.9997 CH2 O CH3 CH3 O CH2 OH CH4 C2 H HCCO C2 H2 CH2 CO C2 H3 C2 H4 C2 H5 C2 H6 C C2 N2 Enthalpy 0.9998 0.9998 0.9996 0.9998 0.9663 0.9997 0.9996 0.9998 0.9998 0.9997 0.9998 0.9997 0.9997 0.9996 0.9996 0.9519 0.9881 17 18 Dimension Reduction of Simulation Data Fig. 17. 2-Variable Solution: HO2 actual versus estimated values Fig. 18. 2-Variable Solution: Enthalpy actual versus estimated values How close are the two solutions obtained above for two and three variables to optimal solutions with the same number of variables, assuming that the average error computed by the Lazy Learner is to be minimized? Since the number of variables is small, the question can be settled by enumeration of cases. For the 3-variable case, the optimal solution has the variables OH, H2 , and O2 , which Dimension Reduction of Simulation Data 19 is precisely the solution obtained by the greedy method. As stated earlier, the accuracy for that case is essentially the same as that attained by the REDSUB solution. For two variables, the optimal solution has the variables H2 and H2 O, which are precisely the variables selected by REDSUB. We conclude that for this application, REDSUB has obtained solutions that are very close to or actually optimal. 4 Algorithm SUBARP Recall that I is the index set of variables from which the variables are to be selected, and that the learning set X is split 50/50 into a training set S and a testing set T . Let Z be the collection of subvectors of the x ∈ S defined by the index set I. With each record of Z, we have a value t called the target value of the record, which is the function value F (x) of the associated x ∈ S. Subgroup Discovery can find important relationships that link the vectors of Z and the target t; see [12, 19]. The subset of records defined by any such relationships is a subgroup. In the present setting, each relationship is defined by a logic formula involving linear inequalities, and each subgroup consists of a subset of vectors of Z lying in a certain polyhedron. There are a number of Subgroup Discovery methods; for a survey that unifies Subgroup Discovery with the related tasks of Contrast Set Mining and Emerging Pattern Mining, see [16]. In particular, any standard classification rule learning approach can be adapted to Subgroup Discovery [14]. SUBARP is one such method, adapted from the classification method Lsquare [2–5] and the extension of [18]. For completeness we include an intuitive summary of SUBARP that is specialized for the case at hand. 4.1 Target Discretization We want to find relationships connected with the level sets of the function F (·) or their complements. Accordingly, we define a number of values c of possible target values for the target t. For each c, we declare the set {t | t ≥ c} to be a range of the target t. For given target range R, let A be the set of given records whose target values t are in R, and define B to be the set of remaining records. SUBARP tries to explain the difference between the sets A and B in a multi-step process. The steps involve feature selection, computation of two formulas, factor selection, and subgroup evaluation. They are described in subsequent sections. The first two steps use Lsquare cited above. We need to introduce the notion of a logic formula that to a given record of Z assigns the value True or False. The concept is best explained using an example. Suppose that a record has g and h among its attributes, and that the record values for these attributes are g = 3.5 and h = 4.7. The logic formula consists of terms that are linear inequalities involving the attributes, and the operators ∧ (“and”) and ∨ (“or”). For a given record, a term evaluates to True 20 Dimension Reduction of Simulation Data if the inequality is satisfied by the attribute values of the record, and evaluates to False otherwise. Given True/False values for the terms, the formula is evaluated according the standard rules of propositional logic. For example, (g ≤ 4) and (h > 3) are terms. A short formula using these terms is (g ≤ 4) ∧ (h > 3). For the above record with g = 3.5 and h = 4.7, both terms of the formula evaluate to True, and hence the formula has that value as well. As a second example, (g ≤ 4) ∨ (h > 5) produces the value True for the cited record, since at least one of the terms evaluates to True. 4.2 Feature Selection This step, which is handled by Lsquare, repeatedly partitions the set A into subsets A1 and A2 ; correspondingly divides the set B into subsets B1 and B2 ; finds a logic formula that achieves the value True on the records of A1 and False on those of B1 ; and tests how often the formula achieves True for the records A2 and False for those of B2 . In total, 20 formulas are created. In Lsquare, these formulas are part of an ensemble voting scheme to classify records. Here, the frequency with which a given attribute occurs in the formulas is used as an indicator of the importance of the attribute in explaining the differences between the sets A and B. Using that indicator, a significance value is computed for each attribute. The attributes with significance value beyond a certain threshold are selected for the next step. 4.3 Computation of Two Formulas Using the selected attributes, Lsquare computes two formulas. One of the formulas evaluates to True for the records of A and to False for those of B, while the second formula reverses the roles of A and B. Both formulas are in disjunctive normal form (DNF), which means that they consist of one or more clauses combined by “or.” In turn, each clause consists of linear inequality terms as defined before and combined by “and.” An example of a DNF formula is ((g ≤ 4) ∧ (h > 3)) ∨ ((g < 3) ∧ (h ≥ 2)), with the two clauses ((g ≤ 4) ∧ (h > 3)) and ((g < 3) ∧ (h ≥ 2)). Later, we refer to each clause as a factor. Next, using the general construction approach of [14], we derive factors, and thus implicitly subgroups, from the first DNF formula. The same process is carried out for the second DNF formula, except for reversal of the roles of A and B. 4.4 Factor Selection Recall that the first DNF formula evaluates to False for the records of B. Let f be a factor of that formula. Since the formula consists of clauses combined by “or,” the factor f also evaluates to False for all records of B. Let Q be the subset of records for which f evaluates to True. Evidently, Q is a subset of A. Dimension Reduction of Simulation Data 21 We introduce a direct description of Q using an example. Suppose the factor f has the form f = ((g ≤ 4) ∧ (h > 3). Let us assume that A was defined as the subset of records for which the values of target t fall into the range R = {t ≥ 9}. Then Q consists of the records of the original data set satisfying t ≥ 9, g ≤ 4, and h > 3. But we know more than just this characterization. Indeed, each record of B violates at least one of the inequalities g ≤ 4 and h > 3. Put differently, for any record where t does not satisfy t ≤ 9, we know that at least one of the inequalities g ≤ 4 and h > 3 is violated. So far, we have assumed that the formula evaluates to False for all records of B. In general, this goal may not be achieved, and the formula produces the desired False values for most but not all records of B. Put differently, Q then contains a few records of B. For the above example, this means that for some records where the target t does not satisfy t ≥ 9, both inequalities x ≤ 4 and y > 3 may actually be satisfied. Potentially, the target range R and the factor f characterize an important configuration that corresponds to an interesting and useful subgroup. On the other hand, the information may be trivial and of no interest. To estimate which case applies, SUBARP computes a significance value for each factor that lies in the interval [0, 1]. The rule for computing the significance value is related to those used in other Subgroup Discovery methods, where the goal of interestingness [13] is measured with quality functions balancing conflicting goals such as (1) the size of the subgroup, (2) the length of the pattern or rule defining the subgroup, and (3) the extent to which the subgroup is contained in the set A. Specifically, SUBARP decides significance using the third measure and the probability with which a random process could create a set of records M for which the target t is in R, the factor f evaluates to True, and the size of M matches that of Q. We call that random process an alternate random process (ARP). Generally, an ARP is an imagined random process that can produce an intermediate or final result of SUBARP by random selection. SUBARP considers and evaluates several ARPs within and outside the Lsquare subroutine, and then structures decisions so that results claimed to be important can only be achieved by the ARPs with very low probability. The overall effect of this approach is that subgroups based on factors with very high significance usually turn out to be interesting and important. Up to this point, the discussion has covered polyhedra representing subgroups contained in a level set. When the same computations are done with the roles of A and B reversed, polyhedra are found that represent subgroups contained in the complement of a level set. It is easily seen that the arguments made earlier about the importance of variables based on the earlier polyhedra apply as well when the latter polyhedra are used. 4.5 Evaluation of Subgroups Once the significance values have been computed for the subgroups, the associated target ranges and factors are used to compute a second significance value using the testing records of T , which up to this point have not been employed 22 Dimension Reduction of Simulation Data by SUBARP. The average of the significance values obtained from the training and testing data is assigned as overall significance value to each subgroup. In numerous tests of data sets, it has turned out that only subgroups with very high significance value are potentially of interest and thus may produce new insight. Based on these results, only subgroups with significance above 0.90 are considered potentially useful. Finally, we relate the output of highly significant subgroups to the original problem. 4.6 Interpretation of Significant Subgroups Let f be the factor of a very significant subgroup and R = t ≥ c be the associated target range. Define P to be the polyhedron defined by P = {xj , j ∈ I | factor f evaluates to True}. Then the polyhedron is an important partial characterization of the level set {x | F (x) ≥ c} of the function F (·). As discussed in Section 2, the variables used in the definition of P can therefore be considered an important part of the characterization of that level set. That fact supports the construction of the sets Ji described in Section 2 according to the frequencies qj with which each variable xj , j ∈ I, occurs in the definitions of the polyhedra of very significant subgroups. 5 Summary This paper introduces a method called REDSUB for the reduction of combustion simulation models. REDSUB is based on the Machine Learning techniques of Subgroup Discovery and Lazy Learner. Use of the method does not require any user-specified transformations of simulation data or other manual effort. The effectiveness of REDSUB is demonstrated with simulation data of a methane/air combustion process, where 34 variables representing 29 chemical species of the combustion mixture, 3 thermodynamic properties, and 2 velocity components are reduced to 3 variables. For the remaining 31 variables, the values are estimated with good accuracy over the entire range of possible values. An even smaller model using 2 variables still has reasonable accuracy. References 1. C. G. Atkeson, A. W. Moore, and S. Schaal. Locally weighted learning. Artificial Intelligence Review, 11:11–73, 1997. 2. S. Bartnikowski, M. Granberry, J. Mugan, and K. Truemper. Transformation of rational and set data to logic data. In Data Mining and Knowledge Discovery Approaches Based on Rule Induction Techniques. Springer, 2006. 3. G. Felici, F. Sun, and K. Truemper. Learning logic formulas and related error distributions. In Data Mining and Knowledge Discovery Approaches Based on Rule Induction Techniques. Springer, 2006. 4. G. Felici and K. Truemper. A MINSAT approach for learning in logic domain. INFORMS Journal of Computing, 14:20–36, 2002. Dimension Reduction of Simulation Data 23 5. G. Felici and K. Truemper. The lsquare system for mining logic data. In Encyclopedia of Data Warehousing and Mining, pages 693–697. Idea Group Publishing, 2005. 6. R. Hilbert, F. Tap, H. El-Rabii, and D. Thévenin. Impact of detailed chemistry and transport models on turbulent combustion simulations. Progress in Energy and Combustion Science, 30:61–117, 2004. 7. A. E. Hoerl. Application of ridge analysis to regression problems. Chemical Engineering Progress, 58:54–59, 1962. 8. A. E. Hoerl and R. W. Kennard. Ridge regression: Biased estimation for nonorthgonal problems. Technometrics, 12:55–67, 1970. 9. G. Janiga, O. Gicquel, and D. Thévenin. High-resolution simulation of threedimensional laminar burners using tabulated chemistry on parallel computers. In 2nd ECCOMAS Thematic Conference on Computational Combustion, 2007. 10. G. Janiga, A. Gordner, H. Shalaby, and D. Thévenin. Simulation of laminar burners using detailed chemistry on parallel computers. In European Conference on Computational Fluid Dynamics (ECCOMAS CFD 2006), 2006. 11. G. Janiga and D. Thévenin. Three-dimensional detailed simulation of laminar burners on parallel computers. In Proceedings of the European Combustion Meeting ECM2009, 2009. 12. W. Klösgen. EXPLORA: A multipattern and multistrategy discovery assistant. In Advances in Knowledge Discovery and Data Mining. AAAI Press, 1996. 13. W. Klösgen. Subgroup discovery. In Handbook of Data Mining and Knowledge Discovery. Oxford University Press, 2002. 14. N. Lavrač, B. Cestnik, D. Gamberger, and P. Flach. Decision support through subgroup discovery: Three case studies and the lessons learned. Machine Learning, 57:115–143, 2004. 15. T. Mitchell. Machine Learning. McGraw Hill, 1997. 16. P. K. Novak, N. Lavrač, and G. I. Webb. Supervised descriptive rule discovery: A unifying survey of contrast set, emerging pattern and subgroup mining. Journal of Machine Learning Research, 10:377–403, 2009. 17. N. Peters. Turbulent Combustion. Cambridge University Press, 2000. 18. K. Truemper. Improved comprehensibility and reliability of explanations via restricted halfspace discretization. In Proceedings of International Conference on Machine Learning and Data Mining (MLDM 2009), 2009. 19. S. Wrobel. An algorithm for multi-relational discovery of subgroups. In Proceedings of First European Conference on Principles of Data Mining and Knowledge Discovery, 1997.