Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
ARTICLE IN PRESS Information Systems 33 (2008) 133–150 www.elsevier.com/locate/infosys A cost model to estimate the effort of data mining projects (DMCoMo) Oscar Marbán, Ernestina Menasalvas, Covadonga Fernández-Baizán Facultad de Informática, Universidad Politécnica de Madrid (U.P.M.), Campus de Montegancedo s/n., 28660 Boadilla del Monte, Madrid, Spain Received 26 February 2007; accepted 7 July 2007 Recommended by N. Koudas Abstract CRISP-DM is the standard to develop Data Mining projects. CRISP-DM proposes processes and tasks that you have to carry out to develop a Data Mining project. A task proposed by CRISP-DM is the cost estimation of the Data Mining project. In software development a lot of methods are described to estimate the costs of project development (SLIM, SEERSEM, PRICE-S and COCOMO). These methods are not appropriate in the case of Data Mining projects because in Data Mining software development is not the first goal. Some methods have been proposed to estimate some phases of a Data Mining project, but there is no method to estimate the global cost of a generic Data Mining project. The lack of Data Mining project estimation methods is because of many real-life project failures due to the non-realistic estimation at the beginning of the projects. Consequently, in this paper we propose to design and validate a parametric cost estimation model, similar to COCOMO or SLIM in software development, for Data Mining projects (DMCoMo1). The drivers of the model will be proposed first and later the equation of the model will be proposed. r 2007 Elsevier B.V. All rights reserved. Keywords: Data Mining; Knowledge discovery; Cost estimation; Parametric model 1. Introduction The concept of CRM (Customer Relationship Management) evolved when the man of the caverns could choose if he wanted to trade with Og or Thag. Corresponding author. Tel.: +34 913367388; fax: +34 913367393. E-mail addresses: omarban@fi.upm.es (O. Marbán), emenasalvas@fi.upm.es (E. Menasalvas), cfbaizan@fi.upm.es (C. Fernández-Baizán). 1 The work presented in this paper has been partially supported by UPM project ERDM ref.14589. However, the CRM was used for the first time in the middle of the 1990s. CRM can be defined as: ‘‘to give to the client what he wants, when he wants and where he wants’’ [1]. The main objective of the CRM projects is to recover a one-to-one relationship with the client. The one-to-one relationship has been lost as a consequence of the competitive environment in which modern companies work. For this reason, companies have been developing CRM systems, in order to retain their clients, for the past 10 years. In CRM systems we can distinguish between three areas: operational CRM, collaborative 0306-4379/$ - see front matter r 2007 Elsevier B.V. All rights reserved. doi:10.1016/j.is.2007.07.004 ARTICLE IN PRESS 134 O. Marbán et al. / Information Systems 33 (2008) 133–150 CRM and analytic CRM. Analytic CRM analyzes the operational data to optimize the relationship with the client. Due to the great volume of data that must be analyzed Data Mining techniques must be used [2,3]. Therefore, Data Mining researches have been increasing in the last years [4,5]. This growth has been motivated because companies need to find the knowledge that is hidden in their data. This knowledge allows companies to compete against other companies. For this reason, companies are investing more resources in Data Mining projects [6]. The need of efficient methods to search knowledge in data has caused that a lot of Data Mining algorithms and Data Mining tools have been developing [7–10]. However, due to the complexity of the Data Mining process, a Data Mining methodology is needed. The Data Mining methodology is CRISP-DM [11]. CRISP-DM evolved to solve the problems that companies had in the development of Data Mining projects. CRISP-DM is a process model to develop Data Mining projects and was proposed by a consortium of companies (Teradata, SPSS (ISL), Daimler-Chrysler and OHRA). CRISP-DM defines the processes and tasks that you have to do in order to develop a successful Data Mining project. For each task proposed by CRISP-DM, the inputs and outputs of the task are also proposed. Hence, CRISP-DM proposed a process model to develop Data Mining projects such as ISO 12207 [12] and IEEE 1074 [13] for developing software projects. In the ‘‘Business understanding’’ phase CRISP-DM proposes a task to make the project plan. In this task you have to budget the project, and you have to calculate the cost of the project taking into account the time and the personnel that are needed to develop the Data Mining project. But, CRISP-DM does not propose how to carry out this task. If we wish to rate the success or failure of a Data Mining project, we need a method to calculate the goodness of the knowledge extracted by the model, the time used to obtain the knowledge, the cost of the personnel and resources used in the project, etc. However, it is needed to estimate the cost of the project too, because if the cost of the knowledge is not accessible to the company the project is non-viable. Some researches have been done to estimate the goodness of the knowledge extracted from the data. Thus, in [14] a framework to estimate the goodness of knowledge after the Data Mining phase in CRM projects is proposed. This framework tries to maximize the value of the knowledge extracted. In [15] the value of customers is taken into account to maximize the benefit of the predictive Data Mining models. About the cost estimation of the Data Mining projects, in [16] a cost estimation model for classification problems is proposed, which can be used in any moment along the project. This model is based on NPVs (Net Present Values) [17]. NPV is calculated as the difference between the money invested in the project and the recovered money from that investment. In the model presented in [16] the NPV is used to decide whether the project will continue. NPV is calculated at any point in the project, and the project will continue only if NPV has a positive value. All previous estimation methods do not allow to establish the effort, time and cost at the beginning of the project. But we can try to use the software estimation tools such as COCOMO II [18], SLIM [19] or PRICE-S [20] to estimate the cost of Data Mining projects. If we take a look at these tools, we can conclude that they are not useful for estimating Data Mining projects, because they use the size of the software as the main input, in lines of code, to be developed. Other factors used in the estimation of software are experience of the development team, use of tools, features of the development platform and so forth. These features must also be used to estimate the cost of Data Mining projects. Nevertheless, if we wish to estimate Data Mining projects, we allow for other features of Data Mining projects such as characteristics of data sources, data integration level, the kind of Data Mining problem to be solved and the number of models to build inter alia. Software estimation methods do not consider those features of Data Mining projects. Hence, software estimation methods are not useful for estimating Data Mining projects. Consequently, we can say that nowadays there is no cost estimation method for Data Mining projects, although Data Mining projects have been developing for the past 20 years. Therefore, in this paper we propose a parametric estimation model for Data Mining projects. The model is named DMCoMo (Data Mining Cost Model). DMCoMo is based on a parametric cost estimation model such as COCOMO family. DMCoMo allows to estimate the effort (men month) that is needed to develop a Data Mining project since its conception until its deployment. ARTICLE IN PRESS O. Marbán et al. / Information Systems 33 (2008) 133–150 The rest of the paper is organized as follows. In Section 2 the work related to this research is presented. Section 3 describes DMCoMo, the cost drivers and equations of DMCoMo. Section 4 shows the results produced by DMCoMo in the estimation of Data Mining projects. Section 5 presents the conclusions and future lines of work. Finally, Appendix A shows the complete DMCoMo model. 2. Related work Parametric cost estimation models were developed first [21]. Rand Corporation developed the first parametric cost model which was named Cost Estimating Relationship (CER) [21]. CER estimates the cost of aircrafts. CER takes into account some features of the airplanes to estimate the cost of them. Estimation tools were developed at the same time of the estimation methods to automate the process of estimation. PRICE-H [22] and PRICE-S [20] were the first estimation tools that implemented parametric estimation methods. PRICE-H estimates the cost to develop hardware components, and PRICE-S estimates the cost to develop software. Parametric estimation models have been developed to estimate different kinds of projects: software projects (COCOMO [18], SLIM [19], etc.), hardware projects (PRICE-H [22]), even for space launchings of NASA [23] or to build ships [24]. Parametric models use mathematical equations to obtain the values of estimations. The results of the estimations are different dependent variables like effort or development time. These dependent variables depend on a set of independent variables called cost drivers. Examples of cost drivers are lines of code of a software application, required reliability or complexity of the software application to be developed. The parametric models operate in a two-step process: (1) To do a first approximation or estimation which depends on the value of a reduced set of parameters whose weight in the final result is considered greater than the rest and is not normally related to the features of the project but the product. (2) The final result is determined using another set of variables that allow the estimation to be refined 135 by introducing the specific characteristics of the application and development environment. The accuracy of parametric estimation models are based on: (1) A precise definition of the equations to be used. Thus, for example, non-lineal equations have replaced lineal ones in most of the parametric mathematical models. (2) Constant refining of the parameters used. This involves not only adding or removing them to reflect changes in the technology but also a thorough understanding of those selected. Thus, for example, COCOMO II [18] has eliminated some of the effort multipliers used in COCOMO 81 [25] like Execution Time Constrain (TIME) and introduced others such as Documentation of the project (DOCU). (3) An accurate calibration of the numerical values for each parameter rating levels. New reviewed and enlarged data sets as well as new statistical methods have been used [26–29]. (4) Wise selection of the rating level for each parameter used for the selected model in order to calculate the estimations for the specific project [30]. To sum up, parametric cost estimation models for software development estimate the effort and time to develop the project taking into account some features of the software and projects such as the size of the software, characteristics of the projects and of the development team. 2.1. Cost estimation models for Data Mining projects No method has been proposed to estimate the cost of a full Data Mining project. Nevertheless, some proposals to estimate different kinds of Data Mining projects have been proposed. These proposals are described subsequently. In [31] a classification of different costs in the processes of inductive learning is proposed. According to the authors, this classification could help in estimating the results of a predictive problem. The identified costs are: Cost of misclassification errors: This cost is due to models that do not correctly classify all the items presented to them. ARTICLE IN PRESS O. Marbán et al. / Information Systems 33 (2008) 133–150 136 Cost of the test: Each test to obtain data test may have an associated cost. Cost of the teacher: Teacher is available for the learner, but each classification petition that the apprentice asks from the instructor has an associate cost. Cost of intervention: These costs are associated with the cost of manipulation or modification of the values of the variables that participate in the classification. Cost of unwanted achievements: These unwanted objectives are due to the modification of some factors of the classification algorithm; hence, errors are obtained in the classification. Cost of computation: Computer resources are limited; hence, the cost of these kinds of resources must be taken into account. Human–computer interaction costs: These costs take into account the cost of the personnel who use the learning software. This cost includes the costs of deciding the attributes to use, the parameters of the algorithm, the conversion of data to the format required by the algorithm and the analysis of the result models. of this work is to propose the usefulness of Data Mining operation in a quantitative way. Other work [15] proposed an estimation model for predictive Data Mining models. In [15] to estimate the profit of predictive Data Mining models the values of the client are borne in mind. The estimation model proposed in [15] is based on the next business model: P ¼ ðr pÞ c, where P is the profit obtained from a client, c is the cost to get a client, r is the income obtained from a client and p is the probability that the client will accept an offer of the company. Thus, this model is used to evaluate different predictive Data Mining models to obtain a greater profit P for the company. In [16] the NPV model is applied to decide whether the project will continue. NPV [17] is defined as the subtraction of the cash flow and the ROI (Return On Investment) of the project, as we can see in the equation NPV ¼ C 0 þ 1 X t¼1 Although this work presents a classification of costs associated with inductive learning processes, it is not proposed how to estimate them. Next, a cost estimation model which tries to minimize a cost function in inductive learning processes is described. In [14] a model to estimate the value of the obtained knowledge in the Data Mining phases in CRM projects is proposed. This work proposes to estimate a microeconomic framework. Hence, a pattern in data is interesting only if it can be used when decisions are taken by the company. Therefore, a pattern is useful if it is transformed into information, information into actions and actions into value. In [14] the estimation problem becomes an optimization problem that can be formulated as follows: max f ðxÞ, x2D (1) where D is the domain of all possible decisions (production plans, marketing strategies, etc.) and f ðxÞ is the usefulness of the decision (x 2 D). In this work, Data Mining are studied from an economic point of view of optimization problems when a great volume of non-aggregate data are used. This framework uses combinatory optimization, linear programming and games theory. The main objective (2) Ct , ð1 þ rÞt (3) where C 0 is the initial cash flow, usually negative, and it represents the initial investment. In [16] Eq. (3) is interpreted as follows: NPV represents the cost of development of the system. This cost includes the costs of hardware, software, personnel training, etc., t is the time, C t is the cash flow in time t and r is the expected ROI. The cash flow at a time t41 is the result of decisions that were taken during the project and has two components, the cost of taking a decision and the cash flow that results from the decision. This method takes into account features of the project as experience of the staff, use of Data Mining tools, etc. to estimate, as the cost estimation methods for software development do. But this estimation method does not estimate the effort of the project, it can only be used to decide whether the project will continue (NPV 40) or whether the project will halt (NPV o0). 3. An estimation model for Data Mining projects: DMCoMo The models that have been proposed (see Section 2.1) are not generic models to estimate Data Mining projects. Cost estimation models for software projects could be used to estimate Data ARTICLE IN PRESS O. Marbán et al. / Information Systems 33 (2008) 133–150 Mining projects, but its main inconvenience is that they use the size of the software to be built as the main input. In Data Mining projects, software is not built; hence, in this paper we proposed a parametric estimation model to estimate the effort of Data Mining projects. The proposed model is named DMCoMo [32]. In the following, cost drivers that affect the effort of a Data Mining project are proposed. The proposed cost drivers are grouped into six categories: Data, Data Mining Models, Platform, Techniques and Tools, Project and Staff. In Section 3.1 the cost drivers of each group are introduced. The techniques to calculate each cost driver are described in Appendix A. Delphi method [33–35] was used to establish the foundations of the levels and descriptions of each cost driver of DMCoMo. 3.1. Cost drivers for DMCoMo 3.1.1. Data cost drivers Cost drivers in this group make reference to the effort of data management in the project. Thus, if we work with a few tables, a few attributes and a low dispersion, the effort is smaller than if we work with a lot of tables, a lot of attributes and a high dispersion of the attributes. These cost drivers take into account features of the Data Mining project such as data quality, integration level and location. Data cost drivers have been grouped into five clusters: initial amount of data, dispersion, quality of data, data model availability and data privacy level. Initial amount of data considers the number of tables (NTAB) in the database, number of tuples (NTUP) and the number of attributes (NATR) of the tables that are stored in the databases to be used in the project. These cost drivers are calculated before the preprocess Data Mining phase. NATR adds a bigger effort than NTAB and NTUP, because the bigger the NATR to be managed the bigger is the effort in the preprocess Data Mining phase. Dispersion (DISP) is defined as the number of different values in the domain of the attribute. This cost driver adds some effort to the Data Mining project. A combination of variance (s2 ) and entropy [36] is used to calculate the value of DISP. Thus, to compute DISP, the variance of quantitative attributes is calculated and the entropy of qualitative attributes is worked out. Once variance and entropy have specific values, DISP are calculated 137 using Eq. (A.1). Our experience demonstrates that the bigger the number of different values of an attribute, the bigger the effort required to understand models. As far as data quality is concerned, it shows how nice the data are and is divided into two cost drivers: the percentage of null values in data (PNUL) and whether the criteria of data codification are available (CCOD). Null values must be taken into account to calculate the effort of the project because they must be processed and different techniques could be applied, for instance, tuples that contain null values could be deleted, null values of the attributes could be filled through a predictive model or the attribute that has null values may be erased. The success of the Data Mining project depends on the right definition of the problem to solve and the right treatment of null values. CCOD adds the effort of transform data to be used by algorithms. If the transformation criteria are given by an expert the effort will be lesser than if the responsible person of preprocess Data Mining phase were to devise them. Furthermore, if documentation of data sources (data models, description of attributes, etc.) is available, the comprehension of data will be easier and it will help in establishing the problem to solve. This modification of the effort is considered in the cost driver DMOD. As regards the privacy level of data, one may observe that it has influence on the effort of the Data Mining project. If data are protected by law, there are useful data that cannot be used in the project. The cost driver that represents this effort is PRIV. The protected data could be substituted by external data such as demographic databases. This substitution requires extra effort, because we have to integrate the external data with the data of the project. This effort is considered by DEXT cost driver. 3.1.2. Data Mining model cost drivers The number of Data Mining models (NMODs) to be created has to be considered to estimate the effort of a Data Mining project, because the bigger the number of Data Mining models, the bigger the effort will be because the data have to be adapted to a bigger number of Data Mining algorithms and optimization of algorithms. In addition, the type of Data Mining model to be developed (TMOD), the number of tuples (MTUP), the number of attributes (MATR) used by each model have to be considered ARTICLE IN PRESS 138 O. Marbán et al. / Information Systems 33 (2008) 133–150 to estimate the effort of the project and also the dispersion of each attribute. Even if we have to obtain new attributes (derived attributes) the effort will be increased; hence, a new cost driver, MDER, is introduced to DMCoMo. The cost driver TMOD considers the effort of the type of Data Mining model to be developed, because the effort of development of a predictive model is different from the effort required to develop a descriptive model. The amount of data that is necessary to develop each model after preprocess Data Mining phase must be estimated for each model, because different Data Mining algorithms need different data. Hence, the NTUP, the NATR, its type and dispersion that will be used to develop each model must be considered in the effort of the Data Mining project. Added to that, available Data Mining techniques to develop a Data Mining model must be looked at. We have to consider whether a Data Mining technique is available. If any Data Mining technique is available, we will develop a new one and this adds an additional effort to the project. Usually, the bigger the number of available Data Mining techniques, the lesser the effort required, because we can try all Data Mining techniques instead of optimizing the parameters of the Data Mining algorithm. 3.1.3. Development of platform cost drivers This cluster is formed by cost drivers related to the development platform. The first cost driver in this group is NFUN that represents the effort that is introduced to the Data Mining project by the number of data sources where data are stored. The number of different data servers (NSER) and how they are communicated (SCOM) also influence the effort of the project. NFUN considers the number and the type of data sources. The bigger the number of data sources, the bigger the effort will be, because a greater data sources must be integrated. If data were stored in a Data Warehouse, data will be integrated; hence, the effort is less than if data were stored in any other kind of storage medium. On the other hand, if we work with files or with relational databases, the required effort will be different. For example, an operation like ‘‘join’’ of two or more tables is realized easier in a relational database than in a file. Additionally, different data servers do not share data easily; hence, native interconnection tools must be considered for they could help in communicating different data servers. This effort is considered by the cost driver NSER. 3.1.4. Techniques and tool cost drivers The use of Data Mining tools to develop Data Mining models facilitates the work. Thus, available Data Mining tools (TOOL), implemented techniques by tools (NTEC) and the integration of tools with the rest of available tools in the project (COMP, TCOM) must be used to compute the effort of the Data Mining project. The cost driver TOOL takes into account whether Data Mining tools are used in the project. Data Mining tools do not have to implement all Data Mining techniques; thus, if we have some available Data Mining tool for the project, it is probably that at least in one tool the technique is implemented. Therefore, NTEC represents the number of useful Data Mining techniques which are implemented in some Data Mining tool. COMP (Compatibility) represents how compatible are Data Mining tools with the rest of the software (text processors, spreadsheets, data bases, etc.). TCOM reflects the compatibility between different Data Mining tools that are used in the project. This cost driver distinguishes between tools that can use a Data Mining model that was created by a different Data Mining tool and tools that can convert Data Mining models to be used by them. Also we have to consider the effort of deciding which tool, technique and machine will be used to generate the models, because neither all Data Mining tools perform Data Mining techniques in the same time and in the same way nor all Data Mining tools execute in any machine of the project. This effort is picked up by the cost driver TOMM. Another cost driver that must be considered is TRAN. If algorithms of Data Mining tools could be modified or adapted for the project, the modification will imply an extra effort that is considered by TRAN. Lastly, the training to use Data Mining tools by the staff of the project will influence the effort of the project and it is gathered by the cost driver NFOR. Related to that, the level of user-friendliness of Data Mining tools are considered in the cost driver TFRI. A user-friendly Data Mining tool reduces the effort of the project, because the work is easier. 3.1.5. Project cost drivers Features of the project such as the number of participating departments must be considered in the ARTICLE IN PRESS O. Marbán et al. / Information Systems 33 (2008) 133–150 computation of the total effort of the project. The cost drivers defined in this group are NDEP, DOCU, MSIM and SITE. NDEP represents the number of departments participating in the project. NDEP influences the effort because each department could have its own data model, different names for attributes and even any of them who may not like to participate in the project; hence, a greater effort is necessary. Documentation (DOCU) to be produced in the project also influences the effort of the project. If a high amount of documentation has to be written, it will require more effort, not only the quantity of documentation has to be taken into account but also the complexity of it has to be taken into account. MSIM puts up with the extra effort of developing the same Data Mining model for multiple locations. Multiple location development implies that local data have to be understood and integrated; hence, an extra effort is required. On the other hand, if the project is developed in different places (building, town, country, etc.), this will imply an additional effort due to the communications (telephone, ISDN, LAN, WAN, etc.). This effort is considered by the SITE cost driver. 3.1.6. Staff cost drivers The staffs in a Data Mining project are composed of sponsors, data analysts, data management specialists, business analysts, users and a project manager. These persons are from different areas: computer specialists, statisticians, executives, etc.; hence, an additional effort is required to reach an agreement in decisions of the project. The following drivers are proposed to take into account the effort due to staff collaboration. PCON represents the time that the staff has been working together. If the staff has been working together since a long time the persons in the team are known to each other, and it is easier to reach an agreement in decisions, but if the team has not previously worked together in a project, it is more difficult to reach an agreement. The ability of the staff to carry out different tasks in the project is very important because if someone cannot work one day another person can substitute. This feature is dealt in the cost driver PCOM. Additionally, if the data are previously known (KDAT) to the staff in the project, the effort will be less than if the data are completely unknown to the staff of the project. The familiarity with the type of problem (MFAM) to be solved is important to 139 determine the effort of the project too. The knowledge of the problem facilitates its resolution and it is easier to solve the problem; hence, the effort will be minor. Similarly, the knowledge of the business (BCON) in which the project is based upon, the experience of the staff in similar problems and the experience with Data Mining tools to be used in the project (TEXP) are features to be taken into account in order to calculate the effort of the Data Mining project. Lastly, the attitude of the directive is another factor that influences the effort of the project (ADIR). If the directive supports the project, it is easier to finish the project successfully and the effort will decrease, but if the project is not supported by the directive the effort will increase. 3.2. DMCoMo equation outline Once cost drivers have been defined, the equation of DMCoMo has to be outlined. In order to obtain the equation, information about the cost drivers and the effort of real Data Mining project were gathered. The DMCoMo equation was created through a multivariate linear regression [37,38], because that is the most usual way of obtaining the equation in parametric estimation methods [25,24,23]. The equation will be similar to Eq. (4), where y is the dependent variable, xi is the ith independent variable, n is the number of independent variables, ai are constants and ei is the error in the ith estimation. y ¼ a0 þ n X ai xi þ ei . (4) i¼1 In order to obtain the equation the following steps must be carried out [38]: Step 1: Descriptive study of the input data. Step 2: Study of outliers in data. Step 3: Correlation study between input variables. Step 4: Application of linear regression to obtain the equation. Step 5: Statistical study of the significance level of the equation. The obtained model will be reliable for estimating projects where effort will be in the range of the models that are used to create the equation, in our case between 90 and 185 men month. If the effort of the project is out of the range, the behavior will be unknown. ARTICLE IN PRESS O. Marbán et al. / Information Systems 33 (2008) 133–150 140 3.2.1. Data description Information about different Data Mining projects was gathered from different organizations. Different kinds of projects were involved: marketing projects of Spanish enterprises, meteorological projects and medical projects. Table 1 Data collection form Driver NTAB NTUP NATR DISP PNUL CCOD DMOD PRIV DEXT NMOD Value Driver TMOD MTUP MATR MDIS MDER MTEC NFUN NSER SCOM TOOL Value Driver NTEC COMP TCOM TOMM TRAN NFOR TFRI NDEP DOCU MSIM Value Driver Value SITE PCON KDAT ADIR PEXP MFAM TEXP BCON In order to gather data, the form in Table 1 was used. Project manager of each project filled a form with information about the project. The values that could be used in the questionnaire are Extra low (XB), Very low (MB), Low (B), Nominal (N), High (H), Very high (MA) or Extra high (XA). Additionally, in the duration field the number of months the project lasted must be filled and in the person field the number of persons in the staff of the project must be written down. To obtain the effort (men month) required by the project, the value of the person field must be multiplied by the value of the duration field as shown in the equation EffortðMMÞ ¼ DurationðmonthsÞ Persons. Later on, the qualitative values were translated to quantitative values to be able to obtain the equation through linear regression. The translations are XB to 0, MB to 1, B to 2, N to 3, A to 4, MA to 5 and XA to 6. Once the data were gathered, they were analyzed statistically to observe whether all the variables are Duration (months) Persons Descriptive Statistics N NTAB NTUP NATR DISP PNUL DMOD DEXT NMOD TMOD MTUP MATR MTEC NFUN SCOM TOOL COMP NFOR NDEP DOCU SITE KDAT ADIR MFAM MM Valid N (listwise) 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 (5) Minimum 0 1 1 1 1 1 2 2 1 1 1 1 1 1 1 1 1 2 2 1 1 1 1 90 Maximum 6 5 5 5 5 5 5 5 5 5 5 5 4 4 5 5 5 4 5 6 3 3 5 184 Fig. 1. Statistical data description. Mean 121,48 Std. Deviation 21,147 ARTICLE IN PRESS O. Marbán et al. / Information Systems 33 (2008) 133–150 141 in the right range, they do not have null values and their statistical distribution is appropriate to apply linear regression methods. Table 2 DMCoMo cost drivers Name Abrev. 3.2.2. DMCoMo equation In order to establish the regression equation, data must be statistically analyzed. This study analyzes the number of values, the maximum and minimum, and the standard deviation of each variable in data. In Fig. 1 we could see that all variables have 40 values, one for each project. The maximum and minimum is useful to check whether all cost drivers take values in their ranges. The values of the mean and standard deviation of the effort are also shown. We point out that if the regression equation will not create, if the mean value (121.48 MM) is considered, the error in the effort will be 21:147 men month. This error is the value of the standard deviation of MM variable (effort). The second step to create the regression equation and to avoid error is that the outliers in data must be eliminated. Hence, tuples with outliers must be deleted from the data set. But, our data set was collected for the experiment, and the NTUP is small, 40; hence, none of the tuples will be deleted. Next step is the study of correlation between cost drivers. The Spearman correlation coefficient was used. We consider that two cost drivers are correlated if the Spearman correlation coefficient is above 0.5. Once correlation coefficients were obtained, we deleted one of the cost drivers whose correlation coefficient is above 0.5. To delete one cost driver, there are two different possibilities. The first one is the erasure of one of the cost drivers. The second one is the integration of the two correlated cost drivers into one cost driver. In this paper we consider the first possibility. We deleted the less significant cost driver. After the study of correlation was done, 16 cost drivers were deleted. Hence, to build the regression equation of DMCoMo only 23 cost drivers were considered. The considered drivers are shown in Table 2. The cost drivers that were deleted are CCOD, PRIV, MDIS, MDER, NSER, NTEC, TCOM, TOMM, TRAN, TFRI, MSIM, PCON, PCOM, PEXP, TEXP and BCON. In this point, we have the final data set of data projects to apply linear regression. The cost drivers that appear in Table 2 are not correlated, or its correlation coefficient is under 0:5. Number of tables Number of tuples Number of attributes Dispersion Nulls percentage Data model availability External data needs Number of models Type of model Number of tuples for each model Number and type of attributes for each model Problem type familiarity Techniques availability Number and type of data sources Distance and communication form Tools availability Compatibility Training level of users Number of involved departments Documentation Multisite development Data knowledge Directive attitude NTAB NTUP NATR DISP PNUL DMOD DEXT NMOD TMOD MTUP MATR MFAM MTEC NFUN SCOM TOOL COMP NFOR NDEP DOCU SITE KDAT ADIR The regression equation in DMCoMo model will be similar to the one presented in the equation y ¼ a0 þ n X ai xi þ ei , (6) i¼1 where the dependent variable (y) is the effort measured in men month (MM) that we wish to estimate, the independent variables (xi ) are the cost drivers that appear in Table 2, ai are the values that we will find through the linear regression and n is the number of cost drivers, in our case 23. As a result of linear regression ai values are obtained. These values are shown in Fig. 2. Each value of the B column in Fig. 2 is an ai value. The names of the a0 are constant and the rest of ai are named as the cost driver that represents. Hence, the effort equation (EðpÞ) of DMCoMo is as shown in the equation EðpÞ ¼ 78:752 þ 2:802 NTAB þ 1:953 NTUP þ 2:115 NATR þ 6:426 DISP þ 0:345 PNUL þ ð2:656Þ DMOD þ 2:586 DEXT þ ð0:456Þ NMOD þ 6:032 TMOD þ 4:312 MTUP ARTICLE IN PRESS O. Marbán et al. / Information Systems 33 (2008) 133–150 142 Coefficientsa Model 1 (Constant) NTAB NTUP NATR DISP PNUL DMOD DEXT NMOD TMOD MTUP MATR MTEC NFUN SCOM TOOL COMP NFOR NDEP DOCU SITE KDAT ADIR MFAM Unstandardized Coefficients B Std. Error 78,752 37,415 2,802 1,654 1,953 2,108 2,115 2,558 6,426 2,096 ,345 2,204 -2,656 2,613 2,586 2,853 -,456 3,654 6,032 2,727 4,312 2,312 4,966 2,930 -2,591 2,063 3,943 3,723 ,896 3,521 -4,615 2,479 -1,831 3,100 -4,698 2,186 2,931 4,230 -,892 2,783 2,135 2,112 -,214 4,258 -3,756 5,110 -4,543 2,562 Standardized Coefficients Beta t 2,105 1,695 ,927 ,827 3,065 ,157 -1,017 ,906 -,125 2,212 1,865 1,695 -1,256 1,059 ,254 -1,861 -,591 -2,149 ,693 -,321 1,011 -,050 -,735 -1,773 ,264 ,142 ,131 ,459 ,025 -,184 ,164 -,020 ,358 ,293 ,313 -,182 ,193 ,044 -,326 -,126 -,297 ,115 -,049 ,165 -,008 -,131 -,323 Sig. ,051 ,109 ,368 ,421 ,007 ,877 ,324 ,378 ,902 ,042 ,081 ,109 ,227 ,305 ,802 ,081 ,563 ,047 ,498 ,753 ,327 ,961 ,473 ,095 a. Dependent Variable: MM Fig. 2. Linear regression coefficients. þ 4:966 MATR þ ð2:591Þ MTEC Model Summary þ 3:943 NFUN þ 0:896 SCOM þ ð4:615Þ TOOL þ ð1:831Þ COMP Model 1 R ,893 þ ð4:689Þ NFOR þ 2:931 NDEP þ ð0:892Þ DOCU þ 2:135 SITE R Square ,798 Adjusted R Square ,507 Std. Error of the Estimate 14,846 Fig. 3. Model summary. þ ð0:214Þ KDAT þ ð3:756Þ ADIR þ ð4:543Þ MFAM. ð7Þ ANOVA Once the DMCoMo equation is built, it must be statistically analyzed. This analysis is carried out with the ANOVA test and residual analysis. In Fig. 3 a summary of the linear regression is shown. It shows that the model is able to predict 50% of the training projects and it explains 80% of its variance. Typical error of the models is 14.846. This error is smaller than the standard deviation of data, that is, 21.147 (see Fig. 1). Thus, the error in the estimation of new projects is smaller if we use DMCoMo instead of the mean value of the training data. Model 1 Sum of Squares Regression Residual Total 13913,477 3526,498 17439,975 df Mean Square 23 16 39 604,934 220,406 F 2,745 Sig. ,021 b. Dependent Variable: MM Fig. 4. ANOVA analysis results. The result of ANOVA analysis is shown in Fig. 4. If we look at Fig. 4 where ANOVA analysis is shown, we will conclude that the regression model is statistically significant because its p-value is smaller ARTICLE IN PRESS O. Marbán et al. / Information Systems 33 (2008) 133–150 than 0.05. Hence, the model has a confidence level of 95%; thus, regression is statistically useful. Although the model is useful, we have to take into account the relative statistical importance of each cost driver in the regression equation of DMCoMo. This importance is reflected in the sig column of Fig. 2. If the value of sig is greater than 0.01, then the cost driver is not significant. Nonsignificant drivers are PNUL, NMOD, SCOM, DOCU, and KDAT. Thus, these drivers do not have great influence on the estimation of the effort of the project. Normality of residuals could contrast by statistical means using the Kolmogorov– Smirnov test (see Fig. 5). We could conclude that residuals follow a normal distribution with a confidence level of 90% because the Asymp. Sig. (2-tailed) value in Fig. 5 is smaller than 0.10. Hence, the regression model is acceptable. The previous tests allow to establish that the regression model can be used with an acceptable error to estimate the effort of a Data Mining project. Although the model is useful, in Fig. 2 we could see that several Sig values are greater than 0.1. Those cost drivers are not significant and they do not have great influence on the regression equation. Because of that, these cost drivers could be deleted from the regression equation. This erasure will not affect the result of the estimation in an important way. Then, we use the ‘‘step-wise’’ method to build the regression equation. This method uses only the statistical significant variables in the regression equation. Using the same data that were used to build the regression model, data of 23 cost drivers of 40 projects, the ‘‘step-wise’’ regression equation is created. The results are shown in Fig. 6. 143 Coefficients Unstandardized Coefficients Model 8 B (Constant) TMOD DISP MATR MFAM NFOR DEXT NTAB NATR Standardized Coefficients Std. Error 70,897 7,257 4,792 4,615 -3,275 -3,842 2,713 2,368 2,885 t Beta 13,505 1,911 1,596 2,019 1,522 1,712 1,897 1,224 1,906 ,431 ,342 ,291 -,233 -,243 ,172 ,223 ,179 Sig. 5,250 3,798 3,003 2,286 -2,152 -2,244 1,430 1,935 1,514 ,000 ,001 ,005 ,029 ,039 ,032 ,163 ,062 ,140 a. Dependent Variable: MM Fig. 6. ‘‘Step-wise’’ regression coefficients. Model Summary Model R 8 Adjusted R Square R Square ,810 ,656 Std. Error of the Estimate ,568 13,904 Fig. 7. ‘‘Step-wise’’ model summary. ANOVA Modelo Regresión Residual Total Suma de cuadrados 11446,860 5993,115 17439,975 gl 8 31 39 Media cuadrática 1430,857 193,326 F 7,401 Sig. ,000 Fig. 8. ANOVA analysis results of ‘‘step-wise’’ regression model. Then, the number of drivers has been reduced through the ‘‘step-wise’’ method. The new equation is shown in the equation EðpÞ ¼ 70:897 þ 2:368 NTAB þ 2:885 NATR þ 4:792 DISP þ 2:713 DEXT þ 7:257 TMOD þ 4:615 MATR One-Sample Kolmogorov-Smirnov Test N Normal Parameters a,b Most Extreme Differences Mean Std. Deviation Absolute Positive Negative Kolmogorov-Smirnov Z Asymp. Sig. (2-tailed) Unstandardized Residual 40 ,0000000 9,50910240 ,125 ,125 -,088 ,788 ,564 a. Test distribution is Normal. b. Calculated from data. Fig. 5. Kolmogorov–Smirnov test results. þ ð3:842Þ NFOR þ ð3:275Þ MFAM. ð8Þ In Fig. 7 the features of the model of Fig. 6 are shown. This new model predicts the 56% of the training and it explains 65% of the variance of training data. The ANOVA analysis (see Fig. 8) of the model of Fig. 7 shows that the model has a confidence level of 95%, because the p-value is smaller than 0.05. Hence, the model is statistically significant. The analysis of residuals of the regression shows that residuals follow a normal distribution (see the histogram and P–P graphic, Fig. 9). ARTICLE IN PRESS O. Marbán et al. / Information Systems 33 (2008) 133–150 144 Normal P-P Plot of Regression Standardized Residual Histogram Dependent Variable: MM Dependent Variable: MM 1,0 12 10 Expected Cum Prob ,8 Frequency 8 6 4 ,5 ,3 Std. Dev = ,89 Mean = 0,00 N = 40,00 2 0 -2,00 -1,00 0,00 1,00 2,00 -1,50 -,50 ,50 1,50 2,50 0,0 0,0 ,3 ,5 ,8 1,0 Observed Cum Prob Regression Standardized Residual Fig. 9. Residual analysis of the ‘‘step-wise’’ model. One-Sample Kolmogorov-Smirnov Test N Normal Parametersa,b Most Extreme Differences Mean Std. Deviation Absolute Positive Negative Kolmogorov-Smirnov Z Asymp. Sig. (2-tailed) Unstandardized Residual 40 ,0000000 12,39635501 ,088 ,088 -,081 ,557 ,916 a. Test distribution is Normal. b. Calculated from data. Fig. 10. Kolmogorov– Smirnov test for ‘‘step-wise’’ regression. Normality of residuals could be tested if we use the Kolmogorov– Smirnov test (see Fig. 10) too, where Asymp. Sig. (2-tailed) is greater than 0.10. Hence, we can conclude that residuals follow a normal distribution with a confidence level of 90%. Thus, the ‘‘step-wise’’ regression is useful. Therefore, the two models (Eqs. (7) and (8)) are statistically useful in the estimation of the effort of Data Mining projects. The model created with ‘‘step-wise’’ method is easier to apply because it has only eight cost drivers. Next, two models will be used to estimate Data Mining projects. The results of estimations will be analyzed. that, data of 15 Data Mining projects were gathered in the same way that data of the 40 training projects were collected (see Section 3.2.1). Then, the two models (Eqs. (7) and (8)) were used to estimate the effort of these new 15 Data Mining projects. The results of estimations are shown in Fig. 11, where Id: is project identifier, MM is the real value of the effort that is reported by the project manager, $EMM (23 cost drivers) is the estimated effort using Eq. (7) and $EMM (8 cost drivers) is the estimated effort using Eq. (8). If the real effort ðMMÞ and the estimated efforts ($EMM) are compared, we will obtain the results that are shown in Table 3. In Table 3, we can see the maximum, mean and minimum errors that the estimation methods have in relation to the real effort value. The standard deviation is also shown. The standard deviation shows that if we use the 23-cost drivers model the error is 16:908 MM and if we use the 8-cost drivers model the error is 23:105 MM. In Fig. 12 the real and estimated efforts of the test projects are depicted. Fig. 13 shows the relative error produced by estimation equations for each project. The relative error is calculated as shown in the following equation: Relative error ¼ 4. Experimentation and results Once DMCoMo has been established, it will be used to estimate new Data Mining projects. To do estimated value real value . real value (9) It is necessary to highlight that 66% of estimations has an error smaller than 15% and 13% of estimations has an error greater than 20% and ARTICLE IN PRESS O. Marbán et al. / Information Systems 33 (2008) 133–150 Id. MM $E-MM (23 drivers) $E-MM (8 drivers) 145 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 117 93 162 167 105 168 108 131 123 121 87 127 113 118 154 132,564 100,803 130,502 136,377 92,2694 146,425 93,0873 117,749 129,402 138,559 91,6333 101,14 87,5875 96,9164 132,528 133,43 116,979 140,718 116,784 125,987 138,377 107,88 116,728 132,527 120,213 99,284 114,264 91,605 94,563 110,079 Fig. 11. Estimated effort. Table 3 Comparison of real and estimated effort Minimum error Maximum error Mean error Absolute mean error Standard deviation 23 drivers 8 drivers 17.559 31.498 11.097 18.025 16.908 23.979 50.216 8.972 20.066 23.105 Occurrences 15 15 180 MM $E-MM (23 drivers) $E-MM (8 drivers) 170 Effort (men * month) 160 150 140 130 120 110 100 90 smaller than 22% when the 23-cost drivers model is used. If we use the 8-cost drivers model the 53% of estimations has an error below 15% and the 26% of estimations has an error between 20% and 30%. 5. Conclusions In this paper we present a generic cost model for Data Mining projects. The cost model is a parametric one like COCOMO is. The model is composed of an equation and the cost drivers that affect the Data Mining project. Hence, the cost drivers for Data Mining projects have been proposed. DMCoMo estimates the effort of a Data Mining project taking into account some features of it. DMCoMo is useful in the estimation of the effort in men month. Two different equations are proposed for DMCoMo using different methods of multivariate linear regression. One equation has 23-cost drivers and it could be used when the projects is well defined, and the other equation has 8-cost drivers and it could be used when the project is fuzzy defined. 80 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Appendix A. Complete model Fig. 12. Real and estimated efforts of 15 test projects. 35% $E-MM (23 drivers) $E-MM (8 drivers) 30% 30% 29% 26% 25% 18% 15% 14% 13% 18% 15% 14% 14% 19% 14% 11% 12% 10% 20% 18% 13% 13% 22% 20% 20% 19% 20% 10% 10% 8% 8% 5% 5% 5% 0% 0% 1 2 3 4 5 6 7 1% 8 9 10 11 12 Id. project Fig. 13. Relative error of estimations. 13 In this appendix, the DMCoMo cost drivers are summarized. The rating levels of the cost drivers and the way of obtaining its rating level are also summarized in Table A.1. DISP calculation: ! ! X X 1 2 DISP ¼ si þ Hj M , (A.1) V i j 14 15 where i is the number of qualitative attributes, j is the number of quantitative attributes, V and M are the variance and the mean for variance and entropy of all attributes. The difference of M and division by V is to normalize the value of dispersion in the range ½0; 1. PNUL calculation: The percentage of null values for each attribute must be computed with the help of Table A.2. ARTICLE IN PRESS O. Marbán et al. / Information Systems 33 (2008) 133–150 146 Table A.1 DMCoMo cost drivers description Cost drivers XB MB B N A MA EA NTAB Up to 20 tables 20–60 tables 60–80 tables 80–100 tables 100–120 tables 120–300 tables Above 300 tables All models Up to 5 107 tuples Up to 500 attributes 0pHo0:2 1 90% of models 5 107 –10 107 tuples 500–1000 attributes 0:2pHo0:4 2 80–90% of models 10 107 –20 107 tuples 1000–1500 attributes 0:4pHo0:6 3 70–80% of models 20 107 –50 107 tuples 1500–2000 attributes 0:6pHo0:8 4 60–70% of models 1–3 external data sources 3–5 external data sources 5–7 external data sources 1–3 2 2 2 2 2–3 homogeneous data sources 3–5 3 3 3 3 2–3 heterogeneous data sources All data sources in the same building communicate through LAN Tools used for 50–70% of models 5–7 4 4 4 4 More than 3 heterogeneous data sources without data in paper Data sources in distinct places but communicate More than 50 107 tuples More than 2000 attributes 0:8pHp1 5 Less than 60% of models More than 7 external data sources More than 7 5 5 5 5 More than 3 heterogeneous data sources with data in paper NTUP NATR DISP* PNUL* DMOD DEXT NMOD TMOD* MTUP* MATR* MTEC* NFUN 1 1 1 1 Only 1 data source SCOM Data in machine where it will be analyzed Data in same database TOOL Tools used for all models Tools used for more than 70% of models 2 2 1 department COMP* NFOR* NDEP 0 No tools used 3 3 2 departments 4 4 3–5 departments Implanted model All generated models All models and central Data Mining phases 5 5 More than 5 departments All models and Data Mining phases Collaboration of business and data experts Department directive supports the project Collaboration of data expert Data unknown, but exists data description Department directive supports the project and executive does not oppose 1 2 No data description or data model Department directive does not support the project and executive does not support 4 5 DOCU SITE KDAT ADIR MFAM* * Data sources in distinct places but not communicate Tools used up to 50% of models 1 1 6 See Table A.10 Department directive supports the project but not executive 3 See description in Appendix A. To calculate PNUL value the following equation must be used and then the rating level of PNUL must be sought in Table A.1: Pn i¼1 PNULpðiÞ PNUL ¼ ROUND . (A.2) n TMOD calculation: TMOD value has to be computed using Table A.3 and the following equation. Then the rating level must be looked for in Table A.1: Pn i¼1 TMODpðiÞ . (A.3) TMOD ¼ ROUND n MTUP calculation: To compute MTUP value, the MTUP of each attribute must be obtained using Table A.4. ARTICLE IN PRESS O. Marbán et al. / Information Systems 33 (2008) 133–150 Table A.2 PNULp 147 Table A.5 MATRnp Level Description Value Level Description Value MB B N A MA Up to 10% of null values 10–15% of null values 15–20% of null values 20–25% of null values More than 25% of null values 1 2 3 4 5 MB B N A Ma Up to 10 attributes Between 10 and 20 attributes Between 30 and 50 attributes Between 50 and 70 attributes More than 70 attributes 1 2 3 4 5 PNUL for each attribute. MATRn for each model. Table A.3 TMODp Table A.6 MATRtp Level Description Value MB B N A MA 1 2 3 4 5 Descriptive model: Association Descriptive model: Clustering Descriptive model: Sequential patterns Predictive model: Classification Predictive model: Prediction, estimation or temporal series TMOD for each model. Level Description Value MB B 1 2 All attributes non-numeric Bigger number of non-numeric attributes than numeric attributes N 50% of numeric attributes and 50% of numeric attributes A Bigger number of numeric attributes than nonnumeric attributes MA All attributes numeric 3 4 5 MATRt for each model. Table A.4 MTUPp Level Description Value MB B N A MA Up to 5 106 Between 5 106 and 10 106 Between 10 106 and 20 106 Between 20 106 and 50 106 More than 50 106 1 2 3 4 5 MTUP for each model. Once MTUP has been obtained for each model, MTUP is computed using the equation Pn i¼1 MTUPpðiÞ MTUP ¼ ROUND . (A.4) n MATR calculation: The number and type of attributes to be used by each model must be obtained using Tables A.5 and A.6. Next, MATRn and MATRt have to be calculated using the following equations: Pn i¼1 MATRnpðiÞ MATRn ¼ ROUND , n Pn i¼1 MATRtpðiÞ MATRt ¼ ROUND . ðA:5Þ n MATR value is calculated using the following equation. Then, in order to obtain the rating level Table A.1 must be used.: MATR ¼ TRUNC MATRn þ MATRt . 2 (A.6) MTEC calculation: MTEC value for each mode is calculated using Table A.7. The following equation is used to compute the MTEC value. The rating level of MTEC has to be consulted in Table A.1: Pn i¼1 MTECpðiÞ MTEC ¼ ROUND . (A.7) n COMP calculation: Compute the COMP value for each tool using Table A.8. Use the following equation to obtain COMP values and Table A.1 to obtain its rating level: Pn i¼1 COMPpðiÞ . (A.8) COMP ¼ ROUND n NFOR calculation: Calculate for each tool its NFOR value using Table A.9. ARTICLE IN PRESS O. Marbán et al. / Information Systems 33 (2008) 133–150 148 Table A.7 MTECp Level Descriptive models Description MB B N A MA EA There There There There – – Predictive models Value Description are more than four techniques to generate model 1 are three techniques to generate model 2 are two techniques to generate model 3 is one technique to generate model 4 – – – There There There There There Value – are more than four techniques to generate model 2 are four techniques to generate model 3 are three techniques to generate model 4 are two techniques to generate model 5 is one technique to generate model 6 MTEC for each model. Table A.8 COMPp Level Description Value EB MB B N A MA Total compatibility and integration with all tools available in the company Compatibility with text editors, spreadsheets, DBMS and tools of Data Mining available in the company Compatibility with text editors, spreadsheets and DBMS available in the company Compatibility with text editors and spreadsheets available in the company Compatibility with text editors available in the company No compatibility with all tools available in the company 0 1 2 3 4 5 COMP for each tool. Table A.9 NFORp Level Description Value MB Tool uses wizards and intelligent agents that guide the user through Data Mining process. User needs only a light knowledge of Data Mining techniques Data Mining techniques knowledge. Tool uses wizards Light Data Mining techniques and tool knowledge Data Mining techniques knowledge and expert in tool Expert in Data Mining techniques and tool 1 B N A MA 2 3 4 5 NFOR for each tool. Table A.10 SITE cost driver description Level MB B N A MA EA Description Value Location Communication In the same location Same building or complex Same city or metropolitan area Several cities and several companies Several cities and several companies International Interactive multimedia Broadband and rarely videoconference Broadband Narrowband, e-mail Telephone, FAX Telephone, mail 1 2 3 4 5 6 ARTICLE IN PRESS O. Marbán et al. / Information Systems 33 (2008) 133–150 149 Table A.11 MFAMp Level Description Value MB Staff of the project has been working together and in the same kind of Data Mining projects as the new one and with similar data Staff of the project has worked in the same kind of Data Mining projects as the new one and with similar dataL Staff of the project has worked in the same kind of Data Mining projects as the new one but data are different Staff of the project has worked in the same kind of Data Mining projects as the new one but never in the same environment Staff of the project has never worked in Data Mining projects 1 B N A MA 2 3 4 5 MFAM for each model. Compute NFOR using the following equation and look for its rating level in Table A.1. Pn i¼1 NFORpðiÞ . (A.9) NFOR ¼ ROUND n SITE cost driver description: Table A.10 is used to obtain SITE cost driver description. MFAM calculation: Table A.11 is used to obtain the MFAM value for each model. The following equation is used to obtain MFAM value. The rating level of MFAM is obtained using Table A.1: Pn i¼1 MFAMpðiÞ MFAM ¼ ROUND . (A.10) n References [1] J. Dyché, The CRM Handbook: A Business Guide to Customer Relationship Management, first ed., AddisonWesley, Reading, MA, 2001. [2] G. Piatetsky-Shaphiro, W. Frawley, Knowledge Discovery in Databases, AAAI/MIT Press, Cambridge, MA, 1991. [3] U. Fayyad, G. Piatetsky-Shapiro, P. Smith, R. Uthurusamy, Advances in Knowledge Discovery and Data Mining, AAAI/MIT Press, Cambridge, MA, 1996. [4] L. DiLauro, What’s Next in Monitoring Technology? Data Mining Finds a Calling in Call Centers, May 2000. [5] B. Chatham, B.D. Temkin, K.M. Gardiner, T. Nakashima, CRM’s Future: Humble Growth Through 2007, July 2002. [6] KdNuggets.Com. hhttp://www.kdnuggets.com/pollsi, 2002. [7] ISL, Clementine User Guide, Version 5, ISL, Integral Solutions Limited, July 1995. [8] IBM, Application programming interface and utility reference, IBM DB2 Intelligent Miner for Data, IBM, September 1999. [9] I.H. Witten, Data Mining: Practical Machine Learning Tools with Java Implementations, 2000. [10] The Data Mining Research Group, DBMiner User Manual, Simnon Fraser University, Intelligent Database Systems Laboratory, December 1997. [11] P. Chapman (NCR), J. Clinton (SPSS), R. Kerber (NCR), T. Khabaza (SPSS), T. Reinartz (DaimlerChrysler), C. Shearer (SPSS), R. Wirth (DaimlerChrysler), CRISP-DM 1.0 step-by-step data mining guide, Technical Report, CRISP-DM, 2000. [12] ISO, ISO/IEC Standard 12207:1995. Software Life Cycle Processes, International Organization for Standarization, Ginebra, Suiza, 1995. [13] IEEE, Standard for Developing Software Life Cycle Processes. IEEE Std. 1074-1991, IEEE Computer Society, Nueva York (EE.UU.), 1991. [14] J. Kleinberg, C. Papadimitriou, P. Raghavan, A microeconomic view of data mining, J. Data Min. Knowl. Discovery 2 (4) (1998) 311–324. [15] B. Masand, G. Piatesky-Shapiro, A comparison of approaches for maximizing business payoff of prediction models, in: Second International Conference on Knowledge Discovery and Data Mining, Portland, OR, AAAI Press, 1996, pp. 195–201. [16] P. Domingos, How to get a free lunch: a simple cost model for machine learning applications, in: Proceedings of AAAI98/ICML-98 Workshop on the Methodology of Applying Machine Learning, 1998. [17] R.A. Brealey, S.C. Myers, Principles of Corporate Finance, fifth ed., McGraw-Hill, New York, NY, 1996. [18] B.W. Boehm, C. Abts, A.W. Brown, S. Chulani, B.K. Clark, E. Horowitz, R. Madachy, D. Reifer, B. Steece, Software Cost Estimation with COCOMO II, Prentice-Hall, Englewood Cliffs, NJ, 2000. [19] L.H. Putnam Sr., D.T. Putnam, L.H. Putnam Jr., M.A. Ross, Software Lifecycle Management (SLIM) Training. SLIM Estimate Exercises with Answers, Quantitative Software Management, Mc Lean, VA, 2000. [20] LLC PRICE Systems, PRICE S Reference Manual Version 3.0, Lockheed-Martin, 1998. [21] International Society of Parametric Analysts (ISPA), Parametric Cost Estimating Handbook, second ed., International Society of Parametric Analysts (ISPA), 1999. [22] LLC PRICE Systems, PRICE H Reference Manual Version 3.0, Lockheed-Martin, 1998. [23] J. Hamaker, Rules of thumb: space project cost trends over time holding technical performance constant, Parametric World Winter (2001–2002) 5–7. [24] J. Hamaker, Using the minimum squared error regression approach, Parametric World 21 (3) (2002) 11–13. [25] B. Boehm, Software Engineering Economics, Prentice-Hall, Englewood Cliffs, NJ, 1981. ARTICLE IN PRESS 150 O. Marbán et al. / Information Systems 33 (2008) 133–150 [26] S. Chulani, B. Clark, B. Boehm, Calibration approach and results of COCOMOII.1997, in: 22nd Software Engineering Workshop, Goddard, NASA, 1997. [27] S. Chulani, B. Clark, B. Boehm, B. Steece, Calibration approach and results of the COCOMO II post-architecture model, in: 20th Annual Conference of the International Society of Parametric Analysts (ISPA) and the 8th Annual Conference of the Society of Cost Estimating and Analysis (SCEA), 1998. [28] T. Shrum, Calibration and validation of the CHECKPOINT model to the air force electronic systems center software databases, Master’s Thesis, Air Force Institute of Technology, 1997. [29] L. Fischman, Calibrating a software evaluation model, in: ARMS Conference, 1997. [30] J.J. Cuadrado Gallego, Metodo Matematico de Seleccion Del Rango de Las Variables de Entrada En Los Modelos Parametricos de Estimacion Software, Ph.D. Thesis, Departamento de Informatica, Escuela Politecnica Superior, Universidad Carlos III de Madrid, 2000. [31] P. Turney, Types of cost in inductive concept learning, in: Workshop on Cost-Sensitive Learning at the 17th Interna- [32] [33] [34] [35] [36] [37] [38] tional Conference on Machine Learning, WCSL at ICML2000, Stanford University, California, 2000, pp. 15–21. O. Marbán, Modelo Matemático Paramétrico de Estimación Para Proyectos de Data Mining (DMCoMo), Ph.D. Thesis, Facultad de Informática, Universidad Politécnica de Madrid, June 2003. H. Linstone, M. Turoff, The Delphi Method: Techniques and Applications, Addison-Wesley, Reading, MA, 1975. J.A. Farquhar, A preliminary inquiry into the software estimation process, Technical Report RM-6271-PR, The Rand Corporation, 1970. S. Devnani-Chulani, Bayesian analysis of software cost and quality models, Ph.D. Thesis, Faculty of the Graduate School, University of Southern California, May 1999. C.E. Shannon, W. Weaver, The Mathematical Theory of Communication, University of Illinois Press, Urbana, USA, 1949. W.E. Griffiths, R.C. Hill, G.G. Judge, Learning and Practicing Econometrics, Wiley, New York, 1993. S. Weisberg, Applied Linear Regression, Wiley, New York, 1985.