Download A cost model to estimate the effort of data mining

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
ARTICLE IN PRESS
Information Systems 33 (2008) 133–150
www.elsevier.com/locate/infosys
A cost model to estimate the effort of data
mining projects (DMCoMo)
Oscar Marbán, Ernestina Menasalvas, Covadonga Fernández-Baizán
Facultad de Informática, Universidad Politécnica de Madrid (U.P.M.), Campus de Montegancedo s/n.,
28660 Boadilla del Monte, Madrid, Spain
Received 26 February 2007; accepted 7 July 2007
Recommended by N. Koudas
Abstract
CRISP-DM is the standard to develop Data Mining projects. CRISP-DM proposes processes and tasks that you have to
carry out to develop a Data Mining project. A task proposed by CRISP-DM is the cost estimation of the Data
Mining project.
In software development a lot of methods are described to estimate the costs of project development (SLIM, SEERSEM, PRICE-S and COCOMO). These methods are not appropriate in the case of Data Mining projects because in Data
Mining software development is not the first goal.
Some methods have been proposed to estimate some phases of a Data Mining project, but there is no method to
estimate the global cost of a generic Data Mining project. The lack of Data Mining project estimation methods is because
of many real-life project failures due to the non-realistic estimation at the beginning of the projects.
Consequently, in this paper we propose to design and validate a parametric cost estimation model, similar to COCOMO
or SLIM in software development, for Data Mining projects (DMCoMo1). The drivers of the model will be proposed first
and later the equation of the model will be proposed.
r 2007 Elsevier B.V. All rights reserved.
Keywords: Data Mining; Knowledge discovery; Cost estimation; Parametric model
1. Introduction
The concept of CRM (Customer Relationship
Management) evolved when the man of the caverns
could choose if he wanted to trade with Og or Thag.
Corresponding author. Tel.: +34 913367388;
fax: +34 913367393.
E-mail addresses: omarban@fi.upm.es (O. Marbán),
emenasalvas@fi.upm.es (E. Menasalvas), cfbaizan@fi.upm.es
(C. Fernández-Baizán).
1
The work presented in this paper has been partially supported
by UPM project ERDM ref.14589.
However, the CRM was used for the first time in the
middle of the 1990s. CRM can be defined as: ‘‘to
give to the client what he wants, when he wants and
where he wants’’ [1]. The main objective of the
CRM projects is to recover a one-to-one relationship with the client. The one-to-one relationship has
been lost as a consequence of the competitive
environment in which modern companies work.
For this reason, companies have been developing
CRM systems, in order to retain their clients, for the
past 10 years. In CRM systems we can distinguish
between three areas: operational CRM, collaborative
0306-4379/$ - see front matter r 2007 Elsevier B.V. All rights reserved.
doi:10.1016/j.is.2007.07.004
ARTICLE IN PRESS
134
O. Marbán et al. / Information Systems 33 (2008) 133–150
CRM and analytic CRM. Analytic CRM analyzes
the operational data to optimize the relationship
with the client. Due to the great volume of data that
must be analyzed Data Mining techniques must be
used [2,3].
Therefore, Data Mining researches have been
increasing in the last years [4,5]. This growth has
been motivated because companies need to find the
knowledge that is hidden in their data. This knowledge allows companies to compete against other
companies. For this reason, companies are investing
more resources in Data Mining projects [6].
The need of efficient methods to search knowledge in data has caused that a lot of Data
Mining algorithms and Data Mining tools have
been developing [7–10]. However, due to the
complexity of the Data Mining process, a Data
Mining methodology is needed. The Data Mining
methodology is CRISP-DM [11].
CRISP-DM evolved to solve the problems that
companies had in the development of Data Mining projects. CRISP-DM is a process model to
develop Data Mining projects and was proposed
by a consortium of companies (Teradata, SPSS
(ISL), Daimler-Chrysler and OHRA). CRISP-DM
defines the processes and tasks that you have to do
in order to develop a successful Data Mining project. For each task proposed by CRISP-DM, the
inputs and outputs of the task are also proposed.
Hence, CRISP-DM proposed a process model to
develop Data Mining projects such as ISO 12207
[12] and IEEE 1074 [13] for developing software
projects.
In the ‘‘Business understanding’’ phase CRISP-DM
proposes a task to make the project plan. In this
task you have to budget the project, and you have to
calculate the cost of the project taking into account
the time and the personnel that are needed to
develop the Data Mining project. But, CRISP-DM
does not propose how to carry out this task.
If we wish to rate the success or failure of a Data
Mining project, we need a method to calculate the
goodness of the knowledge extracted by the model,
the time used to obtain the knowledge, the cost of the
personnel and resources used in the project, etc.
However, it is needed to estimate the cost of the
project too, because if the cost of the knowledge is not
accessible to the company the project is non-viable.
Some researches have been done to estimate the
goodness of the knowledge extracted from the data.
Thus, in [14] a framework to estimate the goodness
of knowledge after the Data Mining phase in CRM
projects is proposed. This framework tries to
maximize the value of the knowledge extracted. In
[15] the value of customers is taken into account to
maximize the benefit of the predictive Data Mining
models.
About the cost estimation of the Data Mining
projects, in [16] a cost estimation model for
classification problems is proposed, which can be
used in any moment along the project. This model is
based on NPVs (Net Present Values) [17]. NPV is
calculated as the difference between the money
invested in the project and the recovered money
from that investment. In the model presented in [16]
the NPV is used to decide whether the project will
continue. NPV is calculated at any point in the
project, and the project will continue only if NPV
has a positive value.
All previous estimation methods do not allow to
establish the effort, time and cost at the beginning
of the project. But we can try to use the software
estimation tools such as COCOMO II [18], SLIM
[19] or PRICE-S [20] to estimate the cost of Data
Mining projects. If we take a look at these tools, we
can conclude that they are not useful for estimating
Data Mining projects, because they use the size of
the software as the main input, in lines of code, to
be developed. Other factors used in the estimation
of software are experience of the development team,
use of tools, features of the development platform
and so forth. These features must also be used to
estimate the cost of Data Mining projects. Nevertheless, if we wish to estimate Data Mining projects,
we allow for other features of Data Mining projects
such as characteristics of data sources, data
integration level, the kind of Data Mining problem
to be solved and the number of models to build inter
alia. Software estimation methods do not consider
those features of Data Mining projects. Hence,
software estimation methods are not useful for
estimating Data Mining projects.
Consequently, we can say that nowadays there is
no cost estimation method for Data Mining projects, although Data Mining projects have been
developing for the past 20 years. Therefore, in this
paper we propose a parametric estimation model for
Data Mining projects. The model is named DMCoMo (Data Mining Cost Model). DMCoMo is based
on a parametric cost estimation model such as
COCOMO family. DMCoMo allows to estimate
the effort (men month) that is needed to develop a
Data Mining project since its conception until its
deployment.
ARTICLE IN PRESS
O. Marbán et al. / Information Systems 33 (2008) 133–150
The rest of the paper is organized as follows. In
Section 2 the work related to this research is
presented. Section 3 describes DMCoMo, the cost
drivers and equations of DMCoMo. Section 4
shows the results produced by DMCoMo in the
estimation of Data Mining projects. Section 5
presents the conclusions and future lines of work.
Finally, Appendix A shows the complete DMCoMo
model.
2. Related work
Parametric cost estimation models were developed first [21]. Rand Corporation developed the first
parametric cost model which was named Cost
Estimating Relationship (CER) [21]. CER estimates
the cost of aircrafts. CER takes into account some
features of the airplanes to estimate the cost of
them.
Estimation tools were developed at the same
time of the estimation methods to automate the
process of estimation. PRICE-H [22] and PRICE-S
[20] were the first estimation tools that implemented parametric estimation methods. PRICE-H
estimates the cost to develop hardware components, and PRICE-S estimates the cost to develop
software.
Parametric estimation models have been developed to estimate different kinds of projects: software projects (COCOMO [18], SLIM [19], etc.),
hardware projects (PRICE-H [22]), even for space
launchings of NASA [23] or to build ships [24].
Parametric models use mathematical equations to
obtain the values of estimations. The results of the
estimations are different dependent variables like
effort or development time. These dependent variables depend on a set of independent variables
called cost drivers. Examples of cost drivers are lines
of code of a software application, required reliability or complexity of the software application to
be developed.
The parametric models operate in a two-step
process:
(1) To do a first approximation or estimation which
depends on the value of a reduced set of
parameters whose weight in the final result is
considered greater than the rest and is not
normally related to the features of the project
but the product.
(2) The final result is determined using another set of
variables that allow the estimation to be refined
135
by introducing the specific characteristics of the
application and development environment.
The accuracy of parametric estimation models are
based on:
(1) A precise definition of the equations to be used.
Thus, for example, non-lineal equations have
replaced lineal ones in most of the parametric
mathematical models.
(2) Constant refining of the parameters used. This
involves not only adding or removing them to
reflect changes in the technology but also a
thorough understanding of those selected. Thus,
for example, COCOMO II [18] has eliminated
some of the effort multipliers used in COCOMO
81 [25] like Execution Time Constrain (TIME)
and introduced others such as Documentation of
the project (DOCU).
(3) An accurate calibration of the numerical values
for each parameter rating levels. New reviewed
and enlarged data sets as well as new statistical
methods have been used [26–29].
(4) Wise selection of the rating level for each
parameter used for the selected model in order
to calculate the estimations for the specific
project [30].
To sum up, parametric cost estimation models for
software development estimate the effort and time
to develop the project taking into account some
features of the software and projects such as the size
of the software, characteristics of the projects and of
the development team.
2.1. Cost estimation models for Data
Mining projects
No method has been proposed to estimate the
cost of a full Data Mining project. Nevertheless,
some proposals to estimate different kinds of Data
Mining projects have been proposed. These proposals are described subsequently.
In [31] a classification of different costs in the
processes of inductive learning is proposed. According to the authors, this classification could help in
estimating the results of a predictive problem. The
identified costs are:
Cost of misclassification errors: This cost is due
to models that do not correctly classify all the
items presented to them.
ARTICLE IN PRESS
O. Marbán et al. / Information Systems 33 (2008) 133–150
136
Cost of the test: Each test to obtain data test may
have an associated cost.
Cost of the teacher: Teacher is available for the
learner, but each classification petition that the
apprentice asks from the instructor has an
associate cost.
Cost of intervention: These costs are associated
with the cost of manipulation or modification of
the values of the variables that participate in the
classification.
Cost of unwanted achievements: These unwanted
objectives are due to the modification of some
factors of the classification algorithm; hence,
errors are obtained in the classification.
Cost of computation: Computer resources are
limited; hence, the cost of these kinds of
resources must be taken into account.
Human–computer interaction costs: These costs
take into account the cost of the personnel who
use the learning software. This cost includes the
costs of deciding the attributes to use, the
parameters of the algorithm, the conversion of
data to the format required by the algorithm and
the analysis of the result models.
of this work is to propose the usefulness of Data
Mining operation in a quantitative way.
Other work [15] proposed an estimation model
for predictive Data Mining models. In [15] to
estimate the profit of predictive Data Mining models the values of the client are borne in mind. The
estimation model proposed in [15] is based on the
next business model:
P ¼ ðr pÞ c,
where P is the profit obtained from a client, c is the
cost to get a client, r is the income obtained from a
client and p is the probability that the client will
accept an offer of the company. Thus, this model is
used to evaluate different predictive Data Mining models to obtain a greater profit P for the
company.
In [16] the NPV model is applied to decide
whether the project will continue. NPV [17] is
defined as the subtraction of the cash flow and the
ROI (Return On Investment) of the project, as we
can see in the equation
NPV ¼ C 0 þ
1
X
t¼1
Although this work presents a classification of costs
associated with inductive learning processes, it is
not proposed how to estimate them. Next, a cost
estimation model which tries to minimize a cost
function in inductive learning processes is described.
In [14] a model to estimate the value of the
obtained knowledge in the Data Mining phases in
CRM projects is proposed. This work proposes to
estimate a microeconomic framework. Hence, a
pattern in data is interesting only if it can be used
when decisions are taken by the company. Therefore, a pattern is useful if it is transformed into
information, information into actions and actions
into value. In [14] the estimation problem becomes
an optimization problem that can be formulated as
follows:
max f ðxÞ,
x2D
(1)
where D is the domain of all possible decisions
(production plans, marketing strategies, etc.) and
f ðxÞ is the usefulness of the decision (x 2 D). In this
work, Data Mining are studied from an economic
point of view of optimization problems when a great
volume of non-aggregate data are used. This
framework uses combinatory optimization, linear
programming and games theory. The main objective
(2)
Ct
,
ð1 þ rÞt
(3)
where C 0 is the initial cash flow, usually negative,
and it represents the initial investment.
In [16] Eq. (3) is interpreted as follows: NPV
represents the cost of development of the system.
This cost includes the costs of hardware, software,
personnel training, etc., t is the time, C t is the cash
flow in time t and r is the expected ROI. The cash
flow at a time t41 is the result of decisions that
were taken during the project and has two
components, the cost of taking a decision and the
cash flow that results from the decision.
This method takes into account features of the
project as experience of the staff, use of Data
Mining tools, etc. to estimate, as the cost estimation
methods for software development do. But this
estimation method does not estimate the effort of
the project, it can only be used to decide whether the
project will continue (NPV 40) or whether the
project will halt (NPV o0).
3. An estimation model for Data Mining projects:
DMCoMo
The models that have been proposed (see Section
2.1) are not generic models to estimate Data
Mining projects. Cost estimation models for software projects could be used to estimate Data
ARTICLE IN PRESS
O. Marbán et al. / Information Systems 33 (2008) 133–150
Mining projects, but its main inconvenience is that
they use the size of the software to be built as the
main input. In Data Mining projects, software is
not built; hence, in this paper we proposed a
parametric estimation model to estimate the effort
of Data Mining projects. The proposed model is
named DMCoMo [32].
In the following, cost drivers that affect the effort
of a Data Mining project are proposed. The
proposed cost drivers are grouped into six categories: Data, Data Mining Models, Platform, Techniques and Tools, Project and Staff. In Section 3.1
the cost drivers of each group are introduced. The
techniques to calculate each cost driver are described in Appendix A. Delphi method [33–35] was
used to establish the foundations of the levels and
descriptions of each cost driver of DMCoMo.
3.1. Cost drivers for DMCoMo
3.1.1. Data cost drivers
Cost drivers in this group make reference to the
effort of data management in the project. Thus, if
we work with a few tables, a few attributes and a
low dispersion, the effort is smaller than if we work
with a lot of tables, a lot of attributes and a high
dispersion of the attributes. These cost drivers take
into account features of the Data Mining project
such as data quality, integration level and location.
Data cost drivers have been grouped into five
clusters: initial amount of data, dispersion, quality
of data, data model availability and data privacy
level.
Initial amount of data considers the number of
tables (NTAB) in the database, number of tuples
(NTUP) and the number of attributes (NATR) of
the tables that are stored in the databases to be used
in the project. These cost drivers are calculated
before the preprocess Data Mining phase. NATR
adds a bigger effort than NTAB and NTUP,
because the bigger the NATR to be managed the
bigger is the effort in the preprocess Data Mining
phase.
Dispersion (DISP) is defined as the number of
different values in the domain of the attribute. This
cost driver adds some effort to the Data Mining project. A combination of variance (s2 ) and
entropy [36] is used to calculate the value of DISP.
Thus, to compute DISP, the variance of quantitative attributes is calculated and the entropy of
qualitative attributes is worked out. Once variance
and entropy have specific values, DISP are calculated
137
using Eq. (A.1). Our experience demonstrates that
the bigger the number of different values of an
attribute, the bigger the effort required to understand models.
As far as data quality is concerned, it shows how
nice the data are and is divided into two cost
drivers: the percentage of null values in data
(PNUL) and whether the criteria of data codification are available (CCOD). Null values must be
taken into account to calculate the effort of the
project because they must be processed and
different techniques could be applied, for instance,
tuples that contain null values could be deleted, null
values of the attributes could be filled through a
predictive model or the attribute that has null values
may be erased. The success of the Data Mining project depends on the right definition of the
problem to solve and the right treatment of null
values. CCOD adds the effort of transform data to
be used by algorithms. If the transformation criteria
are given by an expert the effort will be lesser than if
the responsible person of preprocess Data Mining
phase were to devise them.
Furthermore, if documentation of data sources
(data models, description of attributes, etc.) is
available, the comprehension of data will be easier
and it will help in establishing the problem to solve.
This modification of the effort is considered in the
cost driver DMOD.
As regards the privacy level of data, one may
observe that it has influence on the effort of the
Data Mining project. If data are protected by law,
there are useful data that cannot be used in the
project. The cost driver that represents this effort is
PRIV. The protected data could be substituted by
external data such as demographic databases. This
substitution requires extra effort, because we have
to integrate the external data with the data of the
project. This effort is considered by DEXT cost
driver.
3.1.2. Data Mining model cost drivers
The number of Data Mining models (NMODs)
to be created has to be considered to estimate the
effort of a Data Mining project, because the bigger
the number of Data Mining models, the bigger the
effort will be because the data have to be adapted to
a bigger number of Data Mining algorithms and
optimization of algorithms. In addition, the type of
Data Mining model to be developed (TMOD), the
number of tuples (MTUP), the number of attributes
(MATR) used by each model have to be considered
ARTICLE IN PRESS
138
O. Marbán et al. / Information Systems 33 (2008) 133–150
to estimate the effort of the project and also the
dispersion of each attribute. Even if we have to
obtain new attributes (derived attributes) the effort
will be increased; hence, a new cost driver, MDER,
is introduced to DMCoMo.
The cost driver TMOD considers the effort of the
type of Data Mining model to be developed,
because the effort of development of a predictive
model is different from the effort required to
develop a descriptive model. The amount of data
that is necessary to develop each model after
preprocess Data Mining phase must be estimated
for each model, because different Data Mining algorithms need different data. Hence, the NTUP, the
NATR, its type and dispersion that will be used to
develop each model must be considered in the effort
of the Data Mining project.
Added to that, available Data Mining techniques
to develop a Data Mining model must be looked at.
We have to consider whether a Data Mining technique is available. If any Data Mining technique
is available, we will develop a new one and this
adds an additional effort to the project. Usually,
the bigger the number of available Data Mining techniques, the lesser the effort required, because
we can try all Data Mining techniques instead
of optimizing the parameters of the Data Mining
algorithm.
3.1.3. Development of platform cost drivers
This cluster is formed by cost drivers related to
the development platform. The first cost driver in
this group is NFUN that represents the effort that is
introduced to the Data Mining project by the
number of data sources where data are stored.
The number of different data servers (NSER) and
how they are communicated (SCOM) also influence
the effort of the project.
NFUN considers the number and the type of data
sources. The bigger the number of data sources, the
bigger the effort will be, because a greater data
sources must be integrated. If data were stored in a
Data Warehouse, data will be integrated; hence, the
effort is less than if data were stored in any other
kind of storage medium. On the other hand, if we
work with files or with relational databases, the
required effort will be different. For example, an
operation like ‘‘join’’ of two or more tables is
realized easier in a relational database than in a file.
Additionally, different data servers do not share
data easily; hence, native interconnection tools must
be considered for they could help in communicating
different data servers. This effort is considered by
the cost driver NSER.
3.1.4. Techniques and tool cost drivers
The use of Data Mining tools to develop Data
Mining models facilitates the work. Thus, available
Data Mining tools (TOOL), implemented techniques by tools (NTEC) and the integration of tools
with the rest of available tools in the project
(COMP, TCOM) must be used to compute the
effort of the Data Mining project.
The cost driver TOOL takes into account
whether Data Mining tools are used in the project.
Data Mining tools do not have to implement all
Data Mining techniques; thus, if we have some
available Data Mining tool for the project, it is
probably that at least in one tool the technique is
implemented. Therefore, NTEC represents the
number of useful Data Mining techniques which
are implemented in some Data Mining tool.
COMP (Compatibility) represents how compatible are Data Mining tools with the rest of the
software (text processors, spreadsheets, data bases,
etc.). TCOM reflects the compatibility between
different Data Mining tools that are used in the
project. This cost driver distinguishes between tools
that can use a Data Mining model that was created
by a different Data Mining tool and tools that can
convert Data Mining models to be used by them.
Also we have to consider the effort of deciding
which tool, technique and machine will be used to
generate the models, because neither all Data
Mining tools perform Data Mining techniques in
the same time and in the same way nor all Data
Mining tools execute in any machine of the project.
This effort is picked up by the cost driver TOMM.
Another cost driver that must be considered is
TRAN. If algorithms of Data Mining tools could
be modified or adapted for the project, the
modification will imply an extra effort that is
considered by TRAN.
Lastly, the training to use Data Mining tools by
the staff of the project will influence the effort of the
project and it is gathered by the cost driver NFOR.
Related to that, the level of user-friendliness of Data
Mining tools are considered in the cost driver
TFRI. A user-friendly Data Mining tool reduces
the effort of the project, because the work is easier.
3.1.5. Project cost drivers
Features of the project such as the number of
participating departments must be considered in the
ARTICLE IN PRESS
O. Marbán et al. / Information Systems 33 (2008) 133–150
computation of the total effort of the project. The
cost drivers defined in this group are NDEP,
DOCU, MSIM and SITE.
NDEP represents the number of departments
participating in the project. NDEP influences the
effort because each department could have its own
data model, different names for attributes and even
any of them who may not like to participate in the
project; hence, a greater effort is necessary. Documentation (DOCU) to be produced in the project
also influences the effort of the project. If a high
amount of documentation has to be written, it will
require more effort, not only the quantity of
documentation has to be taken into account but
also the complexity of it has to be taken into
account. MSIM puts up with the extra effort of
developing the same Data Mining model for multiple locations. Multiple location development implies
that local data have to be understood and integrated; hence, an extra effort is required. On the
other hand, if the project is developed in different
places (building, town, country, etc.), this will imply
an additional effort due to the communications
(telephone, ISDN, LAN, WAN, etc.). This effort is
considered by the SITE cost driver.
3.1.6. Staff cost drivers
The staffs in a Data Mining project are composed
of sponsors, data analysts, data management
specialists, business analysts, users and a project
manager. These persons are from different areas:
computer specialists, statisticians, executives, etc.;
hence, an additional effort is required to reach an
agreement in decisions of the project. The following
drivers are proposed to take into account the effort
due to staff collaboration.
PCON represents the time that the staff has been
working together. If the staff has been working
together since a long time the persons in the team
are known to each other, and it is easier to reach an
agreement in decisions, but if the team has not
previously worked together in a project, it is more
difficult to reach an agreement. The ability of the
staff to carry out different tasks in the project is very
important because if someone cannot work one day
another person can substitute. This feature is dealt
in the cost driver PCOM.
Additionally, if the data are previously known
(KDAT) to the staff in the project, the effort will be
less than if the data are completely unknown to the
staff of the project. The familiarity with the type of
problem (MFAM) to be solved is important to
139
determine the effort of the project too. The knowledge of the problem facilitates its resolution and it is
easier to solve the problem; hence, the effort will be
minor. Similarly, the knowledge of the business
(BCON) in which the project is based upon, the
experience of the staff in similar problems and the
experience with Data Mining tools to be used in
the project (TEXP) are features to be taken into
account in order to calculate the effort of the Data
Mining project.
Lastly, the attitude of the directive is another
factor that influences the effort of the project
(ADIR). If the directive supports the project, it is
easier to finish the project successfully and the effort
will decrease, but if the project is not supported by
the directive the effort will increase.
3.2. DMCoMo equation outline
Once cost drivers have been defined, the equation
of DMCoMo has to be outlined. In order to obtain
the equation, information about the cost drivers and the effort of real Data Mining project were
gathered. The DMCoMo equation was created
through a multivariate linear regression [37,38],
because that is the most usual way of obtaining the
equation in parametric estimation methods
[25,24,23]. The equation will be similar to Eq. (4),
where y is the dependent variable, xi is the ith
independent variable, n is the number of independent variables, ai are constants and ei is the error in
the ith estimation.
y ¼ a0 þ
n
X
ai xi þ ei .
(4)
i¼1
In order to obtain the equation the following
steps must be carried out [38]:
Step 1: Descriptive study of the input data.
Step 2: Study of outliers in data.
Step 3: Correlation study between input variables.
Step 4: Application of linear regression to obtain
the equation.
Step 5: Statistical study of the significance level of
the equation.
The obtained model will be reliable for estimating
projects where effort will be in the range of the
models that are used to create the equation, in our
case between 90 and 185 men month. If the effort
of the project is out of the range, the behavior will
be unknown.
ARTICLE IN PRESS
O. Marbán et al. / Information Systems 33 (2008) 133–150
140
3.2.1. Data description
Information about different Data Mining projects was gathered from different organizations.
Different kinds of projects were involved: marketing
projects of Spanish enterprises, meteorological
projects and medical projects.
Table 1
Data collection form
Driver
NTAB
NTUP
NATR
DISP
PNUL
CCOD
DMOD
PRIV
DEXT
NMOD
Value Driver
TMOD
MTUP
MATR
MDIS
MDER
MTEC
NFUN
NSER
SCOM
TOOL
Value Driver
NTEC
COMP
TCOM
TOMM
TRAN
NFOR
TFRI
NDEP
DOCU
MSIM
Value Driver
Value
SITE
PCON
KDAT
ADIR
PEXP
MFAM
TEXP
BCON
In order to gather data, the form in Table 1 was
used. Project manager of each project filled a form
with information about the project. The values
that could be used in the questionnaire are Extra
low (XB), Very low (MB), Low (B), Nominal
(N), High (H), Very high (MA) or Extra high
(XA). Additionally, in the duration field the number
of months the project lasted must be filled and in
the person field the number of persons in the staff of
the project must be written down. To obtain the
effort (men month) required by the project, the
value of the person field must be multiplied by
the value of the duration field as shown in the
equation
EffortðMMÞ ¼ DurationðmonthsÞ Persons.
Later on, the qualitative values were translated to
quantitative values to be able to obtain the equation
through linear regression. The translations are XB
to 0, MB to 1, B to 2, N to 3, A to 4, MA to 5 and
XA to 6.
Once the data were gathered, they were analyzed
statistically to observe whether all the variables are
Duration (months)
Persons
Descriptive Statistics
N
NTAB
NTUP
NATR
DISP
PNUL
DMOD
DEXT
NMOD
TMOD
MTUP
MATR
MTEC
NFUN
SCOM
TOOL
COMP
NFOR
NDEP
DOCU
SITE
KDAT
ADIR
MFAM
MM
Valid N (listwise)
40
40
40
40
40
40
40
40
40
40
40
40
40
40
40
40
40
40
40
40
40
40
40
40
40
(5)
Minimum
0
1
1
1
1
1
2
2
1
1
1
1
1
1
1
1
1
2
2
1
1
1
1
90
Maximum
6
5
5
5
5
5
5
5
5
5
5
5
4
4
5
5
5
4
5
6
3
3
5
184
Fig. 1. Statistical data description.
Mean
121,48
Std. Deviation
21,147
ARTICLE IN PRESS
O. Marbán et al. / Information Systems 33 (2008) 133–150
141
in the right range, they do not have null values and
their statistical distribution is appropriate to apply
linear regression methods.
Table 2
DMCoMo cost drivers
Name
Abrev.
3.2.2. DMCoMo equation
In order to establish the regression equation, data
must be statistically analyzed. This study analyzes
the number of values, the maximum and minimum,
and the standard deviation of each variable in data.
In Fig. 1 we could see that all variables have 40
values, one for each project. The maximum and
minimum is useful to check whether all cost
drivers take values in their ranges. The values
of the mean and standard deviation of the effort
are also shown. We point out that if the regression equation will not create, if the mean value
(121.48 MM) is considered, the error in the effort
will be 21:147 men month. This error is the
value of the standard deviation of MM variable
(effort).
The second step to create the regression equation
and to avoid error is that the outliers in data must
be eliminated. Hence, tuples with outliers must be
deleted from the data set. But, our data set was
collected for the experiment, and the NTUP is
small, 40; hence, none of the tuples will be deleted.
Next step is the study of correlation between cost
drivers. The Spearman correlation coefficient was
used. We consider that two cost drivers are correlated if the Spearman correlation coefficient is
above 0.5.
Once correlation coefficients were obtained, we
deleted one of the cost drivers whose correlation
coefficient is above 0.5. To delete one cost driver,
there are two different possibilities. The first one is
the erasure of one of the cost drivers. The second
one is the integration of the two correlated cost
drivers into one cost driver. In this paper we
consider the first possibility. We deleted the less
significant cost driver.
After the study of correlation was done, 16 cost
drivers were deleted. Hence, to build the regression
equation of DMCoMo only 23 cost drivers were
considered. The considered drivers are shown in
Table 2. The cost drivers that were deleted are
CCOD, PRIV, MDIS, MDER, NSER, NTEC,
TCOM, TOMM, TRAN, TFRI, MSIM, PCON,
PCOM, PEXP, TEXP and BCON.
In this point, we have the final data set of data
projects to apply linear regression. The cost
drivers that appear in Table 2 are not correlated,
or its correlation coefficient is under 0:5.
Number of tables
Number of tuples
Number of attributes
Dispersion
Nulls percentage
Data model availability
External data needs
Number of models
Type of model
Number of tuples for each model
Number and type of attributes for each model
Problem type familiarity
Techniques availability
Number and type of data sources
Distance and communication form
Tools availability
Compatibility
Training level of users
Number of involved departments
Documentation
Multisite development
Data knowledge
Directive attitude
NTAB
NTUP
NATR
DISP
PNUL
DMOD
DEXT
NMOD
TMOD
MTUP
MATR
MFAM
MTEC
NFUN
SCOM
TOOL
COMP
NFOR
NDEP
DOCU
SITE
KDAT
ADIR
The regression equation in DMCoMo model will
be similar to the one presented in the equation
y ¼ a0 þ
n
X
ai xi þ ei ,
(6)
i¼1
where the dependent variable (y) is the effort
measured in men month (MM) that we wish to
estimate, the independent variables (xi ) are the cost
drivers that appear in Table 2, ai are the values that
we will find through the linear regression and n is
the number of cost drivers, in our case 23. As a
result of linear regression ai values are obtained.
These values are shown in Fig. 2. Each value of the
B column in Fig. 2 is an ai value. The names of
the a0 are constant and the rest of ai are named as
the cost driver that represents.
Hence, the effort equation (EðpÞ) of DMCoMo is
as shown in the equation
EðpÞ ¼ 78:752 þ 2:802 NTAB þ 1:953 NTUP
þ 2:115 NATR þ 6:426 DISP
þ 0:345 PNUL þ ð2:656Þ DMOD
þ 2:586 DEXT þ ð0:456Þ NMOD
þ 6:032 TMOD þ 4:312 MTUP
ARTICLE IN PRESS
O. Marbán et al. / Information Systems 33 (2008) 133–150
142
Coefficientsa
Model
1
(Constant)
NTAB
NTUP
NATR
DISP
PNUL
DMOD
DEXT
NMOD
TMOD
MTUP
MATR
MTEC
NFUN
SCOM
TOOL
COMP
NFOR
NDEP
DOCU
SITE
KDAT
ADIR
MFAM
Unstandardized
Coefficients
B
Std. Error
78,752
37,415
2,802
1,654
1,953
2,108
2,115
2,558
6,426
2,096
,345
2,204
-2,656
2,613
2,586
2,853
-,456
3,654
6,032
2,727
4,312
2,312
4,966
2,930
-2,591
2,063
3,943
3,723
,896
3,521
-4,615
2,479
-1,831
3,100
-4,698
2,186
2,931
4,230
-,892
2,783
2,135
2,112
-,214
4,258
-3,756
5,110
-4,543
2,562
Standardized
Coefficients
Beta
t
2,105
1,695
,927
,827
3,065
,157
-1,017
,906
-,125
2,212
1,865
1,695
-1,256
1,059
,254
-1,861
-,591
-2,149
,693
-,321
1,011
-,050
-,735
-1,773
,264
,142
,131
,459
,025
-,184
,164
-,020
,358
,293
,313
-,182
,193
,044
-,326
-,126
-,297
,115
-,049
,165
-,008
-,131
-,323
Sig.
,051
,109
,368
,421
,007
,877
,324
,378
,902
,042
,081
,109
,227
,305
,802
,081
,563
,047
,498
,753
,327
,961
,473
,095
a. Dependent Variable: MM
Fig. 2. Linear regression coefficients.
þ 4:966 MATR þ ð2:591Þ MTEC
Model Summary
þ 3:943 NFUN þ 0:896 SCOM
þ ð4:615Þ TOOL þ ð1:831Þ COMP
Model
1
R
,893
þ ð4:689Þ NFOR þ 2:931 NDEP
þ ð0:892Þ DOCU þ 2:135 SITE
R Square
,798
Adjusted
R Square
,507
Std. Error of
the Estimate
14,846
Fig. 3. Model summary.
þ ð0:214Þ KDAT þ ð3:756Þ ADIR
þ ð4:543Þ MFAM.
ð7Þ
ANOVA
Once the DMCoMo equation is built, it must be
statistically analyzed. This analysis is carried out
with the ANOVA test and residual analysis.
In Fig. 3 a summary of the linear regression is
shown. It shows that the model is able to predict 50%
of the training projects and it explains 80% of its variance. Typical error of the models is 14.846. This error
is smaller than the standard deviation of data, that is,
21.147 (see Fig. 1). Thus, the error in the estimation of
new projects is smaller if we use DMCoMo instead of
the mean value of the training data.
Model
1
Sum of
Squares
Regression
Residual
Total
13913,477
3526,498
17439,975
df
Mean Square
23
16
39
604,934
220,406
F
2,745
Sig.
,021
b. Dependent Variable: MM
Fig. 4. ANOVA analysis results.
The result of ANOVA analysis is shown in Fig. 4.
If we look at Fig. 4 where ANOVA analysis is
shown, we will conclude that the regression model is
statistically significant because its p-value is smaller
ARTICLE IN PRESS
O. Marbán et al. / Information Systems 33 (2008) 133–150
than 0.05. Hence, the model has a confidence level
of 95%; thus, regression is statistically useful.
Although the model is useful, we have to take
into account the relative statistical importance of
each cost driver in the regression equation of
DMCoMo. This importance is reflected in the sig
column of Fig. 2. If the value of sig is greater than
0.01, then the cost driver is not significant. Nonsignificant drivers are PNUL, NMOD, SCOM,
DOCU, and KDAT. Thus, these drivers do not
have great influence on the estimation of the effort
of the project.
Normality of residuals could contrast by statistical means using the Kolmogorov– Smirnov test (see
Fig. 5).
We could conclude that residuals follow a normal
distribution with a confidence level of 90% because
the Asymp. Sig. (2-tailed) value in Fig. 5 is smaller
than 0.10. Hence, the regression model is acceptable.
The previous tests allow to establish that the
regression model can be used with an acceptable error
to estimate the effort of a Data Mining project.
Although the model is useful, in Fig. 2 we could
see that several Sig values are greater than 0.1.
Those cost drivers are not significant and they do
not have great influence on the regression equation.
Because of that, these cost drivers could be deleted
from the regression equation. This erasure will not
affect the result of the estimation in an important
way. Then, we use the ‘‘step-wise’’ method to build
the regression equation. This method uses only the
statistical significant variables in the regression
equation.
Using the same data that were used to build the
regression model, data of 23 cost drivers of 40
projects, the ‘‘step-wise’’ regression equation is
created. The results are shown in Fig. 6.
143
Coefficients
Unstandardized
Coefficients
Model
8
B
(Constant)
TMOD
DISP
MATR
MFAM
NFOR
DEXT
NTAB
NATR
Standardized
Coefficients
Std. Error
70,897
7,257
4,792
4,615
-3,275
-3,842
2,713
2,368
2,885
t
Beta
13,505
1,911
1,596
2,019
1,522
1,712
1,897
1,224
1,906
,431
,342
,291
-,233
-,243
,172
,223
,179
Sig.
5,250
3,798
3,003
2,286
-2,152
-2,244
1,430
1,935
1,514
,000
,001
,005
,029
,039
,032
,163
,062
,140
a. Dependent Variable: MM
Fig. 6. ‘‘Step-wise’’ regression coefficients.
Model Summary
Model
R
8
Adjusted
R Square
R Square
,810
,656
Std. Error of
the Estimate
,568
13,904
Fig. 7. ‘‘Step-wise’’ model summary.
ANOVA
Modelo
Regresión
Residual
Total
Suma de
cuadrados
11446,860
5993,115
17439,975
gl
8
31
39
Media
cuadrática
1430,857
193,326
F
7,401
Sig.
,000
Fig. 8. ANOVA analysis results of ‘‘step-wise’’ regression model.
Then, the number of drivers has been reduced
through the ‘‘step-wise’’ method. The new equation
is shown in the equation
EðpÞ ¼ 70:897 þ 2:368 NTAB þ 2:885 NATR
þ 4:792 DISP þ 2:713 DEXT
þ 7:257 TMOD þ 4:615 MATR
One-Sample Kolmogorov-Smirnov Test
N
Normal Parameters a,b
Most Extreme
Differences
Mean
Std. Deviation
Absolute
Positive
Negative
Kolmogorov-Smirnov Z
Asymp. Sig. (2-tailed)
Unstandardized
Residual
40
,0000000
9,50910240
,125
,125
-,088
,788
,564
a. Test distribution is Normal.
b. Calculated from data.
Fig. 5. Kolmogorov–Smirnov test results.
þ ð3:842Þ NFOR
þ ð3:275Þ MFAM.
ð8Þ
In Fig. 7 the features of the model of Fig. 6 are
shown. This new model predicts the 56% of the
training and it explains 65% of the variance of
training data.
The ANOVA analysis (see Fig. 8) of the model of
Fig. 7 shows that the model has a confidence level of
95%, because the p-value is smaller than 0.05.
Hence, the model is statistically significant.
The analysis of residuals of the regression shows
that residuals follow a normal distribution (see the
histogram and P–P graphic, Fig. 9).
ARTICLE IN PRESS
O. Marbán et al. / Information Systems 33 (2008) 133–150
144
Normal P-P Plot of Regression
Standardized Residual
Histogram
Dependent Variable: MM
Dependent Variable: MM
1,0
12
10
Expected Cum Prob
,8
Frequency
8
6
4
,5
,3
Std. Dev = ,89
Mean = 0,00
N = 40,00
2
0
-2,00 -1,00 0,00 1,00 2,00
-1,50 -,50
,50
1,50 2,50
0,0
0,0
,3
,5
,8
1,0
Observed Cum Prob
Regression Standardized Residual
Fig. 9. Residual analysis of the ‘‘step-wise’’ model.
One-Sample Kolmogorov-Smirnov Test
N
Normal Parametersa,b
Most Extreme
Differences
Mean
Std. Deviation
Absolute
Positive
Negative
Kolmogorov-Smirnov Z
Asymp. Sig. (2-tailed)
Unstandardized
Residual
40
,0000000
12,39635501
,088
,088
-,081
,557
,916
a. Test distribution is Normal.
b. Calculated from data.
Fig. 10. Kolmogorov– Smirnov test for ‘‘step-wise’’ regression.
Normality of residuals could be tested if we use
the Kolmogorov– Smirnov test (see Fig. 10) too,
where Asymp. Sig. (2-tailed) is greater than 0.10.
Hence, we can conclude that residuals follow a
normal distribution with a confidence level of 90%.
Thus, the ‘‘step-wise’’ regression is useful.
Therefore, the two models (Eqs. (7) and (8)) are
statistically useful in the estimation of the effort of
Data Mining projects. The model created with
‘‘step-wise’’ method is easier to apply because it
has only eight cost drivers. Next, two models will be
used to estimate Data Mining projects. The results
of estimations will be analyzed.
that, data of 15 Data Mining projects were gathered
in the same way that data of the 40 training projects
were collected (see Section 3.2.1). Then, the two
models (Eqs. (7) and (8)) were used to estimate the
effort of these new 15 Data Mining projects. The
results of estimations are shown in Fig. 11, where
Id: is project identifier, MM is the real value of the
effort that is reported by the project manager,
$EMM (23 cost drivers) is the estimated effort
using Eq. (7) and $EMM (8 cost drivers) is the
estimated effort using Eq. (8).
If the real effort ðMMÞ and the estimated efforts
($EMM) are compared, we will obtain the results
that are shown in Table 3.
In Table 3, we can see the maximum, mean and
minimum errors that the estimation methods have
in relation to the real effort value. The standard
deviation is also shown. The standard deviation
shows that if we use the 23-cost drivers model the
error is 16:908 MM and if we use the 8-cost
drivers model the error is 23:105 MM. In Fig. 12
the real and estimated efforts of the test projects are
depicted.
Fig. 13 shows the relative error produced by
estimation equations for each project. The relative
error is calculated as shown in the following
equation:
Relative error ¼
4. Experimentation and results
Once DMCoMo has been established, it will be
used to estimate new Data Mining projects. To do
estimated value real value
.
real value
(9)
It is necessary to highlight that 66% of estimations
has an error smaller than 15% and 13% of
estimations has an error greater than 20% and
ARTICLE IN PRESS
O. Marbán et al. / Information Systems 33 (2008) 133–150
Id.
MM
$E-MM (23 drivers)
$E-MM (8 drivers)
145
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
117
93
162
167
105
168
108
131
123
121
87
127
113
118
154
132,564 100,803 130,502 136,377 92,2694 146,425 93,0873 117,749 129,402 138,559 91,6333 101,14 87,5875 96,9164 132,528
133,43 116,979 140,718 116,784 125,987 138,377 107,88 116,728 132,527 120,213 99,284 114,264 91,605 94,563 110,079
Fig. 11. Estimated effort.
Table 3
Comparison of real and estimated effort
Minimum error
Maximum error
Mean error
Absolute mean error
Standard deviation
23 drivers
8 drivers
17.559
31.498
11.097
18.025
16.908
23.979
50.216
8.972
20.066
23.105
Occurrences
15
15
180
MM
$E-MM (23 drivers)
$E-MM (8 drivers)
170
Effort (men * month)
160
150
140
130
120
110
100
90
smaller than 22% when the 23-cost drivers model is
used. If we use the 8-cost drivers model the 53% of
estimations has an error below 15% and the 26%
of estimations has an error between 20% and 30%.
5. Conclusions
In this paper we present a generic cost model for
Data Mining projects. The cost model is a parametric one like COCOMO is.
The model is composed of an equation and the
cost drivers that affect the Data Mining project.
Hence, the cost drivers for Data Mining projects
have been proposed. DMCoMo estimates the effort
of a Data Mining project taking into account some
features of it.
DMCoMo is useful in the estimation of the effort
in men month. Two different equations are proposed for DMCoMo using different methods of
multivariate linear regression. One equation has
23-cost drivers and it could be used when the
projects is well defined, and the other equation has
8-cost drivers and it could be used when the project
is fuzzy defined.
80
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Appendix A. Complete model
Fig. 12. Real and estimated efforts of 15 test projects.
35%
$E-MM (23 drivers)
$E-MM (8 drivers)
30%
30%
29%
26%
25%
18%
15%
14%
13%
18%
15% 14%
14%
19%
14%
11%
12%
10%
20%
18%
13%
13%
22%
20%
20%
19%
20%
10%
10% 8%
8%
5%
5%
5%
0%
0%
1
2
3
4
5
6
7
1%
8
9
10
11
12
Id. project
Fig. 13. Relative error of estimations.
13
In this appendix, the DMCoMo cost drivers are
summarized. The rating levels of the cost drivers and the way of obtaining its rating level are also
summarized in Table A.1.
DISP calculation:
!
!
X
X
1
2
DISP ¼
si þ
Hj M ,
(A.1)
V
i
j
14
15
where i is the number of qualitative attributes, j is
the number of quantitative attributes, V and M are
the variance and the mean for variance and entropy
of all attributes. The difference of M and division by
V is to normalize the value of dispersion in the range
½0; 1.
PNUL calculation: The percentage of null values
for each attribute must be computed with the help
of Table A.2.
ARTICLE IN PRESS
O. Marbán et al. / Information Systems 33 (2008) 133–150
146
Table A.1
DMCoMo cost drivers description
Cost drivers
XB
MB
B
N
A
MA
EA
NTAB
Up to 20 tables
20–60 tables
60–80 tables
80–100 tables
100–120 tables
120–300 tables
Above 300
tables
All models
Up to 5 107
tuples
Up to 500
attributes
0pHo0:2
1
90% of models
5 107 –10 107
tuples
500–1000
attributes
0:2pHo0:4
2
80–90% of models
10 107 –20 107
tuples
1000–1500
attributes
0:4pHo0:6
3
70–80% of models
20 107 –50 107
tuples
1500–2000
attributes
0:6pHo0:8
4
60–70% of models
1–3 external data
sources
3–5 external data
sources
5–7 external data
sources
1–3
2
2
2
2
2–3 homogeneous
data sources
3–5
3
3
3
3
2–3 heterogeneous
data sources
All data sources in
the same building
communicate
through LAN
Tools used for
50–70% of models
5–7
4
4
4
4
More than 3
heterogeneous
data sources
without data in
paper
Data sources in
distinct places but
communicate
More than 50 107 tuples
More than 2000
attributes
0:8pHp1
5
Less than 60% of
models
More than 7
external data
sources
More than 7
5
5
5
5
More than 3
heterogeneous
data sources with
data in paper
NTUP
NATR
DISP*
PNUL*
DMOD
DEXT
NMOD
TMOD*
MTUP*
MATR*
MTEC*
NFUN
1
1
1
1
Only 1 data source
SCOM
Data in machine
where it will be
analyzed
Data in same
database
TOOL
Tools used for all
models
Tools used for
more than 70% of
models
2
2
1 department
COMP*
NFOR*
NDEP
0
No tools used
3
3
2 departments
4
4
3–5 departments
Implanted model
All generated
models
All models and
central Data
Mining phases
5
5
More than 5
departments
All models and
Data
Mining phases
Collaboration of
business and data
experts
Department
directive supports
the project
Collaboration of
data expert
Data unknown,
but exists data
description
Department
directive supports
the project and
executive does not
oppose
1
2
No data
description or
data model
Department
directive does not
support the
project and
executive does not
support
4
5
DOCU
SITE
KDAT
ADIR
MFAM*
*
Data sources in
distinct places but
not communicate
Tools used up to
50% of models
1
1
6
See Table A.10
Department
directive supports
the project but not
executive
3
See description in Appendix A.
To calculate PNUL value the following equation
must be used and then the rating level of PNUL
must be sought in Table A.1:
Pn
i¼1 PNULpðiÞ
PNUL ¼ ROUND
.
(A.2)
n
TMOD calculation: TMOD value has to be
computed using Table A.3 and the following
equation. Then the rating level must be looked for
in Table A.1:
Pn
i¼1 TMODpðiÞ
.
(A.3)
TMOD ¼ ROUND
n
MTUP calculation: To compute MTUP value, the
MTUP of each attribute must be obtained using
Table A.4.
ARTICLE IN PRESS
O. Marbán et al. / Information Systems 33 (2008) 133–150
Table A.2
PNULp
147
Table A.5
MATRnp
Level
Description
Value
Level
Description
Value
MB
B
N
A
MA
Up to 10% of null values
10–15% of null values
15–20% of null values
20–25% of null values
More than 25% of null values
1
2
3
4
5
MB
B
N
A
Ma
Up to 10 attributes
Between 10 and 20 attributes
Between 30 and 50 attributes
Between 50 and 70 attributes
More than 70 attributes
1
2
3
4
5
PNUL for each attribute.
MATRn for each model.
Table A.3
TMODp
Table A.6
MATRtp
Level Description
Value
MB
B
N
A
MA
1
2
3
4
5
Descriptive model: Association
Descriptive model: Clustering
Descriptive model: Sequential patterns
Predictive model: Classification
Predictive model: Prediction, estimation or temporal
series
TMOD for each model.
Level Description
Value
MB
B
1
2
All attributes non-numeric
Bigger number of non-numeric attributes than
numeric attributes
N
50% of numeric attributes and 50% of numeric
attributes
A
Bigger number of numeric attributes than nonnumeric attributes
MA All attributes numeric
3
4
5
MATRt for each model.
Table A.4
MTUPp
Level
Description
Value
MB
B
N
A
MA
Up to 5 106
Between 5 106 and 10 106
Between 10 106 and 20 106
Between 20 106 and 50 106
More than 50 106
1
2
3
4
5
MTUP for each model.
Once MTUP has been obtained for each model,
MTUP is computed using the equation
Pn
i¼1 MTUPpðiÞ
MTUP ¼ ROUND
.
(A.4)
n
MATR calculation: The number and type of
attributes to be used by each model must be
obtained using Tables A.5 and A.6. Next, MATRn
and MATRt have to be calculated using the
following equations:
Pn
i¼1 MATRnpðiÞ
MATRn ¼ ROUND
,
n
Pn
i¼1 MATRtpðiÞ
MATRt ¼ ROUND
.
ðA:5Þ
n
MATR value is calculated using the following
equation. Then, in order to obtain the rating level
Table A.1 must be used.:
MATR ¼ TRUNC
MATRn þ MATRt
.
2
(A.6)
MTEC calculation: MTEC value for each mode is
calculated using Table A.7.
The following equation is used to compute the
MTEC value. The rating level of MTEC has to be
consulted in Table A.1:
Pn
i¼1 MTECpðiÞ
MTEC ¼ ROUND
.
(A.7)
n
COMP calculation: Compute the COMP value
for each tool using Table A.8.
Use the following equation to obtain COMP
values and Table A.1 to obtain its rating level:
Pn
i¼1 COMPpðiÞ
.
(A.8)
COMP ¼ ROUND
n
NFOR calculation: Calculate for each tool its
NFOR value using Table A.9.
ARTICLE IN PRESS
O. Marbán et al. / Information Systems 33 (2008) 133–150
148
Table A.7
MTECp
Level Descriptive models
Description
MB
B
N
A
MA
EA
There
There
There
There
–
–
Predictive models
Value Description
are more than four techniques to generate model 1
are three techniques to generate model
2
are two techniques to generate model
3
is one technique to generate model
4
–
–
–
There
There
There
There
There
Value
–
are more than four techniques to generate model 2
are four techniques to generate model
3
are three techniques to generate model
4
are two techniques to generate model
5
is one technique to generate model
6
MTEC for each model.
Table A.8
COMPp
Level
Description
Value
EB
MB
B
N
A
MA
Total compatibility and integration with all tools available in the company
Compatibility with text editors, spreadsheets, DBMS and tools of Data Mining available in the company
Compatibility with text editors, spreadsheets and DBMS available in the company
Compatibility with text editors and spreadsheets available in the company
Compatibility with text editors available in the company
No compatibility with all tools available in the company
0
1
2
3
4
5
COMP for each tool.
Table A.9
NFORp
Level
Description
Value
MB
Tool uses wizards and intelligent agents that guide the user through Data Mining process. User needs only a
light knowledge of Data Mining techniques
Data Mining techniques knowledge. Tool uses wizards
Light Data Mining techniques and tool knowledge
Data Mining techniques knowledge and expert in tool
Expert in Data Mining techniques and tool
1
B
N
A
MA
2
3
4
5
NFOR for each tool.
Table A.10
SITE cost driver description
Level
MB
B
N
A
MA
EA
Description
Value
Location
Communication
In the same location
Same building or complex
Same city or metropolitan area
Several cities and several companies
Several cities and several companies
International
Interactive multimedia
Broadband and rarely videoconference
Broadband
Narrowband, e-mail
Telephone, FAX
Telephone, mail
1
2
3
4
5
6
ARTICLE IN PRESS
O. Marbán et al. / Information Systems 33 (2008) 133–150
149
Table A.11
MFAMp
Level
Description
Value
MB
Staff of the project has been working together and in the same kind of Data Mining projects as the new one and
with similar data
Staff of the project has worked in the same kind of Data Mining projects as the new one and with similar dataL
Staff of the project has worked in the same kind of Data Mining projects as the new one but data are different
Staff of the project has worked in the same kind of Data Mining projects as the new one but never in the same
environment
Staff of the project has never worked in Data Mining projects
1
B
N
A
MA
2
3
4
5
MFAM for each model.
Compute NFOR using the following equation
and look for its rating level in Table A.1.
Pn
i¼1 NFORpðiÞ
.
(A.9)
NFOR ¼ ROUND
n
SITE cost driver description: Table A.10 is used to
obtain SITE cost driver description.
MFAM calculation: Table A.11 is used to obtain
the MFAM value for each model.
The following equation is used to obtain MFAM
value. The rating level of MFAM is obtained using
Table A.1:
Pn
i¼1 MFAMpðiÞ
MFAM ¼ ROUND
.
(A.10)
n
References
[1] J. Dyché, The CRM Handbook: A Business Guide to
Customer Relationship Management, first ed., AddisonWesley, Reading, MA, 2001.
[2] G. Piatetsky-Shaphiro, W. Frawley, Knowledge Discovery
in Databases, AAAI/MIT Press, Cambridge, MA, 1991.
[3] U. Fayyad, G. Piatetsky-Shapiro, P. Smith, R. Uthurusamy,
Advances in Knowledge Discovery and Data Mining,
AAAI/MIT Press, Cambridge, MA, 1996.
[4] L. DiLauro, What’s Next in Monitoring Technology? Data
Mining Finds a Calling in Call Centers, May 2000.
[5] B. Chatham, B.D. Temkin, K.M. Gardiner, T. Nakashima,
CRM’s Future: Humble Growth Through 2007, July
2002.
[6] KdNuggets.Com. hhttp://www.kdnuggets.com/pollsi, 2002.
[7] ISL, Clementine User Guide, Version 5, ISL, Integral
Solutions Limited, July 1995.
[8] IBM, Application programming interface and utility reference, IBM DB2 Intelligent Miner for Data, IBM, September
1999.
[9] I.H. Witten, Data Mining: Practical Machine Learning
Tools with Java Implementations, 2000.
[10] The Data Mining Research Group, DBMiner User Manual,
Simnon Fraser University, Intelligent Database Systems
Laboratory, December 1997.
[11] P. Chapman (NCR), J. Clinton (SPSS), R. Kerber (NCR),
T. Khabaza (SPSS), T. Reinartz (DaimlerChrysler), C.
Shearer (SPSS), R. Wirth (DaimlerChrysler), CRISP-DM
1.0 step-by-step data mining guide, Technical Report,
CRISP-DM, 2000.
[12] ISO, ISO/IEC Standard 12207:1995. Software Life Cycle
Processes, International Organization for Standarization,
Ginebra, Suiza, 1995.
[13] IEEE, Standard for Developing Software Life Cycle
Processes. IEEE Std. 1074-1991, IEEE Computer Society,
Nueva York (EE.UU.), 1991.
[14] J. Kleinberg, C. Papadimitriou, P. Raghavan, A microeconomic view of data mining, J. Data Min. Knowl.
Discovery 2 (4) (1998) 311–324.
[15] B. Masand, G. Piatesky-Shapiro, A comparison of approaches
for maximizing business payoff of prediction models, in:
Second International Conference on Knowledge Discovery and
Data Mining, Portland, OR, AAAI Press, 1996, pp. 195–201.
[16] P. Domingos, How to get a free lunch: a simple cost model
for machine learning applications, in: Proceedings of AAAI98/ICML-98 Workshop on the Methodology of Applying
Machine Learning, 1998.
[17] R.A. Brealey, S.C. Myers, Principles of Corporate Finance,
fifth ed., McGraw-Hill, New York, NY, 1996.
[18] B.W. Boehm, C. Abts, A.W. Brown, S. Chulani, B.K. Clark,
E. Horowitz, R. Madachy, D. Reifer, B. Steece, Software
Cost Estimation with COCOMO II, Prentice-Hall, Englewood Cliffs, NJ, 2000.
[19] L.H. Putnam Sr., D.T. Putnam, L.H. Putnam Jr., M.A.
Ross, Software Lifecycle Management (SLIM) Training.
SLIM Estimate Exercises with Answers, Quantitative Software Management, Mc Lean, VA, 2000.
[20] LLC PRICE Systems, PRICE S Reference Manual Version
3.0, Lockheed-Martin, 1998.
[21] International Society of Parametric Analysts (ISPA), Parametric Cost Estimating Handbook, second ed., International
Society of Parametric Analysts (ISPA), 1999.
[22] LLC PRICE Systems, PRICE H Reference Manual Version
3.0, Lockheed-Martin, 1998.
[23] J. Hamaker, Rules of thumb: space project cost trends over
time holding technical performance constant, Parametric
World Winter (2001–2002) 5–7.
[24] J. Hamaker, Using the minimum squared error regression
approach, Parametric World 21 (3) (2002) 11–13.
[25] B. Boehm, Software Engineering Economics, Prentice-Hall,
Englewood Cliffs, NJ, 1981.
ARTICLE IN PRESS
150
O. Marbán et al. / Information Systems 33 (2008) 133–150
[26] S. Chulani, B. Clark, B. Boehm, Calibration approach and
results of COCOMOII.1997, in: 22nd Software Engineering
Workshop, Goddard, NASA, 1997.
[27] S. Chulani, B. Clark, B. Boehm, B. Steece, Calibration
approach and results of the COCOMO II post-architecture
model, in: 20th Annual Conference of the International
Society of Parametric Analysts (ISPA) and the 8th Annual
Conference of the Society of Cost Estimating and Analysis
(SCEA), 1998.
[28] T. Shrum, Calibration and validation of the CHECKPOINT
model to the air force electronic systems center software databases, Master’s Thesis, Air Force Institute of Technology, 1997.
[29] L. Fischman, Calibrating a software evaluation model, in:
ARMS Conference, 1997.
[30] J.J. Cuadrado Gallego, Metodo Matematico de Seleccion
Del Rango de Las Variables de Entrada En Los Modelos
Parametricos de Estimacion Software, Ph.D. Thesis, Departamento de Informatica, Escuela Politecnica Superior,
Universidad Carlos III de Madrid, 2000.
[31] P. Turney, Types of cost in inductive concept learning, in:
Workshop on Cost-Sensitive Learning at the 17th Interna-
[32]
[33]
[34]
[35]
[36]
[37]
[38]
tional Conference on Machine Learning, WCSL at ICML2000, Stanford University, California, 2000, pp. 15–21.
O. Marbán, Modelo Matemático Paramétrico de Estimación
Para Proyectos de Data Mining (DMCoMo), Ph.D. Thesis,
Facultad de Informática, Universidad Politécnica de Madrid, June 2003.
H. Linstone, M. Turoff, The Delphi Method: Techniques
and Applications, Addison-Wesley, Reading, MA, 1975.
J.A. Farquhar, A preliminary inquiry into the software
estimation process, Technical Report RM-6271-PR, The
Rand Corporation, 1970.
S. Devnani-Chulani, Bayesian analysis of software cost and
quality models, Ph.D. Thesis, Faculty of the Graduate
School, University of Southern California, May 1999.
C.E. Shannon, W. Weaver, The Mathematical Theory of
Communication, University of Illinois Press, Urbana, USA,
1949.
W.E. Griffiths, R.C. Hill, G.G. Judge, Learning and
Practicing Econometrics, Wiley, New York, 1993.
S. Weisberg, Applied Linear Regression, Wiley, New York,
1985.