Download A methodology for knowledge discovery: a KDD roadmap SYS

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
A methodology for knowledge discovery:
a KDD roadmap
SYS Technical Report SYS-C99-01
J. C. W. Debuse, B. de la Iglesia, C. M. Howard and V. J. Rayward-Smith∗
April 26, 1999
Abstract
A wealth of expert input is required within any successful project in knowledge discovery in
databases (KDD). A distillation of such expertise is described here within an outline methodology
for KDD, which is presented in the form of a roadmap. We extend the existing work in this area,
add clarity to the field and hope to make the comparison and exchange of ideas within the KDD
community more straightforward. Moreover, much of the expertise that has been acquired through
practical experience is brought within the reach of KDD practitioners; our work is thus of value to
KDD experts and novices alike.
1
Introduction
Any organisation which undertakes a project in knowledge discovery in databases (KDD) will require a
considerable degree of expert input to ensure the results produced are of high quality, valid, interesting,
novel and so on. The purpose of this document is to present such expert input in the form of an outline
methodology, expressed in the form of a roadmap; this will serve two key purposes. Firstly, we will
be creating a framework which should facilitate the exchange and comparison of ideas across different
parts of the KDD spectrum. Secondly, this framework should clarify the process and bring much of the
knowledge gained through practical experience within reach of KDD practitioners. We aim to extend
existing work to give a greater level of detail, and our framework will be used as the basis for the creation
of a commercial development of a research prototype data mining package [10].
The target audience of this paper is broad, being suitable for personnel at levels which range from
analyst through to end user. We describe KDD analysts as project leaders, with considerable KDD
process knowledge but potentially little or no domain, background or application area knowledge at the
start of the project. The end user (described as a business user in [1]) initiates the KDD process with
requests and receives the discovered knowledge. End users may of course be somewhere between these
two extremes, such as an end user who has some knowledge of the KDD process. We also consider
personnel at the ‘management’ level; these are one level above the analyst/end user level and will
commission the project, control its budget and may also receive and act upon the discovered knowledge.
The aim of this document is to provide clear, detailed information for management and end users, as
well as being useful as a reference for analysts.
An important decision which must be made within any KDD project is the type (or types) of
personnel which are to be used. Analysts have the advantage of considerable KDD process knowledge;
this means that they are likely to know, from a relatively abstract KDD process perspective, the possible
pitfalls and benefits of the decisions that can be made, how best to undertake the required tasks and
how to evaluate the results. Analysts do however suffer the disadvantage of potentially having little or
no domain knowledge and therefore may possibly make decisions which do not take this information
into account. The situation is reversed for end users; although they may have little or no KDD process
knowledge, they know a great deal about the data to be used and the area within which the extracted
∗ This
work was supported by the Teaching Company Scheme under programme number 2552.
1
Figure 1: The KDD process roadmap
knowledge is to be applied. Within a KDD project, end users are therefore much more likely to take
all of the relevant characteristics of their data into account and produce knowledge that is in a suitable
form and of appropriate quality and novelty to be useful within the desired application area. End users
are also much more likely to be able to interpret the results produced for the purpose of validation,
evaluation and integration with existing knowledge. The limited or non-existent KDD process knowledge
which end users possess does however mean that they are more likely than analysts to make poor or
incorrect decisions at this level.
If the project is suitably large and well resourced in terms of personnel, both analysts and end users
can be involved; their skills will clearly be complementary. However, in many projects only a single
person may be used and therefore a decision must be made regarding their desired area of expertise.
In such cases, we believe that an end user can carry out the project if suitable software is available
to support them. Such software would be aimed at a specific KDD project application area (such as
marketing); KDD process knowledge would be incorporated into the package and used to guide the end
user through the project. The package would therefore in effect be taking the end user along a route
through our KDD roadmap which is known to be suitable for the application area. The user would
therefore be protected as far as possible from making poor or incorrect KDD process decisions by the
package.
2
The KDD process
The KDD process is described in [4, 8, 18]; an earlier methodology for KDD is presented in [1]. A
description of the KDD process that is oriented more towards business processes than our own is given
in [3]. A concept for KDD is also described in [27]. The KDD process may be divided into the following
sub-phases.
1. Problem specification.
2. Resourcing.
3. Data cleansing.
4. Pre-processing.
5. Data mining.
6. Evaluation of results.
7. Interpretation of results.
8. Exploitation of results.
We present an illustration of our view of the KDD process at the broadest level in figure 1. Each of
the sub-phases illustrated within the figure is described in detail within sections 2.1 to 2.8.
We present the KDD process in the form of a roadmap, which has some parallels with the software
engineering process [19]. The map contains one and two way roads, junctions which may be taken, and
locations, representing processes to be undertaken, that may be stopped at. As with any map, provided
that the rules of the road are obeyed, any valid route may be taken. However, within section 3, we
present a suggested route for a specific type of KDD project that may be of guidance to the end user.
2
Figure 2: The problem specification stage
2.1
Problem specification
This stage is illustrated in figure 2; the purpose of the phase is to move from the position of having a
problem description, which may be loosely defined, to a tightly defined problem specification. Processes
which are performed within this phase include preliminary database examination and familiarisation, determination of required tasks, data availability and software and hardware requirements. The feasibility
of the project is then determined; the detailed problem specification is then produced.
2.1.1
Inputs to the problem specification phase
The input to this phase is a problem description, which may be loosely defined; the output from the
phase is a problem specification, which is tightly defined. The application area of the data mining project
must be determined; this must have been established in very broad terms before the data mining project
is undertaken, but must be clarified at this point.
2.1.2
The travel log
A ‘travel log’ must be initiated, which is used to store details of the operations performed at each stage
of the project, routes taken through the roadmap and so on; each piece of information recorded will be
timestamped. This document is updated throughout the course of the project; this may be supported
and automated by a toolkit. The travel log is useful in allowing progress to be tracked and accurate
information concerning what has happened through the course of the project to be retrieved easily.
Recording precise details of operations that have been performed also allows them to be reversed if
necessary.
2.1.3
Preliminary database examination
A preliminary examination of the database or databases to be used is then made; some of the results
of this will subsequently be stored within the data dictionary, described in section 2.1.5. This phase,
together with subsequent phases, may be performed a large number of times during the course of the
3
project as the database or databases to be used are modified, added to and so on. It should be noted
that, at this stage, the actual databases may not yet be available and so their descriptions may have to
be examined instead. The following characteristics are determined.
2.1.3.1. The number of records.
2.1.3.2. The number of fields.
2.1.3.3. The proportion of the database which is missing.
2.1.3.4. The proportion of the fields which contain missing values.
2.1.3.5. The proportion of the records which contain missing values.
2.1.3.6. The accessibility of the database.
2.1.3.7. The linking required to form the database; for example, part of the database may be stored in
paper format and thus need to be electronically created and linked to the existing portion before
the project can begin.
2.1.3.8. The extent to which data from multiple sources can be integrated.
2.1.3.9. The speed with which the data can be accessed; for example, the process of getting the data
into electronic form may be time consuming and result in the production of databases within
which each record is several months old.
2.1.3.10. Noise level determination. It may be useful to establish the level of noise that exists within
the database, since this affects later aspects of the project such as the acceptable accuracy level
of discovered patterns. This may be measured by identifying ‘contradictions’ within the database;
these are records which have the same values for all input fields but differing output field values1 .
It should be noted that such noise may be caused primarily by the intrinsic nature of the database
rather than inaccuracies in the data.
2.1.4
Database familiarisation
Part of the database familiarisation process may require access to domain experts. It should be noted
that, as we have discussed, the data may not be available at this stage; if this is the case then other parts
of the familiarisation process may have to be deferred. The following are examples of familiarisation
processes; some of the results of these processes will subsequently be stored within the data dictionary,
described in section 2.1.5.
2.1.4.1. Database field type determination. There are a variety of terms that are used to describe type;
we will use the following.
Numerical. There are two numerical types; numerical discrete, describing integers (for example
the number of dependents which, in this case, can only be non negative), and numerical
continuous, describing reals (such as temperature).
Categorical. This data is discrete and again has two types; categorical ordinal, describing categorical data with an implied ordering (such as size), and categorical nominal, describing
categorical data for which there is no implied ordering (such as sex).
For some types of field, the type may be obvious; however, there may be cases where the actual
type is different to that which the field appears to have. For example, a field with integer value
may be an encoding of categorical nominal data, in which case it should be treated as such.
1 The output field of a record will typically describe its class or the real value onto which the record must be mapped;
we describe the remainder of the fields as input.
4
2.1.4.2. Determination of database field semantics. Knowledge of the meaning of a database field may
influence the KDD process considerably. It may be known that two or more fields, although
different, are based on the same or similar measurements. Such knowledge may then be used
later in the KDD process; for example, only the most predictive of such a group of fields may be
used. The names given to fields may be abbreviated but should be explained in full within the
data dictionary; such abbreviations may potentially be misleading if used without reference to the
data dictionary. For example, a field with the name ‘No’ may potentially have meanings such as
‘number’ or ‘negative’; reference to the data dictionary will be necessary to determine the actual
meaning.
2.1.4.3. Reliability. Fields within the database, or even specific field values, may have varying levels
of reliability. Knowledge of these levels can be useful in determining pre-processing operations on
fields and interpretation of the discovered knowledge. It may also be possible to incorporate the
reliability information within the data mining algorithms to be used and thus target patterns that
are based on reliable data.
2.1.4.4. Determination of field value semantics. Knowledge of the meaning of field values may be
used to spot outliers or erroneous values in later phases. It may also allow missing values to be
understood more clearly and handled in a more appropriate manner. For example, some missing
values may be caused purely through error in the data collection process, whilst others may be
the result of some understood process; in the latter case, it may prove fruitful to treat the absence
of data within a field simply as an extra value which the field may take.
2.1.4.5. Simple statistics. A basic understanding of the nature of a field may be gained by examining
measures such as its range, mean, standard deviation, distribution and so on. If the data is not
available at this stage and such statistics have not been generated then they may be examined
at a later stage. The statistics may suggest that data cleansing (discussed within section 2.3) is
necessary; if this is to be performed then these statistics should be generated for the cleansed data.
2.1.4.6. Data visualisation. Familiarisation with the nature of each field may be achieved by simple
plots, whilst more complex visualisations may allow a deeper level of understanding of the data,
such as the effect of combinations of field values on class. Again, if the data is not available at
this stage and no visualisations have been made available in advance then visualisation may occur
at a later stage.
2.1.4.7. Domain knowledge acquisition. Specialised knowledge of the domain within which the project
is involved is crucial to the success of any project. Such knowledge may be acquired by talking to
domain experts, studying relevant literature and so on.
2.1.5
Data dictionary
A data dictionary must be made available for each data source. The data dictionaries may have to be
created and may also need to be updated during the course of the project. Each data dictionary will
contain attribute names, types and ranges together with information regarding missing values and/or
reliability of values.
2.1.6
High level task (HLT) determination
The goal or goals of the data mining project must be determined, which will be prediction and/or
description [18]. We refer to the goals of the data mining project as high level tasks; low level tasks
must also be determined and this process is discussed within section 2.1.7. The goal of description is to
present discovered patterns in an understandable form; predicting unknown values is the requirement
of prediction. This phase is essentially the process of determining whether a “black box” approach is
suitable; if so then description will not be one of the goals. It should be pointed out that some algorithms
may fulfil both goals; for example, a simple decision tree may be both understood and used to predict
future values.
5
2.1.7
Low level task (LLT) determination
The first step is to identify which tasks are feasible, based upon the database which is to be used and the
application area. For example, classification cannot be carried out unless each object has been assigned
a class and time series analysis obviously requires data with a time dimension. A target task or tasks
must then be selected from the set of feasible tasks. The selection process will depend largely on the
application area of the project and its goal or goals.
The following are examples of data mining tasks which may be carried out; descriptions of such tasks
may be found in [8, 16, 25].
Classification. Descriptions are found for a set of pre-defined classes within the database. A total
classification may be produced, in which case descriptions are produced for all classes within the
database; alternatively, a partial classification may be produced, within which descriptions are
only found for certain classes.
Clustering. The database is grouped into classes; a clustering of the data may be used for both
description and prediction.
Regression. A function which maps every record in the database onto a real value is produced. Such
a function is useful primarily for prediction, although it may be possible to express the function
(or some summarisation of its key features) in a form that may be used for description.
Dependency modelling. A model is produced which describes dependencies which are significant
between variables. Such a model is mainly useful for description, although it may also be used for
prediction if it is in a suitable form.
Time series analysis. Each record within the database has an associated time; patterns which exist
over time are generally sought. Such patterns may be used for both prediction and description.
Visualisation. Data is presented graphically in a way which facilitates visual identification of knowledge. Visualisations are clearly suitable for description.
It should be noted that it is possible to convert some low level tasks into alternatives. For example,
a database may be created within which the class field describes the current class of the record and the
remainder of the fields describe the values of its attributes a year ago. In such a case, a time series
analysis task effectively becomes a classification task.
The desired properties of the discovered knowledge must be determined at this stage. A measure or
set of measures of interest should then be defined for each required low level task. No single interest
measure exists, and the measure or measures used should reflect the desired characteristics of the
discovered knowledge. Interest measures may be based upon characteristics such as the accuracy or
generality for classification tasks, or the size of clusters for clustering tasks.
2.1.8
Software and hardware requirements
An estimate of software requirements should be made at this stage. This may be fairly general, but
should give some indication of the hardware requirements and the cost of using necessary packages.
Typical software requirements include the following.
2.1.8.1. Database software. More than one package may be required if several databases, each in the
format of a different system, are used.
2.1.8.2. Spreadsheet software.
2.1.8.3. Software to support pre-processing operations. At this stage, the desired pre-processing operations, together with the algorithms to perform them, have not yet been determined; if a decision
cannot be reached at this stage then estimates must be made.
2.1.8.4. Software to support the data mining algorithms that will carry out the required high and low
level tasks. This, of course, means that such algorithms must be chosen at this point; if this is not
possible at this stage then estimates must be made.
6
2.1.8.5. KDD packages. These may include software to support pre-processing operations, database
interfacing, data mining algorithms and so on.
As discussed within section 2.1.9, the hardware requirements will depend both upon the software
requirements and the database or databases to be used.
2.1.9
Feasibility determination
The feasibility of mining for patterns within the database or databases is determined within the following
areas.
2.1.9.1 Missing and unreliable data. The size of characteristic 2.1.3.3. in section 2.1.3 (the proportion
of the database which is missing) may be so large that data mining is infeasible; similarly, the
proportion of the database which is unreliable may be infeasibly large. Alternatively, if the majority of the missing or unreliable information occurs primarily within a subset of the records or
fields then it may be possible to use only certain records or certain fields. If this is not the case
then the incorporation of missing or unreliable data within the data mining algorithms may be
investigated to determine the feasibility of the project.
2.1.9.2 System performance. Once feasibility has been established from a missing data perspective, the
performance of the system on which the data mining will be carried out must be established. The
first step within this phase is to confirm that the system meets the requirements of the software
to be used within the project; once this has been done, the system performance must be measured
in the following areas.
2.1.9.2.1. The available hard disk space. The space required to store the database or databases
on the hard disk must be estimated and compared to the available area. If the available space
is insufficient then disk space may be increased or different databases used. Alternatively, one
or more steps from the data cleansing and pre-processing phases (described within sections 2.3
and 2.4 respectively) may be performed at a later stage to reduce the size of the database;
these include random sampling, feature subset selection, discretisation and clustering groups
of similar records together, so that data is effectively dealt with at a ‘macro’ rather than
‘micro’ level and the number of records is reduced.
2.1.9.2.2. The size of the available memory. Once the disk space feasibility has been established,
an estimate of the memory required by the database and software must be made. If this
exceeds the amount available then the same pre-processing steps as described for disk space
limitations may be undertaken. Again, if this does not render the project feasible then more
major project modifications must be undertaken such as upgrading the memory or using
different databases. It should be noted that the available memory may render some data
mining algorithms infeasible if they scale up poorly to large databases.
2.1.9.2.3. The database access speed (if flat files are not to be used).
2.1.9.2.4. The processor speed, measured using an appropriate benchmark.
Given an approximation of the amount of processor effort required within the data mining exercise
given the database or databases and taking into account the database access speed, an estimate of
the time which the project will take can be made. The accuracy of this estimate will depend upon
the extent to which future phases have been planned. If this estimate significantly exceeds the
time available then the pre-processing steps described previously will be considered. Estimates
must be made of the time taken to perform the necessary pre-processing step or steps (including
those performed because of memory or hard disk limitations), together with the time which will
be required to perform data mining on the new data. If the total time exceeds that available for
all of the appropriate pre-processing steps then the project must be redesigned, by upgrading the
available processing power, using a different database or databases, allowing more time and so on.
2.1.9.3. Personnel. Estimated personnel requirements form part of the measure of project feasibility.
Provision must be made for domain experts and KDD experts; training may also need to be
undertaken.
7
2.1.9.4. Size of database regions of interest. If the regions of interest within the database are too small
then the project may be infeasible. For example, an organisation may be interested in rules that
describe a class of interest; if only a handful of records in a database containing millions of records
belong to the class then the project may be infeasible.
2.1.9.5. Low level task feasibility. As discussed within section 2.1.7, some low level data mining tasks
may prove infeasible given the available data. For example, if the records do not have an associated
time then time series analysis cannot be carried out.
2.1.9.6. Cost. The estimated total cost of the proposed project forms the final component of the
feasibility measure. In addition to determining feasibility, such information can also be used in
weighing up the potential costs and benefits of the project together with the risks involved; the
decision to run, revise or redesign the project can then be made in a more informed fashion.
2.1.10
Outputs from the problem specification phase
The output from this phase is a problem specification, which contains the following components.
2.1.10.1. A list of resource requirements, including cost, time, personnel, hardware and software. These
should be presented to management level personnel for approval.
2.1.10.2. The high and low level tasks to be undertaken within the project.
2.1.10.3. A data dictionary.
2.1.10.4. The feasibility of the project.
2.1.10.5. A travel log, which is updated at this point to record the above information. The travel log
will continue to be updated throughout the course of the project so that it contains a record of
everything that has happened within it.
A KDD toolkit can potentially offer support during this phase and produce the final problem specification document, which will accompany the travel log. The toolkit may also generate a suggested
route or routes through the KDD roadmap, based upon the nature of the project to be tackled.
2.2
Resourcing
This stage is illustrated in figure 3; the list of resource requirements, which is output from the problem
specification phase, is taken as input. Within this phase, the resources specified within the problem
specification, including the data mining algorithms that are to be used, are gathered. The resource
which may potentially be the most time consuming to gather within this phase is the data. The data
may not have been available within the previous stage, or may exist in forms which are time consuming
to convert into usable databases. For example, as we have previously discussed, part of the database
that is to be used may exist in paper form and thus require putting into electronic form and linking
with the existing components.
The data may be sourced from data warehouses. These are vast stores of data which some organisations maintain, and each one may contain all of the data which the organisation has ever gathered in
a particular area. Data warehouses will generally contain far more data than is manageable or required
by the KDD project; the project may also require data from several such warehouses. This has lead to
the development of ‘data marts’, which contain the relevant data collected from one or possibly more
warehouses and which are much smaller than any single data warehouse. The data mart is therefore
similar to a shop, which generally takes its stock from a range of warehouses but contains much less
stock than any single warehouse. Data may potentially be more easily sourced from data marts, since
their data has been gathered from multiple data warehouses and is of a more manageable size than even
a single data warehouse.
The output from the phase is an “operational database”. This may be made up from a number of
different sources, each with its own database management system, but exists as a complete database
that is consistent in its structure, formatting, identifiers for missing values and so on. To create such
a database, procedures for transforming the data from each of the sources into the required structure
8
Figure 3: The resourcing stage
and format must clearly be established. There are a number of issues related to such transformations,
including the following.
2.2.1. Banding levels of data. Each source may contain data at a different banding level. For example,
age may be represented as raw values or alternatively be banded into intervals. If the banding
levels are different within each source then the methodology described in [15] can be used to
combine them.
2.2.2. Macro and micro level data. As we have previously discussed in section 2.1.9, if groups of
records that are similar are clustered together, the data is converted from micro to macro level.
Within macro level data, each record therefore represents a group of micro level records and often
includes a numerical value of the number of corresponding records in the original database, whilst
every record is represented individually within micro level data. Data may be stored at different
levels within different sources; if this is the case then the levels should be made the same within
the operational database. Converting micro level data to macro level is the most straightforward
way to accomplish this, since converting in the opposite direction generally requires access to an
original, micro level version of the data; this is really only a problem when banding has taken
place.
2.2.3. Gathering data from the world wide web. The web contains enormous quantities of data that
may prove useful within a KDD project. However, the principal problem in gathering and using
such data is dealing with the large quantities of unstructured information that is designed primarily
for human rather than machine consumption. A survey of data mining from the web is given in
[7].
2.2.4. Coding consistency. Data from each source may contain complex coding schemes. For example,
the value of a field may be a code that describes a node within a large, complex hierarchy of types.
Each source may use a different hierarchy; procedures must therefore be developed to allow such
codings to be translated into a common format.
2.2.5. Consistent data formatting. Representations and field names must be consistent, including their
use of upper and lower case characters; this information will be stored within the data dictionary
9
Figure 4: The data cleansing stage
for the operational database, as described within section 2.1.5. A single format must be decided
on for the operational database, such as database tables; all of the data must then be converted
into this format.
2.2.6. Miscellaneous data formatting. The data must be converted into a format which is suitable for
the data mining algorithms to be used; this conversion will typically involve making use of suitable
field delimiters, adding appropriate descriptive headers to files and so on.
The operational database may be formed by creating a physical copy of the data from the various
sources, or alternatively exist only in ‘virtual’ form, drawing data directly from the sources when accessed. In both cases, the source databases remain unchanged by any transformations that are performed
to create the operational database.
2.3
Data cleansing
Figure 4 gives an illustration of this phase, within which the aim is to prepare the data for subsequent
phases that involve learning. Operations such as the removal of errors, dealing with missing values and
perhaps balancing are therefore performed at this stage. Although the operations performed within this
phase may be classified as pre-processing, they differ from other pre-processing operations in two key
ways. Firstly, learning may be performed within the pre-processing phase but never occurs within this
phase. Secondly, this phase is generally only performed once for a given database or databases, whereas
pre-processing may be carried out a number of times.
The operations which are performed within this phase are the following; it should be noted that
database size reduction operations determined within the problem specification stage are made use of
here as mandatory data cleansing operations.
2.3.1
Outlier handling
As described in [22], many outliers may be classified as either errors or groups of interest. In the case
of the latter, the project will probably be concentrating upon the outliers. Within the data cleansing
phase of the project, only erroneous outliers are dealt with.
10
The process of dealing with erroneous outliers will generally require some domain knowledge to
determine what constitutes such an outlier. Domain knowledge is also often required to determine the
corrective action to apply to each form of outlier. For example, the presence of an outlier may suggest
that the value is erroneous and should be treated as missing data. Alternatively, some corrective
processing may be applied to the outlier to convert it into a valid value.
2.3.2
Random sampling
If a sufficient number of records exist within the database or databases then they may be split at random
into a separate training and testing subset. The data mining algorithm or algorithms which are to be
used will later be applied to the training set; the patterns which they discover will then be evaluated
later using the testing set. The size of each of these subsets may be determined by the system on which
data mining will be carried out. For example, the available memory may only be sufficient to allow a
training set size which contains 10% of the complete database. (If balancing, described in section 2.3.4,
is to be performed then it should only be carried out on the training database.)
In some cases, there may be too few records to allow separate testing and training subsets to
be formed. In such cases, it is often still necessary to obtain some estimate of the extent to which
the discovered knowledge represents genuine patterns rather than noise. Under such circumstances,
alternative evaluation approaches must be used, such as those discussed within section 2.6.
2.3.3
Missing data handling
The approach which is to be used to deal with missing data must be determined and performed at this
stage. As previously described in [4], there are a variety of ways in which this may be performed. One
of the most straightforward is to simply mark the data as missing within this phase and allow the data
mining algorithm to deal with it in an appropriate manner. If the missing values are caused by some
understood process (and therefore the fact that they are missing represents useful information) then
the absence of data may be represented as an additional valid value which the field can take; otherwise,
missing values should be represented by a flag which alerts the data mining algorithm to the fact that no
data exists. Some examples of methods for handling such missing values within data mining algorithms
can be found in [2, 21, 23].
If missing data is not to be handled primarily within the data mining algorithm then there are two
main approaches for dealing with it as a pre-processing step.
The removal of missing data. This approach eliminates missing data in ways such as removing all
records containing missing data or all fields containing missing data. This approach may be used
in conjunction with handling missing values within the data mining algorithm; for example, all
records with a high proportion of missing values may be discarded. Databases within which
missing values occur in only a small proportion of the fields or records tend to be most suitable
for this approach.
Missing data estimation. The missing values are estimated, within the training database, using approaches ranging from the simple (such as replacing missing numeric values within a field with
the mean over all known examples) to the complex (such as training a neural network to predict
missing values for a field using the remaining fields [9]).
The first of these approaches may prove less time consuming than the second, but suffers the disadvantage of throwing away data. However, the second approach also potentially loses information,
since by filling in missing values their uncertainty is not recorded. This may be rectified by flagging the
filled-in values within the database. The measure of pattern quality used by data mining algorithms
could then incorporate the proportion of missing values upon which the pattern is based; this would
allow the user to encourage the production of patterns which are not based upon many missing values.
If this approach is used then the patterns produced may be evaluated by putting the missing values
back into the database before testing occurs.
11
Figure 5: The pre-processing stage
2.3.4
Database balancing
The database or databases to be used may be ‘balanced’ at this stage. This process allows the proportion
of records within a database which belong to a chosen minority class to be increased, which may improve
the performance of some data mining algorithms. Generally, balancing algorithms work in one of two
ways.
Data deletion. Records which do not belong to the chosen class are discarded at random, until the
proportion of records within the database which belong to the chosen class is sufficiently large.
This approach has the disadvantage of throwing away data.
Data duplication. Records which belong to the chosen class are duplicated at random, until the
proportion of records within the database which belong to the chosen class is sufficiently large.
The disadvantages of this approach are that the duplication of records may distort patterns within
the database and will result in the noise duplication; the increase in database size may also impair
the performance of the data mining algorithm.
It should be noted that it may prove beneficial to produce a number of balanced databases, each of
which contains a different proportion of records that belong to the chosen class.
2.4
Pre-processing
Pre-processing is the first phase of the project within which learning may occur and is illustrated within
figure 5; this phase is generally performed a number of times during the course of the project. The
information gathered within the problem specification stage, in terms of available time, space and speed,
is used within this stage. As with the previous phase, database size reduction operations determined
within the problem specification stage are made use of here as mandatory pre-processing operations.
At this stage, pre-processing operations which are not mandatory may be considered, since many of
these may improve the quality of the results produced within the data mining phase. The following
operations may be performed within this phase.
12
2.4.1
Feature construction
Such techniques, as described in [11, 14], apply a set of constructive operators to a set of existing
database features to construct one or more new features. Good feature construction algorithms may
improve the performance of data mining algorithms considerably. The technique may also prove useful
when combined with feature subset selection to produce a small set of powerfully predictive fields. The
operators which are applied to the existing features within the database may range from the simple to
the complex, and domain knowledge may be incorporated within the process. For example, a domain
expert may know that it is not the values of field a or field b which are important in predicting a class
but the difference between them; creating a new field which represents the difference between the two
fields thus makes use of such domain knowledge. Ideas for feature construction often come from the
data visualisation phase (see section 2.1.4); for example, a straight line may be seen when the data
is visualised which indicates a potentially useful feature construction approach. A potential drawback
of feature construction is that fields may be produced which, though powerfully predictive, are highly
complex; this can lead to the production of knowledge which is difficult to understand.
2.4.2
Feature subset selection (FSS)
FSS [5, 12, 17] reduces the number of fields within the database, and can produce a highly predictive
subset. If separate training and testing databases exist then FSS should only be applied to the training
database. High quality feature subset selection algorithms may improve the performance of data mining
algorithms in terms of speed, accuracy and simplicity. The knowledge of powerfully predictive fields
may also represent important information in itself. Fields not deemed important might indicate features
that no longer need to be collected and stored. Information on the most important fields may also be
passed on to outside groups which will make use of it in their own ways.
A wide range of feature subset selection algorithms exist, which may make use of quality measures
from the fields of machine learning or statistics; a high quality approach should be used, since the
selection of a poor quality feature subset may potentially impair the performance of the data mining
algorithm or algorithms to be used. The speed of the FSS approach is also an important consideration in
this phase; some approaches may prove infeasible in the time available, or require more time to execute
than they save within the data mining phase.
2.4.3
Discretisation
A variety of such techniques are described in [6]. Some data mining algorithms require such preprocessing, but even those which do not may benefit. The potential benefits of discretisation are the same
as those for FSS; the data mining algorithm or algorithms to be used may yield improved performance
in terms of speed, accuracy and simplicity. Again, if separate training and testing databases exist then
discretisation should only be applied to the training database. The potential pitfalls to this approach
are similar to those of FSS; a poor quality discretisation scheme may impair the performance of a data
mining algorithm.
If the required task is regression and discretisation is performed on the numeric field whose value
is to be predicted then the task is effectively changed from regression to classification. Discretisation
algorithms may also be used to perform FSS; any fields which are discretised into a single interval may
clearly be removed from the database. Data mining can then be performed on the remaining fields,
using either their original or discretised form.
A large number of discretisation schemes is available; these may be grouped into a number of
categories [6].
Local or global. Local discretisation algorithms are applied to localised regions of the database, whilst
global methods discretise the whole database.
Unsupervised or supervised. Supervised methods make use of the class value for each record when
forming discretisation and may potentially produce intervals which are relatively homogeneous
with respect to this. Unsupervised methods use only the value of the field to be discretised when
forming discretisations and therefore may potentially lose classification information.
13
Figure 6: The data mining stage
Static or dynamic. Dynamic methods form discretisations for all features simultaneously, whilst
static approaches discretise each feature in turn individually.
Discretisation may also effect macro level data by merging classes.
2.5
Data mining
Within this phase, illustrated within figure 6, the data mining algorithm or algorithms to be used may
have to be determined. The data mining tasks which are required within the project will obviously
restrict the choice of data mining algorithm or algorithms; for example, if the required task is clustering
then a tree induction algorithm such as C4.5 may not be used. Similarly, the interest measure or
measures which are to be used (discussed previously within section 2.1.7) may affect the choice of data
mining algorithm or algorithms, since some algorithms may produce results in a form which are more
straightforward to evaluate than others using the interest measure or measures. If the decision has
been made not to use separate training and testing sets within section 2.3.2 then some form of error
estimation must be used within this phase, such as ‘Leave one out’ [26].
For each data mining task which is to be performed, there are a wide range of algorithms available;
some of these are fairly similar to each other, whilst others work in very different ways. Each algorithm
will typically have its own strengths and weaknesses, in terms of efficiency, suitability for certain types
of data, simplicity of patterns produced and so on; several different data mining algorithms should
therefore ideally be used to perform each different task. The amount of time available, together with
the performance of the system which is to be used to run the data mining algorithm, will also influence
the choice of data mining algorithms and the number which are used.
Each data mining algorithm which is to be used will typically have a number of parameters which
must be set before it can be executed. These will generally fall into the following categories and will
often have default values which may initially be used.
Algorithm parameters. These control the execution of the data mining algorithm in a manner which
affects its overall performance.
Problem parameters. These offer the user control over a variety of options related to the desired
characteristics of the discovered knowledge. For example, the user may be given control over
14
Figure 7: The evaluation stage
the number of clusters produced within a clustering algorithm, the desired generality level of the
discovered patterns for a classification algorithm, or the number of nodes within a neural network.
The desired properties of the discovered knowledge, determined in section 2.1.7, must be used to set
the problem parameters. Once suitable parameter values have been found, the data mining algorithm
or algorithms must be executed and the discovered knowledge examined. This preliminary evaluation
will not be particularly rigorous, since its purpose is primarily to determine whether the discovered
knowledge is worthy of closer scrutiny. If this is found to be the case then no more work needs to be
done within this phase; however, such an outcome at the first attempt is extremely rare. Typically, the
discovered knowledge will be unusable for reasons such as being overly complex, at an unsuitable level
of accuracy, of insufficient quality and so on. In such cases, the relevant parameter or parameters of the
data mining algorithm are set to new values in an attempt to rectify the situation and the algorithm
is re-run. This process repeats until satisfactory results are produced, or no set of parameter values
is found which gives such results; in the latter case, more drastic action is required, such as using an
alternative algorithm or examining the validity and suitability of the data used for the required tasks
and revising these accordingly.
It is also possible to combine the algorithmic parameter setting, algorithmic execution and preliminary evaluations into a single process. [13] report an approach within which the parameter selection for
C4.5 is automated, by minimising estimated error.
2.6
Evaluation of results
This phase is illustrated within figure 7. There are a range of approaches which may be used to evaluate
the results of a data mining exercise; the choice of these will be made in part by the data mining goal
or goals, the required tasks and the application area. Within this phase, the test database is used for
evaluation; if this does not exist then this phase is effectively merged with the data mining phase since
the training database is used both for the production and testing of patterns. The areas within which
the discovered knowledge is evaluated are as follows.
2.6.1. Performance on test database. If separate training and testing sets have been generated then
the performance of the discovered knowledge on the testing set may be used to determine its
quality. If, for example, a set of rules is produced that is much more accurate when applied
to the training database than to the testing database, the rules are likely to be overfitting the
15
data and thus may be unsuitable for practical use. If no separate training and testing databases
exist then alternative approaches may be used to give some approximation of the performance
of the discovered knowledge on unseen data. These approaches, based on resampling or dividing
the database into smaller segments that are subsequently used for training and testing, include
cross-validation and bootstrapping [26].
2.6.2. Simplicity. If description is a high level task of the project then the simplicity of the discovered
knowledge is likely to be crucial. The level of simplicity which is required in such cases will
be partly dependent on the application area; for example, if the discovered knowledge is to be
presented to domain experts then a far lower level of simplicity will be required than if it is to be
understood by general personnel.
2.6.3. Application area suitability. The suitability of the discovered knowledge for the area of application will generally be a crucial factor in determining the success of the data mining project;
if the knowledge which is discovered has no useful application then clearly the project must be
revised accordingly. For example, knowledge may be unsuitable because it is of insufficient quality
to be useful; if this is the case then revisions should be made in phases such as data mining and
pre-processing.
2.6.4. Generality. The generality level of the discovered knowledge (the proportion of the database
to which it applies) may be critical in some areas of application. Within some areas, maximum
benefit may be gained from knowledge which is very general, whilst in others the reverse may be
the case. Generality levels may be varied by making changes in areas such as the data mining
algorithm or algorithms used, their parameters, preprocessing and so on.
2.6.5. Visualisation. This is a potentially useful evaluation tool; by examining discovered knowledge
in a visual environment, complex characteristics may be easily assimilated. For example, a visualisation may be produced for a set of rules, showing their performance throughout the database.
Such a visualisation may be used to understand areas within which the discovered knowledge is
performing poorly, and may offer some insight into why this is happening and how it may be
rectified.
2.6.6. Statistical analysis. The field of statistics offers a wide range of approaches which are useful
in evaluating the discovered knowledge; some examples include the investigation of robustness,
significance, overfit and underfit.
2.7
Interpretation of results
At this stage, presented within figure 8, evaluation is performed by domain experts, who offer a particularly valuable source of evaluation. They will be able to compare the discovered knowledge to their
own and determine how closely they match. Wide differences would suggest one or more errors at some
stage within the data mining process and can be used to guide the search for these, together with the
revision of the approach. One would generally expect discovered patterns which are genuine to match
the knowledge of the domain expert, represent a refinement of it, or alternatively fit reasonably well
with their intuition and background knowledge.
The discovered knowledge may effectively represent hypotheses in which domain experts are interested. In such cases, the domain experts may wish to analyse these hypotheses using their own methods
of testing.
Domain experts will also be able to determine how the discovered knowledge fits with existing
knowledge within the application area. This is clearly a vital step for areas within which the new
patterns are to be put to use alongside existing knowledge.
2.8
Exploitation of results
If the project has reached this stage, illustrated within figure 9, then the discovered knowledge has
been evaluated to a considerable extent and is believed to be valid, of good quality and suitable for the
proposed application area. Within this phase, the patterns which have been produced are put to use;
16
Figure 8: The interpretation stage
this may often be a major undertaking for an organisation; efforts will therefore be made to minimise
the risks involved and maximise the potential benefits.
If the high level task within the project is description then the extracted knowledge is applied to
the required application area. For example, an organisation may change its procedures to incorporate
knowledge. This may require the involvement and consent of senior management.
The project may require a software application to be generated which embeds the discovered knowledge. This may be facilitated by the packages used within the project; for example, the discovered
knowledge may be exported in the form of C++ code.
The KDD process undertaken during the course of the project may be integrated within the company.
If the project is not a one-off then the travel log may be used as a starting point for the creation of an
automated version of the project. This automated version may then be set up to be regularly re-run as
the databases used within it are updated; changes in the discovered knowledge can then be noted and
reported. The reported changes in the discovered patterns could then be put into practice, which would
keep the organisation up to date with the environment within which it operates.
The process of putting the discovered knowledge into practice should ideally involve the minimum
of risk together with the maximum of benefit. To achieve such goals, it may prove beneficial to make
this process a gradual one. Initially, simulation of the process within which the knowledge is to be put
to use may be performed; this can be used to estimate the likely effects in a variety of different areas.
Once the simulation studies have been performed, the next stage will be to undertake small scale trials
of the discovered knowledge. If the results of these trials appear promising then the organisation may
expand them until full use is made of the discovered knowledge; its full benefits may then be realised.
3
A suggested KDD roadmap route for marketing applications
Within this section, we offer an example of a KDD roadmap route which is aimed at the application area
of marketing. Within this area, a wide variety of specific routes may be taken and so we will concentrate
upon presenting a single route at a general level, discussing likely directions, repetitions and so on; key
characteristics of the route are presented in order of execution. For the sake of brevity, only the most
pertinent points will be described; those which are missed out are not necessarily considered to be
excluded from all such projects.
A KDD toolkit may potentially generate a suggested route for a user, based upon the application
area within which the user is working. The suggested route can be determined more tightly through the
course of the project as the toolkit questions the user further regarding the nature of the application
17
Figure 9: The exploitation stage
area. The toolkit will therefore be providing low level information for personnel at the user level, which
will guide them along a route which is appropriate for their needs.
Problem specification. The databases which are to be used for marketing projects may contain very
large numbers of records, large numbers of fields and be noisy. The database is likely to contain a
variety of types and may have missing or unreliable values; a KDD toolkit can perform automatic
database examination to determine field types. A considerable quantity of domain knowledge may
also be available.
The high level tasks which are required may be prediction and/or description; a range of low level
tasks may be required, based upon project types such as the following.
• Customer segmentation; it may prove useful to divide customers (or potential customers)
into a number of groups, each of which contains customers which are similar to each other.
This therefore represents a clustering task.
• Mailshot targeting. When mailshots are sent out to potential customers, only a very small
proportion of these are likely to respond; identifying the customer types which are likely to
respond and targeting them is therefore potentially beneficial. This therefore represents a
classification task.
• Customer profiling. An example of such a project is the creation of a credit scoring system,
which takes as input a number of customer characteristics and outputs a continuous numerical
value that represents an estimate of their credit worthiness. This represents a regression task.
A range of database packages may be required, since the data to be used may come from a variety
of sources. A KDD package or packages will also be required. Pre-processing tasks such as feature
construction are likely to prove useful and therefore suitable software may also be required. The
remainder of this route will be aimed at the mailshot targeting project type.
Resourcing. The key issue within the phase for marketing projects is the integration of databases from
multiple sources to form the operational database. The databases may be in different format, have
different coding schemes and contain micro and macro level data as well as having different banding
levels.
Data cleansing. Erroneous outliers and missing values are likely to exist and therefore must be dealt
with at this phase. Each algorithm which is to be used to undertake a low level task may have
18
an associated set of cleansing operations which may be required. For example, balancing may
improve the performance of certain algorithms when they are used with databases which contain a
small class of interest, such as those used within this type of project; a KDD toolkit may therefore
suggest appropriate cleansing operations under such circumstances.
Pre-processing. The size of the databases to be used means that random sampling is likely to be
carried out to create testing and training subsets. Feature construction algorithms can potentially
create new, powerfully predictive fields. Feature subset selection may also prove useful if there is
a large number of fields.
Data mining. For the task of mailshot targeting, approaches such as neural networks, rule induction
and decision tree induction algorithms may be used. If the high level task is description then a
neural network is unlikely to be used. The databases which will be used for this task are likely to
contain a class of interest that is extremely small; decision tree induction algorithms are therefore
likely to prove less useful than rule induction algorithms that are capable of producing rules to
describe a pre-specified class. This phase is likely to be undertaken a considerable number of times
until satisfactory results are produced and the project may have to return to the pre-processing,
cleansing or even resourcing phase to help achieve this goal.
Evaluation of results. Suitability for the application area will be crucial within mailshot targeting
projects; test database performance, together with generality level, is also likely to be a useful
measure of quality. Simplicity may be required, since description will generally be a high level
task of the project, and statistical analysis is likely to be used to determine the significance of the
results. It is likely that the project will return to earlier phases at least once from this phase.
Interpretation of results. A considerable amount of domain expertise is likely to exist, which can be
used to further evaluate the discovered knowledge. It will generally be necessary to examine how
the new knowledge fits with existing knowledge and how it is to be used alongside it. Again, it is
likely that the project will return to earlier phases at least once from this phase.
Exploitation of results. If the results produced within the project so far are of sufficient quality then
the organisation will be keen to exploit them as rapidly as possible. The knowledge is likely to have
a limited shelf life and thus degrade over time. In addition to this, competitors of the organisation
are likely to be undertaking their own projects in similar areas and thus the competitive advantage
gained is maximised by rapid exploitation. However, the risk associated with exploitation can be
considerable, and so simulation and a degree of trialling are likely to be performed to reduce this.
References
[1] R. J. Brachman and T. Anand. The process of knowledge discovery in databases: A human-centered
approach. In U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, editors, Advances
in Knowledge Discovery and Data Mining. AAAI Press, Menlo Park, 1995.
[2] L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. Classification and Regression Trees.
Wadsworth and Brooks, Monterey, CA., 1984.
[3] P. Chapman, J. Clinton, J. H. Hejlesen, R. Kerber, T. Khabaza, T. Reinartz, and R. Wirth. The
current CRISP-DM process model for data mining. Distributed at a CRISP-DM Special Interest
Group meeting, 1998.
[4] J. C. W. Debuse. Exploitation of Modern Heuristic Techniques within a Commercial Data Mining
Environment. PhD thesis, University of East Anglia, 1997.
[5] P. A. Devijver and J. Kittler. Pattern Recognition: a Statistical Approach. Prentice-Hall International, London, 1982.
[6] J. Dougherty, R. Kohavi, and M. Sahami. Supervised and unsupervised discretization of continuous
features. In Prieditis and Russell [20], pages 194–202.
19
[7] O. Etzioni. The world-wide web: Quagmire or gold mine? In U. M. Fayyad and R. Uthurusamy,
editors, Comm. ACM, volume 39(11), November 1996.
[8] U. Fayyad, G. Piatetsky-Shapiro, and P. Smyth. Knowledge discovery and data mining: Towards
a unifying framework. In Simoudis et al. [24].
[9] A. Gupta and M. S. Lam. Estimating missing values using neural networks. Journal of the Operational Research Society, 47:229–238, 1996.
[10] C. M. Howard. The DataLamp package. School of Information Systems, University of East Anglia,
1998.
[11] A. Ittner and M. Schlosser. Discovery of relevant new features by generating non-linear decision
trees. In Simoudis et al. [24], pages 108–113.
[12] G. H. John, R. Kohavi, and K. Pfleger. Irrelevant features and the subset selection problem. In
W. W. Cohen and H. Hirsh, editors, Machine Learning: Proc. of the Eleventh Int. Conf., pages
121–129, San Francisco, 1994. Morgan Kaufmann.
[13] R. Kohavi and G. H. John. Automatic parameter selection by minimizing estimated error. In
Prieditis and Russell [20], pages 304–312.
[14] C. J. Matheus and L. A. Rendell. Constructive induction on decision trees. In Proc. of the Eleventh
Int. Joint Conf. on Artificial Intelligence. Morgan Kaufmann, 1989.
[15] S. McClean and B. Scotney. Distributed database management for uncertainty handling in data
mining. In Proc. of the Data Mining Conf., pages 291–311. UNICOM, 1996.
[16] Knowledge Discovery Nuggets. Siftware, 1998. www.kdnuggets.com/siftware.html.
[17] M. Pei, E. D. Goodman, W. F. Punch, and Y. Ding. Genetic algorithms for classification and
feature extraction. Proc. of the Classification Soc. Conf., 1995.
[18] G. Piatetsky-Shapiro. From data mining to knowledge discovery: the roadmap. Proc. of the Data
Mining Conf., pages 209–221, 1996.
[19] R. S. Pressman. Software engineering: a practitioner’s approach. McGraw-Hill, 1992.
[20] A. Prieditis and S. Russell, editors. Proc. of the Twelfth Int. Conf. on Machine Learning, San
Francisco, CA, 1995. Morgan Kaufmann.
[21] J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993.
[22] V. J. Rayward-Smith and J. C. W. Debuse. Knowledge discovery issues within the financial services
sector: the benefits of a rule based approach. Proc of the Unicom Data Mining / Data Warehouse
seminar, 1998.
[23] V. J. Rayward-Smith, J. C. W. Debuse, and B. de la Iglesia. Using a genetic algorithm to data
mine in the financial services sector. In A. Macintosh and C. Cooper, editors, Applications and
Innovations in Expert Systems III, pages 237–252. SGES Publications, 1995.
[24] E. Simoudis, J. W. Han, and U. Fayyad, editors. Proc. of the Second Int. Conf. on Knowledge
Discovery and Data Mining (KDD-96), 1996.
[25] Pilot Software. Glossary of data mining terms, 1998. Available electronically from:
www.pilotsw.com/r and t/datamine/dmglos.htm.
[26] S. M. Weiss and C. A. Kulikowski. Computer systems that learn : classification and prediction
methods from statistics, neural nets, machine learning and expert systems. Morgan Kaufmann, San
Francisco, 1991.
[27] J. P. Yoon and L. Kerschberg. A framework for knowledge discovery and evolution in databases.
IEEE Trans. on Knowledge and Data Engineering, 5(6):973–979, 1993.
20