Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
A methodology for knowledge discovery: a KDD roadmap SYS Technical Report SYS-C99-01 J. C. W. Debuse, B. de la Iglesia, C. M. Howard and V. J. Rayward-Smith∗ April 26, 1999 Abstract A wealth of expert input is required within any successful project in knowledge discovery in databases (KDD). A distillation of such expertise is described here within an outline methodology for KDD, which is presented in the form of a roadmap. We extend the existing work in this area, add clarity to the field and hope to make the comparison and exchange of ideas within the KDD community more straightforward. Moreover, much of the expertise that has been acquired through practical experience is brought within the reach of KDD practitioners; our work is thus of value to KDD experts and novices alike. 1 Introduction Any organisation which undertakes a project in knowledge discovery in databases (KDD) will require a considerable degree of expert input to ensure the results produced are of high quality, valid, interesting, novel and so on. The purpose of this document is to present such expert input in the form of an outline methodology, expressed in the form of a roadmap; this will serve two key purposes. Firstly, we will be creating a framework which should facilitate the exchange and comparison of ideas across different parts of the KDD spectrum. Secondly, this framework should clarify the process and bring much of the knowledge gained through practical experience within reach of KDD practitioners. We aim to extend existing work to give a greater level of detail, and our framework will be used as the basis for the creation of a commercial development of a research prototype data mining package [10]. The target audience of this paper is broad, being suitable for personnel at levels which range from analyst through to end user. We describe KDD analysts as project leaders, with considerable KDD process knowledge but potentially little or no domain, background or application area knowledge at the start of the project. The end user (described as a business user in [1]) initiates the KDD process with requests and receives the discovered knowledge. End users may of course be somewhere between these two extremes, such as an end user who has some knowledge of the KDD process. We also consider personnel at the ‘management’ level; these are one level above the analyst/end user level and will commission the project, control its budget and may also receive and act upon the discovered knowledge. The aim of this document is to provide clear, detailed information for management and end users, as well as being useful as a reference for analysts. An important decision which must be made within any KDD project is the type (or types) of personnel which are to be used. Analysts have the advantage of considerable KDD process knowledge; this means that they are likely to know, from a relatively abstract KDD process perspective, the possible pitfalls and benefits of the decisions that can be made, how best to undertake the required tasks and how to evaluate the results. Analysts do however suffer the disadvantage of potentially having little or no domain knowledge and therefore may possibly make decisions which do not take this information into account. The situation is reversed for end users; although they may have little or no KDD process knowledge, they know a great deal about the data to be used and the area within which the extracted ∗ This work was supported by the Teaching Company Scheme under programme number 2552. 1 Figure 1: The KDD process roadmap knowledge is to be applied. Within a KDD project, end users are therefore much more likely to take all of the relevant characteristics of their data into account and produce knowledge that is in a suitable form and of appropriate quality and novelty to be useful within the desired application area. End users are also much more likely to be able to interpret the results produced for the purpose of validation, evaluation and integration with existing knowledge. The limited or non-existent KDD process knowledge which end users possess does however mean that they are more likely than analysts to make poor or incorrect decisions at this level. If the project is suitably large and well resourced in terms of personnel, both analysts and end users can be involved; their skills will clearly be complementary. However, in many projects only a single person may be used and therefore a decision must be made regarding their desired area of expertise. In such cases, we believe that an end user can carry out the project if suitable software is available to support them. Such software would be aimed at a specific KDD project application area (such as marketing); KDD process knowledge would be incorporated into the package and used to guide the end user through the project. The package would therefore in effect be taking the end user along a route through our KDD roadmap which is known to be suitable for the application area. The user would therefore be protected as far as possible from making poor or incorrect KDD process decisions by the package. 2 The KDD process The KDD process is described in [4, 8, 18]; an earlier methodology for KDD is presented in [1]. A description of the KDD process that is oriented more towards business processes than our own is given in [3]. A concept for KDD is also described in [27]. The KDD process may be divided into the following sub-phases. 1. Problem specification. 2. Resourcing. 3. Data cleansing. 4. Pre-processing. 5. Data mining. 6. Evaluation of results. 7. Interpretation of results. 8. Exploitation of results. We present an illustration of our view of the KDD process at the broadest level in figure 1. Each of the sub-phases illustrated within the figure is described in detail within sections 2.1 to 2.8. We present the KDD process in the form of a roadmap, which has some parallels with the software engineering process [19]. The map contains one and two way roads, junctions which may be taken, and locations, representing processes to be undertaken, that may be stopped at. As with any map, provided that the rules of the road are obeyed, any valid route may be taken. However, within section 3, we present a suggested route for a specific type of KDD project that may be of guidance to the end user. 2 Figure 2: The problem specification stage 2.1 Problem specification This stage is illustrated in figure 2; the purpose of the phase is to move from the position of having a problem description, which may be loosely defined, to a tightly defined problem specification. Processes which are performed within this phase include preliminary database examination and familiarisation, determination of required tasks, data availability and software and hardware requirements. The feasibility of the project is then determined; the detailed problem specification is then produced. 2.1.1 Inputs to the problem specification phase The input to this phase is a problem description, which may be loosely defined; the output from the phase is a problem specification, which is tightly defined. The application area of the data mining project must be determined; this must have been established in very broad terms before the data mining project is undertaken, but must be clarified at this point. 2.1.2 The travel log A ‘travel log’ must be initiated, which is used to store details of the operations performed at each stage of the project, routes taken through the roadmap and so on; each piece of information recorded will be timestamped. This document is updated throughout the course of the project; this may be supported and automated by a toolkit. The travel log is useful in allowing progress to be tracked and accurate information concerning what has happened through the course of the project to be retrieved easily. Recording precise details of operations that have been performed also allows them to be reversed if necessary. 2.1.3 Preliminary database examination A preliminary examination of the database or databases to be used is then made; some of the results of this will subsequently be stored within the data dictionary, described in section 2.1.5. This phase, together with subsequent phases, may be performed a large number of times during the course of the 3 project as the database or databases to be used are modified, added to and so on. It should be noted that, at this stage, the actual databases may not yet be available and so their descriptions may have to be examined instead. The following characteristics are determined. 2.1.3.1. The number of records. 2.1.3.2. The number of fields. 2.1.3.3. The proportion of the database which is missing. 2.1.3.4. The proportion of the fields which contain missing values. 2.1.3.5. The proportion of the records which contain missing values. 2.1.3.6. The accessibility of the database. 2.1.3.7. The linking required to form the database; for example, part of the database may be stored in paper format and thus need to be electronically created and linked to the existing portion before the project can begin. 2.1.3.8. The extent to which data from multiple sources can be integrated. 2.1.3.9. The speed with which the data can be accessed; for example, the process of getting the data into electronic form may be time consuming and result in the production of databases within which each record is several months old. 2.1.3.10. Noise level determination. It may be useful to establish the level of noise that exists within the database, since this affects later aspects of the project such as the acceptable accuracy level of discovered patterns. This may be measured by identifying ‘contradictions’ within the database; these are records which have the same values for all input fields but differing output field values1 . It should be noted that such noise may be caused primarily by the intrinsic nature of the database rather than inaccuracies in the data. 2.1.4 Database familiarisation Part of the database familiarisation process may require access to domain experts. It should be noted that, as we have discussed, the data may not be available at this stage; if this is the case then other parts of the familiarisation process may have to be deferred. The following are examples of familiarisation processes; some of the results of these processes will subsequently be stored within the data dictionary, described in section 2.1.5. 2.1.4.1. Database field type determination. There are a variety of terms that are used to describe type; we will use the following. Numerical. There are two numerical types; numerical discrete, describing integers (for example the number of dependents which, in this case, can only be non negative), and numerical continuous, describing reals (such as temperature). Categorical. This data is discrete and again has two types; categorical ordinal, describing categorical data with an implied ordering (such as size), and categorical nominal, describing categorical data for which there is no implied ordering (such as sex). For some types of field, the type may be obvious; however, there may be cases where the actual type is different to that which the field appears to have. For example, a field with integer value may be an encoding of categorical nominal data, in which case it should be treated as such. 1 The output field of a record will typically describe its class or the real value onto which the record must be mapped; we describe the remainder of the fields as input. 4 2.1.4.2. Determination of database field semantics. Knowledge of the meaning of a database field may influence the KDD process considerably. It may be known that two or more fields, although different, are based on the same or similar measurements. Such knowledge may then be used later in the KDD process; for example, only the most predictive of such a group of fields may be used. The names given to fields may be abbreviated but should be explained in full within the data dictionary; such abbreviations may potentially be misleading if used without reference to the data dictionary. For example, a field with the name ‘No’ may potentially have meanings such as ‘number’ or ‘negative’; reference to the data dictionary will be necessary to determine the actual meaning. 2.1.4.3. Reliability. Fields within the database, or even specific field values, may have varying levels of reliability. Knowledge of these levels can be useful in determining pre-processing operations on fields and interpretation of the discovered knowledge. It may also be possible to incorporate the reliability information within the data mining algorithms to be used and thus target patterns that are based on reliable data. 2.1.4.4. Determination of field value semantics. Knowledge of the meaning of field values may be used to spot outliers or erroneous values in later phases. It may also allow missing values to be understood more clearly and handled in a more appropriate manner. For example, some missing values may be caused purely through error in the data collection process, whilst others may be the result of some understood process; in the latter case, it may prove fruitful to treat the absence of data within a field simply as an extra value which the field may take. 2.1.4.5. Simple statistics. A basic understanding of the nature of a field may be gained by examining measures such as its range, mean, standard deviation, distribution and so on. If the data is not available at this stage and such statistics have not been generated then they may be examined at a later stage. The statistics may suggest that data cleansing (discussed within section 2.3) is necessary; if this is to be performed then these statistics should be generated for the cleansed data. 2.1.4.6. Data visualisation. Familiarisation with the nature of each field may be achieved by simple plots, whilst more complex visualisations may allow a deeper level of understanding of the data, such as the effect of combinations of field values on class. Again, if the data is not available at this stage and no visualisations have been made available in advance then visualisation may occur at a later stage. 2.1.4.7. Domain knowledge acquisition. Specialised knowledge of the domain within which the project is involved is crucial to the success of any project. Such knowledge may be acquired by talking to domain experts, studying relevant literature and so on. 2.1.5 Data dictionary A data dictionary must be made available for each data source. The data dictionaries may have to be created and may also need to be updated during the course of the project. Each data dictionary will contain attribute names, types and ranges together with information regarding missing values and/or reliability of values. 2.1.6 High level task (HLT) determination The goal or goals of the data mining project must be determined, which will be prediction and/or description [18]. We refer to the goals of the data mining project as high level tasks; low level tasks must also be determined and this process is discussed within section 2.1.7. The goal of description is to present discovered patterns in an understandable form; predicting unknown values is the requirement of prediction. This phase is essentially the process of determining whether a “black box” approach is suitable; if so then description will not be one of the goals. It should be pointed out that some algorithms may fulfil both goals; for example, a simple decision tree may be both understood and used to predict future values. 5 2.1.7 Low level task (LLT) determination The first step is to identify which tasks are feasible, based upon the database which is to be used and the application area. For example, classification cannot be carried out unless each object has been assigned a class and time series analysis obviously requires data with a time dimension. A target task or tasks must then be selected from the set of feasible tasks. The selection process will depend largely on the application area of the project and its goal or goals. The following are examples of data mining tasks which may be carried out; descriptions of such tasks may be found in [8, 16, 25]. Classification. Descriptions are found for a set of pre-defined classes within the database. A total classification may be produced, in which case descriptions are produced for all classes within the database; alternatively, a partial classification may be produced, within which descriptions are only found for certain classes. Clustering. The database is grouped into classes; a clustering of the data may be used for both description and prediction. Regression. A function which maps every record in the database onto a real value is produced. Such a function is useful primarily for prediction, although it may be possible to express the function (or some summarisation of its key features) in a form that may be used for description. Dependency modelling. A model is produced which describes dependencies which are significant between variables. Such a model is mainly useful for description, although it may also be used for prediction if it is in a suitable form. Time series analysis. Each record within the database has an associated time; patterns which exist over time are generally sought. Such patterns may be used for both prediction and description. Visualisation. Data is presented graphically in a way which facilitates visual identification of knowledge. Visualisations are clearly suitable for description. It should be noted that it is possible to convert some low level tasks into alternatives. For example, a database may be created within which the class field describes the current class of the record and the remainder of the fields describe the values of its attributes a year ago. In such a case, a time series analysis task effectively becomes a classification task. The desired properties of the discovered knowledge must be determined at this stage. A measure or set of measures of interest should then be defined for each required low level task. No single interest measure exists, and the measure or measures used should reflect the desired characteristics of the discovered knowledge. Interest measures may be based upon characteristics such as the accuracy or generality for classification tasks, or the size of clusters for clustering tasks. 2.1.8 Software and hardware requirements An estimate of software requirements should be made at this stage. This may be fairly general, but should give some indication of the hardware requirements and the cost of using necessary packages. Typical software requirements include the following. 2.1.8.1. Database software. More than one package may be required if several databases, each in the format of a different system, are used. 2.1.8.2. Spreadsheet software. 2.1.8.3. Software to support pre-processing operations. At this stage, the desired pre-processing operations, together with the algorithms to perform them, have not yet been determined; if a decision cannot be reached at this stage then estimates must be made. 2.1.8.4. Software to support the data mining algorithms that will carry out the required high and low level tasks. This, of course, means that such algorithms must be chosen at this point; if this is not possible at this stage then estimates must be made. 6 2.1.8.5. KDD packages. These may include software to support pre-processing operations, database interfacing, data mining algorithms and so on. As discussed within section 2.1.9, the hardware requirements will depend both upon the software requirements and the database or databases to be used. 2.1.9 Feasibility determination The feasibility of mining for patterns within the database or databases is determined within the following areas. 2.1.9.1 Missing and unreliable data. The size of characteristic 2.1.3.3. in section 2.1.3 (the proportion of the database which is missing) may be so large that data mining is infeasible; similarly, the proportion of the database which is unreliable may be infeasibly large. Alternatively, if the majority of the missing or unreliable information occurs primarily within a subset of the records or fields then it may be possible to use only certain records or certain fields. If this is not the case then the incorporation of missing or unreliable data within the data mining algorithms may be investigated to determine the feasibility of the project. 2.1.9.2 System performance. Once feasibility has been established from a missing data perspective, the performance of the system on which the data mining will be carried out must be established. The first step within this phase is to confirm that the system meets the requirements of the software to be used within the project; once this has been done, the system performance must be measured in the following areas. 2.1.9.2.1. The available hard disk space. The space required to store the database or databases on the hard disk must be estimated and compared to the available area. If the available space is insufficient then disk space may be increased or different databases used. Alternatively, one or more steps from the data cleansing and pre-processing phases (described within sections 2.3 and 2.4 respectively) may be performed at a later stage to reduce the size of the database; these include random sampling, feature subset selection, discretisation and clustering groups of similar records together, so that data is effectively dealt with at a ‘macro’ rather than ‘micro’ level and the number of records is reduced. 2.1.9.2.2. The size of the available memory. Once the disk space feasibility has been established, an estimate of the memory required by the database and software must be made. If this exceeds the amount available then the same pre-processing steps as described for disk space limitations may be undertaken. Again, if this does not render the project feasible then more major project modifications must be undertaken such as upgrading the memory or using different databases. It should be noted that the available memory may render some data mining algorithms infeasible if they scale up poorly to large databases. 2.1.9.2.3. The database access speed (if flat files are not to be used). 2.1.9.2.4. The processor speed, measured using an appropriate benchmark. Given an approximation of the amount of processor effort required within the data mining exercise given the database or databases and taking into account the database access speed, an estimate of the time which the project will take can be made. The accuracy of this estimate will depend upon the extent to which future phases have been planned. If this estimate significantly exceeds the time available then the pre-processing steps described previously will be considered. Estimates must be made of the time taken to perform the necessary pre-processing step or steps (including those performed because of memory or hard disk limitations), together with the time which will be required to perform data mining on the new data. If the total time exceeds that available for all of the appropriate pre-processing steps then the project must be redesigned, by upgrading the available processing power, using a different database or databases, allowing more time and so on. 2.1.9.3. Personnel. Estimated personnel requirements form part of the measure of project feasibility. Provision must be made for domain experts and KDD experts; training may also need to be undertaken. 7 2.1.9.4. Size of database regions of interest. If the regions of interest within the database are too small then the project may be infeasible. For example, an organisation may be interested in rules that describe a class of interest; if only a handful of records in a database containing millions of records belong to the class then the project may be infeasible. 2.1.9.5. Low level task feasibility. As discussed within section 2.1.7, some low level data mining tasks may prove infeasible given the available data. For example, if the records do not have an associated time then time series analysis cannot be carried out. 2.1.9.6. Cost. The estimated total cost of the proposed project forms the final component of the feasibility measure. In addition to determining feasibility, such information can also be used in weighing up the potential costs and benefits of the project together with the risks involved; the decision to run, revise or redesign the project can then be made in a more informed fashion. 2.1.10 Outputs from the problem specification phase The output from this phase is a problem specification, which contains the following components. 2.1.10.1. A list of resource requirements, including cost, time, personnel, hardware and software. These should be presented to management level personnel for approval. 2.1.10.2. The high and low level tasks to be undertaken within the project. 2.1.10.3. A data dictionary. 2.1.10.4. The feasibility of the project. 2.1.10.5. A travel log, which is updated at this point to record the above information. The travel log will continue to be updated throughout the course of the project so that it contains a record of everything that has happened within it. A KDD toolkit can potentially offer support during this phase and produce the final problem specification document, which will accompany the travel log. The toolkit may also generate a suggested route or routes through the KDD roadmap, based upon the nature of the project to be tackled. 2.2 Resourcing This stage is illustrated in figure 3; the list of resource requirements, which is output from the problem specification phase, is taken as input. Within this phase, the resources specified within the problem specification, including the data mining algorithms that are to be used, are gathered. The resource which may potentially be the most time consuming to gather within this phase is the data. The data may not have been available within the previous stage, or may exist in forms which are time consuming to convert into usable databases. For example, as we have previously discussed, part of the database that is to be used may exist in paper form and thus require putting into electronic form and linking with the existing components. The data may be sourced from data warehouses. These are vast stores of data which some organisations maintain, and each one may contain all of the data which the organisation has ever gathered in a particular area. Data warehouses will generally contain far more data than is manageable or required by the KDD project; the project may also require data from several such warehouses. This has lead to the development of ‘data marts’, which contain the relevant data collected from one or possibly more warehouses and which are much smaller than any single data warehouse. The data mart is therefore similar to a shop, which generally takes its stock from a range of warehouses but contains much less stock than any single warehouse. Data may potentially be more easily sourced from data marts, since their data has been gathered from multiple data warehouses and is of a more manageable size than even a single data warehouse. The output from the phase is an “operational database”. This may be made up from a number of different sources, each with its own database management system, but exists as a complete database that is consistent in its structure, formatting, identifiers for missing values and so on. To create such a database, procedures for transforming the data from each of the sources into the required structure 8 Figure 3: The resourcing stage and format must clearly be established. There are a number of issues related to such transformations, including the following. 2.2.1. Banding levels of data. Each source may contain data at a different banding level. For example, age may be represented as raw values or alternatively be banded into intervals. If the banding levels are different within each source then the methodology described in [15] can be used to combine them. 2.2.2. Macro and micro level data. As we have previously discussed in section 2.1.9, if groups of records that are similar are clustered together, the data is converted from micro to macro level. Within macro level data, each record therefore represents a group of micro level records and often includes a numerical value of the number of corresponding records in the original database, whilst every record is represented individually within micro level data. Data may be stored at different levels within different sources; if this is the case then the levels should be made the same within the operational database. Converting micro level data to macro level is the most straightforward way to accomplish this, since converting in the opposite direction generally requires access to an original, micro level version of the data; this is really only a problem when banding has taken place. 2.2.3. Gathering data from the world wide web. The web contains enormous quantities of data that may prove useful within a KDD project. However, the principal problem in gathering and using such data is dealing with the large quantities of unstructured information that is designed primarily for human rather than machine consumption. A survey of data mining from the web is given in [7]. 2.2.4. Coding consistency. Data from each source may contain complex coding schemes. For example, the value of a field may be a code that describes a node within a large, complex hierarchy of types. Each source may use a different hierarchy; procedures must therefore be developed to allow such codings to be translated into a common format. 2.2.5. Consistent data formatting. Representations and field names must be consistent, including their use of upper and lower case characters; this information will be stored within the data dictionary 9 Figure 4: The data cleansing stage for the operational database, as described within section 2.1.5. A single format must be decided on for the operational database, such as database tables; all of the data must then be converted into this format. 2.2.6. Miscellaneous data formatting. The data must be converted into a format which is suitable for the data mining algorithms to be used; this conversion will typically involve making use of suitable field delimiters, adding appropriate descriptive headers to files and so on. The operational database may be formed by creating a physical copy of the data from the various sources, or alternatively exist only in ‘virtual’ form, drawing data directly from the sources when accessed. In both cases, the source databases remain unchanged by any transformations that are performed to create the operational database. 2.3 Data cleansing Figure 4 gives an illustration of this phase, within which the aim is to prepare the data for subsequent phases that involve learning. Operations such as the removal of errors, dealing with missing values and perhaps balancing are therefore performed at this stage. Although the operations performed within this phase may be classified as pre-processing, they differ from other pre-processing operations in two key ways. Firstly, learning may be performed within the pre-processing phase but never occurs within this phase. Secondly, this phase is generally only performed once for a given database or databases, whereas pre-processing may be carried out a number of times. The operations which are performed within this phase are the following; it should be noted that database size reduction operations determined within the problem specification stage are made use of here as mandatory data cleansing operations. 2.3.1 Outlier handling As described in [22], many outliers may be classified as either errors or groups of interest. In the case of the latter, the project will probably be concentrating upon the outliers. Within the data cleansing phase of the project, only erroneous outliers are dealt with. 10 The process of dealing with erroneous outliers will generally require some domain knowledge to determine what constitutes such an outlier. Domain knowledge is also often required to determine the corrective action to apply to each form of outlier. For example, the presence of an outlier may suggest that the value is erroneous and should be treated as missing data. Alternatively, some corrective processing may be applied to the outlier to convert it into a valid value. 2.3.2 Random sampling If a sufficient number of records exist within the database or databases then they may be split at random into a separate training and testing subset. The data mining algorithm or algorithms which are to be used will later be applied to the training set; the patterns which they discover will then be evaluated later using the testing set. The size of each of these subsets may be determined by the system on which data mining will be carried out. For example, the available memory may only be sufficient to allow a training set size which contains 10% of the complete database. (If balancing, described in section 2.3.4, is to be performed then it should only be carried out on the training database.) In some cases, there may be too few records to allow separate testing and training subsets to be formed. In such cases, it is often still necessary to obtain some estimate of the extent to which the discovered knowledge represents genuine patterns rather than noise. Under such circumstances, alternative evaluation approaches must be used, such as those discussed within section 2.6. 2.3.3 Missing data handling The approach which is to be used to deal with missing data must be determined and performed at this stage. As previously described in [4], there are a variety of ways in which this may be performed. One of the most straightforward is to simply mark the data as missing within this phase and allow the data mining algorithm to deal with it in an appropriate manner. If the missing values are caused by some understood process (and therefore the fact that they are missing represents useful information) then the absence of data may be represented as an additional valid value which the field can take; otherwise, missing values should be represented by a flag which alerts the data mining algorithm to the fact that no data exists. Some examples of methods for handling such missing values within data mining algorithms can be found in [2, 21, 23]. If missing data is not to be handled primarily within the data mining algorithm then there are two main approaches for dealing with it as a pre-processing step. The removal of missing data. This approach eliminates missing data in ways such as removing all records containing missing data or all fields containing missing data. This approach may be used in conjunction with handling missing values within the data mining algorithm; for example, all records with a high proportion of missing values may be discarded. Databases within which missing values occur in only a small proportion of the fields or records tend to be most suitable for this approach. Missing data estimation. The missing values are estimated, within the training database, using approaches ranging from the simple (such as replacing missing numeric values within a field with the mean over all known examples) to the complex (such as training a neural network to predict missing values for a field using the remaining fields [9]). The first of these approaches may prove less time consuming than the second, but suffers the disadvantage of throwing away data. However, the second approach also potentially loses information, since by filling in missing values their uncertainty is not recorded. This may be rectified by flagging the filled-in values within the database. The measure of pattern quality used by data mining algorithms could then incorporate the proportion of missing values upon which the pattern is based; this would allow the user to encourage the production of patterns which are not based upon many missing values. If this approach is used then the patterns produced may be evaluated by putting the missing values back into the database before testing occurs. 11 Figure 5: The pre-processing stage 2.3.4 Database balancing The database or databases to be used may be ‘balanced’ at this stage. This process allows the proportion of records within a database which belong to a chosen minority class to be increased, which may improve the performance of some data mining algorithms. Generally, balancing algorithms work in one of two ways. Data deletion. Records which do not belong to the chosen class are discarded at random, until the proportion of records within the database which belong to the chosen class is sufficiently large. This approach has the disadvantage of throwing away data. Data duplication. Records which belong to the chosen class are duplicated at random, until the proportion of records within the database which belong to the chosen class is sufficiently large. The disadvantages of this approach are that the duplication of records may distort patterns within the database and will result in the noise duplication; the increase in database size may also impair the performance of the data mining algorithm. It should be noted that it may prove beneficial to produce a number of balanced databases, each of which contains a different proportion of records that belong to the chosen class. 2.4 Pre-processing Pre-processing is the first phase of the project within which learning may occur and is illustrated within figure 5; this phase is generally performed a number of times during the course of the project. The information gathered within the problem specification stage, in terms of available time, space and speed, is used within this stage. As with the previous phase, database size reduction operations determined within the problem specification stage are made use of here as mandatory pre-processing operations. At this stage, pre-processing operations which are not mandatory may be considered, since many of these may improve the quality of the results produced within the data mining phase. The following operations may be performed within this phase. 12 2.4.1 Feature construction Such techniques, as described in [11, 14], apply a set of constructive operators to a set of existing database features to construct one or more new features. Good feature construction algorithms may improve the performance of data mining algorithms considerably. The technique may also prove useful when combined with feature subset selection to produce a small set of powerfully predictive fields. The operators which are applied to the existing features within the database may range from the simple to the complex, and domain knowledge may be incorporated within the process. For example, a domain expert may know that it is not the values of field a or field b which are important in predicting a class but the difference between them; creating a new field which represents the difference between the two fields thus makes use of such domain knowledge. Ideas for feature construction often come from the data visualisation phase (see section 2.1.4); for example, a straight line may be seen when the data is visualised which indicates a potentially useful feature construction approach. A potential drawback of feature construction is that fields may be produced which, though powerfully predictive, are highly complex; this can lead to the production of knowledge which is difficult to understand. 2.4.2 Feature subset selection (FSS) FSS [5, 12, 17] reduces the number of fields within the database, and can produce a highly predictive subset. If separate training and testing databases exist then FSS should only be applied to the training database. High quality feature subset selection algorithms may improve the performance of data mining algorithms in terms of speed, accuracy and simplicity. The knowledge of powerfully predictive fields may also represent important information in itself. Fields not deemed important might indicate features that no longer need to be collected and stored. Information on the most important fields may also be passed on to outside groups which will make use of it in their own ways. A wide range of feature subset selection algorithms exist, which may make use of quality measures from the fields of machine learning or statistics; a high quality approach should be used, since the selection of a poor quality feature subset may potentially impair the performance of the data mining algorithm or algorithms to be used. The speed of the FSS approach is also an important consideration in this phase; some approaches may prove infeasible in the time available, or require more time to execute than they save within the data mining phase. 2.4.3 Discretisation A variety of such techniques are described in [6]. Some data mining algorithms require such preprocessing, but even those which do not may benefit. The potential benefits of discretisation are the same as those for FSS; the data mining algorithm or algorithms to be used may yield improved performance in terms of speed, accuracy and simplicity. Again, if separate training and testing databases exist then discretisation should only be applied to the training database. The potential pitfalls to this approach are similar to those of FSS; a poor quality discretisation scheme may impair the performance of a data mining algorithm. If the required task is regression and discretisation is performed on the numeric field whose value is to be predicted then the task is effectively changed from regression to classification. Discretisation algorithms may also be used to perform FSS; any fields which are discretised into a single interval may clearly be removed from the database. Data mining can then be performed on the remaining fields, using either their original or discretised form. A large number of discretisation schemes is available; these may be grouped into a number of categories [6]. Local or global. Local discretisation algorithms are applied to localised regions of the database, whilst global methods discretise the whole database. Unsupervised or supervised. Supervised methods make use of the class value for each record when forming discretisation and may potentially produce intervals which are relatively homogeneous with respect to this. Unsupervised methods use only the value of the field to be discretised when forming discretisations and therefore may potentially lose classification information. 13 Figure 6: The data mining stage Static or dynamic. Dynamic methods form discretisations for all features simultaneously, whilst static approaches discretise each feature in turn individually. Discretisation may also effect macro level data by merging classes. 2.5 Data mining Within this phase, illustrated within figure 6, the data mining algorithm or algorithms to be used may have to be determined. The data mining tasks which are required within the project will obviously restrict the choice of data mining algorithm or algorithms; for example, if the required task is clustering then a tree induction algorithm such as C4.5 may not be used. Similarly, the interest measure or measures which are to be used (discussed previously within section 2.1.7) may affect the choice of data mining algorithm or algorithms, since some algorithms may produce results in a form which are more straightforward to evaluate than others using the interest measure or measures. If the decision has been made not to use separate training and testing sets within section 2.3.2 then some form of error estimation must be used within this phase, such as ‘Leave one out’ [26]. For each data mining task which is to be performed, there are a wide range of algorithms available; some of these are fairly similar to each other, whilst others work in very different ways. Each algorithm will typically have its own strengths and weaknesses, in terms of efficiency, suitability for certain types of data, simplicity of patterns produced and so on; several different data mining algorithms should therefore ideally be used to perform each different task. The amount of time available, together with the performance of the system which is to be used to run the data mining algorithm, will also influence the choice of data mining algorithms and the number which are used. Each data mining algorithm which is to be used will typically have a number of parameters which must be set before it can be executed. These will generally fall into the following categories and will often have default values which may initially be used. Algorithm parameters. These control the execution of the data mining algorithm in a manner which affects its overall performance. Problem parameters. These offer the user control over a variety of options related to the desired characteristics of the discovered knowledge. For example, the user may be given control over 14 Figure 7: The evaluation stage the number of clusters produced within a clustering algorithm, the desired generality level of the discovered patterns for a classification algorithm, or the number of nodes within a neural network. The desired properties of the discovered knowledge, determined in section 2.1.7, must be used to set the problem parameters. Once suitable parameter values have been found, the data mining algorithm or algorithms must be executed and the discovered knowledge examined. This preliminary evaluation will not be particularly rigorous, since its purpose is primarily to determine whether the discovered knowledge is worthy of closer scrutiny. If this is found to be the case then no more work needs to be done within this phase; however, such an outcome at the first attempt is extremely rare. Typically, the discovered knowledge will be unusable for reasons such as being overly complex, at an unsuitable level of accuracy, of insufficient quality and so on. In such cases, the relevant parameter or parameters of the data mining algorithm are set to new values in an attempt to rectify the situation and the algorithm is re-run. This process repeats until satisfactory results are produced, or no set of parameter values is found which gives such results; in the latter case, more drastic action is required, such as using an alternative algorithm or examining the validity and suitability of the data used for the required tasks and revising these accordingly. It is also possible to combine the algorithmic parameter setting, algorithmic execution and preliminary evaluations into a single process. [13] report an approach within which the parameter selection for C4.5 is automated, by minimising estimated error. 2.6 Evaluation of results This phase is illustrated within figure 7. There are a range of approaches which may be used to evaluate the results of a data mining exercise; the choice of these will be made in part by the data mining goal or goals, the required tasks and the application area. Within this phase, the test database is used for evaluation; if this does not exist then this phase is effectively merged with the data mining phase since the training database is used both for the production and testing of patterns. The areas within which the discovered knowledge is evaluated are as follows. 2.6.1. Performance on test database. If separate training and testing sets have been generated then the performance of the discovered knowledge on the testing set may be used to determine its quality. If, for example, a set of rules is produced that is much more accurate when applied to the training database than to the testing database, the rules are likely to be overfitting the 15 data and thus may be unsuitable for practical use. If no separate training and testing databases exist then alternative approaches may be used to give some approximation of the performance of the discovered knowledge on unseen data. These approaches, based on resampling or dividing the database into smaller segments that are subsequently used for training and testing, include cross-validation and bootstrapping [26]. 2.6.2. Simplicity. If description is a high level task of the project then the simplicity of the discovered knowledge is likely to be crucial. The level of simplicity which is required in such cases will be partly dependent on the application area; for example, if the discovered knowledge is to be presented to domain experts then a far lower level of simplicity will be required than if it is to be understood by general personnel. 2.6.3. Application area suitability. The suitability of the discovered knowledge for the area of application will generally be a crucial factor in determining the success of the data mining project; if the knowledge which is discovered has no useful application then clearly the project must be revised accordingly. For example, knowledge may be unsuitable because it is of insufficient quality to be useful; if this is the case then revisions should be made in phases such as data mining and pre-processing. 2.6.4. Generality. The generality level of the discovered knowledge (the proportion of the database to which it applies) may be critical in some areas of application. Within some areas, maximum benefit may be gained from knowledge which is very general, whilst in others the reverse may be the case. Generality levels may be varied by making changes in areas such as the data mining algorithm or algorithms used, their parameters, preprocessing and so on. 2.6.5. Visualisation. This is a potentially useful evaluation tool; by examining discovered knowledge in a visual environment, complex characteristics may be easily assimilated. For example, a visualisation may be produced for a set of rules, showing their performance throughout the database. Such a visualisation may be used to understand areas within which the discovered knowledge is performing poorly, and may offer some insight into why this is happening and how it may be rectified. 2.6.6. Statistical analysis. The field of statistics offers a wide range of approaches which are useful in evaluating the discovered knowledge; some examples include the investigation of robustness, significance, overfit and underfit. 2.7 Interpretation of results At this stage, presented within figure 8, evaluation is performed by domain experts, who offer a particularly valuable source of evaluation. They will be able to compare the discovered knowledge to their own and determine how closely they match. Wide differences would suggest one or more errors at some stage within the data mining process and can be used to guide the search for these, together with the revision of the approach. One would generally expect discovered patterns which are genuine to match the knowledge of the domain expert, represent a refinement of it, or alternatively fit reasonably well with their intuition and background knowledge. The discovered knowledge may effectively represent hypotheses in which domain experts are interested. In such cases, the domain experts may wish to analyse these hypotheses using their own methods of testing. Domain experts will also be able to determine how the discovered knowledge fits with existing knowledge within the application area. This is clearly a vital step for areas within which the new patterns are to be put to use alongside existing knowledge. 2.8 Exploitation of results If the project has reached this stage, illustrated within figure 9, then the discovered knowledge has been evaluated to a considerable extent and is believed to be valid, of good quality and suitable for the proposed application area. Within this phase, the patterns which have been produced are put to use; 16 Figure 8: The interpretation stage this may often be a major undertaking for an organisation; efforts will therefore be made to minimise the risks involved and maximise the potential benefits. If the high level task within the project is description then the extracted knowledge is applied to the required application area. For example, an organisation may change its procedures to incorporate knowledge. This may require the involvement and consent of senior management. The project may require a software application to be generated which embeds the discovered knowledge. This may be facilitated by the packages used within the project; for example, the discovered knowledge may be exported in the form of C++ code. The KDD process undertaken during the course of the project may be integrated within the company. If the project is not a one-off then the travel log may be used as a starting point for the creation of an automated version of the project. This automated version may then be set up to be regularly re-run as the databases used within it are updated; changes in the discovered knowledge can then be noted and reported. The reported changes in the discovered patterns could then be put into practice, which would keep the organisation up to date with the environment within which it operates. The process of putting the discovered knowledge into practice should ideally involve the minimum of risk together with the maximum of benefit. To achieve such goals, it may prove beneficial to make this process a gradual one. Initially, simulation of the process within which the knowledge is to be put to use may be performed; this can be used to estimate the likely effects in a variety of different areas. Once the simulation studies have been performed, the next stage will be to undertake small scale trials of the discovered knowledge. If the results of these trials appear promising then the organisation may expand them until full use is made of the discovered knowledge; its full benefits may then be realised. 3 A suggested KDD roadmap route for marketing applications Within this section, we offer an example of a KDD roadmap route which is aimed at the application area of marketing. Within this area, a wide variety of specific routes may be taken and so we will concentrate upon presenting a single route at a general level, discussing likely directions, repetitions and so on; key characteristics of the route are presented in order of execution. For the sake of brevity, only the most pertinent points will be described; those which are missed out are not necessarily considered to be excluded from all such projects. A KDD toolkit may potentially generate a suggested route for a user, based upon the application area within which the user is working. The suggested route can be determined more tightly through the course of the project as the toolkit questions the user further regarding the nature of the application 17 Figure 9: The exploitation stage area. The toolkit will therefore be providing low level information for personnel at the user level, which will guide them along a route which is appropriate for their needs. Problem specification. The databases which are to be used for marketing projects may contain very large numbers of records, large numbers of fields and be noisy. The database is likely to contain a variety of types and may have missing or unreliable values; a KDD toolkit can perform automatic database examination to determine field types. A considerable quantity of domain knowledge may also be available. The high level tasks which are required may be prediction and/or description; a range of low level tasks may be required, based upon project types such as the following. • Customer segmentation; it may prove useful to divide customers (or potential customers) into a number of groups, each of which contains customers which are similar to each other. This therefore represents a clustering task. • Mailshot targeting. When mailshots are sent out to potential customers, only a very small proportion of these are likely to respond; identifying the customer types which are likely to respond and targeting them is therefore potentially beneficial. This therefore represents a classification task. • Customer profiling. An example of such a project is the creation of a credit scoring system, which takes as input a number of customer characteristics and outputs a continuous numerical value that represents an estimate of their credit worthiness. This represents a regression task. A range of database packages may be required, since the data to be used may come from a variety of sources. A KDD package or packages will also be required. Pre-processing tasks such as feature construction are likely to prove useful and therefore suitable software may also be required. The remainder of this route will be aimed at the mailshot targeting project type. Resourcing. The key issue within the phase for marketing projects is the integration of databases from multiple sources to form the operational database. The databases may be in different format, have different coding schemes and contain micro and macro level data as well as having different banding levels. Data cleansing. Erroneous outliers and missing values are likely to exist and therefore must be dealt with at this phase. Each algorithm which is to be used to undertake a low level task may have 18 an associated set of cleansing operations which may be required. For example, balancing may improve the performance of certain algorithms when they are used with databases which contain a small class of interest, such as those used within this type of project; a KDD toolkit may therefore suggest appropriate cleansing operations under such circumstances. Pre-processing. The size of the databases to be used means that random sampling is likely to be carried out to create testing and training subsets. Feature construction algorithms can potentially create new, powerfully predictive fields. Feature subset selection may also prove useful if there is a large number of fields. Data mining. For the task of mailshot targeting, approaches such as neural networks, rule induction and decision tree induction algorithms may be used. If the high level task is description then a neural network is unlikely to be used. The databases which will be used for this task are likely to contain a class of interest that is extremely small; decision tree induction algorithms are therefore likely to prove less useful than rule induction algorithms that are capable of producing rules to describe a pre-specified class. This phase is likely to be undertaken a considerable number of times until satisfactory results are produced and the project may have to return to the pre-processing, cleansing or even resourcing phase to help achieve this goal. Evaluation of results. Suitability for the application area will be crucial within mailshot targeting projects; test database performance, together with generality level, is also likely to be a useful measure of quality. Simplicity may be required, since description will generally be a high level task of the project, and statistical analysis is likely to be used to determine the significance of the results. It is likely that the project will return to earlier phases at least once from this phase. Interpretation of results. A considerable amount of domain expertise is likely to exist, which can be used to further evaluate the discovered knowledge. It will generally be necessary to examine how the new knowledge fits with existing knowledge and how it is to be used alongside it. Again, it is likely that the project will return to earlier phases at least once from this phase. Exploitation of results. If the results produced within the project so far are of sufficient quality then the organisation will be keen to exploit them as rapidly as possible. The knowledge is likely to have a limited shelf life and thus degrade over time. In addition to this, competitors of the organisation are likely to be undertaking their own projects in similar areas and thus the competitive advantage gained is maximised by rapid exploitation. However, the risk associated with exploitation can be considerable, and so simulation and a degree of trialling are likely to be performed to reduce this. References [1] R. J. Brachman and T. Anand. The process of knowledge discovery in databases: A human-centered approach. In U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, editors, Advances in Knowledge Discovery and Data Mining. AAAI Press, Menlo Park, 1995. [2] L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. Classification and Regression Trees. Wadsworth and Brooks, Monterey, CA., 1984. [3] P. Chapman, J. Clinton, J. H. Hejlesen, R. Kerber, T. Khabaza, T. Reinartz, and R. Wirth. The current CRISP-DM process model for data mining. Distributed at a CRISP-DM Special Interest Group meeting, 1998. [4] J. C. W. Debuse. Exploitation of Modern Heuristic Techniques within a Commercial Data Mining Environment. PhD thesis, University of East Anglia, 1997. [5] P. A. Devijver and J. Kittler. Pattern Recognition: a Statistical Approach. Prentice-Hall International, London, 1982. [6] J. Dougherty, R. Kohavi, and M. Sahami. Supervised and unsupervised discretization of continuous features. In Prieditis and Russell [20], pages 194–202. 19 [7] O. Etzioni. The world-wide web: Quagmire or gold mine? In U. M. Fayyad and R. Uthurusamy, editors, Comm. ACM, volume 39(11), November 1996. [8] U. Fayyad, G. Piatetsky-Shapiro, and P. Smyth. Knowledge discovery and data mining: Towards a unifying framework. In Simoudis et al. [24]. [9] A. Gupta and M. S. Lam. Estimating missing values using neural networks. Journal of the Operational Research Society, 47:229–238, 1996. [10] C. M. Howard. The DataLamp package. School of Information Systems, University of East Anglia, 1998. [11] A. Ittner and M. Schlosser. Discovery of relevant new features by generating non-linear decision trees. In Simoudis et al. [24], pages 108–113. [12] G. H. John, R. Kohavi, and K. Pfleger. Irrelevant features and the subset selection problem. In W. W. Cohen and H. Hirsh, editors, Machine Learning: Proc. of the Eleventh Int. Conf., pages 121–129, San Francisco, 1994. Morgan Kaufmann. [13] R. Kohavi and G. H. John. Automatic parameter selection by minimizing estimated error. In Prieditis and Russell [20], pages 304–312. [14] C. J. Matheus and L. A. Rendell. Constructive induction on decision trees. In Proc. of the Eleventh Int. Joint Conf. on Artificial Intelligence. Morgan Kaufmann, 1989. [15] S. McClean and B. Scotney. Distributed database management for uncertainty handling in data mining. In Proc. of the Data Mining Conf., pages 291–311. UNICOM, 1996. [16] Knowledge Discovery Nuggets. Siftware, 1998. www.kdnuggets.com/siftware.html. [17] M. Pei, E. D. Goodman, W. F. Punch, and Y. Ding. Genetic algorithms for classification and feature extraction. Proc. of the Classification Soc. Conf., 1995. [18] G. Piatetsky-Shapiro. From data mining to knowledge discovery: the roadmap. Proc. of the Data Mining Conf., pages 209–221, 1996. [19] R. S. Pressman. Software engineering: a practitioner’s approach. McGraw-Hill, 1992. [20] A. Prieditis and S. Russell, editors. Proc. of the Twelfth Int. Conf. on Machine Learning, San Francisco, CA, 1995. Morgan Kaufmann. [21] J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993. [22] V. J. Rayward-Smith and J. C. W. Debuse. Knowledge discovery issues within the financial services sector: the benefits of a rule based approach. Proc of the Unicom Data Mining / Data Warehouse seminar, 1998. [23] V. J. Rayward-Smith, J. C. W. Debuse, and B. de la Iglesia. Using a genetic algorithm to data mine in the financial services sector. In A. Macintosh and C. Cooper, editors, Applications and Innovations in Expert Systems III, pages 237–252. SGES Publications, 1995. [24] E. Simoudis, J. W. Han, and U. Fayyad, editors. Proc. of the Second Int. Conf. on Knowledge Discovery and Data Mining (KDD-96), 1996. [25] Pilot Software. Glossary of data mining terms, 1998. Available electronically from: www.pilotsw.com/r and t/datamine/dmglos.htm. [26] S. M. Weiss and C. A. Kulikowski. Computer systems that learn : classification and prediction methods from statistics, neural nets, machine learning and expert systems. Morgan Kaufmann, San Francisco, 1991. [27] J. P. Yoon and L. Kerschberg. A framework for knowledge discovery and evolution in databases. IEEE Trans. on Knowledge and Data Engineering, 5(6):973–979, 1993. 20