Download CRISP_DM_Methodology

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Dr. Bjarne Berg
DRAFT
Overview of the CRISP DM Methodology for Data Mining
The Cross Industry Standard Process for Data Mining (CRISP- DM) was initially a methodology developed
by a consortium of members through a set of workshops with industry practitioners in 1996 and 1997. Following
input from a panel of Special Interest Groups (SIGs), the authors consolidated and refined the model over the next
few years and presented the methodology in 1999 as a tool for practitioners. The methodology consists of a set of
six phases that takes the practitioner from the inception of a problem to the completion of the analysis.
The first phase is the Business understanding. This phase consist of the tasks of determine the business
objectives (why are we doing this), as well as an assessment of the business situation. This is done to place the data
mining effort in context of the problem and in the context of the organization that is conducting the activity. As a
result of this effort, clear goals of the effort is established, prioritized and incorporated into a project plan with
scope statement, duration, dependencies and resource allocations. This is often considered to be part of the ‘project
preparation’ phase in other more traditional System Development Life-Cycle methodologies (SDLC).
The next phase is called the Data Understanding. In this phase an initial set of data is collected to get a
clearer understanding on what is available and see how the data sets can be used to address the problem(s). The
phase includes a detailed documentation of the data through the creation of a data library which describes that data.
In addition, the phase include an initial exploration of the data (often done through the use of descriptive statistics)
as well as a verification of data quality to determine any issues early in the project so that it can be addressed before
a significant effort is consumed by the project.
The third phase is the data preparation. This includes a data selection of the actual data to be used in the
project. It may include whole populations, or simply a sample of the data. It may also include some level of
cleansing of the data as well. The next part of this phase is dedicated to the data construction where samples are
created, organized based on the tool(s) selected and integrated in a storage format that can be accessed by the tool.
It may also include a reformatting of data into new data types, codes, indicators and flags as well as more structured
formatting of unstructured data such as texts, comments and other non-numeric data.
The fourth phase consists of the modeling. The first step is to select the appropriate modeling technique.
This should be based on both the sample size, data type as well as the problem that is being addressed. For many
problems, there may be more than one technique that can be used, and the modeler can decide to use both to see
which yields the best result. After a technique has been selected, it is important that the modeler does not simply
engage in the number ‘punching’ but that he instead takes a serious look at the test design of the problem. This
include a detailed approach to test for validity (are we measuring what we think we are) and reliability (is this only
valid for this one data set, or can it be repeated). We would also test to see which assumptions may be violated i.e.
random sampling, sampling methods, normality, homoscedasticity etc., as well as the impact of those validations to
the test design and subsequent findings. The next step in this phase is to actually build the model. This is often done
with a subset of the sample that is randomly selected. I.e. of an overall sample of 5,000 another random subset of
1,000 can be used to build a model and the remaining 4,000 can be used to assess the model result on known data
Dr. Bjarne Berg
DRAFT
points. This is a very common approach when attempting to optimize the model building through leveraging
multiple methods and models. This fourth phase ends with an assessment of the model and its ability to predict,
illustrate or explore the findings of the system.
The fifth phase of the methodology is the evaluation. In this phase the first step is to evaluate the results
and place them back into business context. It discussed what does the standard deviation, mean and other statistical
measures represents in business terms. Based on this context, the process is re-evaluated to see if any improvements
can be made, or if other techniques should be selected. It is important to note that in the previous phases of data
preparation and modeling, the methodology recommends an iterative nature of the model building through
revisiting the data preparation in a cyclical manner until the model has been refined. Later in the evaluation phase,
we are not revisiting the model creation, but merely placing it in business context for an evaluation of
reasonableness, significance (in statistical terms as well as in business terms), as well as impact to the
organizations. As a result, the last step in this is the determination of the next steps of the process which may result
in repeating the whole project cycle as a distinctly new effort. This is due to the fact that the findings of one data
mining effort may result in the need to further explore those findings before considering any deployments based on
the results.
If the findings are found to have significance and validity, the project may progress to the last phase, which
is known as deployment. This is the phase were the project team asks the questions about how the findings should
impact changes in the business model or the organizational behavior. This may result in a new way of interacting
with customers through credit or new marketing initiatives, or simply be a validation of the known relationships to
see how they might have changed over time. If the findings are actionable, a plan for deployment is created in this
phase. This includes planning items for new technology, people, processes and resources to take advantage of the
findings. It also includes a plan for monitoring and maintenance of the proposed solution as well. This is often
known as a ‘sustain organization’ or ‘sustain support model’. The last steps in this phase are the creation of the
actual final report of the project. This can be through a set of media such as word documents, collaboration rooms,
web pages and any other tool the project may decide to employ. As any good methodology, CRISP-DM also
advocates that the project ends with a “lessons learned” session where the participants sit down as reviews the
project shortly before the project termination. The purpose of this step is to make sure that the organization learning
occurs and that the mistakes and approaches learned by the project are employed in future efforts by new team
members and leveraged by the current members as well. Unfortunately, this is a step that is often ignored. As a
result, many organizations are ‘doomed’ to continue to make the same mistakes on project after project.
The CRISP-DM model organizes the sub-levels of the phase into a hierarchical model that consists of
generic tasks that are mapped to specialized tasks that has various process instances. These two lower levels are
known as the CRISP processes and the two higher levels are known as the CRISP Process Model. While the model
is short and at a generic high level (only consists of less than 100 pages), it provides a solid framework for
organizations to approach data mining. It is worth noting that some companies such as SPSS and SAS has used this
methodology a reference and been instrumental to its development