Download 10_chapter 1

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
CHAPTER 1
INTRODUCTION
1.1 INTRODUCTION
Knowledge Discovery in Databases (KDD) is the process of automatic
discovery of previously unknown patterns, rules, and other regular contents
implicitly present in large volumes of data[1]. Data Mining (DM) denotes
discovery of patterns in a data set previously prepared in a specific way. DM
is often used as a synonym for KDD. However, strictly speaking DM is just a
central phase of the entire process of KDD.
The idea of automatic knowledge discovery in large databases is first
presented informally, by describing some practical needs of users of modern
database systems. The scope of KDD and DM is briefly presented in terms of
classification of KDD/DM problems and common points between KDD and
several other scientific and technical disciplines that have well-developed
methodologies and techniques used in the field of KDD.
1.2 DATA MINING AND WAREHOUSING CONCEPTS
The past couple of decades have seen a dramatic increase in the amount of
information or data being stored in electronic format. This accumulation of
data has taken place at an explosive rate[2]. It has been estimated that the
amount of information in the world doubles every 20 months and the sizes as
well as number of databases are increasing even faster[10]. There are many
examples that can be cited. Point of sale data in retail, policy and claim data in
insurance, medical history data in health care, financial data in banking and
securities, are some instances of the types of data that is being collected.
Data storage became easier as the availability of large amounts of computing
power at low cost i.e., the cost of processing power and storage is falling,
1
made data cheap. There was also the introduction of new machine learning
methods for knowledge representation based on logic programming etc. in
addition to traditional statistical analysis of data. The new methods tend to be
computationally intensive hence a demand for more processing power.
It was recognized that information is at the heart of business operations and
that decision makers could make use of the data stored to gain valuable insight
into the business[5]. Database Management Systems gave access to the data
stored but this was only a small part of what could be gained from the data.
Traditional on-line transaction processing systems, OLTPs, are good at putting
data into databases quickly, safely and efficiently but are not good at
delivering meaningful analysis in return[6]. Analyzing data can provide
further knowledge about a business by going beyond the data explicitly store
to derive knowledge about the business. Data mining, also called as data
archaeology, data dredging, data harvesting, is the process of extracting
hidden knowledge from large volumes of raw data and using it to make crucial
business decisions[11]. This is where Data Mining or Knowledge Discovery in
Databases (KDD) has obvious benefits for any enterprise.
1.3 DATA MINING DEFINITIONS
The term data mining has been stretched beyond its limits to apply to any form
of data analysis. Some of the numerous definitions of Data Mining, or
Knowledge Discovery in Databases are:
Extraction of interesting information or patterns from data in large databases
is known as data mining[13].
According to William J. Frawley, Gregory Piatetsky-Shapiro and Christopher
J. Matheus “Data Mining, or Knowledge Discovery in Databases (KDD) as it
is also known, is the nontrivial extraction of implicit, previously unknown, and
potentially useful information from data”[9]. This encompasses a number of
different technical approaches, such as clustering, data summarization,
2
learning classification rules, finding dependency networks, analyzing changes,
and detecting anomalies.
According to Marcel Holshemier and Arno Siebes "Data mining is the search
for relationships and global patterns that exist in large databases but are
'hidden' among the vast amount of data, such as a relationship between
patient data and their medical diagnosis. These relationships represent
valuable knowledge about the database and the objects in the database and, if
the database is a faithful mirror of the real world registered by the
database"[12].
Data mining refers to "using a variety of techniques to identify nuggets of
information or decision-making knowledge in bodies of data and extracting
these in such a way that they can be put to use in the areas such as a decision
support, prediction, forecasting and estimation. The data is often voluminous,
but as it stands of low value as no direct use can be made of it; it is the hidden
information in the data that is useful[12]"
Data mining is concerned with the analysis of data and the use of software
techniques for finding patterns and regularities in sets of data. It is the
computer which is responsible for finding the patterns by identifying the
underlying rules and features in the data. The idea is that it is possible to strike
gold in unexpected places as the data mining software extracts patterns not
previously discernable or so obvious that no-one has noticed them before[2].
Data mining analysis tends to work from the data up and the best techniques
are those developed with an orientation towards large volumes of data, making
use of as much of the collected data as possible to arrive at reliable
conclusions and decisions. The analysis process starts with a set of data, uses a
methodology to develop an optimal representation of the structure of the data
during which time knowledge is acquired. Once knowledge has been acquired
this can be extended to larger sets of data working on the assumption that the
larger data set has a structure similar to the sample data. Again this is
3
analogous to a mining operation where large amounts of low-grade materials
are sifted through in order to find something of value.
1.4 DATA MINING PROCESS
Data mining operations require a systematic approach. The process of data
mining is generally specified in the form of an ordered list but the process is
not linear. At times, one may need to step back and rework on the previously
performed step[5].
The general phases in the data mining process to extract knowledge are:
1. Problem definition: This phase is to understand the problem and the
domain environment in which the problem occurs. We need to clearly
define the problem before we proceed further. Problem definition
specifies the limits within which the problem needs to be solved. It
also specifies the cost limitations to solve the problem.
2. Creating a database for data mining: This phase is to create a
database where the data to be mined are stored for knowledge
acquisition. Creating database does not require creating a specialized
database management system. We can even use storage where large
amount of data is stored for data mining. The creation of data mining
database consumes about 50% to 90% of the overall data mining
process.
3. Exploring the database: This phase is to select and examine
important data sets of a data mining database in order to determine
their feasibility to solve the problem. Exploring the database is a timeconsuming process and requires a good user interface and computer
system with good processing speed.
4. Preparation for creating a data mining model: This phase is to
select variables to act as predictors. New variables are also built
depending upon the existing variables along with defining the range of
variables in order to support imprecise information.
4
5. Building a data mining model: This phase is to create multiple data
mining models and to select the best of these models. Building a data
mining model is an iterative process. At times we need to go back to
the problem definition phase in order to change the problem definition
itself. The data mining model that we select can be a decision tree, an
artificial neural network, or an association rule model.
6.
Evaluating the data mining model: This phase is to evaluate the
accuracy of the selected data mining model. In data mining, the
evaluating parameter is data accuracy in order to test the working of
the model. This is because the information generated in the simulated
environment varies from the external environment. The errors that
occur during the evaluation phase needs to be recorded and the cost
and time involved in rectifying the error needs to estimate. External
validation is also needs to be performed in order to check whether the
selected model performs correctly when provided real world values.
7. Deploying the data mining model: This phase is to deploy the built
and the evaluated data mining model in the external working
environment. A monitoring system should monitor the working of the
model and generate reports about its performance. The information in
the report helps enhance of selected data mining model.
1.5 KNOWLEDGE DISCOVERY IN DATABASES (KDD)
With the enormous amount of data stored in files, databases, and other
repositories, it is increasingly important, if not necessary, to develop powerful
means for analysis and perhaps interpretation of such data and for the
extraction of interesting knowledge that could help in decision-making. Data
Mining, also popularly known as Knowledge Discovery in Databases (KDD),
refers to the nontrivial extraction of implicit, previously unknown and
potentially useful information from data in databases[6]. While data mining
and knowledge discovery in databases (or KDD) are frequently treated as
5
synonyms, data mining is actually part of the knowledge discovery process.
The following figure (Figure 1.1) shows data mining as a step in an iterative
knowledge discovery process.
Figure 1.1: Knowledge discovery process
The Knowledge Discovery in Databases process comprises of a few steps
leading from raw data collections to some form of new knowledge. The
iterative process consists of the following steps:
1.
Data cleaning: Also known as data cleansing, it is a phase in which
noise data and irrelevant data are removed from the collection.
2.
Data integration: At this stage, multiple data sources, often
heterogeneous, may be combined in a common source.
3.
Data selection: At this step, the data relevant to the analysis is decided
on and retrieved from the data collection.
4.
Data transformation: Also known as data consolidation, it is a phase
in which the selected data is transformed into forms appropriate for the
mining procedure.
5.
Data mining: It is the crucial step in which clever techniques are
applied to extract patterns potentially useful.
6.
Pattern
evaluation: In
this
step,
strictly interesting
patterns
representing knowledge are identified based on given measures.
6
7.
Knowledge representation: It is the final phase in which the
discovered knowledge is visually represented to the user. This essential
step uses visualization techniques to help users understand and
interpret the data mining results.
It is common to combine some of these steps together. For instance, data
cleaning and data integration can be performed together as a pre-processing
phase
to
generate
a
data
warehouse[7].
Data
selection and data
transformation can also be combined where the consolidation of the data is the
result of the selection, or, as for the case of data warehouses, the selection is
done on transformed data.
The KDD is an iterative process. Once the discovered knowledge is presented
to the user, the evaluation measures can be enhanced, the mining can be
further refined, new data can be selected or further transformed, or new data
sources can be integrated, in order to get different, more appropriate results.
Data mining derives its name from the similarities between searching for
valuable information in a large database and mining rocks for a vein of
valuable ore. Both imply either sifting through a large amount of material or
ingeniously probing the material to exactly pinpoint where the values reside. It
is, however, a misnomer, since mining for gold in rocks is usually called "gold
mining" and not "rock mining", thus by analogy, data mining should have
been called "knowledge mining" instead[14]. Nevertheless, data mining
became the accepted customary term, and very rapidly a trend that even
overshadowed more general terms such as knowledge discovery in databases
(KDD) that describe a more complete process. Other similar terms referring to
data mining are: data dredging, knowledge extraction and pattern discovery.
1.6 DATA MINING VERSUS KNOWLEDGE DISCOVERY IN
DATABASES
The terms knowledge discovery in databases (KDD) and data mining are often
used interchangeably. In fact, there have been many other names given to this
7
process discovering useful (hidden) patterns in data: knowledge extraction,
information discovery, exploratory data analysis, information harvesting, and
unsupervised pattern recognition[12]. Over the last few years KDD has been
used to refer to a process consisting of many steps, while data mining is only
one of these steps.
Definition 1.1 Knowledge Discovery in databases (KDD) is the process of
finding useful information and patterns in data[12].
Definition 1.2 Data mining is the use of algorithms to extract the information
and patterns derived by the KDD process[12].
The KDD process is often said to be nontrivial; however, we take the larger
view that KDD is an all-encompassing concept. A traditional SQL database
query can be viewed as the data mining part of a KDD process. Indeed, this
may be viewed as somewhat simple and trivial. However, this was not the case
30 years ago. If we were to advance 30 years into the future, we might find
that processes thought of today as nontrivial and complex will be viewed as
equally simple. The definition of KDD includes the keyword useful Although
some definitions have included the term "potentially useful," we believe that if
the information found in the process is not useful, then it really is not
information of course, the idea of being useful is relative and depends on the
individuals involved.
KDD is a process that involves many different steps. The input to this process
is the data, and the output is the useful information desired by the users.
However, the objective may be unclear or inexact.
1.7 PROCESS MODELS OF DATA MINING
We need to follow a systematic approach of data mining for meaningful
retrieval of data from large data banks. Several process models have been
proposed by various individuals and organizations that provide systematic
steps for data mining. Four most popular process models of data mining
are[4]:
8
1. 5 A's process model
2. CRISP-DM process model
3. SEMMA process model
4. Six-Sigma process model
1.7.1 5 A's process model
The 5 A's process model has been proposed and used by SPSS Inc, Chicago,
USA. The5 A's in this process model stands for Assess, Access, Analyse, Act,
and Automate[9]. SPSS uses this model as a preparatory step towards data
mining and does not plan to provide any further description to perform the
various data mining tasks. After initially applying the 5 A's process model,
SPSS uses the CRISP-DM process model discussed in the next section, to
analyse data in a data bank.
Figure 1.2: The 5 A's process model
The 5 A's process model of data mining generally begins by first assessing the
problem in hand. The next logical step is to access or accumulate data that are
related to the problem. After that, we analyse the accumulated data from
different angles using various data mining techniques[8]. We then extract
meaningful information from the analysed data and implement the result in
solving the problem in hand. At last, try to automate the process of data
9
mining by building software that uses the various techniques that used in the 5
A's process model. Figure 1.2 shows the life cycle of the 5 A's process model:
1.7.2 CRISP-DM Process Model
The CRISP-DM Process model has been proposed by a group of vendors viz.
NCS Systems Engineering Copenhangen (Denmark), Daimler-Benz AG
(Germany) , SPSS/Integral Solutions Ltd. (United Kingdom), and OHRA V
(The Netherlands)[10]. In this process model, CRISP-DM stands for CrossIndustry Standard Process for Data Mining.
Figure 1.3: CRISP-DM Process Model
CRISP-DM process model provides with several data mining techniques that
can use and apply for a specific datasets. Moreover, it is also likely use a
single data mining technique for different types of datasets. In such case,
CRISP-DM process model is never a top-down process; rather jump from one
phase of the model to another in between before completing a complete cycle
of the process.
The life cycle of CRISP-DM process model consists of six phases:
10
1. Understanding the business: This phase is to understand the
objectives and requirements of the business problem and generating a
data mining definition for the business problem.
2. Understanding the data: This phase is to first analyze the data
collected in the first phase and study its characteristics and matching
patterns to propose a hypothesis for solving the problem.
3. Preparing the data: This phase is to create final datasets that are input
to various modeling tools. The raw data items are first transformed and
cleaned to generate datasets that are in the form of tables, records, and
fields.
4. Modeling: This phase is to select and apply different modeling
techniques of data mining then input the datasets collected from the
previous phase to these modeling techniques and analyze the generated
output.
5. Evaluation: This phase is to evaluate a model or a set of models that
generate in the previous phase for better analysis of the refined data.
6. Deployment: This phase is to organize and implement the knowledge
gained from the evaluation phase in such a way that it is easy for the
end users to comprehend.
1.7.3 SEMMA Process Model
The SEMMA Process Model has been proposed and used by SAS Institute
Inc. In this process model, SEEMA stands for Sample, Explore, Modify,
Model, and Assess[6]. Figure 1.4 shows the life cycle of the SEMMA process
model.
The life cycle of the SEMMA Process Model consists of five phases:
1. Sample: This phase is to extract a portion from a large data bank such
that able to retrieve meaningful information from the extracted portion
of data. Selecting a portion from a large data bank significantly reduces
the amount of time required to process them.
11
2. Explore: This phase is to explore and refine the sample portion of data
using various statistical data mining techniques in order to search for
unusual trends and irregularities in the sample data. For example, an
online trading organization can use the technique of clustering to find a
group of consumers that have similar ordering patterns.
3. Modify: This phase is to modify the explored data by creating,
selecting, and transforming the predictive variables for the selection of
a prospective data mining model. As per the problem in hand, one may
need to add new predictive variables or delete existing predictive
variables to narrow down the search for a useful solution to the
problem.
Figure 1.4: SEMMA Process Model
4. Model: This phase is to select a data mining model that automatically
search for a combination of data, which can use to predict the required
result for the problem. Some of the modeling techniques that can use
as a model are neural networks and statistical models.
12
5. Assess: This phase is to assess the use and reliability of the data
generated by the model that selected in the previous phase and estimate
its performance. One can assess the selected model by applying the
sample data that collected in the sample phase and check the output
data.
1.7.4 Six-Sigma Process Model
Six-Sigma is a data driven process model that eliminates defects, wastes, or
quality control problems that generally occurs in a production environment.
This model has been pioneered by Motorola and popularised by General
Electric (GE)[8]. Six-Sigma is very popular in various American industries
due to its easy implementation, and it is likely to be implemented worldwide.
This process model is based on various statistical techniques, use of various
types of data analysis techniques, and implementation of systematic training of
all the employees of an organization. Six-Sigma process model postulates a
sequence of five stages called DMAIC, which stands for Define, Measure,
Analyse, Improve and Control. Figure 1.5 shows the five phases in the life
cycle of the Six-Sigma process model:
Figure 1.5: Six-Sigma Process Model
The life cycle of the Six-Sigma process model consists of five phases:
13
1. Define: This phase is to define the goals of a project along with its
limitations. This phase also identifies the issues that need to be
addressed in order to achieve the defined goal.
2. Measure:
This phase is to collect information about the current
process in which the work is done and to try to identify the basics of
the problem.
3. Analyze: This phase is to identify the root cause of the problem in
hand and ensure those root causes of the problem in hand. The root
causes are identified in the previous phase.
4. Control: This phase is to monitor the outcome of all its previous
phases and suggest improvement measures in each of its earlier phases.
1.8. DATA MINING FUNCTIONALITIES
The kinds of patterns that can be discovered depend upon the data mining
tasks employed. There are two types of data mining tasks: descriptive data
mining tasks that describe the general properties of the existing data,
and predictive data mining tasks that attempt to do predictions based on
inference on available data[2]. The data mining functionalities and the variety
of knowledge they discover are briefly presented in the following list:
1.
Characterization: Data characterization is a summarization of general
features
of
objects
in
a
target
class,
and
produces
what
is
called characteristic rules. The data relevant to a user-specified class are
normally retrieved by a database query and run through a summarization
module to extract the essence of the data at different levels of abstractions.
With concept hierarchies on the attributes describing the target class,
the attribute-oriented induction method can be used, for example, to carry
out data summarization. With a data cube containing summarization of
data, simple OLAP operations fit the purpose of data characterization.
2.
Discrimination: Data discrimination produces what are called discriminate
rules and is basically the comparison of the general features of objects
14
between two classes referred to as the target class and the contrasting class.
The techniques used for data discrimination are very similar to the
techniques used for data characterization with the exception that data
discrimination results include comparative measures.
3.
Association analysis: Association analysis is the discovery of what are
commonly called association rules. It studies the frequency of items
occurring together in transactional databases, and based on a threshold
called support,
identifies
the
frequent
item
sets.
Another
threshold, confidence, which is the conditional probability than an item
appears in a transaction when another item appears, is used to pinpoint
association rules. Association analysis is commonly used for market basket
analysis. For example, it could be useful for the Video Store manager to
know what movies are often rented together or if there is a relationship
between renting a certain type of movies and buying popcorn or pop. The
discovered association rules are of the form: P -> Q [s,c], where P and Q
are conjunctions of attribute value-pairs, and s (for support) is the
probability that P and Q appear together in a transaction and c (for
confidence) is the conditional probability that Q appears in a transaction
when P is present. For example, the hypothetic association rule:
RentType(X, "game") AND Age(X, "13-19") -> Buys(X, "pop") [s=2%
,c=55%] would indicate that 2% of the transactions considered are of
customers aged between 13 and 19 who are renting a game and buying a
pop, and that there is a certainty of 55% that teenage customers who rent a
game also buy pop.
4.
Classification: Classification analysis is the organization of data in given
classes. Also known as supervised classification, the classification uses
given class labels to order the objects in the data collection. Classification
approaches normally use a training set where all objects are already
associated with known class labels. The classification algorithm learns from
the training set and builds a model. The model is used to classify new
15
objects. For example, after starting a credit policy, the Video Store
managers could analyze the customers behaviours vis-à-vis their credit, and
label accordingly the customers who received credits with three possible
labels "safe", "risky" and "very risky". The classification analysis would
generate a model that could be used to either accept or reject credit requests
in the future.
5.
Prediction: Prediction has attracted considerable attention given the
potential implications of successful forecasting in a business context. There
are two major types of predictions: one can either try to predict some
unavailable data values or pending trends, or predict a class label for some
data. The latter is tied to classification. Once a classification model is built
based on a training set, the class label of an object can be foreseen based on
the attribute values of the object and the attribute values of the classes.
Prediction is however more often referred to the forecast of missing
numerical values, or increase/ decrease trends in time related data. The
major idea is to use a large number of past values to consider probable
future values.
6.
Clustering: Similar to classification, clustering is the organization of data
in classes. However, unlike classification, in clustering, class labels are
unknown and it is up to the clustering algorithm to discover acceptable
classes. Clustering is also called unsupervised classification, because the
classification is not dictated by given class labels. There are many
clustering approaches all based on the principle of maximizing the
similarity between objects in a same class (intra-class similarity) and
minimizing the similarity between objects of different classes (inter-class
similarity).
7.
Outlier analysis: Outliers are data elements that cannot be grouped in a
given class or cluster. Also known as exceptions or surprises, they are often
very important to identify. While outliers can be considered noise and
16
discarded in some applications, they can reveal important knowledge in
other domains, and thus can be very significant and their analysis valuable.
8.
Evolution and deviation analysis: Evolution and deviation analysis
pertain to the study of time related data that changes in time. Evolution
analysis models evolutionary trends in data, which consent to
characterizing, comparing, classifying or clustering of time related data.
Deviation analysis, on the other hand, considers differences between
measured values and expected values, and attempts to find the cause of the
deviations from the anticipated values.
It is common that users do not have a clear idea of the kind of patterns they
can discover or need to discover from the data at hand. It is therefore
important to have a versatile and inclusive data mining system that allows the
discovery of different kinds of knowledge and at different levels of
abstraction. This also makes interactivity an important attribute of a data
mining system.
1.9 CATEGORIES OF DATA MINING SYSTEMS
There are many data mining systems available or being developed. Some are
specialized systems dedicated to a given data source or are confined to limited
data mining functionalities, other are more versatile and comprehensive. Data
mining systems can be categorized according to various criteria among other
classification are the following [4]:
1.
Classification according to the type of data source mined: This
classification categorizes data mining systems according to the type of data
handled such as spatial data, multimedia data, time-series data, text data,
World Wide Web, etc.
2.
Classification according to the data model drawn on: This classification
categorizes data mining systems based on the data model involved such as
relational
database,
object-oriented
transactional, etc.
17
database,
data
warehouse,
3.
Classification according to the kind of knowledge discovered: This
classification categorizes data mining systems based on the kind of
knowledge
discovered
or
data
mining
functionalities,
such
as
characterization, discrimination, association, classification, clustering, etc.
Some systems tend to be comprehensive systems offering several data
mining functionalities together.
4.
Classification according to mining techniques used: Data mining
systems employ and provide different techniques. This classification
categorizes data mining systems according to the data analysis approach
used such as machine learning, neural networks, genetic algorithms,
statistics, visualization, database-oriented or data warehouse-oriented, etc.
The classification can also take into account the degree of user interaction
involved in the data mining process such as query-driven systems,
interactive exploratory systems, or autonomous systems. A comprehensive
system would provide a wide variety of data mining techniques to fit
different situations and options, and offer different degrees of user
interaction.
1.10 DATA MINING ISSUES
There are many important implementation issues associated with data
mining[5]:
1. Human interaction: Since data mining problems are often not
precisely stated, interfaces may be needed with both domain and
technical experts. Technical experts are used to formulate the queries
and assist in interpreting the results. Users are needed to identify
training data and desired results.
2. Overfitting: When a model is generated that is associated with a given
database state, it is desirable that the model also fit future database
states. Overfitting occurs when the model does not fit future states.
This may be caused by assumptions that are made about the data or
18
may simply be caused by the small size of the training database. For
example, a classification model for an employee database may be
developed to classify employees as short, medium, or tall. If the
training database is quite small. The model might erroneously indicate
that a short person is anyone fewer than five feet eight inches because
there is only one entry in the training database under five feet eight. In
this case, many future employees would be erroneously classified as
short. Overfitting can arise under other circumstances as well, even
though the data are not changing.
3. Outliers: There are often many data entries that do not fit nicely into
the derived model. This becomes even more of an issue with very large
databases. If a model is developed that includes these outliers, then the
model may not behave well for data that are not outliers.
4. Interpretation of results: Currently, data mining output may require
experts to correctly interpret the results, which might otherwise be
meaningless to the average database user.
5. Visualization of results: To easily view and understand the output of
data mining algorithms, visualization of the results is helpful.
6. Large datasets: The massive datasets associated with data mining
create problems when applying algorithms designed for small datasets.
Many modeling applications grow exponentially on the dataset size
and thus are too inefficient for larger datasets. Sampling and
parallelization are effective tools to attack this scalability problem.
7. High dimensionality: A conventional database schema may be
composed of many different attributes. The problem here is that not all
attributes may be needed to solve a given data mining problem. This
problem is sometimes referred to as the dimensionality curse, meaning
that there are many attributes (dimensions) involved and it is difficult
to determine which ones should be used. One solution to this high
19
dimensionality problem is to reduce the number of attributes, which is
known as dimensionality reduction.
8. Multimedia data: Most previous data mining algorithms are targeted
to traditional data types (numeric, character, text, etc). The use of
multimedia data such as is found in GIS database complicates or
invalidates many proposed algorithms.
9. Missing data: During the preprocessing phase of KDD, missing data
may be replaced with estimates. This and other approaches to handling
missing data can lead to invalid results in the data mining step.
10. Irrelevant data: Some attributes in the database might not be of
interest to the data mining task being developed.
11. Noisy data: Some attribute values might be invalid or incorrect. These
values are often corrected before running data mining application.
12. Changing data: Databases cannot be assumed to be static. However,
most data mining algorithms do assume a static database. This requires
that the algorithm be completely rerun anytime the database changes.
13. Integration: The KDD process is not currently integrated into normal
data processing activities. KDD requests may be treated as special,
unusual, or one-time needs. This makes them inefficient, ineffective,
and not general enough to be used on an ongoing basis. Integration of
data mining functions into traditional DBMS systems is certainly a
desirable goal.
14. Application: Determining the intended use for the information
obtained from the data mining function is a challenge. Indeed, how
business executives can effectively use the output is sometimes
considered the more difficult part, not the running of the algorithms
themselves.
These issues should be addressed by data mining algorithms and products.
20