Download Data Quality Mining: Employing Classifiers for

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia, lookup

K-means clustering wikipedia, lookup

Nonlinear dimensionality reduction wikipedia, lookup

K-nearest neighbors algorithm wikipedia, lookup

Transcript
Data Quality Mining: Employing Classifiers for
Assuring consistent Datasets
Fabian Grüning
Carl von Ossietzky Universität Oldenburg, Germany,
[email protected]
Abstract: Independent from the concrete definition of the term “data quality” consistency always plays a major role. There are two main points
when dealing with the data quality of a database: Firstly, the data quality
has to be measured, and secondly, if is necessary, it must be improved. A
classifier can be used for both purposes regarding consistency demands by
calculating the distance of the classified value to the stored value for
measuring and using the classified value for correction.
Keywords: data mining, data quality, classifiers, ontology, utilities
1 Introduction
A good introduction of the main topics of the field of “data quality” can be
found in (Scannapieco et al. 2005) where a motivation is given and relevant
data quality dimensions are highlighted. Having discussed an ontology that
describes such a definition and the semantical integration of data quality
aspects into given data schemas using an ontological approach in Grüning
(2006) we now come to the appliance of data quality mining algorithms to
estimate the consistency of a given data set and suggest correct values
where necessary. This is one of the four identified algorithms needed for
the holistic data quality management approach to be developed in a projected funded by a major German utility. One of its goals is to provide an
ICT infrastructure for managing the upcoming power plant mix consisting
of more decentralized, probably regenerative, and sustainable power
plants, e.g. wind power and biogas plants, and combined heat and power
generation together with the conventional power plants. As many decisions
for controlling relevant parts of the system are made automatically, good
data quality is vital for the health of the overall system, as false data leads
to wrong decisions that may worsen the system’s overall performance. The
2
Fabian Grüning
system is used for both day-to-day business and strategical decisions. Examples for those decisions are the regulation of conventional power plants
with the wind forecast in mind to provide an optimal integration of the sustainable power plants like wind power plants into the distribution grid. A
strategical decision might be the decision where another wind park is built
by taking statistical series of wind measurements into account.
The system contains costumer data as well as technical data about the distribution grid and power plants. The data is critical for the company as
well as the state as it contains information about vital distribution systems
so that concrete information about the data cannot be given in this paper.
Therefore the example given later in this paper will only contain a simple
list of dates. The paper focuses more on the concepts of the approach discussed beforehand.
The paper is structured as follows: First we are going to give a short
summary of the term data quality mining and the dimensions belonging to
it with focus on consistency. We than are going to reasonably chose a concrete classification algorithm that fits our needs in the examined field. The
process of using a classifier for checking consistency in data sets is going
to be described in the following section giving an example of the algorithm’s performance. We are going to touch the subject of using domain
experts’ knowledge through employing ontologies and eventually getting
to conclusions and further work to do.
2 Excursus: About Data Quality Mining
The definition of data quality by Redman (1996) defines four different data
quality dimensions: accuracy, consistency, currency as a specialization of
timeliness constraints and correctness. After having discussed the semantics of those dimensions in the previous paper we now concentrate on the
realization of the algorithms for data quality mining, namely for checking
and improving consistency.
The term “data quality mining” is meant in the way that algorithms of
the data mining domain are utilized for the purpose of data quality management (see Hipp et all. 2002). In this paper we are discussing the consistency aspect of data quality. We will explain that classification algorithms
are reasonable applicable for this purpose.
Data Quality Mining: Employing Classifiers for Assuring consistent Datasets
3
2.1 Consistency as a Data Quality Dimension
Whenever there is redundancy in a data set, inconsistencies might occur. A
good example is the topology of a distribution grid that consists of power
supply lines and connections. An inconsistency in a relational orientated
data store leads to non realizable topologies where e.g. a power supply line
only has one connection or is connected more than twice. Such a datacentric problem leads to real world problems in the sense that power flow
algorithms cannot be applied to the data so that management systems and
the power grid get unstable or short circuit cannot be detected or are registered all the time.
This example also shows that a consistency check can only be done by
considering a real world entity, here the distribution grid, on the whole and
that the verification of consistency works better all the more the semantical
correlation between real world entities and data schemas is realized so that
relationships between the single data properties can be utilized (see (Noy
and Guinness 2001) and Grüning 2006). A particular good approach for
assuring this correlation is using ontologies for modeling a real world extract as they explicitly keep the relationships inside of and between the examined real world’s concepts in contrast to for example normalized relational data schemas.
2.2 Employing Classifiers for Consistency Checking
Classification algorithms are used to classify one data item of a data record
by using the information of the remaining data items. E.g. a triple of two
power supply lines and one connection implies that those two lines are
connected by the very connector. This is only true if the value of the connector is in fact a valid identifier for a connector. If the connector’s name
is different from a certain pattern that identifies such a connector, a different resource is addressed and an invalid topology is represented. Such dependence can be learned by a classifier.
If the classified value and the stored value differ from one another, an
inconsistency in the dataset might have been identified which even can be
corrected by using the classified value as a clue.
Classifiers can therefore be found basically usable for finding and correcting inconsistencies in datasets and a prototype will confirm this assumption as shown in the following sections.
To check every possible consistency violation every data item has to be
classified with the rest of the data record respectively. It is therefore necessary to train n classifiers for a data record consisting of n data items.
4
Fabian Grüning
3 Using Support Vector Machines as a concrete
Classification Algorithm
There are several different algorithms for classification tasks like decision
trees (C4.5), rule sets (CN2), neural networks, Bayes classifiers, evolutionary algorithms, and support vector machines. A decision has to be
made which algorithm fits the needs for the classification task in the field
of checking consistency in data sets.
The classifiers have in common that their implementation consists of
two consecutively phases: In the first phase the algorithm learns through
the usage of a representative data set the characteristics of the data. This
phase is called the training phase. In the second phase the algorithm classifies not known data records utilizing the knowledge gained from phase one
(see (Witten and Frank 2005)).
There are two main points a classification algorithm has to fulfill in this
kind of application:
• The dataset for the learning task in which the algorithms adapts to the
characteristics of the data is in comparison to the whole data set relatively small. This is related to the fact that the data set for the learning
task has to be constructed out of error-free data so that the classifier will
detect and complain about data that differs from these. The labeling, i.e.
the task of deciding whether a data record is correct or not, has to be
done by domain experts and therefore is a complex and expensive task.
• The classification approach has to be quite general because not much is
known about the data to be classified beforehand. A well qualified classification algorithm therefore needs only few parameters to be configured to be adjusted to the classification task.
Both demands are fulfilled by support vector machines (SVM) as they
scale well for even small data sets and the configuration efforts are restricted to the choice and configuration of the kernel function that is used
to map the training set’s samples into the high dimensional feature space
and the adaptation of the coefficient weighting the costs for misclassification (see (Russell and Norvig 2003) for an introduction to SVMs).
A classifier’s output can also be understood as a recommendation in the
case where the classified value differs from the stored value. The SVM can
both be used as a classification or regression algorithm making it possible
to not only give recommendations for discrete but also for continuous values. The algorithm for the regression version of SVM does not differ much
from the classifier version so that it is easy to be used either way. Classification and regression can be used nearly synonymously when it comes to
SVM because the learning phases do not differ much from one another.
Data Quality Mining: Employing Classifiers for Assuring consistent Datasets
5
4 Prototype for Consistency Checking
The considerations made so far have to be verified by an appliance to realworld data. For this reason a prototype was developed employing YALE
(see Mierswa 2006), a learning environment that allows to orchestrate
processes that are necessary in the field of learning algorithms. As the
whole approach for data quality mining is encouraged by a German utility
real data was available for testing purposes.
We will show promising results from a prototype utilizing SVMs as a
classification algorithm for checking consistency in a given data set.
4.1 Phase I: Selection
To compile the training set for the first phase of the learning algorithm, a
choice out of the existing data records has to be made (see figure 4.1).
On the one hand all relevant equivalent classes for the classification task
have to be covered which is addressed by the stratified sampling, on the
other hand the cardinal number of the training set has to be restricted because of the expensive labeling task for the training set (see section 3).
Therefore the absolute sampling assures that only a certain amount of data
records are in the training set at most.
Fig. 4.1. Selection phase
The data itself is converted to interval scale (see (Bortz 2005)) by one of
the following algorithms: If the data originally is in nominal scale the data
is mapped to [0, 1] equidistantly. Ordinal data gets normalized and therefore also mapped to [0, 1] where the sequence of the data gets conserved.
Strings are addressed separately: They are mapped to interval scale under a
given string distance function in a way that similar strings have less distance to one another than less similar strings. The results are clusters of
6
Fabian Grüning
similar strings that get normalized to [0, 1], having obtained a certain
amount of semantics.
This preprocessing phase produces data sets that only consist of interval
scaled values that are therefore suitable for getting processed via the regression version of the SVM algorithm. We now can use the distance between the outcome of the regression algorithm and the mapped value as a
degree of quality. The outcome of the regression algorithm can directly be
used as a recommendation for the correct value.
Mentioned as a side note we do not lose any practicability by the data’s
preprocessing as it is still possible to establish arbitrary bounds to use the
classification version of the algorithm.
4.2 Phase II: Learning
In the learning phase the classifier adapts to the characteristics of the data
set. This mainly means to adjust the SVM parameter set so that it adapts
optimally to the training set. As (Russel and Norvig 2003) describe, this
means to choose the kernel function that adapts the best to the training set
and to choose the correct values for the kernel’s parameters for optimal results.
The learning phase consists of several steps (see figure 4.2):
1. In the preprocessing phase the data sets are completed where necessary
because the classification algorithm cannot handle empty data items.
This is no problem as the values filled in are uniform so that they cannot
be taken into account for classification because they are not characteristic for any data set.
2. The next steps are repeatedly executed to find the optimal parameter setting for the SVM: The training set is split into a learning and a classification subset as the procedure of cross validation plans. The first set is
used for training the classifier and the second set is used for validating
the trained classifier. Cross validation avoids a too strict adaptation to
the training set so that the classifier only adapts to the characteristics of
the training set and does not “mimic” it. Having done that with a defined
number of combinations the overall performance of the classifier is
evaluated and associated with the parameter configuration.
The more parameter combinations of the classification algorithms are
tested the better the classifier is as the result of this process. This is one
of the strengths of the SVMs as only three variables are used to configure a certain SVM in the case when using the radial basis function as
kernel function. The parameter space can therefore be mined quite in
Data Quality Mining: Employing Classifiers for Assuring consistent Datasets
7
great detail for finding the optimal parameter configuration so that the
out coming classifier is of high quality.
3. Finally, the optimal parameter configuration is used to eventually train a
classifier with the whole training set which gets stored for the last step
of the process of finding inconsistent data, namely to apply the classifier
to not known data records.
Fig. 4.2. Learning phase
4.3 Phase III: Appliance
In the last phase (see figure 4.3) the classifier is applied to the data records
of the whole data set searching for discrepancies between classified and
8
Fabian Grüning
stored values. The more discrepancies are found the lower the data quality
is regarding the consistency aspect.
As SVMs can also be used for regression, a concrete recommendation
for a correct value can be made for the cases where inconsistencies occur.
Such a recommendation is not only a range but a concrete value in contrast
to other classification algorithms only capable of classifications, like decision trees, again showing the adequate choice of the classification algorithm.
Fig. 4.3. Appliance phase
4.4 Results
A typical result is shown in Table 1. It was generated out of a training set
consisting of 128 examples that were proved to be valid. The classifier was
then used to find inconsistencies between the classified and the stored values.
In the examples given there are two major discrepancies between the
stored and the classified values (marked by italics).
The first one is a result of a deliberate falsification to show the approach’s functionality. The correct value had been “1995” so that the distance relative to the remaining distances between stored and classified values is large and implies an error in the specific data set. The classified
value can be used as a correction and meets the non-falsified value quite
well.
Data Quality Mining: Employing Classifiers for Assuring consistent Datasets
9
The second one also shows a huge distance between the classified and
the stored value although no falsification has taken place. This is an example that shows that the training set missed a relevant equivalent class so
that the algorithm wrongly detects an inconsistency. The user has to mark
this wrong classification. Those data sets are then included in the training
set so that in the next learning phase the classifier better adapts to the
data’s characteristics. This procedure may be executed until the classifier
has adapted well enough to the relevant data set or regularly to adapt to
changes in the underlying structure of the data.
Classified Value
Stored Value
1994.66
2000.0
1994.66
1995.0
1994.66
1995.0
1992.17
1995.0
1990.26
1990.0
1991.68
1990.0
1990.26
1990.0
1990.26
1990.0
1992.35
2003.0
[…]
[…]
Table 1: Prototype's results sample (classified and stored values are shown)
5 Using Ontologies for further Improvements
As already pointed out in section 2.1 the usage of ontologies for modeling
the examined real world extract is beneficial for the sake of building a
classifier for the discovery of inconsistencies in data sets.
But not only the semantical coherence of the modeled concepts is useful
but also further information the modeling domain expert can annotate to
the identified concepts. This information is made explicit and can therefore
considered to be directly usable knowledge. We gave examples in chapter
4.1 where the information about the values’ scale was given by domain
experts and annotated to the data scheme. These annotations, can be used
to configure the data quality mining’s algorithms for further improvements
of the achieved results by adjusting them to the needs induced by the underlying data schema and the domain expert’s knowledge that would otherwise not be available or would difficulty be utilizable.
10
Fabian Grüning
6 Conclusions and further Work
In this paper it was shown that classifiers can be employed to find inconsistencies in data sets and to give concrete recommendations for correct
values. This approach was first made plausible through a discussion together with the decision to employ support vector machines as the classification algorithm and later through the results of a prototype.
For a holistic approach for data quality mining there are still the data
quality dimensions accuracy, correctness, and currency open for further research. The solutions for these dimensions will be discussed in upcoming
papers.
The positive influence of ontologies for the data quality mining approach in particular and checking for consistency problems in general by
employing the additional semantical knowledge in contrast to other modeling techniques was highlighted.
The results presented in this paper were achieved in a project funded by
EWE AG (see http://www.ewe.de/), which is a major German utility.
Bibliography
Bortz J (2005) Statistik. Springer Medizin Verlag, Heidelberg.
Grüning F (2006) Data Quality Mining in Ontologies for Utilities. In:
Managing Environmental Knowledge, 20th International Conference of Informatics in Environmental Protection
Hipp J, Güntzer U, Nakhaeizadeh G (2002) Data Mining of Association
Rules and the Process of Knowledge Discovery in Databases. In: Lecture
Notes of Computer Science: Advances in Data Mining: Applications in ECommerce, Medicine, and Knowledge Management, Springer Berlin/Heidelberg, Volume 2394/2002.
Noy F N, McGuinness D L (2001) Ontology Development 101: A
Guide to Creating Your First Ontology. Stanford Knowledge Systems
Laboratory Technical Report KSL-01-05 and Stanford Medical Informatics Technical Report SMI-2001-0880.
Redman TC (1996) Data Quality for the Information Age. Artech
House, Inc.
Russell S, Norvig P (2003) Artificial Intelligence: A Modern Approach.
Prentice Hall.
Scannapieco M, Missier P, Batini C (2005) Data Quality at a Glance. In:
Datenbank-Spektrum, Volume 14, Pages 6-14.
Data Quality Mining: Employing Classifiers for Assuring consistent
Datasets 11
Witten I H, Frank E (2005) Data Mining: Practical machine learning
tools and techniques. 2nd Edition, Morgan Kaufmann, San Francisco.
Mierswa I (2007) YALE Yet Another Learning Environment.
http://yale.sf.net/ (last access 31.1.2007)