Download Error Awareness Data Mining - Department of Computer Science

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Transcript
Error Awareness Data Mining
Xingquan Zhu, Member, IEEE, and Xindong Wu, Senior Member, IEEE
Abstract —1Real-world data mining applications often deal with
low-quality information sources where data collection
inaccuracy, device limitations, data transmission and
discretization errors, or man-made perturbations frequently
result in imprecise or vague data. Two common practices are to
adopt either data cleansing to enhance data consistency or
simply take noisy data as quality sources and feed them into the
data mining algorithms. Either way may substantially sacrifice
the mining performances. In this paper, we consider an error
awareness data mining framework, which takes advantage of
statistical error information (such as noise level and noise
distribution) to improve data mining results. We assume such
noise knowledge is available in advance, and propose a solution
to incorporate it into the mining process. More specifically, we
use noise knowledge to restore original data distributions, and
then use the restored information to modify the model built from
noise corrupted data. We present an Error Awareness Naïve
Bayes (EA_NB) classification algorithm, and provide extensive
experimental comparisons to demonstrate the effectiveness of
this effort.
1. INTRODUCTION
Data mining is dedicated to exploring meaningful information from
massive unstructured data, where the quality of the input plays an
essential role in ensuring the success of the mining process [1].
Many data mining methods, however, assume their input data comes
from quality sources and complies with nice distributions. In reality,
data often carries a significant amount of errors, which negatively
impact on the mining algorithms. In addition, existing research from
privacy-preserving data mining [2-3][18] often uses intentionally
injected errors, commonly referred to as data perturbations, for
privacy preserving, such that sensitive information in the data
records can be protected, but knowledge in the dataset is still
available for mining practice. As these systematic or man-made
errors will eventually deteriorate the data quality, conducting
effective mining from such low-quality data sources becomes a
challenging and reality issue for the data mining community.
General data mining applications consist of four major steps: data
preparation, data quality enhancement, actual mining, and postmining processing [17]. When the underlying data bears a certain
amount of errors, a common practice is to cleanse the data and
enhance data consistency, so the subsequent mining process can
possibly achieve better performances. Data cleansing methods are
effective in their own scenarios, but some problems are still open:
1. Data cleansing only takes effect on certain types of errors, such
as class noise [1]. Although it has been demonstrated that
cleansing class noise often results in better learners [4], for
attribute noise, unfortunately, it is not an effective solution, as it
frequently incurs information loss.
2. No data cleansing methods can result in perfect data. As long as
some errors remain in the data, they will ultimately impact on
the mining process.
3. For intentionally imposed errors, such as privacy-preserving
data mining, data cleansing is simply not an option, as privacy-
1
X. Zhu and X. Wu are with the Department of Computer Science,
University of Vermont, Burlington VT 05405 USA (corresponding author,
fax: 802-656-0696; e-mail: [email protected]).
preserving data mining intends to hide sensitive information in
data records but not to eliminate them.
4. Many suspicious data items in the database are just partially
noisy, their usefulness needs to be justified by the actual mining
algorithm, but not by any universal cleansing criteria which
5. have no idea about the mining methods the users are going to
use next.
6. Under a cleansing based data mining framework, data cleansing
and data mining are two isolated, independent operations and
have no intrinsic connections with each other. So the data
mining process has no awareness of data errors.
In addition to the above mentioned efforts, many other methods
such as data editing [5] and imputation [6] have also been used to
correct suspicious data entries and enhance data quality. In reality,
the benefits of these methods are often questionable, as any
unsupervised effort in data correction may incur more troubles, such
as ignoring outliers or introducing new errors. For many
applications, users are simply reluctant to change their data entries, if
they are very serious with their data. It is obvious that data cleansing,
correction or editing all try to polish data before it is fed into the
mining algorithms. The intuition behind is straightforward, as many
data mining practitioners believe that enhancing data consistency
will consequently improve the mining performance, although
exceptions often occur. Unfortunately, all previous efforts simply try
to polish data quality for improved mining results, and they fail to
address the challenge of unifying data quality with the mining
process towards good performances. In other words, if we can let
data mining algorithms be aware of the underlying data errors, the
mining process may modify the model produced from the noisy data
and generate good results, as long as noise knowledge is known
before the actual mining process.
There are many cases in reality that statistical error information
in the database is known in a priori.
¾
¾
¾
¾
Information transformation, wireless networks in particular,
often raises a certain amount of errors in the communicated
data. For error control purposes, the statistical errors of the
signal transmission channel should be investigated in advance,
and can be used to estimate the error rate in the transformed
information.
When collecting information from different devices, the
inaccuracy level of each device is often available, as it is part of
the system features. For example, fluorescent labeling for gene
chips in microarray experiments usually contains inaccuracy
caused by many sources such as the influence of background
intensity. The values of collected gene chip data are often
associated with a probability to indicate the reliability of the
current value.
Data discretization, a general procedure to discretize numerical
attributes, inherently introduces noise as well, as it uses a
certain number of discrete values to estimate the infinite
continuous values. Such discretization errors can be measured
in advance and therefore are available for a mining procedure.
As a representative example of artificial errors, privacypreserving data mining, intentionally perturbs the data, so
private information in data records can be protected, but
knowledge conveyed in the datasets is still minable. In such
cases, the level of errors introduced is certainly known for data
mining algorithms.
Most data mining methods, however, do not accommodate such
error information in their algorithm design. They either take noisy
data as quality sources, or adopt data cleansing beforehand to
eliminate errors. Either way may considerably sacrifice the
performance of the succeeding data mining algorithms. The above
observations raise an interesting and important concern on error
awareness data mining, where the previously known error
knowledge should be incorporated into the mining process for
improved mining results.
In this paper, we report our recent research efforts towards this
goal. We will provide an error awareness data mining framework
which accommodates noise knowledge to enhance the classification
accuracy. Our experimental results on real-world datasets will
demonstrate that such an error awareness data mining procedure is
certainly superior to cleansing based data mining, and can
significantly improve data mining results in noisy environments.
2. RELATED WORK
The problem of mining from noisy data sources has been a major
concern for the data mining community. As most data mining
algorithms crucially depend on their input data to produce reliable
models, a general consensus among data mining practitioners is that
low data quality often leads to wrong decisions or even ruins the
projects ("garbage in, garbage out") [1]. In reality, noise usually
takes two forms: class noise and attribute noise, depending on
whether the errors (inconsistency, contradiction or missing values)
were introduced to the class label or the attributes. For either type of
noise, providing effective algorithms to deal with noisy sources has
been a major part of data mining research. Existing endeavors from
data preprocessing [4] and data quality [7] perspectives have come
up with many solutions such as class noise identification [4],
erroneous attribute value location [8] and correction, missing
attribute value imputation [5-6] and acquisition [9], and editing
training instances for instance based learning [10]. An essential goal
of all these efforts is to enhance the quality of the training data so it
can possibly benefit the mining process. In practice, however, there
are many cases that data cleansing actualy deteriorates the mining
performances, and one of the main reasons identified is information
loss from incorrect eliminating, correcting or editing. This is one
objective our proposed effort tries to target, where the mining
process should be aware of the underlying data errors and make
use of this information, but not to try to cleanse them.
It is understandable that no data processing effort can result in
perfect data, and in reality, most algorithms would still have to
conduct knowledge discovery from noisy sources, regardless of
whether they have noise handing mechanisms or not. The problem of
learning in noisy environments has been the focus of much attention
in machine learning and most inductive learning algorithms have a
mechanism for handling noise. For example, pruning in decision
trees is designed to reduce the chance that the trees are overfitting to
noise [11]. A common practice in reducing the noise impact is to
adopting some thresholding measures to remove poor knowledge
drawn from noisy data. Although simple, this mechanism has
received impressive results. For example, Integrative Windowing
[13] adopts good rule selection criteria to reduce the noise impact,
and Instance Based Learning [14] selects representative prototypes to
remove poor training samples. It is clear that in these algorithm
designs, the existence of noise has been taken into consideration, but
they still follow the same direction as data cleansing, and have not
realized that noise knowledge can actually benefit the mining process
and therefore make use of this information.
Recent research in privacy-preserving data mining has raised an
issue of perturbing data entries to protect privacy and maintain data
mining performances, where randomization is one of the favorites
for this purpose. The intuition behind it is to intentionally introduce
errors (often in the form of randomness) into sensitive data entries,
and reveal randomized information about each record in exchange
for not having to reveal the original records to anyone [2-3].
Although the data record faces were modified, the imposed
randomness was controlled so that knowledge in the dataset is still
minable, with a little sacrifice in performances. Such a
randomization procedure requires a compromise between the levels
of privacy the system tries to protect and the mining performances
from the perturbed data:
1. As randomization can eventually ruin useful knowledge in the
dataset, the level of perturbations should be well controlled to
avoid making a perturbed dataset totally useless.
2. The distribution of the errors, random data, is available for both
database managers and data mining practitioners, as without this
information, the mining results would deteriorate significantly.
For example, numerical attributes often adopt normally
distributed data perturbations.
For categorical attributes, Du and Zhan [3] have adopted
randomized response techniques to scramble the original data entries
for privacy-preserving data mining. This method assumes that
attributes contain binary values only, and the values are collected in
such a way that the information providers tell the truth about all their
answers to sensitive questions with the probability θ; and they tell
the lie about all their answers with the probability 1-θ. For example,
if the original attribute values were A1=1, A2=1, and A3=0. Then there
are θ of chances that all these values remain unchanged (users telling
the truth), and there are 1-θ of chances that all values were flipped to
A1=0, A2=0, and A3=1 (users lying). The assumption of this approach,
however, is too strong, as it assumes that once users decided to lie
they will lie on all questions, which is not true in realty. In fact, users
may randomly tell the truth or lie on each single question, i.e., lying
on attribute A1 but telling the truth on attribute A2, or vise versa. So
the perturbation introduced to each attribute (question) may be
totally independent. In our system, we consider more realistic
situations where perturbations are randomly and independently
introduced to each attribute.
3. ERROR AWARENESS DATA MINING
FOR NAIVE BAYES (NB)
3.1 Naïve Bayes Classification
In classification learning, each instance is described by a vector of
attribute values and a class label. A set of instances with their classes
is provided as the training data, The learner is asked to predict a test
instance’s class according to the evidence provided by the training
data. We define:
•
•
•
•
•
•
X <A1, A2, .., AM> as a vector of random variables denoting the
observed attribute values (an instance with M attribute values)
x <a1, a2, …, aM> as a particular observed attribute value vector
(a particular instance).
X=x as shorthand for X1=x1 ∧ X2=x2 ∧..∧ Xk=xk.
aij, j=1,..,Mi as a particular value of attribute Ai, and Mi denotes
the number of attribute values in Ai.
Y as a random variable denoting the class of an instance.
Cl, l=1, .., L, denotes a particular class label of a dataset with L
classes.
Assuming that P(Y=Cl| x) denotes the probability that example x
belongs to class Cl, the Bayes theorem can be used to optimally
predict the class label of a previously unseen example x, given a set
of training examples in advance. With the Bayes theorem, the
expected classification error can be minimized by choosing
argmaxl {P(Y = Cl | x)}. Given an example x, the Bayes theorem
provides a method to compute P(Y=Cl| x) with Eq. (1)
P (Y = Cl | x) =
P (Y = Cl ) ⋅ P ( X = x | Y = Cl )
P( X = x)
| Di1 |=| Ei1 | ⋅(1 − pi +
(1)
...
Assuming that the attributes are independent given the class,
P(X=x| Y=Cl) can be decomposed into the product P(x1|Cl) × P(x2| Cl)
×…× P(xa| Cl). Then the probability that an example belongs to class
Cl is given by Eq. (2).
P(Y = Cl | x) =
P(Y = Cl ) ⋅ ∏a P( X = xa | Y = Cl )
P( X = x)
| Dij |=| Ei1 | ⋅
| Di M i |=| Ei1 | ⋅
(2)
The classifier obtained by using the discriminant function in Eq.
(2) is known as the Naïve Bayes classifier. The independence
assumption embodied in Eq. (2) makes NB classifiers very efficient
for large datasets, because an NB classifier does not use attribute
combinations as a predictor and can be constructed by only one scan
of the dataset with a linear time complexity. Although the
assumption of conditional independence among attributes is often
violated in reality, the classification performance of NB is
surprisingly good compared with other more complex classifiers,
especially when dealing with noisy datasets.
In noisy environments, erroneous attribute values will change
conditional probabilities P(X|Y=Cl), l=1, .., L, and then deteriorate
NB’s performances. As the essential goal of error awareness data
mining for NB classification, we will try to let NB be aware of the
underlying data errors, and then attempt to restore the original
conditional probabilities and improve the NB classifier. In the case
that errors exist in the class label as well, the same approach should
be adopted to restore priori probability P(Y=Cl), l=1,.., L.
3.2 Data Distribution Restoration for Naïve Bayes
Assume that the previously known noise level in attribute Ai is
denoted by pi, and that noise in each attribute is uniformly
distributed. A noise level pi indicates that for any particular attribute
value, say aij, it has a pi probability of being randomly corrupted to
including itself. So for any two
values aij and aik, aij has a pi / Mi probability of being changed to aik,
and vise versa. Such a random transformation model for attribute
value aij is depicted in Fig. 1.
p
1 − pi + i
Mi
pi
Mi
A⋅ Χ = B
ai 2
pi
Mi
where
•
pi
Mi
•
a ij
pi
Mi
ai1
pi
Mi
pi
Mi
pi
Mi
aik
•
•
aiM i
Fig. 1. Random value transformation for attribute value aij
Given data set D with |D| instances, assume it was corrupted from
an error free dataset E (which does not exist). Let |Dij| and |Eij| denote
the numbers of instances in D and E respectively which contain the
attribute value aij. When noise is uniformly distributed, as depicted
in Fig. 1, for any attribute Ai, the relationship between |Dij| and |Eij|,
j=1, 2, ..Mi, can be expressed as follows.
Eq. (5)
⎡
pi
P
... i ....
⎢1 − p i +
M
M
i
i
⎢
⎢...
⎢
p
p
A = ⎢ i ... 1 − p i + i ....
⎢Mi
Mi
⎢
⎢...
⎢ p
P
⎢ i ... i .... 1 − p i +
Mi
⎢⎣ M i
Pi ⎤
⎥
Mi ⎥
⎥
⎥
Pi ⎥
Mi ⎥
⎥
⎥
pi ⎥
⎥
M i ⎥⎦
⎡| Di1 |
⎢
⎢...
B = ⎢| Dij |
⎢
⎢...
⎢
⎣⎢| Di M i
⎤
⎥
⎥
⎥
⎥
⎥
⎥
|⎦⎥
As we hold the corrupted dataset D, we can easily count the value
of |Dij|, j=1,..,Mi. Because noise level pi is known as well, Eq. (5) is
just a set of linear functions consisting of Mi variables |Eij|, j=1, ..,
Mi, and Mi functions, which can be easily solved to estimate |Eij|, i.e.
the number of instances in E which contain attribute value aij..
The results from Eq. (5) can only estimate the number of
instances w.r.t. each attribute value, regardless of the class label.
This information is not sufficient to solve our problem, as Naïve
Bayes needs to estimate the conditional probability given a particular
class Y=Cl, P(X | Y=Cl). For this purpose, we transform Eq. (5) by
pushing constraints onto the class labels.
Assuming the number of instances in E, which contain attribute
value aij and class label Cl, is denoted by
| EijCl | , and the same type
of instances in the corrupted dataset D is denoted by
| DijCl | , Eq. (5)
can be rewritten as follows:
A ⋅ ( X 1 + X 2 + .. X l + .. X L ) = B1 + B2 + ..Bl + ..BL Eq. (6)
X
Where
pi
Mi
•
pi
p
p
+ ...+ | Eij | ⋅ i + ...+ | Ei M i | ⋅(1 − pi + i )
Mi
Mi
Mi
Eq. (4) can be written in a matrix form as
P(Y = Cl | x) ∝ log(P(Y = Cl )) + ∑a log(P( X = xa | Y = Cl )) (3)
ai1 ,..., aiM i ,
pi
p
p
+ ...+ | Eij | ⋅(1 − pi + i ) + ...+ | Ei M i | ⋅ i
Mi
Mi
Mi
...
Which can be rewritten as Eq. (3).
any other values
pi
p
p
) + ...+ | Eij | ⋅ i + ...+ | Ei M i | ⋅ i
Mi
Mi
M i Eq. (4)
Bl
l
[
= [| D
= | E iC1 l |
Cl
i1
|
| E iC2 l |
|D
Cl
i2
|
...
...
]
|]
Cl
| E iM
|
i
|D
Cl
iM
T
i
B = B 1 + B 2 + .. + B L
Eq. (6), however, is unsolvable, as there are L⋅Mi variables
(|
EijCl | ,
l=1,..L, j=1,.., Mi, i∈[1, M]) but Mi functions only,
although we know exactly the values of B1, …BL. An alternative is to
decompose Eq. (6) into a series of linear functions associated to each
single class, as denoted by Eq. (7).
⎧ A ⋅ X 1 = B1
⎪A⋅ X = B
2
2
⎪⎪
⎨...
⎪A⋅ X = B
L
L
⎪
⎪⎩ X 1 + X 2 + .. + X L = X
Eq. (7)
The rationale of Eq. (7) lies in the fact that errors are randomly
and independently distributed across all attributes, so instances in
each classe suffer from almost the same level of errors. Once the
number of instances is large enough, estimating the attribute value
distribution from an instance subset or from the whole dataset does
not bring much difference. In the case that errors exist in the class
label as well, the decomposed equations in Eq. (7) may still hold, as
long as noise is randomly distributed among all classes.
However, because Eq. (7) estimates attribute value distribution
with the constraint of the class label, it may possibly result in higher
estimation errors for minor classes, in comparison with major
classes. The reason is that minor classes have a very limited number
of instances, which is hard to assess whether or not noise in this
small number of instances indeed complies with the transformation
model in Fig. 1. As a result, for minor classes, the estimated values
( | EijC l | , l=1,..L, j=1,.., Mi) can be seriously biased. Although we can
almost do nothing to improve this shortfall, Naïve Bayes has
inherently accommodated this issue already. As shown in Eq. (1), the
priori probability P(Cl) also takes part in the final decision, and the
final decision error is the product between the bias of the conditional
probability Bias(P(X|Y=Cl)) and the priori probability P(Cl).
Although minor classes may have larger Bias(P(X|Y=Cl)), they
actually have less P(Cl). Consequently, the bias from minor classes
can be controlled and should not bring a large impact to the final
results.
3.3 Error Awareness Naïve Bayes (EA_NB)
With the above analysis, we can estimate the value of the original
attribute distribution ( |
EijCl | , l=1,..L, j=1,.., Mi, i=1,..,M), with the
constraint of the class Cl, l=1,…L. This value can be directly used as
the conditional probability P(X|Y=Cl). As NB assumes all attributes
are independent, we can repeat the same process for each attribute,
and use the estimated conditional probabilities for the final
classification. The pseudocode of the whole algorithm is depicted in
Fig. 2.
Procedure: ErrorAwarenessNaiveBayesia()
Input: (1) D (a noisy dataset); (2) pi, i=1,.., M (noise level for each
attribute)
Output: Polished Naïve Bayes model.
For each class Cl, l=1, …, L
Calculate class priori probability P(Cl)
For attribute Ai, i=1, …, M
Count the value of the attribute distribution, | DijCl | , j=1,
.., Mi from D
Solve Eq. (7) and acquire estimated
| EijCl | , j=1,.., Mi.
End For
End For
Take estimated |
EijCl | ,
j=1, .., Mi;
i=1, .., M;
l=1, …,L, as
conditional probabilities, and combine with priori probability P(Cl)
for error awareness Naïve Bayes classification.
Fig. 2. Error Awareness Naïve Bayes
Most datasets in the UCI data repository have been carefully
examined by domain experts, and so do not contain much noise (at
least we do not know which instances and which attribute values are
erroneous). For comparative studies, we adopt a random corruption
model, which is exactly the same as the one shown in Fig. 1, to
manually inject errors into the attributes. Given a noise level pi, an
attribute value aij has a chance of pi to be changed to any other
random value (including itself). So the actual noise level in Ai is
p
pi − i , which is always lower than the designed value (Mi is the
Mi
number of attribute values in Ai). With the same noise level pi, the
more the number of attribute values, the higher the overall noise
level in the attribute. As we assume noise is evenly distributed
among all attribute values, it would bring a much smaller effect on
attributes with a large number of possible values than those that
have, say, only two attribute values.
Most experiments are designed to assess the performances of the
proposed error awareness NB in noisy environments, in comparison
with the original NB classifiers trained from the same dataset. For
each experiment, we perform 10 times 10-fold cross-validation and
use the average accuracy as the final result. In each run, the dataset is
randomly (with a proportional partitioning scheme) divided into a
training set and a test set. The noise corruption model was applied on
the training set and this corrupted dataset was used to build NB and
EA_NB classifiers. All the learners are tested on the test set to
evaluate their performances.
4. 1 Classification Accuracy Comparison
To evaluate the performance of EA_NB, we design the following
experiments. Given a dataset E, we first train an NB classifier, and
denote its classification accuracy by “Original”. We then introduce a
certain level of noise into E to build a corrupted dataset D, and learn
another NB classifier from D with its performance denoted by
“Corrupted”. With D and a noise level pi, we can build an EA_NB
classifier, which is represented as “EA_NB”. As in noisy
environments, data cleansing is often adopted to enhance data quality
and improve the classification accuracy, we therefore apply a data
cleansing method [5] on D to remove all misclassified examples, and
build another NB classifier from the cleansed dataset. The
performance of this NB classifier is expressed as “Cleansing”.
We compare the performances of the above four classifiers at
different noise levels, pi∈[0.1, 0.5], and report the detailed results
from four representative datasets in Fig. 4. The summarized results
from six other datasets are reported in Table 1. To justify the
performance of the NB implemented by ourselves and to show that
NB is indeed robust in noisy environments, we report the results of
C4.5 in Table 1 as well. In Fig. 4, the x-axis represents the noise
level pi, and the y-axis indicates the classification accuracy. The
meaning of each curve in Fig. 4 is explained in Fig. 3.
Orginal
EA_NB
4. EXPERIMENTAL EVALUATIONS
Fig. 3 The meanings of curves in Fig. 4
86
96
84
95
82
94
80
Accuracy
Accuracy
To evaluate the performances of the proposed error awareness Naïve
Bayes algorithm, we implemented both NB and EA_NB. In our
implementation, most NB classifications are based on the
discriminant function in Eq. (2), and in the case that the class
distributions of all classes become undistinguishable (which happens
when the data is sparse or datasets have many attributes), we use Eq.
(3) instead. We evaluate our approach on 10 benchmark datasets
from the UCI data repository [15], where the numerical attributes are
equally discretized into 10 discrete values. Due to size restrictions,
we will mainly analyze the results on several representative datasets.
The summarized results are reported in Table 1.
Corrupted
Cleansing
78
76
74
72
93
92
91
90
70
89
0.1
0.2
0.3
Noise Level Pi
0.4
(a) Car dataset
0.5
0.1
0.2
0.3
0.4
Noise Level Pi
(b) Splice dataset
0.5
48
Accuracy
Accuracy
97
95
93
46
44
42
40
38
91
36
34
89
0.1
0.2
0.3
Noise Level Pi
0.4
0.5
(c) Mushroom dataset
0.1
0.2
0.3
Noise Level Pi
0.4
0.5
(d) Glass dataset
Fig. 4. Classification accuracy comparison
Table 1. Classification accuracy comparison
Dataset pi (%)
10
30
50
10
Krvskp 30
50
10
LED24 30
50
10
Nursery 30
50
10
Wine
30
50
10
Zoo
30
50
Adult
Original
NB C4.5
81.39 84.82
87.86 99.46
100.0 100.0
91.47 98.68
95.12 93.25
96.08 92.96
Corrupted
Cleansing
EA_NB
NB
NB C4.5
80.57 83.11 80.31
77.74
79.11 79.65 80.19
77.92
78.92 76.68 80.12
78.37
85.41 95.69 87.36
79.33
82.93 82.55 86.08
79.24
80.07 71.30 85.13
78.47
100.0 94.41 100.0
100.0
98.97 78.62 99.96
99.88
93.62 55.59 98.14
96.32
90.84 92.23 90.72
87.59
90.31 77.08 90.49
87.22
90.04 63.87 90.25
87.31
95.02 92.61 95.01
94.56
92.63 90.32 94.42
90.04
88.71 85.14 91.76
73.33
90.16 91.97 94.05
90.52
88.83 86.87 91.39
88.10
84.54 77.03 86.17
82.41
As shown in Fig. 4, errors negatively impact on the learners built
from noisy datasets, this is a common sense, as corrupted datasets no
longer reveal the true data distributions, and therefore confuse NB
classifiers from making correct decisions. It is worth noting that
different datasets react differently to the same level of noise. A small
portion of noise can seriously deteriorate a NB learner (e.g. for
datasets in Fig. 4), or a significant amount of noise may still do not
show much impact at all (e.g. for the Adult and Nursery datasets in
Table 1). We believe this is an intrinsic feature of a dataset, which is
determined by factors like instance numbers and the complexity of
the concepts in the dataset. Meanwhile, as NB is a typical statistical
learner, noise normally does less harm to it, in comparison with other
learning mechanisms. As shown in Table 1, where C4.5 usually
deteriorates much faster than NB. Generally, for a dataset with a
large number of instances and containing a significant amount of
redundancy, the existence of errors does less harm, as noise cannot
easily ruin the true data distributions. On the other hand, for a dataset
with a very limited number of instances and when each instance
appears to be important for classification, adding a little bit noise can
make considerable changes to the NB classifier, because noise in this
case can easily modify data distributions and fool NB learners.
When noise is introduced to the attributes, data cleansing is
simply not an option to improve data mining performances. For
many datasets we used, the learners trained from the cleansed dataset
“Cleansing” are obviously inferior to the ones trained from the
original noisy “Corrupted”, where the results of “Cleansing” can be
as worse as 7% less than “Corrupted”. This complies with the
previous observations from Quinlan [12]. The negative impact of
data cleansing may come from two possible reasons: (1) removing
suspicious instances, which do not comply with the existing model,
will inevitably eliminate good examples and incur information loss;
and (2) just because some attribute values are erroneous, does not
necessarily mean the whole instance is useless, and many other
attribute values of the noisy instance may still benefit the learning
theory, therefore cannot be simply removed. The impact of these two
factors becomes extremely clear, if the accuracy of the underlying
learner is low, as show in Fig. 4(d). This is understandable, because
if a learner has only 50% accuracy, it means that half of removed
instances are actually not noise.
When incorporating statistical error information for error
awareness NB classification, we can achieve a very significant
improvement in the classification accuracy (as “EA_NB” and
“Corrupted” have demonstrated). Take the Car dataset in Fig. 7(a) as
an example, where the accuracy of EA_NB is 10% better than the
leaner from the corrupted dataset. Similar results can be observed
from many benchmark datasets, where EA_NB is almost always
better than Corrupted. The higher the noise level in the datasets, the
more improvement can be observed (when the noise level is no
larger than 50%). This indicates that although noise continuously
brings an impact to the learning theory, having a data mining process
aware of the data errors can substantially reduce the noise impact and
enhance mining results. It is true that the proposed effort relies on the
statistical error information to ensure the success, where a very
limited number of instances may be insufficient for this purpose. Our
results from Glass (214 examples and 7 classes) and Wine (178
examples and 3 classes) indicate that even with a very limited
number of instances, the proposed effort can still work well and
achieve impressive results.
4.2 Classification Performance under Inexact Noise
Levels
EA_NB considers noise level pi as a priori knowledge given by
users. In reality, it can often be the case that the user specified value
is different from the actual noise level in the dataset. If a tiny
difference between the user specified value and the actual noise level
in the database would bring a considerable impact to the system
performance, we should then find solutions to enhance the
robustness of our algorithm. For this purpose, we adopt the following
approach to perturb the user specified noise level pi.
Given a noise level pi, we first use this value to construct a noisy
dataset D. When learning an EA_NB classifier from D, we
intentionally change the noise level pi as pi ± δ(p)⋅pi, where δ(p) is
the amplitude of the perturbation and was uniformly selected from
[0, p], and p is controlled by users. This approach simulates
situations where users can only roughly guess the noise value.
88
Corrupted
86
EA_NB+0
Accuracy
50
84
EA_NB+0.1
82
EA_NB+0.3
80
EA_NB+0.5
78
0.1
0.2
0.3
0.4
0.5
Noise Level Pi
(a) Krvskp dataset
86
84
Corrupted
82
EA_NB+0
80
Accuracy
52
99
78
EA_NB+0.1
76
74
EA_NB+0.5
72
70
0.1
0.2
0.3
0.4
0.5
Noise Level Pi
(b) Car dataset
Fig. 5 Classification performance comparison under inexact
noise levels
In Fig. 5, we report the results from two most representative
datasets Krvskp and Car, where the results from different values of p
are denoted by “EA_NB+p”. We set the value of p from 0 to 0.5,
which means that the perturbation amplitude varies from 0 to 50% of
the original noise level, and provide the results at four values 0, 0.1,
0.3 and 0.5 (as different values of p do not result in significant
changes in Fig. 5(b), we ignore the result from p=0.3).
As shown in Fig. 5, when the user specified noise level is
different from the actual noise level in the database, EA_NB
deteriorates for sure, as inaccurate noise knowledge misleads
EA_NB to build a biased model. Most likely, the higher the
amplitude of the perturbation, the lower the system performance.
Depending on the intrinsic characteristics of each dataset, the
deterioration of the system performance varies significantly.
Determining what types of datasets are more sensitive to such a
perturbation is a nontrivial task and requires intensive studies on the
complexity of the underlying concepts, the data redundancy and the
interactions among attributes. This investigation is beyond the
coverage of this paper, but our observations from two most
representative datasets indicate that, as long as the user specified
value is close to the actual noise level in the dataset (e.g., no more
than 30%), the proposed effort can still achieve impressive results
and far outperform the learner trained from noise corrupted datasets.
This shows that EA_NB is pretty robust in reality, and can
accommodate deviations in users’ input for effective mining.
5. DISCUSSION
To demonstrate the theme of error awareness data mining, we have
proposed a Naïve Bayes based solution. The same idea is actually
applicable to many other classification algorithms, such as the most
popular decision tree algorithms ID3/C4.5 [11-12] and CART [16].
For decision tree construction, ID3/C4.5 and CART adopt
information gain/gain ratio and the gini index respectively to
evaluate each attribute, and select the most informative attribute once
a time to split data into smaller subsets. This procedure repeats until
all instances in each splitted subset belong to one class or some
stopping criteria. To ensure good performances, calculating
information gain and gini index values accurately is essential, as
incorrect values lead to poor splitting and decrease system
performances. With statistical error information, we can restore the
original information gain or gini index values, similar to what we
have done with NB.
Take the gini index in CART as an example. For a data set S with
N instances and L classes, the gini index, Gini(S) is defined as
L
Gini ( S ) = 1 −
f l 2 , where fl is the relative frequency of class l in
∑
l =1
S. With certain splitting criteria T, if we split S into two subsets S1
and S2 with sizes N1 and N2 respectively, the gini index gini(S, T) is
defined as
N
N
(10)
Gini
( S , T ) = 1 Gini ( S ) + 2 Gini ( S )
Split
N
1
N
2
To find the “best” splitting attribute, we have to enumerate all
possible splitting points (determined by the attribute values) for each
attribute and produce every two subsets S1 and S2, and choose the
one with the smallest gini index for splitting.
In noisy environments, erroneous attribute values produce
incorrect class frequencies in the spitted subsets S1 and S2, and
therefore damage the true gini index values. With error information
pi, we can estimate the original gini index values through the
following three steps.
1. Given a dataset S, for each class Cl, l=1,…L, adopt Eq. (7) to
estimate the original attribute value distribution EijC l ,
j=1,..,Mi; i=1,..,M.
2. Use Eq. (4) to estimate the original distribution of attribute Ai,
| Eij |, j = 1,.., M i .
3. The modified gini index of S, w.r.t. each possible splitting
point of attribute Ai is denoted by Eq. (10)
M
C
C
N − | Eij |
| Eij |
L | Eij |
L ∑k =1 | Eik |, k ≠ j (10)
GiniSplit (S, aij ) =
N
(1 − ∑l =1
l
| Eij |
)+
N
(1 − ∑l =1
i
l
N − | Eij |
)
4. Enumerate all possible splitting points of all attributes, and
choose the one with the smallest value for splitting.
6. CONCLUSIONS
In this paper, we have proposed an error awareness data mining
framework which seamlessly unifies statistical error information and
a data mining algorithm for effective learning. The proposed effort
makes use of noise knowledge to modify the model built from noise
corrupted data, and has resulted in a substantial improvement in
comparison with the models built from the original noisy data and
the noise cleansed data. The novel features that distinguish the
proposed effort from existing endeavors are twofold: (1) we unify
noise knowledge and a general data mining algorithm into a unique
structure, whereas existing data mining activities often have no
awareness of the underlying data errors; and (2) instead of polishing
noisy data, like many cleansing based approaches do, we take the
advantage of noise knowledge to polish the model trained from noisy
data sources, and therefore the original data is well maintained.
While the strategies presented in this paper are specific to Naïve
Bayes, incorporating noise knowledge into the mining process for
error awareness data mining is an essential idea we want to convey
here. When data enclosing a certain level of erroneous attribute
values, data cleansing is simply not an option to improve the data
mining performance, as it can result in substantial information loss.
Our error awareness data mining framework, which utilizes noise
knowledge to supervise the model construction, becomes very
promising, as it can bridge the gap between low-quality data and the
mining process to enhance the system performance, and essentially
avoids the information loss incurred by data cleansing.
REFERENCES
[1] D. Luebbers, U. Grimmer, & M. Jarke, Systematic development of data
mining based data quality tools, Proc. of VLDB, Germany, 2003.
[2] R. Agrawal & R. Srikant, Privacy-preserving data mining, In Proc. of
ACM SIGMOD, pp. 439-450, 2000.
[3] W. Du & Z. Zhan, Using randomized response techniques for privacypreserving data mining, in Proc. of 9th ACM SIGKDD, 2003.
[4] X. Zhu, X. Wu, & Chen Q., (2003), Eliminating class noise in large
datasets, in Proc. of ICML, 920-927.
[5] I. Fellegi & D. Holt, A systematic approach to automatic edit and
imputation, J. of the American Statistical Association, vol.71, 1976.
[6] D. Rubin, Multiple Imputation for Nonresponse in Surveys, New York:
John Wiley & Sons, Inc., 1987.
[7] R. Wang, V. Storey, & C. Firth, A Framework for Analysis of Data
Quality Research, IEEE Trans. on KDE, 7(4), pp. 623-639, 1995.
[8] X. Zhu, X. Wu, & Y. Yang, Error detection and impact-sensitive
instance ranking in noisy datasets, in Proc. of AAAI, 2004.
[9] X. Zhu & X. Wu "Cost-Constrained Data Acquisition for Intelligent
Data Preparation", IEEE Trans. on KDE, 17(11), 2005.
[10] F. Ferri, J. Albert, & E. Vidal, Considerations about sample-size
sensitivity of a family of edited nearest-neighbor rules. IEEE Trans. on
SMC - Part B, 29, pp.667–672, 1999.
[11] J. Quinlan, C4.5: Programs for Machine Learning, Morgan Kaufmann,
San Mateo, CA, 1993.
[12] J. Quinlan, Induction of decision trees. Machine Learning, 1(1), 1986.
[13] J. Fuernkranz, Integrative Windowing, Journal of Artificial Intelligence
Research, vol.8, pp.129-164, 1998.
[14] D. Aha, D. Kibler, & M. Albert, Instance-based learning algorithms.
Machine Learning, 6(1):37-66, 1991.
[15] C. Blake & C. Merz,, UCI Repository of Machine Learning, 1998.
[16] L. Breiman, J. Friedman, R. Olshen, R., & C. Stone, C., Classification
and regression trees, Wadsworth & Brooks, CA, 1984.
[17] M. Berry & G. Linoff. Mastering Data Mining, Wiley, 1999.
[18] Z. Huang, W. Du, and B. Chen, Deriving Private Information from
Randomized Data. In Proc. of ACM SIGMOD, Maryland, USA, 2005.