Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Error Awareness Data Mining Xingquan Zhu, Member, IEEE, and Xindong Wu, Senior Member, IEEE Abstract —1Real-world data mining applications often deal with low-quality information sources where data collection inaccuracy, device limitations, data transmission and discretization errors, or man-made perturbations frequently result in imprecise or vague data. Two common practices are to adopt either data cleansing to enhance data consistency or simply take noisy data as quality sources and feed them into the data mining algorithms. Either way may substantially sacrifice the mining performances. In this paper, we consider an error awareness data mining framework, which takes advantage of statistical error information (such as noise level and noise distribution) to improve data mining results. We assume such noise knowledge is available in advance, and propose a solution to incorporate it into the mining process. More specifically, we use noise knowledge to restore original data distributions, and then use the restored information to modify the model built from noise corrupted data. We present an Error Awareness Naïve Bayes (EA_NB) classification algorithm, and provide extensive experimental comparisons to demonstrate the effectiveness of this effort. 1. INTRODUCTION Data mining is dedicated to exploring meaningful information from massive unstructured data, where the quality of the input plays an essential role in ensuring the success of the mining process [1]. Many data mining methods, however, assume their input data comes from quality sources and complies with nice distributions. In reality, data often carries a significant amount of errors, which negatively impact on the mining algorithms. In addition, existing research from privacy-preserving data mining [2-3][18] often uses intentionally injected errors, commonly referred to as data perturbations, for privacy preserving, such that sensitive information in the data records can be protected, but knowledge in the dataset is still available for mining practice. As these systematic or man-made errors will eventually deteriorate the data quality, conducting effective mining from such low-quality data sources becomes a challenging and reality issue for the data mining community. General data mining applications consist of four major steps: data preparation, data quality enhancement, actual mining, and postmining processing [17]. When the underlying data bears a certain amount of errors, a common practice is to cleanse the data and enhance data consistency, so the subsequent mining process can possibly achieve better performances. Data cleansing methods are effective in their own scenarios, but some problems are still open: 1. Data cleansing only takes effect on certain types of errors, such as class noise [1]. Although it has been demonstrated that cleansing class noise often results in better learners [4], for attribute noise, unfortunately, it is not an effective solution, as it frequently incurs information loss. 2. No data cleansing methods can result in perfect data. As long as some errors remain in the data, they will ultimately impact on the mining process. 3. For intentionally imposed errors, such as privacy-preserving data mining, data cleansing is simply not an option, as privacy- 1 X. Zhu and X. Wu are with the Department of Computer Science, University of Vermont, Burlington VT 05405 USA (corresponding author, fax: 802-656-0696; e-mail: [email protected]). preserving data mining intends to hide sensitive information in data records but not to eliminate them. 4. Many suspicious data items in the database are just partially noisy, their usefulness needs to be justified by the actual mining algorithm, but not by any universal cleansing criteria which 5. have no idea about the mining methods the users are going to use next. 6. Under a cleansing based data mining framework, data cleansing and data mining are two isolated, independent operations and have no intrinsic connections with each other. So the data mining process has no awareness of data errors. In addition to the above mentioned efforts, many other methods such as data editing [5] and imputation [6] have also been used to correct suspicious data entries and enhance data quality. In reality, the benefits of these methods are often questionable, as any unsupervised effort in data correction may incur more troubles, such as ignoring outliers or introducing new errors. For many applications, users are simply reluctant to change their data entries, if they are very serious with their data. It is obvious that data cleansing, correction or editing all try to polish data before it is fed into the mining algorithms. The intuition behind is straightforward, as many data mining practitioners believe that enhancing data consistency will consequently improve the mining performance, although exceptions often occur. Unfortunately, all previous efforts simply try to polish data quality for improved mining results, and they fail to address the challenge of unifying data quality with the mining process towards good performances. In other words, if we can let data mining algorithms be aware of the underlying data errors, the mining process may modify the model produced from the noisy data and generate good results, as long as noise knowledge is known before the actual mining process. There are many cases in reality that statistical error information in the database is known in a priori. ¾ ¾ ¾ ¾ Information transformation, wireless networks in particular, often raises a certain amount of errors in the communicated data. For error control purposes, the statistical errors of the signal transmission channel should be investigated in advance, and can be used to estimate the error rate in the transformed information. When collecting information from different devices, the inaccuracy level of each device is often available, as it is part of the system features. For example, fluorescent labeling for gene chips in microarray experiments usually contains inaccuracy caused by many sources such as the influence of background intensity. The values of collected gene chip data are often associated with a probability to indicate the reliability of the current value. Data discretization, a general procedure to discretize numerical attributes, inherently introduces noise as well, as it uses a certain number of discrete values to estimate the infinite continuous values. Such discretization errors can be measured in advance and therefore are available for a mining procedure. As a representative example of artificial errors, privacypreserving data mining, intentionally perturbs the data, so private information in data records can be protected, but knowledge conveyed in the datasets is still minable. In such cases, the level of errors introduced is certainly known for data mining algorithms. Most data mining methods, however, do not accommodate such error information in their algorithm design. They either take noisy data as quality sources, or adopt data cleansing beforehand to eliminate errors. Either way may considerably sacrifice the performance of the succeeding data mining algorithms. The above observations raise an interesting and important concern on error awareness data mining, where the previously known error knowledge should be incorporated into the mining process for improved mining results. In this paper, we report our recent research efforts towards this goal. We will provide an error awareness data mining framework which accommodates noise knowledge to enhance the classification accuracy. Our experimental results on real-world datasets will demonstrate that such an error awareness data mining procedure is certainly superior to cleansing based data mining, and can significantly improve data mining results in noisy environments. 2. RELATED WORK The problem of mining from noisy data sources has been a major concern for the data mining community. As most data mining algorithms crucially depend on their input data to produce reliable models, a general consensus among data mining practitioners is that low data quality often leads to wrong decisions or even ruins the projects ("garbage in, garbage out") [1]. In reality, noise usually takes two forms: class noise and attribute noise, depending on whether the errors (inconsistency, contradiction or missing values) were introduced to the class label or the attributes. For either type of noise, providing effective algorithms to deal with noisy sources has been a major part of data mining research. Existing endeavors from data preprocessing [4] and data quality [7] perspectives have come up with many solutions such as class noise identification [4], erroneous attribute value location [8] and correction, missing attribute value imputation [5-6] and acquisition [9], and editing training instances for instance based learning [10]. An essential goal of all these efforts is to enhance the quality of the training data so it can possibly benefit the mining process. In practice, however, there are many cases that data cleansing actualy deteriorates the mining performances, and one of the main reasons identified is information loss from incorrect eliminating, correcting or editing. This is one objective our proposed effort tries to target, where the mining process should be aware of the underlying data errors and make use of this information, but not to try to cleanse them. It is understandable that no data processing effort can result in perfect data, and in reality, most algorithms would still have to conduct knowledge discovery from noisy sources, regardless of whether they have noise handing mechanisms or not. The problem of learning in noisy environments has been the focus of much attention in machine learning and most inductive learning algorithms have a mechanism for handling noise. For example, pruning in decision trees is designed to reduce the chance that the trees are overfitting to noise [11]. A common practice in reducing the noise impact is to adopting some thresholding measures to remove poor knowledge drawn from noisy data. Although simple, this mechanism has received impressive results. For example, Integrative Windowing [13] adopts good rule selection criteria to reduce the noise impact, and Instance Based Learning [14] selects representative prototypes to remove poor training samples. It is clear that in these algorithm designs, the existence of noise has been taken into consideration, but they still follow the same direction as data cleansing, and have not realized that noise knowledge can actually benefit the mining process and therefore make use of this information. Recent research in privacy-preserving data mining has raised an issue of perturbing data entries to protect privacy and maintain data mining performances, where randomization is one of the favorites for this purpose. The intuition behind it is to intentionally introduce errors (often in the form of randomness) into sensitive data entries, and reveal randomized information about each record in exchange for not having to reveal the original records to anyone [2-3]. Although the data record faces were modified, the imposed randomness was controlled so that knowledge in the dataset is still minable, with a little sacrifice in performances. Such a randomization procedure requires a compromise between the levels of privacy the system tries to protect and the mining performances from the perturbed data: 1. As randomization can eventually ruin useful knowledge in the dataset, the level of perturbations should be well controlled to avoid making a perturbed dataset totally useless. 2. The distribution of the errors, random data, is available for both database managers and data mining practitioners, as without this information, the mining results would deteriorate significantly. For example, numerical attributes often adopt normally distributed data perturbations. For categorical attributes, Du and Zhan [3] have adopted randomized response techniques to scramble the original data entries for privacy-preserving data mining. This method assumes that attributes contain binary values only, and the values are collected in such a way that the information providers tell the truth about all their answers to sensitive questions with the probability θ; and they tell the lie about all their answers with the probability 1-θ. For example, if the original attribute values were A1=1, A2=1, and A3=0. Then there are θ of chances that all these values remain unchanged (users telling the truth), and there are 1-θ of chances that all values were flipped to A1=0, A2=0, and A3=1 (users lying). The assumption of this approach, however, is too strong, as it assumes that once users decided to lie they will lie on all questions, which is not true in realty. In fact, users may randomly tell the truth or lie on each single question, i.e., lying on attribute A1 but telling the truth on attribute A2, or vise versa. So the perturbation introduced to each attribute (question) may be totally independent. In our system, we consider more realistic situations where perturbations are randomly and independently introduced to each attribute. 3. ERROR AWARENESS DATA MINING FOR NAIVE BAYES (NB) 3.1 Naïve Bayes Classification In classification learning, each instance is described by a vector of attribute values and a class label. A set of instances with their classes is provided as the training data, The learner is asked to predict a test instance’s class according to the evidence provided by the training data. We define: • • • • • • X <A1, A2, .., AM> as a vector of random variables denoting the observed attribute values (an instance with M attribute values) x <a1, a2, …, aM> as a particular observed attribute value vector (a particular instance). X=x as shorthand for X1=x1 ∧ X2=x2 ∧..∧ Xk=xk. aij, j=1,..,Mi as a particular value of attribute Ai, and Mi denotes the number of attribute values in Ai. Y as a random variable denoting the class of an instance. Cl, l=1, .., L, denotes a particular class label of a dataset with L classes. Assuming that P(Y=Cl| x) denotes the probability that example x belongs to class Cl, the Bayes theorem can be used to optimally predict the class label of a previously unseen example x, given a set of training examples in advance. With the Bayes theorem, the expected classification error can be minimized by choosing argmaxl {P(Y = Cl | x)}. Given an example x, the Bayes theorem provides a method to compute P(Y=Cl| x) with Eq. (1) P (Y = Cl | x) = P (Y = Cl ) ⋅ P ( X = x | Y = Cl ) P( X = x) | Di1 |=| Ei1 | ⋅(1 − pi + (1) ... Assuming that the attributes are independent given the class, P(X=x| Y=Cl) can be decomposed into the product P(x1|Cl) × P(x2| Cl) ×…× P(xa| Cl). Then the probability that an example belongs to class Cl is given by Eq. (2). P(Y = Cl | x) = P(Y = Cl ) ⋅ ∏a P( X = xa | Y = Cl ) P( X = x) | Dij |=| Ei1 | ⋅ | Di M i |=| Ei1 | ⋅ (2) The classifier obtained by using the discriminant function in Eq. (2) is known as the Naïve Bayes classifier. The independence assumption embodied in Eq. (2) makes NB classifiers very efficient for large datasets, because an NB classifier does not use attribute combinations as a predictor and can be constructed by only one scan of the dataset with a linear time complexity. Although the assumption of conditional independence among attributes is often violated in reality, the classification performance of NB is surprisingly good compared with other more complex classifiers, especially when dealing with noisy datasets. In noisy environments, erroneous attribute values will change conditional probabilities P(X|Y=Cl), l=1, .., L, and then deteriorate NB’s performances. As the essential goal of error awareness data mining for NB classification, we will try to let NB be aware of the underlying data errors, and then attempt to restore the original conditional probabilities and improve the NB classifier. In the case that errors exist in the class label as well, the same approach should be adopted to restore priori probability P(Y=Cl), l=1,.., L. 3.2 Data Distribution Restoration for Naïve Bayes Assume that the previously known noise level in attribute Ai is denoted by pi, and that noise in each attribute is uniformly distributed. A noise level pi indicates that for any particular attribute value, say aij, it has a pi probability of being randomly corrupted to including itself. So for any two values aij and aik, aij has a pi / Mi probability of being changed to aik, and vise versa. Such a random transformation model for attribute value aij is depicted in Fig. 1. p 1 − pi + i Mi pi Mi A⋅ Χ = B ai 2 pi Mi where • pi Mi • a ij pi Mi ai1 pi Mi pi Mi pi Mi aik • • aiM i Fig. 1. Random value transformation for attribute value aij Given data set D with |D| instances, assume it was corrupted from an error free dataset E (which does not exist). Let |Dij| and |Eij| denote the numbers of instances in D and E respectively which contain the attribute value aij. When noise is uniformly distributed, as depicted in Fig. 1, for any attribute Ai, the relationship between |Dij| and |Eij|, j=1, 2, ..Mi, can be expressed as follows. Eq. (5) ⎡ pi P ... i .... ⎢1 − p i + M M i i ⎢ ⎢... ⎢ p p A = ⎢ i ... 1 − p i + i .... ⎢Mi Mi ⎢ ⎢... ⎢ p P ⎢ i ... i .... 1 − p i + Mi ⎢⎣ M i Pi ⎤ ⎥ Mi ⎥ ⎥ ⎥ Pi ⎥ Mi ⎥ ⎥ ⎥ pi ⎥ ⎥ M i ⎥⎦ ⎡| Di1 | ⎢ ⎢... B = ⎢| Dij | ⎢ ⎢... ⎢ ⎣⎢| Di M i ⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ |⎦⎥ As we hold the corrupted dataset D, we can easily count the value of |Dij|, j=1,..,Mi. Because noise level pi is known as well, Eq. (5) is just a set of linear functions consisting of Mi variables |Eij|, j=1, .., Mi, and Mi functions, which can be easily solved to estimate |Eij|, i.e. the number of instances in E which contain attribute value aij.. The results from Eq. (5) can only estimate the number of instances w.r.t. each attribute value, regardless of the class label. This information is not sufficient to solve our problem, as Naïve Bayes needs to estimate the conditional probability given a particular class Y=Cl, P(X | Y=Cl). For this purpose, we transform Eq. (5) by pushing constraints onto the class labels. Assuming the number of instances in E, which contain attribute value aij and class label Cl, is denoted by | EijCl | , and the same type of instances in the corrupted dataset D is denoted by | DijCl | , Eq. (5) can be rewritten as follows: A ⋅ ( X 1 + X 2 + .. X l + .. X L ) = B1 + B2 + ..Bl + ..BL Eq. (6) X Where pi Mi • pi p p + ...+ | Eij | ⋅ i + ...+ | Ei M i | ⋅(1 − pi + i ) Mi Mi Mi Eq. (4) can be written in a matrix form as P(Y = Cl | x) ∝ log(P(Y = Cl )) + ∑a log(P( X = xa | Y = Cl )) (3) ai1 ,..., aiM i , pi p p + ...+ | Eij | ⋅(1 − pi + i ) + ...+ | Ei M i | ⋅ i Mi Mi Mi ... Which can be rewritten as Eq. (3). any other values pi p p ) + ...+ | Eij | ⋅ i + ...+ | Ei M i | ⋅ i Mi Mi M i Eq. (4) Bl l [ = [| D = | E iC1 l | Cl i1 | | E iC2 l | |D Cl i2 | ... ... ] |] Cl | E iM | i |D Cl iM T i B = B 1 + B 2 + .. + B L Eq. (6), however, is unsolvable, as there are L⋅Mi variables (| EijCl | , l=1,..L, j=1,.., Mi, i∈[1, M]) but Mi functions only, although we know exactly the values of B1, …BL. An alternative is to decompose Eq. (6) into a series of linear functions associated to each single class, as denoted by Eq. (7). ⎧ A ⋅ X 1 = B1 ⎪A⋅ X = B 2 2 ⎪⎪ ⎨... ⎪A⋅ X = B L L ⎪ ⎪⎩ X 1 + X 2 + .. + X L = X Eq. (7) The rationale of Eq. (7) lies in the fact that errors are randomly and independently distributed across all attributes, so instances in each classe suffer from almost the same level of errors. Once the number of instances is large enough, estimating the attribute value distribution from an instance subset or from the whole dataset does not bring much difference. In the case that errors exist in the class label as well, the decomposed equations in Eq. (7) may still hold, as long as noise is randomly distributed among all classes. However, because Eq. (7) estimates attribute value distribution with the constraint of the class label, it may possibly result in higher estimation errors for minor classes, in comparison with major classes. The reason is that minor classes have a very limited number of instances, which is hard to assess whether or not noise in this small number of instances indeed complies with the transformation model in Fig. 1. As a result, for minor classes, the estimated values ( | EijC l | , l=1,..L, j=1,.., Mi) can be seriously biased. Although we can almost do nothing to improve this shortfall, Naïve Bayes has inherently accommodated this issue already. As shown in Eq. (1), the priori probability P(Cl) also takes part in the final decision, and the final decision error is the product between the bias of the conditional probability Bias(P(X|Y=Cl)) and the priori probability P(Cl). Although minor classes may have larger Bias(P(X|Y=Cl)), they actually have less P(Cl). Consequently, the bias from minor classes can be controlled and should not bring a large impact to the final results. 3.3 Error Awareness Naïve Bayes (EA_NB) With the above analysis, we can estimate the value of the original attribute distribution ( | EijCl | , l=1,..L, j=1,.., Mi, i=1,..,M), with the constraint of the class Cl, l=1,…L. This value can be directly used as the conditional probability P(X|Y=Cl). As NB assumes all attributes are independent, we can repeat the same process for each attribute, and use the estimated conditional probabilities for the final classification. The pseudocode of the whole algorithm is depicted in Fig. 2. Procedure: ErrorAwarenessNaiveBayesia() Input: (1) D (a noisy dataset); (2) pi, i=1,.., M (noise level for each attribute) Output: Polished Naïve Bayes model. For each class Cl, l=1, …, L Calculate class priori probability P(Cl) For attribute Ai, i=1, …, M Count the value of the attribute distribution, | DijCl | , j=1, .., Mi from D Solve Eq. (7) and acquire estimated | EijCl | , j=1,.., Mi. End For End For Take estimated | EijCl | , j=1, .., Mi; i=1, .., M; l=1, …,L, as conditional probabilities, and combine with priori probability P(Cl) for error awareness Naïve Bayes classification. Fig. 2. Error Awareness Naïve Bayes Most datasets in the UCI data repository have been carefully examined by domain experts, and so do not contain much noise (at least we do not know which instances and which attribute values are erroneous). For comparative studies, we adopt a random corruption model, which is exactly the same as the one shown in Fig. 1, to manually inject errors into the attributes. Given a noise level pi, an attribute value aij has a chance of pi to be changed to any other random value (including itself). So the actual noise level in Ai is p pi − i , which is always lower than the designed value (Mi is the Mi number of attribute values in Ai). With the same noise level pi, the more the number of attribute values, the higher the overall noise level in the attribute. As we assume noise is evenly distributed among all attribute values, it would bring a much smaller effect on attributes with a large number of possible values than those that have, say, only two attribute values. Most experiments are designed to assess the performances of the proposed error awareness NB in noisy environments, in comparison with the original NB classifiers trained from the same dataset. For each experiment, we perform 10 times 10-fold cross-validation and use the average accuracy as the final result. In each run, the dataset is randomly (with a proportional partitioning scheme) divided into a training set and a test set. The noise corruption model was applied on the training set and this corrupted dataset was used to build NB and EA_NB classifiers. All the learners are tested on the test set to evaluate their performances. 4. 1 Classification Accuracy Comparison To evaluate the performance of EA_NB, we design the following experiments. Given a dataset E, we first train an NB classifier, and denote its classification accuracy by “Original”. We then introduce a certain level of noise into E to build a corrupted dataset D, and learn another NB classifier from D with its performance denoted by “Corrupted”. With D and a noise level pi, we can build an EA_NB classifier, which is represented as “EA_NB”. As in noisy environments, data cleansing is often adopted to enhance data quality and improve the classification accuracy, we therefore apply a data cleansing method [5] on D to remove all misclassified examples, and build another NB classifier from the cleansed dataset. The performance of this NB classifier is expressed as “Cleansing”. We compare the performances of the above four classifiers at different noise levels, pi∈[0.1, 0.5], and report the detailed results from four representative datasets in Fig. 4. The summarized results from six other datasets are reported in Table 1. To justify the performance of the NB implemented by ourselves and to show that NB is indeed robust in noisy environments, we report the results of C4.5 in Table 1 as well. In Fig. 4, the x-axis represents the noise level pi, and the y-axis indicates the classification accuracy. The meaning of each curve in Fig. 4 is explained in Fig. 3. Orginal EA_NB 4. EXPERIMENTAL EVALUATIONS Fig. 3 The meanings of curves in Fig. 4 86 96 84 95 82 94 80 Accuracy Accuracy To evaluate the performances of the proposed error awareness Naïve Bayes algorithm, we implemented both NB and EA_NB. In our implementation, most NB classifications are based on the discriminant function in Eq. (2), and in the case that the class distributions of all classes become undistinguishable (which happens when the data is sparse or datasets have many attributes), we use Eq. (3) instead. We evaluate our approach on 10 benchmark datasets from the UCI data repository [15], where the numerical attributes are equally discretized into 10 discrete values. Due to size restrictions, we will mainly analyze the results on several representative datasets. The summarized results are reported in Table 1. Corrupted Cleansing 78 76 74 72 93 92 91 90 70 89 0.1 0.2 0.3 Noise Level Pi 0.4 (a) Car dataset 0.5 0.1 0.2 0.3 0.4 Noise Level Pi (b) Splice dataset 0.5 48 Accuracy Accuracy 97 95 93 46 44 42 40 38 91 36 34 89 0.1 0.2 0.3 Noise Level Pi 0.4 0.5 (c) Mushroom dataset 0.1 0.2 0.3 Noise Level Pi 0.4 0.5 (d) Glass dataset Fig. 4. Classification accuracy comparison Table 1. Classification accuracy comparison Dataset pi (%) 10 30 50 10 Krvskp 30 50 10 LED24 30 50 10 Nursery 30 50 10 Wine 30 50 10 Zoo 30 50 Adult Original NB C4.5 81.39 84.82 87.86 99.46 100.0 100.0 91.47 98.68 95.12 93.25 96.08 92.96 Corrupted Cleansing EA_NB NB NB C4.5 80.57 83.11 80.31 77.74 79.11 79.65 80.19 77.92 78.92 76.68 80.12 78.37 85.41 95.69 87.36 79.33 82.93 82.55 86.08 79.24 80.07 71.30 85.13 78.47 100.0 94.41 100.0 100.0 98.97 78.62 99.96 99.88 93.62 55.59 98.14 96.32 90.84 92.23 90.72 87.59 90.31 77.08 90.49 87.22 90.04 63.87 90.25 87.31 95.02 92.61 95.01 94.56 92.63 90.32 94.42 90.04 88.71 85.14 91.76 73.33 90.16 91.97 94.05 90.52 88.83 86.87 91.39 88.10 84.54 77.03 86.17 82.41 As shown in Fig. 4, errors negatively impact on the learners built from noisy datasets, this is a common sense, as corrupted datasets no longer reveal the true data distributions, and therefore confuse NB classifiers from making correct decisions. It is worth noting that different datasets react differently to the same level of noise. A small portion of noise can seriously deteriorate a NB learner (e.g. for datasets in Fig. 4), or a significant amount of noise may still do not show much impact at all (e.g. for the Adult and Nursery datasets in Table 1). We believe this is an intrinsic feature of a dataset, which is determined by factors like instance numbers and the complexity of the concepts in the dataset. Meanwhile, as NB is a typical statistical learner, noise normally does less harm to it, in comparison with other learning mechanisms. As shown in Table 1, where C4.5 usually deteriorates much faster than NB. Generally, for a dataset with a large number of instances and containing a significant amount of redundancy, the existence of errors does less harm, as noise cannot easily ruin the true data distributions. On the other hand, for a dataset with a very limited number of instances and when each instance appears to be important for classification, adding a little bit noise can make considerable changes to the NB classifier, because noise in this case can easily modify data distributions and fool NB learners. When noise is introduced to the attributes, data cleansing is simply not an option to improve data mining performances. For many datasets we used, the learners trained from the cleansed dataset “Cleansing” are obviously inferior to the ones trained from the original noisy “Corrupted”, where the results of “Cleansing” can be as worse as 7% less than “Corrupted”. This complies with the previous observations from Quinlan [12]. The negative impact of data cleansing may come from two possible reasons: (1) removing suspicious instances, which do not comply with the existing model, will inevitably eliminate good examples and incur information loss; and (2) just because some attribute values are erroneous, does not necessarily mean the whole instance is useless, and many other attribute values of the noisy instance may still benefit the learning theory, therefore cannot be simply removed. The impact of these two factors becomes extremely clear, if the accuracy of the underlying learner is low, as show in Fig. 4(d). This is understandable, because if a learner has only 50% accuracy, it means that half of removed instances are actually not noise. When incorporating statistical error information for error awareness NB classification, we can achieve a very significant improvement in the classification accuracy (as “EA_NB” and “Corrupted” have demonstrated). Take the Car dataset in Fig. 7(a) as an example, where the accuracy of EA_NB is 10% better than the leaner from the corrupted dataset. Similar results can be observed from many benchmark datasets, where EA_NB is almost always better than Corrupted. The higher the noise level in the datasets, the more improvement can be observed (when the noise level is no larger than 50%). This indicates that although noise continuously brings an impact to the learning theory, having a data mining process aware of the data errors can substantially reduce the noise impact and enhance mining results. It is true that the proposed effort relies on the statistical error information to ensure the success, where a very limited number of instances may be insufficient for this purpose. Our results from Glass (214 examples and 7 classes) and Wine (178 examples and 3 classes) indicate that even with a very limited number of instances, the proposed effort can still work well and achieve impressive results. 4.2 Classification Performance under Inexact Noise Levels EA_NB considers noise level pi as a priori knowledge given by users. In reality, it can often be the case that the user specified value is different from the actual noise level in the dataset. If a tiny difference between the user specified value and the actual noise level in the database would bring a considerable impact to the system performance, we should then find solutions to enhance the robustness of our algorithm. For this purpose, we adopt the following approach to perturb the user specified noise level pi. Given a noise level pi, we first use this value to construct a noisy dataset D. When learning an EA_NB classifier from D, we intentionally change the noise level pi as pi ± δ(p)⋅pi, where δ(p) is the amplitude of the perturbation and was uniformly selected from [0, p], and p is controlled by users. This approach simulates situations where users can only roughly guess the noise value. 88 Corrupted 86 EA_NB+0 Accuracy 50 84 EA_NB+0.1 82 EA_NB+0.3 80 EA_NB+0.5 78 0.1 0.2 0.3 0.4 0.5 Noise Level Pi (a) Krvskp dataset 86 84 Corrupted 82 EA_NB+0 80 Accuracy 52 99 78 EA_NB+0.1 76 74 EA_NB+0.5 72 70 0.1 0.2 0.3 0.4 0.5 Noise Level Pi (b) Car dataset Fig. 5 Classification performance comparison under inexact noise levels In Fig. 5, we report the results from two most representative datasets Krvskp and Car, where the results from different values of p are denoted by “EA_NB+p”. We set the value of p from 0 to 0.5, which means that the perturbation amplitude varies from 0 to 50% of the original noise level, and provide the results at four values 0, 0.1, 0.3 and 0.5 (as different values of p do not result in significant changes in Fig. 5(b), we ignore the result from p=0.3). As shown in Fig. 5, when the user specified noise level is different from the actual noise level in the database, EA_NB deteriorates for sure, as inaccurate noise knowledge misleads EA_NB to build a biased model. Most likely, the higher the amplitude of the perturbation, the lower the system performance. Depending on the intrinsic characteristics of each dataset, the deterioration of the system performance varies significantly. Determining what types of datasets are more sensitive to such a perturbation is a nontrivial task and requires intensive studies on the complexity of the underlying concepts, the data redundancy and the interactions among attributes. This investigation is beyond the coverage of this paper, but our observations from two most representative datasets indicate that, as long as the user specified value is close to the actual noise level in the dataset (e.g., no more than 30%), the proposed effort can still achieve impressive results and far outperform the learner trained from noise corrupted datasets. This shows that EA_NB is pretty robust in reality, and can accommodate deviations in users’ input for effective mining. 5. DISCUSSION To demonstrate the theme of error awareness data mining, we have proposed a Naïve Bayes based solution. The same idea is actually applicable to many other classification algorithms, such as the most popular decision tree algorithms ID3/C4.5 [11-12] and CART [16]. For decision tree construction, ID3/C4.5 and CART adopt information gain/gain ratio and the gini index respectively to evaluate each attribute, and select the most informative attribute once a time to split data into smaller subsets. This procedure repeats until all instances in each splitted subset belong to one class or some stopping criteria. To ensure good performances, calculating information gain and gini index values accurately is essential, as incorrect values lead to poor splitting and decrease system performances. With statistical error information, we can restore the original information gain or gini index values, similar to what we have done with NB. Take the gini index in CART as an example. For a data set S with N instances and L classes, the gini index, Gini(S) is defined as L Gini ( S ) = 1 − f l 2 , where fl is the relative frequency of class l in ∑ l =1 S. With certain splitting criteria T, if we split S into two subsets S1 and S2 with sizes N1 and N2 respectively, the gini index gini(S, T) is defined as N N (10) Gini ( S , T ) = 1 Gini ( S ) + 2 Gini ( S ) Split N 1 N 2 To find the “best” splitting attribute, we have to enumerate all possible splitting points (determined by the attribute values) for each attribute and produce every two subsets S1 and S2, and choose the one with the smallest gini index for splitting. In noisy environments, erroneous attribute values produce incorrect class frequencies in the spitted subsets S1 and S2, and therefore damage the true gini index values. With error information pi, we can estimate the original gini index values through the following three steps. 1. Given a dataset S, for each class Cl, l=1,…L, adopt Eq. (7) to estimate the original attribute value distribution EijC l , j=1,..,Mi; i=1,..,M. 2. Use Eq. (4) to estimate the original distribution of attribute Ai, | Eij |, j = 1,.., M i . 3. The modified gini index of S, w.r.t. each possible splitting point of attribute Ai is denoted by Eq. (10) M C C N − | Eij | | Eij | L | Eij | L ∑k =1 | Eik |, k ≠ j (10) GiniSplit (S, aij ) = N (1 − ∑l =1 l | Eij | )+ N (1 − ∑l =1 i l N − | Eij | ) 4. Enumerate all possible splitting points of all attributes, and choose the one with the smallest value for splitting. 6. CONCLUSIONS In this paper, we have proposed an error awareness data mining framework which seamlessly unifies statistical error information and a data mining algorithm for effective learning. The proposed effort makes use of noise knowledge to modify the model built from noise corrupted data, and has resulted in a substantial improvement in comparison with the models built from the original noisy data and the noise cleansed data. The novel features that distinguish the proposed effort from existing endeavors are twofold: (1) we unify noise knowledge and a general data mining algorithm into a unique structure, whereas existing data mining activities often have no awareness of the underlying data errors; and (2) instead of polishing noisy data, like many cleansing based approaches do, we take the advantage of noise knowledge to polish the model trained from noisy data sources, and therefore the original data is well maintained. While the strategies presented in this paper are specific to Naïve Bayes, incorporating noise knowledge into the mining process for error awareness data mining is an essential idea we want to convey here. When data enclosing a certain level of erroneous attribute values, data cleansing is simply not an option to improve the data mining performance, as it can result in substantial information loss. Our error awareness data mining framework, which utilizes noise knowledge to supervise the model construction, becomes very promising, as it can bridge the gap between low-quality data and the mining process to enhance the system performance, and essentially avoids the information loss incurred by data cleansing. REFERENCES [1] D. Luebbers, U. Grimmer, & M. Jarke, Systematic development of data mining based data quality tools, Proc. of VLDB, Germany, 2003. [2] R. Agrawal & R. Srikant, Privacy-preserving data mining, In Proc. of ACM SIGMOD, pp. 439-450, 2000. [3] W. Du & Z. Zhan, Using randomized response techniques for privacypreserving data mining, in Proc. of 9th ACM SIGKDD, 2003. [4] X. Zhu, X. Wu, & Chen Q., (2003), Eliminating class noise in large datasets, in Proc. of ICML, 920-927. [5] I. Fellegi & D. Holt, A systematic approach to automatic edit and imputation, J. of the American Statistical Association, vol.71, 1976. [6] D. Rubin, Multiple Imputation for Nonresponse in Surveys, New York: John Wiley & Sons, Inc., 1987. [7] R. Wang, V. Storey, & C. Firth, A Framework for Analysis of Data Quality Research, IEEE Trans. on KDE, 7(4), pp. 623-639, 1995. [8] X. Zhu, X. Wu, & Y. Yang, Error detection and impact-sensitive instance ranking in noisy datasets, in Proc. of AAAI, 2004. [9] X. Zhu & X. Wu "Cost-Constrained Data Acquisition for Intelligent Data Preparation", IEEE Trans. on KDE, 17(11), 2005. [10] F. Ferri, J. Albert, & E. Vidal, Considerations about sample-size sensitivity of a family of edited nearest-neighbor rules. IEEE Trans. on SMC - Part B, 29, pp.667–672, 1999. [11] J. Quinlan, C4.5: Programs for Machine Learning, Morgan Kaufmann, San Mateo, CA, 1993. [12] J. Quinlan, Induction of decision trees. Machine Learning, 1(1), 1986. [13] J. Fuernkranz, Integrative Windowing, Journal of Artificial Intelligence Research, vol.8, pp.129-164, 1998. [14] D. Aha, D. Kibler, & M. Albert, Instance-based learning algorithms. Machine Learning, 6(1):37-66, 1991. [15] C. Blake & C. Merz,, UCI Repository of Machine Learning, 1998. [16] L. Breiman, J. Friedman, R. Olshen, R., & C. Stone, C., Classification and regression trees, Wadsworth & Brooks, CA, 1984. [17] M. Berry & G. Linoff. Mastering Data Mining, Wiley, 1999. [18] Z. Huang, W. Du, and B. Chen, Deriving Private Information from Randomized Data. In Proc. of ACM SIGMOD, Maryland, USA, 2005.