Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
On A New Scheme on Privacy Preserving Data Classification Nan Zhang Shengquan Wang Abstract We address the privacy preserving data classification problem in a distributed system. Randomization has been proposed to preserve privacy in such circumstances. However, this approach was challenged in [12] by a privacy intrusion technique that is capable of reconstructing the private data in a relative accurate manner. In this paper, we introduce an algebraic technique based scheme. Compared to the randomization approach, our new scheme can build classifiers with better accuracy but disclose less private information. We also show that our scheme is immune to privacy intrusion attacks. Performance lower bounds in terms of both accuracy and privacy are established. Keywords: Privacy, security, classification 1 Introduction In this paper, we address issues related to privacy preserving data mining techniques. The purpose of data mining is to extract knowledge from large amounts of data [10]. Classification is one of the biggest challenges in data mining. In this paper, we focus on privacy preserving data classification. The objective of classification is to construct a model (classifier) that is capable of predicting the (categorical) class labels of data [10]. The model is usually represented by classification rules, decision trees, neural networks, or mathematical formulae that can be used for classification. The model is constructed by analyzing data tuples (i.e., samples) in a training data set, where the class label of each data tuple is provided. For example, suppose that a company has a database containing the age, occupation, and income of customers and wants to know whether a new customer is a potential buyer of a new product. To answer this question, the company first builds a model which details the existing customers in the database, based on whether they have bought the new product. The model consists of a set of classification rules (e.g., (occupation = student) ∧ (age ≤ 20) ∗ This work was supported in part by the National Science Foundation under Contracts 0081761, 0324988, 0329181, by the Defense Advanced Research Projects Agency under Contract F30602-99-1-0531, and by Texas A&M University under its Telecommunication and Information Task Force Program. Any opinions, findings, conclusions, and/or recommendations expressed in this material, either expressed or implied, are those of the authors and do not necessarily reflect the views of the sponsors listed above. † The authors are with the Department of Computer Science, Texas A&M University, College Station, TX 77843, USA. Email: {nzhang, swang, zhao}@cs.tamu.edu. ∗ Wei Zhao † → buyer). Then, the company uses the model to determine whether the new customer is a potential buyer of the product. Classification techniques have been extensively studied for over twenty years [14]. However, only in recent years has the issue of privacy protection in classification been raised [2, 13]. In many situations, privacy is a very important concern. In the above example, the customers may not want to disclose their personal information (e.g., incomes) to the company. The objective of research on privacy preserving data classification is to develop techniques that can build accurate classification models without disclosing private information in the data being mined. The performance of privacy preserving techniques should be analyzed and compared in terms of both accuracy and privacy. We consider a distributed environment where training data tuples for classification are stored in multiple sites. We can classify distributed privacy preserving classification systems into two categories based on their infrastructures: Server-to-Server (S2S) and Client-to-Server (C2S), respectively. In the first category (S2S), data tuples in the training data set are distributed across several servers. Each server holds a part of the training data set that contains numerous data tuples. The servers collaborate with each other to construct a classifier spanning all servers. Since the number of servers in the system is usually small (e.g., less than 10), the problem can be formulated as a variation of secure multiparty computation [13]. Existing privacy preserving classification algorithms in this category include decision tree classifiers [4,13], naı̈ve Bayesian classifier for vertically partitioned data [16], and naı̈ve Bayesian classifier for horizontally partitioned data [11]. In the second category (C2S), a system usually consists of a data miner (server) and numerous data providers (clients). Each data provider holds only one data tuple. The data miner performs data classification process on aggregated data offered by the data providers. An online survey is a typical example of this type of system, as the system can be modeled as one survey collector/analyzer (data miner) and thousands of survey respondents (data providers). Needless to say, both S2S and C2S systems have a broad range of applications. In this paper, we focus on studying privacy preserving data classification in C2S systems. Most of the current studies on C2S systems tacitly assume that randomization is an effective approach to preserving privacy. However, this assumption has been challenged in [12]. It was shown that an illegal data miner may be able to construct the private data even if they have been randomized. In this paper, we take an algebraic approach and develop a new scheme. Our new scheme has the following important features to distinguish it from previous approaches. • Our scheme can help to build classifiers that have better accuracy but disclose less private information. A lower bound of accuracy is derived and can be used to predict the system accuracy in reality. • Our scheme is immune to privacy intrusion attacks. That is, the data miner cannot derive more private information from the data tuples it receives from the data providers if the data are properly perturbed based on our scheme. • Our scheme allows every user to play a role in determining the tradeoff between accuracy and privacy. Specifically, we allow explicit negotiation between each data provider and the data miner in terms of the tradeoff between accuracy and privacy. This makes our system meet the needs of a wide range of users, from hard-core privacy protectionists to privacy marginally concerned. • Our scheme is flexible and easy to implement. It does not require a distribution reconstruction component as have previous approaches. Thus, our privacy preserving component is transparent to the data classification process and can be readily integrated with existing systems as a middleware. The rest of the paper is organized as follows: The model of data miners is introduced in Section 2. We briefly review previous approaches in Section 3. In Section 4, we introduce our new scheme and its basic components. We present a theoretical analysis on accuracy and privacy in Section 5. Theoretical bounds on accuracy and privacy metrics are also derived in this section. Then we make a comparison between the performance of our scheme and the previous approach in Section 6. Experimental results are presented in this section. The implementation of our scheme is discussed in Section 7, followed by a final remark in Section 8. 2 Model In this section, we introduce our model of data miners. Due to the privacy concern introduced to the system, we classify the data miners into two categories. One category is legal data miners. These data miners always act legally in that they only perform regular data mining tasks and would never intentionally compromise the privacy of data. The other category is illegal data miners. These data miners would purposely discover the privacy in the data being mined. Like adversaries in distributed systems, illegal data miners come in many forms. In most forms, their behavior is restricted from arbitrarily deviating from the protocol. In this paper, we focus on a particular subclass of illegal miners, curious data miners. That is, in our system, illegal data miners are honest but curious 1 : they follow the protocol strictly (i.e., they are honest), but they may analyze all intermediate communications and received data to compromise the privacy of data providers (i.e., they are curious) [8]. 3 Randomization Approach and its Problems Based on the model of data miners, we review the randomization approach, which is currently used to preserve privacy in classification. We also point out the problems associated with the randomization approach that motivates us to design a new privacy preserving scheme on data classification. To prevent privacy invasions by curious data miners, countermeasures must be implemented in the data classification process. Randomization is a commonly used approach. We briefly review it below. Based on the randomization approach, the entire privacy preserving classification process can be considered a two-step process. The first step is to transmit (randomized) data from the data providers to the data miner. That is, in this step, a data provider applies a randomization operator R(·) to the data tuple that the data provider holds. Then, the data provider transmits the randomized data tuple to the data miner. In previous studies, several randomization operators have been proposed including the random perturbation operator [2] and the random response operator [5]. These operators are shown in (3.1) and (3.2), respectively. (3.1) (3.2) R(t) = t + r. t̄, if r ≥ θ, R(t) = t, if r < θ. Here, t is the original data tuple, r is the noise randomly generated from a predetermined distribution, and θ is a predetermined parameter. Note that the random response operator in (3.2) only applies to binary data tuples. In the second step, the legal data miner performs the data classification process on the aggregated data. With the randomization approach, the legal data miner must first employ a distribution reconstruction algorithm which intends to recover the original data distribution from the randomized data. There are several algorithms for reconstructing the original distribution [1, 2, 5]. In particular, an expectation maximization (EM) algorithm was proposed in [1]. The distribution reconstructed by EM algorithm converges to the maximum likelihood estimate of the original distribution. 1 The honest-but-curious behavior model is also known as semi-honest behavior model. Also in the second step, a curious data miner may invade privacy by using a privacy data recovery algorithm on the randomized data supplied by the data providers. Privacy fundamentalists are extremely concerned about privacy. Privacy pragmatists are concerned about privacy, but much less than the fundamentalists. Marginally concerned people are generally willing to provide their private data. The randomization approach treats all the data providers in the same manner and cannot handle the differing needs of different data providers. We believe that the followings are the reasons behind the above mentioned problems. • Randomization operator is user-invariant. In a system, the same perturbation algorithm is applied to all data providers. The reason is that in a system using randomization approach, the communication is one-way: from the data providers to the data miner. As such, a data provider cannot obtain any user specified guidance on the randomization of its private data. Figure 1: Randomization Approach Figure 1 depicts the classification process with the randomization approach. Clearly, any such data classification system should be measured by its capacity of both building accurate classifiers and preventing private data leakage. 3.1 The problems of randomization approach. While the randomization approach is intuitive, researchers have recently identified privacy breaches as one of the major problems with the randomization approach. Kargupta, Datta, Wang, and Sivakumar showed that the spectral properties of randomized data could help curious data miners to separate noise from private data [12]. Based on random matrix theory, they proposed a filtering method to reconstruct private data from the randomized data set. They demonstrated that randomization preserves very little privacy in many cases. Randomization approach also suffers in efficiency as it puts a heavy load on (legal) data miners at run time (because of the distribution reconstruction) [3]. It is shown that the cost of mining randomized data set is “well within an order of magnitude” in respect to that of mining original data set. 2 Another problem with the randomization approach is that it cannot be adapted to meet the needs of different kinds of users. A survey on Internet users (potential data providers) showed that there are 17% privacy fundamentalists, 56% privacy pragmatists, and 27% marginally concerned people. • Randomization operator is attribute-invariant. The same perturbation algorithm is applied to all attributes. The distribution of every attribute, no matter how useful (or useless) it is in the classification, is equally maintained in the perturbation. For example, suppose that each data tuple has three attributes: age, occupation, and salary. Also, assume that more than 95% of test data tuples can be correctly classified using age as the only attribute. The wisest decision is to maintain only the distribution of age, which is the most useful attribute in classification. If the randomization approach is taken, private information disclosed on the other two attributes are unnecessary (from the perspective of a data provider) because it does not contribute much to build the classifier. However, again, due to the lack of communication between the data miner and the data providers, a data provider cannot learn the correlation between different attributes. Thus, a data provider has no choice but to employ an attribute-invariant operator. These properties are inherent in the randomization approach, and hence motivate us to develop a new scheme allowing two-way communication between the data miner and the data providers. The two-way communication helps preserving private information while not incurring too much overhead. Thereby, we significantly improve the performance in terms of accuracy, privacy, and efficiency. We describe the new scheme in the next section. 4 Our New Scheme In this section, we introduce our scheme and its basic 2 Although the work is based on association rule mining, we believe that the similarity between randomization operators in association rule components. We take a two-way communication approach mining and data classification makes the efficiency concern inherent in that substantially improves the performance while incurring randomization approach. little overhead. In the third step, the perturbed data tuples received by the data miner are used by the classifier construction procedure as the training data tuples. A classifier is built and delivered to the data miner. 4.2 Basic components. The basic components of our scheme are: a) the method of computing V k , and b) the perturbation function R(·). Before presenting the details of the components, we first introduce some notions of the training data set. Figure 2: Our New Scheme 4.1 Description of our new scheme. Figure 2 depicts the infrastructure of our new scheme. Our scheme has two key components, perturbation guidance (PG) in the data miner server and perturbation in the data providers. Compared to the randomization approach, our scheme does not have the distribution recovery component. Instead, the classifier construction procedure is performed on the perturbed data tuples (R(t)) directly. Our scheme is a three-step process. In the first step, the data miner negotiates different perturbation level k with different data providers. The larger k is, the more contribution R(t) will make to the classification process. The smaller k is, the more private information is preserved. Thus, a privacy fundamentalist can choose a small k to preserve its privacy while a privacy unconcerned data provider can choose a large k to contribute to the classification. The second step is to transmit (perturbed) data from the data providers to the data miner. Since each data provider comes at a different time, this step can be considered as an iterative process. In each stage, the data miner dispatches a reference (perturbation guidance) V k to a data provider Pi . Here Vk depends on the perturbation level k negotiated by the data miner and the data provider P i in the first step. Based on the received V k , the perturbation component of Pi computes the perturbed data tuple R(t i ) from the original data tuple t i . Then, Pi transmits R(ti ) to the perturbation guidance (PG) component of the data miner. PG then updates Vk based on R(ti ) and forwards R(t i ) to the classifier construction procedure. A curious data miner can also obtain R(t i ). In this case, the curious data miner uses private data recovery algorithm to discover private information from R(t i ). Notions of training data set. Suppose that T is a training data set consisting of m data tuples t 1 , . . . tm . Each data tuple belongs to a predefined class, which is determined by its class label attribute a0 . In this paper, we consider the labeled data from two classes, named C 0 and C1 . The class label attribute has two distinct values 0 and 1, corresponding to classes C0 and C1 , respectively. Besides the class label attribute, each data tuple has n attributes, named a 1 , . . . , an . The class label attribute of each data tuple is public (i.e, privacy-insensitive). All other attributes consist of private information which needs to be preserved. We represent the private part of the training data set by an m × n matrix T = [t1 ; . . . ; tm ] = [a1 , . . . , an ].3 We use T0 and T1 to represent the matrices of data tuples that belong to class C0 and C1 , respectively. We denote the j-th attribute of ti by T ij . An example of T is shown in Table 1. As we can see from the matrix, the first data tuple in T is t1 = [20, 3.2, 18, 1]. It belongs to class C1 . Table 1: An Example of a Training Data Set a1 a2 a3 a4 a0 t1 20 3.2 18 1 t1 1 a0 : . T: . . . . . .. .. .. .. .. .. .. . tm 40 2.5 13 0 tm 0 Notions used in the paper are listed in Appendix F. Computation of Vk . In our scheme, V k is an estimation of the first k eigenvectors of T 0 T0 − T1 T1 . The justification of Vk will be provided in Section 5. Now we show how to update Vk when a new data tuple is received. As we are considering the case where data tuples are iteratively fed to the data miner, the data miner keeps a copy of all received data tuples and updates it when a new data tuple is received. Let the current matrix of received data tuples be T ∗ . When a new data tuple R(t) is received by the data miner, R(t) is appended to the bottom of T ∗ . 3 Here t and a are used somewhat ambiguously. In the context of i i training data set, ti is a data tuple and ai is an attribute. In the context of matrix, ti represents a row vector in T and ai represents a column vector in T . Besides the received data tuples T ∗ , the data miner also keeps track of two additional matrices: A ∗0 = T0∗ T0∗ and A∗1 = T1∗ T1∗ where T0∗ and T1∗ are the matrices of received data tuples that belong to class C 0 and C1 , respectively. Note that the update of A ∗0 and A∗1 (after R(t) is received) does not need access to any data tuple other than the recently received R(t). Thus, we do not require the matrix of data tuples T to remain in main memory. In particular, if the class label attribute of R(t) is c (c ∈ {0, 1}), A ∗c is updated as follows. (4.3) A∗c = A∗c + R(t) R(t). Given the updated A ∗0 and A∗1 , the computation of Vk is done in the following steps. Using singular value decomposition (SVD), we can decompose A ∗ = A∗0 − A∗1 as (4.4) Accuracy metric. Before we formally define an accuracy metric, let us review the process of building a classifier and observe what factors may impact accuracy. Recall that the process to build a classifier takes the following steps: • Training data set T is sampled from a population T . • Due to the privacy concern, we perturb each data tuple t in the training data set by perturbation function R(·) and produce a perturbed data tuple R(t). In the previous section, we have described our selection of R(·). • A classifier construction procedure (CCP) is used to mine the perturbed training data set in order to produce a classifier. A∗ = V ∗ ΣV ∗ , where Σ = diag(s1 , . . . , sn ) is a diagonal matrix with s 1 ≥ · · · ≥ sn , si is the i-th eigenvalue of A ∗ , and V ∗ is an n × n unitary matrix composed of the eigenvectors of A ∗ . Vk is composed of the first k eigenvectors of A ∗ (i.e., the first k column vectors of V ∗ ), which correspond to the k largest eigenvalues of A ∗ . In particular, if V ∗ = [v1 , . . . , vn ], then (4.5) Vk = [v1 , . . . , vk ]. The computing cost of updating V k is addressed in Section 6. Perturbation function R(·). Once a data provider obtains Vk from the data miner, the data provider employs a perturbation function R(·) on the original data tuple t. The result is a perturbed data tuple that is transmitted to the data miner. In our scheme, the perturbation function R(·) is defined as follows. (4.6) R(t) = tVk Vk We have now described our new scheme and its basic components. We are ready to analyze our scheme in terms of accuracy, privacy and their tradeoff. 5 Theoretical Analysis of Our Scheme In this section, we analyze our new scheme. We will define metrics for accuracy and privacy and derive their bounds, in order to provide guidelines for the tradeoff between these two measures and hence help system managers setting parameter in practice. 5.1 Accuracy analysis. An accuracy measure should reflect the capability of the system that can correctly classify the objectives in a given population. We will define an accuracy metric and derive a lower bound on it. Figure 3: Building a Classifier Figure 3 shows the workflow of such a process. From this process, it is clear that many factors impact the accuracy. They include the characteristics of the training data set, the algorithm used by CCP, and the perturbation function used to perturb the training data tuples. Formally, we define a metric for accuracy as the probability that a test data tuple sampled from the population T can be correctly classified by the classifier produced by CCP. We denote the accuracy metric by la (T, R, CCP) where T is the sample from the population, R is the perturbation function, and CCP is the classifier construction procedure. Lower bounds on accuracy. As we have observed, the accuracy measure of a given system depends on many factors. It remains an open problem to develop a systematic methodology allowing one to derive the value of accuracy measures for a given system. Nevertheless, in this paper, we derive lower bounds of accuracy and discuss their implications in practice. We will focus on a system where the classifier construction procedure uses the tree augmented naı̈ve Bayesian classification algorithm (TAN) [7] to build the classifier. Note that this algorithm is an improved version over the traditional naı̈ve Bayesian classification algorithm, as the independence assumption has been relaxed. The independence assumption assumes that all attributes in a data tuple are conditionally independent of each other. TAN relaxes this assumption by allowing certain dependencies between two attributes. An overview of TAN can be found in Appendix A. We will start with two lemmas. First, we define a matrix A = T0 T0 − T1 T1 . (5.7) Recall that T0 and T1 are the matrices of original training data tuples that belong to class C 0 and C1 , respectively. We assume that the data miner maintains an estimation of A. The estimation is defined as follows. à = T̃0 T̃0 − T̃1 T̃1 . (5.8) Here T̃0 and T̃1 are the matrices of perturbed data tuples that belong to class C0 and C1 , respectively. Note that à = A∗ when all the (perturbed) data tuples in the training data set have been received by the data miner. We found the accuracy in our system depends on the estimation error of A, which is defined as follows. = max A − Ãrq . (5.9) r,q∈[1,n] Formally, the following lemma can be established. Please refer to Appendix B for the proof of the lemma. L EMMA 5.1. For a given system where the classifier construction procedure uses TAN to build the classifier, if the estimation error of A is bounded by , then the accuracy of the system is approximately bounded by (5.10) la (T, R, TAN) > p − 0.4vr vq · ∼ 9· 2 vr vq i=1 , r q j=1 (αi αj ) where TAN is the notion of the classifier construction procedure. vr and vq are the number of values of a r and aq , respectively, αri and αqj are the i-th and j-th possible values of ar and aq , respectively, and p is the accuracy of the system when = 0. The following lemma provides an upper bound on the estimation error of A. L EMMA 5.2. Let the (k + 1)-th largest eigenvalue of A be sk+1 . ∀r, q ∈ [1, n], we have A − Ãrq ≤ sk+1 . (5.11) For the proof, please refer to Appendix C. With this lemma, we can establish a lower bound of accuracy as follows. T HEOREM 5.1. (5.12) la (T, R, TAN) > p − 0.4vr vq ∼ 9· s2k+1 vr vq i=1 r q j=1 (αi αj ) . The theorem is established by simply substituting (5.11) into (5.10). From this theorem, we can make the following observations. • The accuracy measure increases monotonously with increasing p, which is the accuracy measure for the original system without privacy preserving countermeasure. That is, the higher the accuracy in the original system, the higher the accuracy in the privacy preserving one. • Given a training data set T , the accuracy measure increases monotonously with increasing perturbation level k (i.e., decreasing s k+1 ). Thus, a privacy unconcerned data provider may contribute more to the classification by choosing a large k to help improving the accuracy. 5.2 Privacy analysis. A privacy measure is needed to properly quantify the privacy loss in the system. We will define a privacy metric and derive a lower bound on it. Metrics defined in pervious studies. Several privacy metrics have been proposed in the literature. We briefly review them below. In [2] and [18], two interval-based metrics were proposed. Suppose that based on the perturbed data tuple R(t), the value of attribute a r of an original data tuple can be estimated to lie in interval (α r0 , αr1 ) with c% confidence. By [2], the privacy measure of attribute a r at level c% is then defined as the minimum width of the interval, which is min(α r1 −αr0 ). In [18], this definition is revised by using the Lebesgue measure and hence becoming more mathematically rigorous. In [1], an information-theoretic metric was defined as follows. Given R(t), the privacy loss of t is given by P(t|R(t)) = 1 − 2−I(t;R(t)) where I(t; R(t)) is the mutual information of t and R(t). This metric measures the average amount of privacy leakage. In contrast, a metric that measures the worst-case privacy loss was proposed in [6]. Each of these metrics has its pros and cons. Most of them are defined in the context of the randomization approach 4 . Nevertheless, we now propose a general privacy metric to quantify privacy leakage in a given privacy preserving data mining system. A general privacy metric. Let the matrix of perturbed data tuples R(t) be T̃ . Suppose that T̂ is the best approximation of T that a (curious) data miner can derive from T̃ . We would like to measure the privacy leakage by the distance between T and T̂ . Formally, we have the following: 4 To make a fair comparison with the randomization approach, we will use the interval-based metric in [2] as the privacy metric in performance comparison in Section 6. D EFINITION 5.1. Given a matrix norm · , we define the privacy measure lp as T HEOREM 5.2. The lower bounds on the privacy measure are given below lp = min T − T̂ , (5.13) T̂ where T̂ can be derived from T̃ . In the above definition, the distance of T and T̂ can always be formulated by a matrix norm (e.g., Frobenius norm, also known as Euclidean norm) of T − T̂ . A list of commonly used matrix norms is shown in Table 2. In comparison Table 2: Matrix Norms of an m × n matrix M norm M 1 M 2 M ∞ M F definition m maxj i=1 M ij max eigenvalue n of M M maxi j=1 M ij m,n 2 i,j=1 M ij lower bound on p T − T̂ 1 T − T̂ 2 T − T̂ ∞ δk+1 /(maxi ai + ãi ∞ ) ρk+1 δk+1 /(max i ai + ãi 1 ) n 2 i=k+1 ρi T − T̂ F where matrix norms are defined in [9], ρ i is the i-th eigenvalue of T , si is the i-th eigenvalue of A, σ i is the i-th eigenvalue of T0 T0 , τi is the i-th eigenvalue of T 1 T1 , and δ i = 2 max{σi , τi } − si . (5.15) with definitions used in previous studies, our definition on privacy metric is a general one. The reason is that one can use different matrix norms to satisfy different measurement requirements. For example, if one wants to measure the average privacy loss, the Frobenious norm may be used. On the other hand, if one wants to analyze the worst case, the 1-norm and ∞-norm may be selected. Immunity property. Before we derive an analytical bound on the privacy measure, we first show that our system is immune to privacy intrusion attacks. That is, an illegal data miner cannot derive a better approximation of the original data tuples from the perturbed data tuples it receives. This property can be established as follows. In our scheme, we have (5.14) norm The proof of the theorem can be found in Appendix D. Theorem 5.2 establishes a quantitative relationship between the privacy measure and eigenvalues of T , A, T 0 , T1 , T0 , and T1 . Note that the lower bounds also implicitly depend on the value of k. By an observation of these formulas, one can easily see that the smaller k is, the higher the lower bounds are. This is consistent with our intuition that a small k would better protect the privacy. 6 Performance Evaluation In this section, we first demonstrate the effectiveness of our scheme by presenting simulation results on a simple data set. Then, we compare the performance of our scheme and the randomization approach in two areas: a) tradeoff between accuracy and privacy, and b) runtime efficiency. (a) Original data R(t) = tVk Vk . Since Vk is composed of the first k eigenvectors of A , Vk Vk is a singular matrix with det(V k Vk ) = 0. That is, Vk Vk does Analytical bound on lp . We now derive a set of lower bounds on the privacy measure depending on the use of matrix norms. 500 400 400 300 300 200 200 100 ∗ 100 2 4 6 8 10 2 4 6 8 10 (c) Perturbed data in our scheme number of elements not have an inverse matrix. Thus, t cannot be deduced from R(t) deterministically. Furthermore, we also claim that no better approximation of t can be derived from R(t). Let the Moore-Penrose pseudoinverse matrix of V k Vk be (Vk Vk )† . Due to the property of the pseudoinverse matrix, given R(t), R(t)(Vk Vk )† is the shortest length least square solution to t in (5.14). Since V k is a singular matrix, (Vk Vk )† is equal to Vk Vk . Thus, the least square approximation of t based on R(t) is t̂ = tVk Vk Vk Vk = R(t). That is, no better approximation of t can be derived from R(t). (b) Randomized data 500 500 400 300 200 attribute a 1 100 1 2 3 4 5 6 attribute value 7 8 9 10 Figure 4: simulation results 6.1 Data distribution before and after randomization and perturbation. We use a training data set of 1, 000 data tuples, equally split between two classes. Each data tuple has 10 privacy-sensitive attributes a 1 , . . . , a10 . Each attribute 6.2 Tradeoff between accuracy and privacy. In order to make a fair comparison between the performance of our scheme and the randomization approach proposed in [2], we use the exactly same training and testing data sets as in [2]. The training data set consists of 100, 000 data tuples. The testing data set consists of 5, 000 data tuples. Each data tuple has nine attributes. Five widely varied classification functions are used to measure the tradeoff between accuracy and privacy in different circumstances. A detailed description of the data set and the classification functions is available in Appendix E. We use the same classification algorithm, ID3 decision tree algorithm, as in [2]. In our scheme, we update the perturbation guidance V k once 1, 000 data tuples are received. The comparison of accuracy measure while fixing privacy level at 75% [2] is shown in Figure 5. As we can see, our scheme outperforms the randomization approaches on all five functions. A comparison between the accuracy of our scheme and that of the randomization approach on different privacy levels is shown in Figure 6. Function 2 is used in this figure. From this figure, we can observe a tradeoff between accuracy and privacy. We can also observe the role of perturbation level k in our scheme. In any case, our scheme outperforms the randomization approach for a wide range of k values. 6.3 Runtime efficiency. As we have addressed in Section 3, it is shown in [3] that the cost of mining randomized data set is “well within an order of magnitude” in respect to that of mining the original data set. In particular, the randomization approach proposed in [2] requires the original data distribution to be reconstructed before a decision tree classifier can be built on the randomized data set. The dis- Privacy level = 75% 100 Accuracy (%) 95 90 85 our new scheme randomization 80 75 Fn1 Fn2 Fn3 Fn4 Fn5 Class label attribute Figure 5: comparison of performance Function 2 160 our new scheme randomization 140 120 Privacy (%) is independently and uniformly chosen from 1 to 10. The classification function is c = (a 1 > 5). That is, a data tuple is in group C0 if a1 ≤ 5. Otherwise, the data tuple is in group C1 . Figure 4(a) shows the distribution of the original data. Each line represents the distribution of an attribute. Figure 4(b) shows the distribution of the randomized data after uniform randomization [2]. Figure 4(c) shows the distribution of the perturbed data in our scheme when the perturbation level k = 2. Our scheme preserves the private information in a2 , . . . , a10 better than the randomization approach. As we can see, the variance of a 2 , . . . , a10 after perturbation in our scheme is smaller than that of the randomized attributes by the randomization approach. On the other hand, our scheme leaves a1 barely perturbed. The reason is that our scheme identifies a1 as the only attribute that is needed in the classification. Thus, our scheme can identify the most useful attribute in classification and maintain the distribution of these attributes only. k=3 k=4 100 k=5 80 k=6 k=7 60 40 k=8 20 0 70 k=9 75 80 85 90 95 100 Accuracy (%) Figure 6: performance on function 2 tribution reconstruction is a three-step process. We use “ByClass” reconstruction algorithm as an example because, as stated in [2], it is a tradeoff between accuracy and efficiency. In the first step, split points are determined to partition the domain of each attribute into intervals. There is an estimated number of data points in each interval. The second step partitions data values into different intervals. For each attribute, the values of randomized data are sorted to be associated with an interval. In the third step, for each attribute, the original distribution is reconstructed for each class separately. The main purpose of the first two steps is to accelerate the computation of the third step. The time complexity of the algorithm is O(mn + nv 2 ) where m is the number of training data tuples, n is the number of private attributes in a data tuple, and v is the number of intervals on each attribute. It is assumed in [2] that 10 ≤ v ≤ 100. Note that the overhead of the randomization approach occurs on the critical time path. Since the distribution re- construction is not an incremental algorithm, it has to be performed after all data tuples are collected and before the classifier is constructed. Besides, the distribution reconstruction algorithm requires access to the whole training data set, some of which may not be stored in the main memory. This problem may incur even more serious overhead. In our scheme, the perturbed data tuples are directly used to construct the classifier. The only overhead incurred on the data miner is to update the perturbation guidance Vk . Note that the overhead is not on the critical time path. Instead, it occurs during the collection of data. The time complexity of the updating process is O(n 2 ). A heuristic of the number of updates is between 10 and 100. Since the number of attributes is much less than the number of data tuples (i.e., n m) in data classification, the overhead of our scheme is significantly less than the overhead of the randomization approach. The space complexity of the updating process in our scheme is also O(n2 ). That is, the received data tuples need not to remain in the main memory. Besides, our scheme is inherently incremental. These features make our scheme scalable to very large training data sets. Since the perturbation level k is always a small number (e.g., k ≤ 10), the communication overhead (O(nk) per data provider) incurred by the two-way communication in our scheme is not significant. middle layer, named perturbation layer, realizes our privacy preserving scheme and exploits the bottom layer to transfer information. The bottom layer, named web layer, consists of web servers and web browsers. As an important feature of our system, the details of data perturbation on the middle layer are transparent to both data providers and the data miner. 7 Implementation References A prototypical system for privacy preserving data classification has been implemented using our new scheme. The goal of the system is to deliver an online survey solution that preserves the privacy of survey respondents. The survey collector/analyzer and the survey respondents are modeled as the data miner and the data providers, respectively. The system consists of a perturbation guidance component on web servers and a data perturbation component on web browsers. Both components are implemented as custom plug-ins that one can easily install to existing systems. The architecture of our system is shown in Figure 7. Figure 7: System implementation As is shown in the figure, there are three separate layers in our system: user interface layer, perturbation layer, and web layer. The top layer, named user interface layer, provides interface to data providers and the data miner. The 8 Final Remarks In this paper, we propose a new scheme on privacy preserving data classification. Compared with previous approaches, we introduce a two-way communication mechanism between the data miner and the data providers with little overhead. In particular, we let the data miner send perturbation guidance to data providers. Using this intelligence, data providers perturb the data tuples to be transmitted to the miner. As a result, our scheme has the benefit of a better tradeoff between accuracy and privacy. Our work is preliminary and many extensions can be made. In addition to using a similar approach in association rule mining [17], we are currently investigating how to apply the approach to clustering problems. We would like to investigate a new behavior model that is stronger than the honest-but-curious model, and can be dealt with by our scheme. [1] D. Agrawal and C. C. Aggarwal, “On the design and quantification of privacy preserving data mining algorithms,” in Proceedings of the 20th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems. ACM Press, 2001, pp. 247–255. [2] R. Agrawal and R. Srikant, “Privacy-preserving data mining,” in Proceedings of the 19th ACM SIGMOD International Conference on Management of Data. ACM Press, 2000, pp. 439–450. [3] S. Agrawal, V. Krishnan, and J. R. Haritsa, “On addressing efficiency concerns in privacy-preserving mining,” in Proceedings of the 9th International Conference on Database Systems for Advanced Applications. Springer Verlag, 2004, pp. 439– 450. [4] W. Du and Z. Zhan, “Building decision tree classifier on private data,” in Proceedings of the IEEE International Conference on Privacy, Security and Data Mining. Australian Computer Society, Inc., 2002, pp. 1–8. [5] W. Du and Z. Zhan, “Using randomized response techniques for privacy-preserving data mining,” in Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM Press, 2003, pp. 505– 510. [6] A. Evfimievski, J. Gehrke, and R. Srikant, “Limiting privacy breaches in privacy preserving data mining,” in Proceedings of the twenty-second ACM SIGMOD-SIGACT-SIGART sym- [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] posium on Principles of database systems. ACM Press, 2003, pp. 211–222. N. Friedman, D. Geiger, and M. Goldszmidt, “Bayesian network classifiers,” Machine Learning, vol. 29, no. 2-3, pp. 131–163, 1997. O. Goldreich, Secure Multi-Party Computation. Cambridge Univeristy Press, 2004, ch. 7. G. H. Golub and C. F. V. Loan, Matrix Computation. John Hopkins University Press, 1996. J. Han and M. Kamber, Data Mining Concepts and Techniques. Morgan Kaufmann, 2001. M. Kantarcioglu and J. Vaidya, “Privacy preserving naı̈ve bayes classifier for horizontally partitioned data,” in Workshop on Privacy Preserving Data Mining held in association with The 3rd IEEE International Conference on Data Mining, 2003. H. Kargupta, S. Datta, Q. Wang, and K. Sivakumar, “On the privacy preserving properties of random data perturbation techniques,” in Proceedings of the 3rd IEEE International Conference on Data Mining. IEEE Press, 2003, pp. 99–106. Y. Lindell and B. Pinkas, “Privacy preserving data mining,” in Proceedings of the 20th Annual International Cryptology Conference on Advances in Cryptology. Springer Verlag, 2000, pp. 36–54. J. R. Quinlan, “Induction of decision trees,” Machine Learning, vol. 1, no. 1, pp. 81–106, 1986. A. C. Tamhane and D. D. Dunlop, Statistics and Data Analysis: From Elementary to Intermediate. Prentice Hall, 1999. J. Vaidya and C. Clifton, “Privacy preserving naı̈ve bayes classifier for vertically partitioned data,” in Proceedings of the 4th SIAM Conference on Data Mining. SIAM Press, 2004, pp. 330–334. N. Zhang, S. Wang, and W. Zhao, “A new scheme on privacy preserving association rule mining,” in Proceedings of the 7th European Conference on Principles and Practice of Knowledge Discovery in Databases. Springer Verlag, 2004. Y. Zhu and L. Liu, “Optimal randomization for privacy preserving data mining,” in Proceedings of the 2004 ACM SIGKDD international conference on Knowledge discovery and data mining. ACM Press, 2004, pp. 761–766. Appendix A Overview of TAN For the completeness of the paper, we briefly introduce the tree augmented naı̈ve Bayesian classification algorithm (TAN). For details, please refer to [7]. Bayesian classification uses the posterior probability, Pr{t ∈ C 0 |t}, to predict the class membership of test samples. Naı̈ve Bayesian classification makes an additional assumption, which is called “conditional independence” [10]. It assumes that the values of the attributes of a data tuple are independent of each other. By Bayes theorem, the posterior probability of the class label on a data tuple t is n Pr{Ci } r=1 Pr{ar = αri |Ci } n (1.16) Pr{t ∈ Ci |t} = . r r=1 Pr{ar = αi } Recall that αri is a possible value of a r . If we represent a naı̈ve Bayesian classifier by a dependency graph (Figure 8), we can see that no connection between two attributes is allowed. TAN relaxes this assumption by allowing each Figure 8: The structure of a naı̈ve Bayesian classifier attribute to have one other attribute as its parent. That is, TAN allows the dependency between attributes to form a tree topology. The dependency graph of a TAN classifier is shown in Figure 9. For a TAN classifier, the posterior Figure 9: An example of TAN classifier probability of the class label on a data tuple t is (1.17) Pr{t ∈ Ci |t} n Pr{Ci } r=1 Pr{ar = αri |Ci , aq = αqj } n , = q r r=1 Pr{ar = αi |aq = αj } where aq is the parent node of a r in the dependency tree. For example, in the example shown in Figure 9, the parent node of a2 is a1 . Appendix B Proof of Lemma 5.1 L EMMA 5.1 For a given system where the classifier construction procedure uses TAN to build the classifier, if the estimation error of A is bounded by , then the accuracy of the system is approximately bounded by (2.18) la (T, R, TAN) > p − 0.4vr vq · ∼ 9· 2 vr vq i=1 , r q j=1 (αi αj ) where TAN is the notion of the classifier construction procedure. vr and vq are the number of values of a r and aq , respectively, αri and αqj are the i-th and j-th possible values of ar and aq , respectively, and p is the accuracy of the system when = 0. Proof. Let p − lt (R, BAN) be t . In order to derive a higher bound on t , we first consider rq as follows. rq = (2.19) vq vr | Pr{C0 , ar = αri , aq = αqj } i=1 j=1 − Pr{C0 , ãr = αri , ãq = αqj }| Here r, q ∈ [1, n]. Recall that vr and vq are the number of possible values of a r and aq , respectively. R EMARK B.1 t ≤ max rq . (2.20) r,q∈[1,n] For the simplicity of discussion, we do not include the details here. An intuitive explanation of the remark is to consider a TAN classifer built on two attributes only. That is, the dependency graph consists of three nodes: C i , ar and aq . q r Pr{C 0 , ar = αi , aq = αj } is the probability that i j a test sample is correctly classified by such classifer built q r on the original data. i j Pr{C0 , ãr = αi , ãq = αj } is the correction rate of such classifer built on the perturbed data. Thus, rq is the error rate of classification made by such classifer due to the perturbation. R EMARK B.2 Let the upper bound on rq be δ. Suppose that the value of every element of A can be determined from the perturbed data tuples with error . An estimation of δ is 2 vr vq δ ≈ 0.4vr vq (2.21) . 9 · i=1 j=1 (αi αj ) We represent Pr{ar = αri } by Pr{αri }. Other notions are similarly abbreviated. Recall that an element A rq in A = T0 T0 − T1 T1 is (2.22) Arq = vq vr αri αqj (Pr{C0 , αri , αqj } − Pr{C1 , αri , αqj }) i=1 j=1 (2.23) = vq vr αri αqj Pr{αri , αqj }(2 Pr{C0 |αri , αqj } − 1). i=1 j=1 Given r, q, let the error of Pr{C 0 , ar = αri , aq = αqj } after perturbation be ij . Here we make an assumption (approximation) that all ij are independently and identically distributed (i.i.d.) random variables. We describe them by the normal distribution. Since i j ij is always 0, we assume that the mean of ij is 0. Let the variance of ij be 2 . The distribution of |A rq −Ãrq | is normal distribution σij with mean µ = 0 and variance (2.24) σ2 = vq vr i=1 j=1 2 αri αqj · 4σij . Recall that the error of Ãrq is upper bounded by . (2.25) max |A − Ãrq | ≤ . r,q∈[1,n] 2 Thus, we have σ ≤ /3. Due to (2.24), σ ij can be bounded by (2.26) 2 σij ≤ 2 1 vr vq · . 4 9 · i=1 j=1 (αri αqj ) Consider the average absolute deviation (i.e., mean deviation) of ij , which is meanij |ij |. We have rq = vr vq meanij |ij |. Since the average absolute deviation of a normal distribution is about 0.8 times the standard deviation [15], we can approximate δ by 2 vr vq δ ≈ 0.4vr vq (2.27) . 9 · i=1 j=1 (αri αqj ) Appendix C Proof of Lemma 5.2 L EMMA 5.2 Let the (k + 1)-th largest eigenvalue of A be sk+1 . ∀r, q ∈ [1, n], we have A − Ãrq ≤ sk+1 . (3.28) Proof. We consider the case where V k is composed of the first k eigenvectors of A = T 0 T0 − T1 T1 . Thus, the bound derived is an approximation as in practice, the value of V k is estimated from the current copy of T ∗ (i.e., currently received data tuples) 5 . Given data tuple matrix T , recall that the matrix of perturbed data tuples is T̃ . (3.29) T = [t1 ; . . . ; tm ], (3.30) T̃ = [t1 Vk Vk ; . . . ; tm Vk Vk ] = T Vk Vk . T̃0 and T̃1 are the matrices of data tuples that belong to class C0 and C1 , respectively. Recall that à = T̃0 T̃0 − T̃1 T̃1 . Let the singular value decomposition of A be A = V ΣV . With some algebraic manipulation, we have (3.31) à = (T0 Vk Vk ) (T0 Vk Vk ) − (T1 Vk Vk ) (T1 Vk Vk ) (3.32) (3.33) (3.34) = Vk Vk (T0 T0 − T1 T1 )Vk Vk = Vk Vk V ΣV Vk Vk = Vk ΣVk . Due to the theory of singular value decomposition [9], à is the optimal rank-k approximation of A. We have (3.35) max |A − Ãrq | ≤ A − Ã2 = sk+1 , r,q∈[1,n] where A − Ã2 is the spectral norm of A − Ã. 5 Fortunately, the estimated V converges well to its accurate value. k Besides, the convergence is fairly fast in most cases. Appendix D Proof of Theorem 5.2 T HEOREM 5.2 The lower bounds on the privacy measure are given below norm lower bound on p T − T̂ 1 T − T̂ 2 T − T̂ ∞ δk+1 /(maxi ai + ãi ∞ ) ρk+1 δk+1 /(max i ai + ãi 1 ) n 2 i=k+1 ρi T − T̂ F where matrix norms are defined in [9], ρ i is the i-th eigenvalue of T , si is the i-th eigenvalue of A, σ i is the i-th eigenvalue of T0 T0 , τi is the i-th eigenvalue of T 1 T1 , and (4.36) δ i = 2 max{σi , τi } − si . Proof. Recall that T̃ is the least square approximation of T that can be derived from T̃ . For the simplicity of discussion, we consider T̂ equal to T̃ . We first show a simple lower bound on spectral norm T − T̃ 2 and Frobenius norm T − T̃ F . After that, we derive lower bounds on maximum absolute column sum norm (i.e., 1-norm) T − T̃ 1 and maximum absolute row sum norm (i.e, infinity norm) T − T̃ ∞ based on the special structure of T̃ . Please refer to Table 5 for notions used in the proof. Recall that T̃ = T Vk Vk . Since the rank of V k is k, we have rank(T̃ ) ≤ k. Let the optimal rank-k apprimation of T be T k . We have (4.37) T̃ − T F ≥ Tk − T F = ρ2k+1 + · · · + ρ2n A − Ã2 = sk+1 . Note that T̃0 and T̃1 are rank-k approximations of T 0 and T1 , respectively. Let the optimal rank-k approximation of T 0 and T1 be T0k and T1k , respectively. We have (4.40) T0 T0 − T̃0 T̃0 2 ≥ T0k − T 2 = σk+1 , and (4.41) T1 T1 − T̃1 T̃1 2 ≥ T1k − T 2 = τk+1 . Note that all matrix norms satisfy triangle inequalities. Let T T − T̃ T̃ be ∆. Thus, (4.42) ∆2 ≥ 2 max{σk+1 , τk+1 } − sk+1 . Note that for any matrix M , M 22 ≤ M 1 M inf , where M 1 is the maximum absolute column sum of M and M ∞ is the maximum absolute row sum of M . Since ∆ is a symmetric matrix, the 1-norm of ∆ is equal to the infinity norm of ∆. Thus, we have (4.46) ∆1 = ∆inf ≥ ∆2 ≥ 2 max{σk+1 , τk+1 } − sk+1 . Due to the definition of 1-norm, we can always find a column in T with index i ∈ [1, n] such that (4.47) (4.48) (4.49) ∆1 =T ai − T̃ ãi 1 <(T − T̃ ) (ai + ãi )1 ∼ ≤T − T̃ ∞ ai + ãi 1 . Similarly, we can derive that (4.50) ∆1 ≤ T − T̃ 1 ai + ãi ∞ . Thus, the lower bounds on T − T̃ 1 and T − T̃ ∞ are as follows. (4.51) T − T̃ 1 ≥ 2 max{σk+1 , τk+1 } − sk+1 , maxi ai + ãi ∞ T − T̃ ∞ ≥ 2 max{σk+1 , τk+1 } − sk+1 . maxi ai + ãi 1 and As we have proved in Appendix C, à is the optimal rank-k approximation of A. Thus, we have (4.39) (4.45) T̃ − T 2 ≥ Tk − T 2 = ρk+1 , and (4.38) Similarly, we have ∆2 =2T0 T0 − 2T̃0 T̃0 − (A − Ã)2 (4.43) ≥2T0 T0 − 2T̃0 T̃0 2 − A − Ã2 (4.44) ≥2σk+1 − sk+1 . (4.52) Appendix E Table of Attributes and Classification Functions Table 3: Description of Attributes Attribute salary commission age elevel car zipcode hvalue hyears loan Description uniformly distributed on [20k,150k] if salary ≥ 75k then commission=0 else uniformly distributed on [10k,75k] uniformly distribued on [20,80] uniformly chosen from 0..4 uniformly chosen from 1..20 uniformly chosen from 0..9 uniformly distributed on [zipcode × 50k, zipcode × 100k] uniformly distributed on [1,30] uniformly distributed on [0,500k] Table 4: Classification Functions F1 F2 F3 F4 F5 Condition for t ∈ C 0 (age<40) ∨ (age>60) ((age<40) ∧ (50k≤salary≤100k)) ∨ ((40≤age<60) ∧ (75k<salary<125k)) ∨ ((age≥60) ∧ (25k<salary<75k)) ((age<40) ∧ (((elevel ∈ [0, 1]) ∧ (25k ≤ salary ≤ 75k)) ∨ ((elevel ∈ [2, 3]) ∧ (50k≤salary≤100k)))) ∨ ((40 ≤age<60) ∧ (((elevel ∈ [1, 3]) ∧ (50k ≤ salary ≤ 100k)) ∨ ((elevel = 4) ∧ (75k≤salary≤125k)))) ∨ ((age≥60) ∧ (((elevel ∈ [2, 4]) ∧ (50k ≤ salary ≤ 100k)) ∨ ((elevel = 1) ∧ (25k≤salary≤75k)))) (0.67 × (salary+commission) - 0.2× loan 10k)>0 (0.67 × (salary+commission) - 0.2× loan +0.2 × equity -10k)>0 where equity = 0.1 × hvalue × max(hyears-20) Appendix F Table of Notions Table 5: Notions notion m n t R(t) ar αri vr ãr C0 C1 T T0 T1 T̃ T̃0 T̃1 T∗ T0∗ T1∗ A à A∗0 A∗1 A∗ V∗ Vk ∆ ρi si Σ σi τi δi ·rq definition number of data tuples in the training data set number of attributes besides the class label attribute a data tuple in the original training data set a data tuple in the perturbed training data set r-th attribute of an original data tuple the i-th value of a r the number of possible values of a r the r-th attribute of a perturbed data tuple class 0 class 1 matrix of original training data set matrix of original data tuples that belong to C 0 matrix of original data tuples that belong to C 1 matrix of perturbed training data set matrix of perturbed data tuples that belong to C 0 matrix of perturbed data tuples that belong to C 1 current matrix of received (perturbed) data tuples matrix of received data tuples that belong to C 0 matrix of received data tuples that belong to C 1 T0 T0 − T1 T1 T̃0 T̃0 − T̃1 T̃1 T0∗ T0∗ T1∗ T1∗ A∗0 − A∗1 maxr,q∈[1,n] A − Ãrq matrix composed of the eigenvectors of A ∗ matrix composed of the first k eigenvectors of A ∗ T̃ T̃ − T T i-th eigenvalue of T i-th eigenvalue of A ∗ diag(s1 , · · · , sn ) i-th eigenvalue of T 0 T0 i-th eigenvalue of T 1 T1 2 max{σi , τi } − si the element of a matrix with index r and q