Download On A New Scheme on Privacy Preserving Data Classification ∗

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Transcript
On A New Scheme on Privacy Preserving Data Classification
Nan Zhang
Shengquan Wang
Abstract
We address the privacy preserving data classification problem in a
distributed system. Randomization has been proposed to preserve
privacy in such circumstances. However, this approach was challenged in [12] by a privacy intrusion technique that is capable of
reconstructing the private data in a relative accurate manner. In this
paper, we introduce an algebraic technique based scheme. Compared to the randomization approach, our new scheme can build
classifiers with better accuracy but disclose less private information. We also show that our scheme is immune to privacy intrusion
attacks. Performance lower bounds in terms of both accuracy and
privacy are established.
Keywords: Privacy, security, classification
1 Introduction
In this paper, we address issues related to privacy preserving
data mining techniques. The purpose of data mining is to
extract knowledge from large amounts of data [10]. Classification is one of the biggest challenges in data mining. In this
paper, we focus on privacy preserving data classification.
The objective of classification is to construct a model
(classifier) that is capable of predicting the (categorical)
class labels of data [10]. The model is usually represented
by classification rules, decision trees, neural networks, or
mathematical formulae that can be used for classification.
The model is constructed by analyzing data tuples (i.e.,
samples) in a training data set, where the class label of
each data tuple is provided. For example, suppose that a
company has a database containing the age, occupation, and
income of customers and wants to know whether a new
customer is a potential buyer of a new product. To answer
this question, the company first builds a model which details
the existing customers in the database, based on whether they
have bought the new product. The model consists of a set of
classification rules (e.g., (occupation = student) ∧ (age ≤ 20)
∗ This work was supported in part by the National Science Foundation
under Contracts 0081761, 0324988, 0329181, by the Defense Advanced
Research Projects Agency under Contract F30602-99-1-0531, and by Texas
A&M University under its Telecommunication and Information Task Force
Program. Any opinions, findings, conclusions, and/or recommendations
expressed in this material, either expressed or implied, are those of the
authors and do not necessarily reflect the views of the sponsors listed above.
† The authors are with the Department of Computer Science, Texas A&M
University, College Station, TX 77843, USA. Email: {nzhang, swang,
zhao}@cs.tamu.edu.
∗
Wei Zhao †
→ buyer). Then, the company uses the model to determine
whether the new customer is a potential buyer of the product.
Classification techniques have been extensively studied
for over twenty years [14]. However, only in recent years has
the issue of privacy protection in classification been raised
[2, 13]. In many situations, privacy is a very important concern. In the above example, the customers may not want
to disclose their personal information (e.g., incomes) to the
company. The objective of research on privacy preserving
data classification is to develop techniques that can build accurate classification models without disclosing private information in the data being mined. The performance of privacy
preserving techniques should be analyzed and compared in
terms of both accuracy and privacy.
We consider a distributed environment where training
data tuples for classification are stored in multiple sites.
We can classify distributed privacy preserving classification
systems into two categories based on their infrastructures:
Server-to-Server (S2S) and Client-to-Server (C2S), respectively.
In the first category (S2S), data tuples in the training
data set are distributed across several servers. Each server
holds a part of the training data set that contains numerous
data tuples. The servers collaborate with each other to construct a classifier spanning all servers. Since the number of
servers in the system is usually small (e.g., less than 10),
the problem can be formulated as a variation of secure multiparty computation [13]. Existing privacy preserving classification algorithms in this category include decision tree
classifiers [4,13], naı̈ve Bayesian classifier for vertically partitioned data [16], and naı̈ve Bayesian classifier for horizontally partitioned data [11].
In the second category (C2S), a system usually consists of a data miner (server) and numerous data providers
(clients). Each data provider holds only one data tuple. The
data miner performs data classification process on aggregated data offered by the data providers. An online survey
is a typical example of this type of system, as the system can
be modeled as one survey collector/analyzer (data miner) and
thousands of survey respondents (data providers).
Needless to say, both S2S and C2S systems have a broad
range of applications. In this paper, we focus on studying
privacy preserving data classification in C2S systems. Most
of the current studies on C2S systems tacitly assume that
randomization is an effective approach to preserving privacy.
However, this assumption has been challenged in [12]. It was
shown that an illegal data miner may be able to construct the
private data even if they have been randomized. In this paper,
we take an algebraic approach and develop a new scheme.
Our new scheme has the following important features to
distinguish it from previous approaches.
• Our scheme can help to build classifiers that have better
accuracy but disclose less private information. A lower
bound of accuracy is derived and can be used to predict
the system accuracy in reality.
• Our scheme is immune to privacy intrusion attacks.
That is, the data miner cannot derive more private
information from the data tuples it receives from the
data providers if the data are properly perturbed based
on our scheme.
• Our scheme allows every user to play a role in determining the tradeoff between accuracy and privacy. Specifically, we allow explicit negotiation between each data
provider and the data miner in terms of the tradeoff between accuracy and privacy. This makes our system
meet the needs of a wide range of users, from hard-core
privacy protectionists to privacy marginally concerned.
• Our scheme is flexible and easy to implement. It does
not require a distribution reconstruction component as
have previous approaches. Thus, our privacy preserving component is transparent to the data classification
process and can be readily integrated with existing systems as a middleware.
The rest of the paper is organized as follows: The model
of data miners is introduced in Section 2. We briefly review
previous approaches in Section 3. In Section 4, we introduce
our new scheme and its basic components. We present a
theoretical analysis on accuracy and privacy in Section 5.
Theoretical bounds on accuracy and privacy metrics are also
derived in this section. Then we make a comparison between
the performance of our scheme and the previous approach in
Section 6. Experimental results are presented in this section.
The implementation of our scheme is discussed in Section 7,
followed by a final remark in Section 8.
2 Model
In this section, we introduce our model of data miners. Due
to the privacy concern introduced to the system, we classify
the data miners into two categories. One category is legal
data miners. These data miners always act legally in that
they only perform regular data mining tasks and would never
intentionally compromise the privacy of data. The other
category is illegal data miners. These data miners would
purposely discover the privacy in the data being mined.
Like adversaries in distributed systems, illegal data miners come in many forms. In most forms, their behavior is restricted from arbitrarily deviating from the protocol. In this
paper, we focus on a particular subclass of illegal miners, curious data miners. That is, in our system, illegal data miners
are honest but curious 1 : they follow the protocol strictly (i.e.,
they are honest), but they may analyze all intermediate communications and received data to compromise the privacy of
data providers (i.e., they are curious) [8].
3 Randomization Approach and its Problems
Based on the model of data miners, we review the randomization approach, which is currently used to preserve privacy
in classification. We also point out the problems associated
with the randomization approach that motivates us to design
a new privacy preserving scheme on data classification.
To prevent privacy invasions by curious data miners,
countermeasures must be implemented in the data classification process. Randomization is a commonly used approach.
We briefly review it below.
Based on the randomization approach, the entire privacy preserving classification process can be considered a
two-step process. The first step is to transmit (randomized)
data from the data providers to the data miner. That is, in
this step, a data provider applies a randomization operator
R(·) to the data tuple that the data provider holds. Then,
the data provider transmits the randomized data tuple to the
data miner. In previous studies, several randomization operators have been proposed including the random perturbation
operator [2] and the random response operator [5]. These
operators are shown in (3.1) and (3.2), respectively.
(3.1)
(3.2)
R(t) = t + r.
t̄, if r ≥ θ,
R(t) =
t, if r < θ.
Here, t is the original data tuple, r is the noise randomly
generated from a predetermined distribution, and θ is a
predetermined parameter. Note that the random response
operator in (3.2) only applies to binary data tuples.
In the second step, the legal data miner performs the
data classification process on the aggregated data. With the
randomization approach, the legal data miner must first employ a distribution reconstruction algorithm which intends
to recover the original data distribution from the randomized data. There are several algorithms for reconstructing
the original distribution [1, 2, 5]. In particular, an expectation maximization (EM) algorithm was proposed in [1]. The
distribution reconstructed by EM algorithm converges to the
maximum likelihood estimate of the original distribution.
1 The honest-but-curious behavior model is also known as semi-honest
behavior model.
Also in the second step, a curious data miner may invade
privacy by using a privacy data recovery algorithm on the
randomized data supplied by the data providers.
Privacy fundamentalists are extremely concerned about privacy. Privacy pragmatists are concerned about privacy, but
much less than the fundamentalists. Marginally concerned
people are generally willing to provide their private data.
The randomization approach treats all the data providers in
the same manner and cannot handle the differing needs of
different data providers.
We believe that the followings are the reasons behind
the above mentioned problems.
• Randomization operator is user-invariant. In a system,
the same perturbation algorithm is applied to all data
providers. The reason is that in a system using randomization approach, the communication is one-way: from
the data providers to the data miner. As such, a data
provider cannot obtain any user specified guidance on
the randomization of its private data.
Figure 1: Randomization Approach
Figure 1 depicts the classification process with the randomization approach. Clearly, any such data classification
system should be measured by its capacity of both building
accurate classifiers and preventing private data leakage.
3.1 The problems of randomization approach. While
the randomization approach is intuitive, researchers have
recently identified privacy breaches as one of the major
problems with the randomization approach. Kargupta, Datta,
Wang, and Sivakumar showed that the spectral properties of
randomized data could help curious data miners to separate
noise from private data [12]. Based on random matrix theory,
they proposed a filtering method to reconstruct private data
from the randomized data set. They demonstrated that
randomization preserves very little privacy in many cases.
Randomization approach also suffers in efficiency as it
puts a heavy load on (legal) data miners at run time (because
of the distribution reconstruction) [3]. It is shown that the
cost of mining randomized data set is “well within an order
of magnitude” in respect to that of mining original data set. 2
Another problem with the randomization approach is
that it cannot be adapted to meet the needs of different kinds
of users. A survey on Internet users (potential data providers)
showed that there are 17% privacy fundamentalists, 56%
privacy pragmatists, and 27% marginally concerned people.
• Randomization operator is attribute-invariant. The
same perturbation algorithm is applied to all attributes.
The distribution of every attribute, no matter how useful
(or useless) it is in the classification, is equally maintained in the perturbation. For example, suppose that
each data tuple has three attributes: age, occupation,
and salary. Also, assume that more than 95% of test
data tuples can be correctly classified using age as the
only attribute. The wisest decision is to maintain only
the distribution of age, which is the most useful attribute in classification. If the randomization approach
is taken, private information disclosed on the other two
attributes are unnecessary (from the perspective of a
data provider) because it does not contribute much to
build the classifier. However, again, due to the lack
of communication between the data miner and the data
providers, a data provider cannot learn the correlation
between different attributes. Thus, a data provider has
no choice but to employ an attribute-invariant operator.
These properties are inherent in the randomization approach, and hence motivate us to develop a new scheme allowing two-way communication between the data miner and
the data providers. The two-way communication helps preserving private information while not incurring too much
overhead. Thereby, we significantly improve the performance in terms of accuracy, privacy, and efficiency. We describe the new scheme in the next section.
4 Our New Scheme
In this section, we introduce our scheme and its basic
2 Although the work is based on association rule mining, we believe
that the similarity between randomization operators in association rule components. We take a two-way communication approach
mining and data classification makes the efficiency concern inherent in that substantially improves the performance while incurring
randomization approach.
little overhead.
In the third step, the perturbed data tuples received
by the data miner are used by the classifier construction
procedure as the training data tuples. A classifier is built
and delivered to the data miner.
4.2 Basic components. The basic components of our
scheme are: a) the method of computing V k , and b) the perturbation function R(·). Before presenting the details of the
components, we first introduce some notions of the training
data set.
Figure 2: Our New Scheme
4.1 Description of our new scheme. Figure 2 depicts the
infrastructure of our new scheme. Our scheme has two key
components, perturbation guidance (PG) in the data miner
server and perturbation in the data providers. Compared
to the randomization approach, our scheme does not have
the distribution recovery component. Instead, the classifier
construction procedure is performed on the perturbed data
tuples (R(t)) directly.
Our scheme is a three-step process. In the first step, the
data miner negotiates different perturbation level k with different data providers. The larger k is, the more contribution
R(t) will make to the classification process. The smaller k
is, the more private information is preserved. Thus, a privacy
fundamentalist can choose a small k to preserve its privacy
while a privacy unconcerned data provider can choose a large
k to contribute to the classification.
The second step is to transmit (perturbed) data from the
data providers to the data miner. Since each data provider
comes at a different time, this step can be considered as an
iterative process. In each stage, the data miner dispatches
a reference (perturbation guidance) V k to a data provider
Pi . Here Vk depends on the perturbation level k negotiated
by the data miner and the data provider P i in the first
step. Based on the received V k , the perturbation component
of Pi computes the perturbed data tuple R(t i ) from the
original data tuple t i . Then, Pi transmits R(ti ) to the
perturbation guidance (PG) component of the data miner.
PG then updates Vk based on R(ti ) and forwards R(t i ) to
the classifier construction procedure. A curious data miner
can also obtain R(t i ). In this case, the curious data miner
uses private data recovery algorithm to discover private
information from R(t i ).
Notions of training data set. Suppose that T is a training
data set consisting of m data tuples t 1 , . . . tm . Each data
tuple belongs to a predefined class, which is determined by
its class label attribute a0 . In this paper, we consider the
labeled data from two classes, named C 0 and C1 . The class
label attribute has two distinct values 0 and 1, corresponding
to classes C0 and C1 , respectively. Besides the class label
attribute, each data tuple has n attributes, named a 1 , . . . , an .
The class label attribute of each data tuple is public (i.e,
privacy-insensitive). All other attributes consist of private
information which needs to be preserved. We represent the
private part of the training data set by an m × n matrix
T = [t1 ; . . . ; tm ] = [a1 , . . . , an ].3 We use T0 and T1 to
represent the matrices of data tuples that belong to class
C0 and C1 , respectively. We denote the j-th attribute of
ti by T ij . An example of T is shown in Table 1. As
we can see from the matrix, the first data tuple in T is
t1 = [20, 3.2, 18, 1]. It belongs to class C1 .
Table 1: An Example of a Training Data Set
a1 a2 a3 a4
a0
t1 20 3.2 18 1
t1
1
a0 : .
T: .
.
.
.
.
..
..
..
..
..
..
..
.
tm
40
2.5
13
0
tm
0
Notions used in the paper are listed in Appendix F.
Computation of Vk . In our scheme, V k is an estimation of
the first k eigenvectors of T 0 T0 − T1 T1 . The justification
of Vk will be provided in Section 5. Now we show how to
update Vk when a new data tuple is received.
As we are considering the case where data tuples are
iteratively fed to the data miner, the data miner keeps a copy
of all received data tuples and updates it when a new data
tuple is received. Let the current matrix of received data
tuples be T ∗ . When a new data tuple R(t) is received by
the data miner, R(t) is appended to the bottom of T ∗ .
3 Here t and a are used somewhat ambiguously. In the context of
i
i
training data set, ti is a data tuple and ai is an attribute. In the context
of matrix, ti represents a row vector in T and ai represents a column vector
in T .
Besides the received data tuples T ∗ , the data miner also
keeps track of two additional matrices: A ∗0 = T0∗ T0∗ and
A∗1 = T1∗ T1∗ where T0∗ and T1∗ are the matrices of received
data tuples that belong to class C 0 and C1 , respectively. Note
that the update of A ∗0 and A∗1 (after R(t) is received) does not
need access to any data tuple other than the recently received
R(t). Thus, we do not require the matrix of data tuples T
to remain in main memory. In particular, if the class label
attribute of R(t) is c (c ∈ {0, 1}), A ∗c is updated as follows.
(4.3)
A∗c = A∗c + R(t) R(t).
Given the updated A ∗0 and A∗1 , the computation of
Vk is done in the following steps. Using singular value
decomposition (SVD), we can decompose A ∗ = A∗0 − A∗1
as
(4.4)
Accuracy metric. Before we formally define an accuracy
metric, let us review the process of building a classifier and
observe what factors may impact accuracy. Recall that the
process to build a classifier takes the following steps:
• Training data set T is sampled from a population T .
• Due to the privacy concern, we perturb each data tuple
t in the training data set by perturbation function R(·)
and produce a perturbed data tuple R(t). In the previous
section, we have described our selection of R(·).
• A classifier construction procedure (CCP) is used to
mine the perturbed training data set in order to produce
a classifier.
A∗ = V ∗ ΣV ∗ ,
where Σ = diag(s1 , . . . , sn ) is a diagonal matrix with s 1 ≥
· · · ≥ sn , si is the i-th eigenvalue of A ∗ , and V ∗ is an n × n
unitary matrix composed of the eigenvectors of A ∗ .
Vk is composed of the first k eigenvectors of A ∗ (i.e.,
the first k column vectors of V ∗ ), which correspond to
the k largest eigenvalues of A ∗ . In particular, if V ∗ =
[v1 , . . . , vn ], then
(4.5)
Vk = [v1 , . . . , vk ].
The computing cost of updating V k is addressed in Section 6.
Perturbation function R(·). Once a data provider obtains
Vk from the data miner, the data provider employs a perturbation function R(·) on the original data tuple t. The result
is a perturbed data tuple that is transmitted to the data miner.
In our scheme, the perturbation function R(·) is defined as
follows.
(4.6)
R(t) = tVk Vk
We have now described our new scheme and its basic
components. We are ready to analyze our scheme in terms
of accuracy, privacy and their tradeoff.
5 Theoretical Analysis of Our Scheme
In this section, we analyze our new scheme. We will define
metrics for accuracy and privacy and derive their bounds, in
order to provide guidelines for the tradeoff between these
two measures and hence help system managers setting parameter in practice.
5.1 Accuracy analysis. An accuracy measure should reflect the capability of the system that can correctly classify
the objectives in a given population. We will define an accuracy metric and derive a lower bound on it.
Figure 3: Building a Classifier
Figure 3 shows the workflow of such a process. From this
process, it is clear that many factors impact the accuracy.
They include the characteristics of the training data set, the
algorithm used by CCP, and the perturbation function used
to perturb the training data tuples. Formally, we define a
metric for accuracy as the probability that a test data tuple
sampled from the population T can be correctly classified
by the classifier produced by CCP. We denote the accuracy
metric by la (T, R, CCP) where T is the sample from the
population, R is the perturbation function, and CCP is the
classifier construction procedure.
Lower bounds on accuracy. As we have observed, the accuracy measure of a given system depends on many factors. It remains an open problem to develop a systematic
methodology allowing one to derive the value of accuracy
measures for a given system. Nevertheless, in this paper, we
derive lower bounds of accuracy and discuss their implications in practice. We will focus on a system where the classifier construction procedure uses the tree augmented naı̈ve
Bayesian classification algorithm (TAN) [7] to build the classifier. Note that this algorithm is an improved version over
the traditional naı̈ve Bayesian classification algorithm, as the
independence assumption has been relaxed. The independence assumption assumes that all attributes in a data tuple
are conditionally independent of each other. TAN relaxes
this assumption by allowing certain dependencies between
two attributes. An overview of TAN can be found in Appendix A.
We will start with two lemmas. First, we define a matrix
A = T0 T0 − T1 T1 .
(5.7)
Recall that T0 and T1 are the matrices of original training
data tuples that belong to class C 0 and C1 , respectively. We
assume that the data miner maintains an estimation of A. The
estimation is defined as follows.
à = T̃0 T̃0 − T̃1 T̃1 .
(5.8)
Here T̃0 and T̃1 are the matrices of perturbed data tuples that
belong to class C0 and C1 , respectively. Note that à = A∗
when all the (perturbed) data tuples in the training data set
have been received by the data miner. We found the accuracy
in our system depends on the estimation error of A, which is
defined as follows.
= max A − Ãrq .
(5.9)
r,q∈[1,n]
Formally, the following lemma can be established. Please
refer to Appendix B for the proof of the lemma.
L EMMA 5.1. For a given system where the classifier construction procedure uses TAN to build the classifier, if the
estimation error of A is bounded by , then the accuracy of
the system is approximately bounded by
(5.10) la (T, R, TAN)
> p − 0.4vr vq ·
∼
9·
2
vr vq
i=1
,
r q
j=1 (αi αj )
where TAN is the notion of the classifier construction procedure. vr and vq are the number of values of a r and aq ,
respectively, αri and αqj are the i-th and j-th possible values
of ar and aq , respectively, and p is the accuracy of the system
when = 0.
The following lemma provides an upper bound on the
estimation error of A.
L EMMA 5.2. Let the (k + 1)-th largest eigenvalue of A be
sk+1 . ∀r, q ∈ [1, n], we have
A − Ãrq ≤ sk+1 .
(5.11)
For the proof, please refer to Appendix C. With this lemma,
we can establish a lower bound of accuracy as follows.
T HEOREM 5.1.
(5.12)
la (T, R, TAN)
> p − 0.4vr vq
∼
9·
s2k+1
vr vq
i=1
r q
j=1 (αi αj )
.
The theorem is established by simply substituting (5.11)
into (5.10). From this theorem, we can make the following
observations.
• The accuracy measure increases monotonously with increasing p, which is the accuracy measure for the original system without privacy preserving countermeasure.
That is, the higher the accuracy in the original system,
the higher the accuracy in the privacy preserving one.
• Given a training data set T , the accuracy measure
increases monotonously with increasing perturbation
level k (i.e., decreasing s k+1 ). Thus, a privacy unconcerned data provider may contribute more to the classification by choosing a large k to help improving the
accuracy.
5.2 Privacy analysis. A privacy measure is needed to
properly quantify the privacy loss in the system. We will
define a privacy metric and derive a lower bound on it.
Metrics defined in pervious studies. Several privacy metrics have been proposed in the literature. We briefly review
them below.
In [2] and [18], two interval-based metrics were proposed. Suppose that based on the perturbed data tuple R(t),
the value of attribute a r of an original data tuple can be estimated to lie in interval (α r0 , αr1 ) with c% confidence. By [2],
the privacy measure of attribute a r at level c% is then defined
as the minimum width of the interval, which is min(α r1 −αr0 ).
In [18], this definition is revised by using the Lebesgue measure and hence becoming more mathematically rigorous.
In [1], an information-theoretic metric was defined as
follows. Given R(t), the privacy loss of t is given by
P(t|R(t)) = 1 − 2−I(t;R(t)) where I(t; R(t)) is the mutual
information of t and R(t). This metric measures the average
amount of privacy leakage. In contrast, a metric that measures the worst-case privacy loss was proposed in [6].
Each of these metrics has its pros and cons. Most of
them are defined in the context of the randomization approach 4 . Nevertheless, we now propose a general privacy
metric to quantify privacy leakage in a given privacy preserving data mining system.
A general privacy metric. Let the matrix of perturbed data
tuples R(t) be T̃ . Suppose that T̂ is the best approximation
of T that a (curious) data miner can derive from T̃ . We would
like to measure the privacy leakage by the distance between
T and T̂ . Formally, we have the following:
4 To
make a fair comparison with the randomization approach, we will
use the interval-based metric in [2] as the privacy metric in performance
comparison in Section 6.
D EFINITION 5.1. Given a matrix norm · , we define the
privacy measure lp as
T HEOREM 5.2. The lower bounds on the privacy measure
are given below
lp = min T − T̂ ,
(5.13)
T̂
where T̂ can be derived from T̃ .
In the above definition, the distance of T and T̂ can always
be formulated by a matrix norm (e.g., Frobenius norm, also
known as Euclidean norm) of T − T̂ . A list of commonly
used matrix norms is shown in Table 2. In comparison
Table 2: Matrix Norms of an m × n matrix M
norm
M 1
M 2
M ∞
M F
definition
m
maxj i=1 M ij
max eigenvalue
n of M M
maxi j=1 M ij
m,n
2
i,j=1 M ij
lower bound on p
T − T̂ 1
T − T̂ 2
T − T̂ ∞
δk+1 /(maxi ai + ãi ∞ )
ρk+1
δk+1 /(max
i ai + ãi 1 )
n
2
i=k+1 ρi
T − T̂ F
where matrix norms are defined in [9], ρ i is the i-th eigenvalue of T , si is the i-th eigenvalue of A, σ i is the i-th eigenvalue of T0 T0 , τi is the i-th eigenvalue of T 1 T1 , and
δ i = 2 max{σi , τi } − si .
(5.15)
with definitions used in previous studies, our definition on
privacy metric is a general one. The reason is that one can
use different matrix norms to satisfy different measurement
requirements. For example, if one wants to measure the
average privacy loss, the Frobenious norm may be used. On
the other hand, if one wants to analyze the worst case, the
1-norm and ∞-norm may be selected.
Immunity property. Before we derive an analytical bound
on the privacy measure, we first show that our system is
immune to privacy intrusion attacks. That is, an illegal data
miner cannot derive a better approximation of the original
data tuples from the perturbed data tuples it receives. This
property can be established as follows. In our scheme, we
have
(5.14)
norm
The proof of the theorem can be found in Appendix D.
Theorem 5.2 establishes a quantitative relationship between the privacy measure and eigenvalues of T , A, T 0 , T1 ,
T0 , and T1 . Note that the lower bounds also implicitly depend on the value of k. By an observation of these formulas,
one can easily see that the smaller k is, the higher the lower
bounds are. This is consistent with our intuition that a small
k would better protect the privacy.
6 Performance Evaluation
In this section, we first demonstrate the effectiveness of our
scheme by presenting simulation results on a simple data set.
Then, we compare the performance of our scheme and the
randomization approach in two areas: a) tradeoff between
accuracy and privacy, and b) runtime efficiency.
(a) Original data
R(t) = tVk Vk .
Since Vk is composed of the first k eigenvectors of A , Vk Vk
is a singular matrix with det(V k Vk ) = 0. That is, Vk Vk does
Analytical bound on lp . We now derive a set of lower
bounds on the privacy measure depending on the use of
matrix norms.
500
400
400
300
300
200
200
100
∗
100
2
4
6
8
10
2
4
6
8
10
(c) Perturbed data in our scheme
number of elements
not have an inverse matrix. Thus, t cannot be deduced from
R(t) deterministically. Furthermore, we also claim that no
better approximation of t can be derived from R(t). Let the
Moore-Penrose pseudoinverse matrix of V k Vk be (Vk Vk )† .
Due to the property of the pseudoinverse matrix, given R(t),
R(t)(Vk Vk )† is the shortest length least square solution to
t in (5.14). Since V k is a singular matrix, (Vk Vk )† is equal
to Vk Vk . Thus, the least square approximation of t based
on R(t) is t̂ = tVk Vk Vk Vk = R(t). That is, no better
approximation of t can be derived from R(t).
(b) Randomized data
500
500
400
300
200
attribute a
1
100
1
2
3
4
5
6
attribute value
7
8
9
10
Figure 4: simulation results
6.1 Data distribution before and after randomization
and perturbation. We use a training data set of 1, 000 data
tuples, equally split between two classes. Each data tuple has
10 privacy-sensitive attributes a 1 , . . . , a10 . Each attribute
6.2 Tradeoff between accuracy and privacy. In order to
make a fair comparison between the performance of our
scheme and the randomization approach proposed in [2],
we use the exactly same training and testing data sets as
in [2]. The training data set consists of 100, 000 data
tuples. The testing data set consists of 5, 000 data tuples.
Each data tuple has nine attributes. Five widely varied
classification functions are used to measure the tradeoff
between accuracy and privacy in different circumstances.
A detailed description of the data set and the classification
functions is available in Appendix E. We use the same
classification algorithm, ID3 decision tree algorithm, as in
[2]. In our scheme, we update the perturbation guidance V k
once 1, 000 data tuples are received.
The comparison of accuracy measure while fixing privacy level at 75% [2] is shown in Figure 5. As we can see,
our scheme outperforms the randomization approaches on all
five functions.
A comparison between the accuracy of our scheme
and that of the randomization approach on different privacy
levels is shown in Figure 6. Function 2 is used in this figure.
From this figure, we can observe a tradeoff between accuracy
and privacy. We can also observe the role of perturbation
level k in our scheme. In any case, our scheme outperforms
the randomization approach for a wide range of k values.
6.3 Runtime efficiency. As we have addressed in Section 3, it is shown in [3] that the cost of mining randomized
data set is “well within an order of magnitude” in respect to
that of mining the original data set. In particular, the randomization approach proposed in [2] requires the original
data distribution to be reconstructed before a decision tree
classifier can be built on the randomized data set. The dis-
Privacy level = 75%
100
Accuracy (%)
95
90
85
our new scheme
randomization
80
75
Fn1
Fn2
Fn3
Fn4
Fn5
Class label attribute
Figure 5: comparison of performance
Function 2
160
our new scheme
randomization
140
120
Privacy (%)
is independently and uniformly chosen from 1 to 10. The
classification function is c = (a 1 > 5). That is, a data tuple
is in group C0 if a1 ≤ 5. Otherwise, the data tuple is in
group C1 . Figure 4(a) shows the distribution of the original
data. Each line represents the distribution of an attribute.
Figure 4(b) shows the distribution of the randomized data
after uniform randomization [2]. Figure 4(c) shows the
distribution of the perturbed data in our scheme when the
perturbation level k = 2.
Our scheme preserves the private information in
a2 , . . . , a10 better than the randomization approach. As we
can see, the variance of a 2 , . . . , a10 after perturbation in our
scheme is smaller than that of the randomized attributes by
the randomization approach. On the other hand, our scheme
leaves a1 barely perturbed. The reason is that our scheme
identifies a1 as the only attribute that is needed in the classification. Thus, our scheme can identify the most useful attribute in classification and maintain the distribution of these
attributes only.
k=3
k=4
100
k=5
80
k=6
k=7
60
40
k=8
20
0
70
k=9
75
80
85
90
95
100
Accuracy (%)
Figure 6: performance on function 2
tribution reconstruction is a three-step process. We use “ByClass” reconstruction algorithm as an example because, as
stated in [2], it is a tradeoff between accuracy and efficiency.
In the first step, split points are determined to partition
the domain of each attribute into intervals. There is an
estimated number of data points in each interval. The second
step partitions data values into different intervals. For each
attribute, the values of randomized data are sorted to be
associated with an interval. In the third step, for each
attribute, the original distribution is reconstructed for each
class separately. The main purpose of the first two steps is
to accelerate the computation of the third step. The time
complexity of the algorithm is O(mn + nv 2 ) where m is
the number of training data tuples, n is the number of private
attributes in a data tuple, and v is the number of intervals on
each attribute. It is assumed in [2] that 10 ≤ v ≤ 100.
Note that the overhead of the randomization approach
occurs on the critical time path. Since the distribution re-
construction is not an incremental algorithm, it has to be
performed after all data tuples are collected and before the
classifier is constructed. Besides, the distribution reconstruction algorithm requires access to the whole training data set,
some of which may not be stored in the main memory. This
problem may incur even more serious overhead.
In our scheme, the perturbed data tuples are directly
used to construct the classifier. The only overhead incurred
on the data miner is to update the perturbation guidance
Vk . Note that the overhead is not on the critical time path.
Instead, it occurs during the collection of data. The time
complexity of the updating process is O(n 2 ). A heuristic
of the number of updates is between 10 and 100. Since the
number of attributes is much less than the number of data
tuples (i.e., n m) in data classification, the overhead
of our scheme is significantly less than the overhead of the
randomization approach.
The space complexity of the updating process in our
scheme is also O(n2 ). That is, the received data tuples need
not to remain in the main memory. Besides, our scheme
is inherently incremental. These features make our scheme
scalable to very large training data sets.
Since the perturbation level k is always a small number
(e.g., k ≤ 10), the communication overhead (O(nk) per
data provider) incurred by the two-way communication in
our scheme is not significant.
middle layer, named perturbation layer, realizes our privacy
preserving scheme and exploits the bottom layer to transfer
information. The bottom layer, named web layer, consists of
web servers and web browsers. As an important feature of
our system, the details of data perturbation on the middle
layer are transparent to both data providers and the data
miner.
7 Implementation
References
A prototypical system for privacy preserving data classification has been implemented using our new scheme. The
goal of the system is to deliver an online survey solution that
preserves the privacy of survey respondents. The survey collector/analyzer and the survey respondents are modeled as
the data miner and the data providers, respectively. The system consists of a perturbation guidance component on web
servers and a data perturbation component on web browsers.
Both components are implemented as custom plug-ins that
one can easily install to existing systems. The architecture
of our system is shown in Figure 7.
Figure 7: System implementation
As is shown in the figure, there are three separate
layers in our system: user interface layer, perturbation layer,
and web layer. The top layer, named user interface layer,
provides interface to data providers and the data miner. The
8 Final Remarks
In this paper, we propose a new scheme on privacy preserving data classification. Compared with previous approaches, we introduce a two-way communication mechanism between the data miner and the data providers with little overhead. In particular, we let the data miner send perturbation guidance to data providers. Using this intelligence,
data providers perturb the data tuples to be transmitted to the
miner. As a result, our scheme has the benefit of a better
tradeoff between accuracy and privacy.
Our work is preliminary and many extensions can be
made. In addition to using a similar approach in association
rule mining [17], we are currently investigating how to
apply the approach to clustering problems. We would like
to investigate a new behavior model that is stronger than
the honest-but-curious model, and can be dealt with by our
scheme.
[1] D. Agrawal and C. C. Aggarwal, “On the design and quantification of privacy preserving data mining algorithms,” in Proceedings of the 20th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems. ACM Press,
2001, pp. 247–255.
[2] R. Agrawal and R. Srikant, “Privacy-preserving data mining,”
in Proceedings of the 19th ACM SIGMOD International
Conference on Management of Data. ACM Press, 2000,
pp. 439–450.
[3] S. Agrawal, V. Krishnan, and J. R. Haritsa, “On addressing efficiency concerns in privacy-preserving mining,” in Proceedings of the 9th International Conference on Database Systems
for Advanced Applications. Springer Verlag, 2004, pp. 439–
450.
[4] W. Du and Z. Zhan, “Building decision tree classifier on private data,” in Proceedings of the IEEE International Conference on Privacy, Security and Data Mining. Australian Computer Society, Inc., 2002, pp. 1–8.
[5] W. Du and Z. Zhan, “Using randomized response techniques
for privacy-preserving data mining,” in Proceedings of the
9th ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining. ACM Press, 2003, pp. 505–
510.
[6] A. Evfimievski, J. Gehrke, and R. Srikant, “Limiting privacy
breaches in privacy preserving data mining,” in Proceedings
of the twenty-second ACM SIGMOD-SIGACT-SIGART sym-
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
posium on Principles of database systems. ACM Press, 2003,
pp. 211–222.
N. Friedman, D. Geiger, and M. Goldszmidt, “Bayesian
network classifiers,” Machine Learning, vol. 29, no. 2-3, pp.
131–163, 1997.
O. Goldreich, Secure Multi-Party Computation. Cambridge
Univeristy Press, 2004, ch. 7.
G. H. Golub and C. F. V. Loan, Matrix Computation. John
Hopkins University Press, 1996.
J. Han and M. Kamber, Data Mining Concepts and Techniques. Morgan Kaufmann, 2001.
M. Kantarcioglu and J. Vaidya, “Privacy preserving naı̈ve
bayes classifier for horizontally partitioned data,” in Workshop on Privacy Preserving Data Mining held in association
with The 3rd IEEE International Conference on Data Mining,
2003.
H. Kargupta, S. Datta, Q. Wang, and K. Sivakumar, “On
the privacy preserving properties of random data perturbation
techniques,” in Proceedings of the 3rd IEEE International
Conference on Data Mining. IEEE Press, 2003, pp. 99–106.
Y. Lindell and B. Pinkas, “Privacy preserving data mining,”
in Proceedings of the 20th Annual International Cryptology
Conference on Advances in Cryptology. Springer Verlag,
2000, pp. 36–54.
J. R. Quinlan, “Induction of decision trees,” Machine Learning, vol. 1, no. 1, pp. 81–106, 1986.
A. C. Tamhane and D. D. Dunlop, Statistics and Data Analysis: From Elementary to Intermediate. Prentice Hall, 1999.
J. Vaidya and C. Clifton, “Privacy preserving naı̈ve bayes
classifier for vertically partitioned data,” in Proceedings of the
4th SIAM Conference on Data Mining. SIAM Press, 2004,
pp. 330–334.
N. Zhang, S. Wang, and W. Zhao, “A new scheme on privacy
preserving association rule mining,” in Proceedings of the 7th
European Conference on Principles and Practice of Knowledge Discovery in Databases. Springer Verlag, 2004.
Y. Zhu and L. Liu, “Optimal randomization for privacy
preserving data mining,” in Proceedings of the 2004 ACM
SIGKDD international conference on Knowledge discovery
and data mining. ACM Press, 2004, pp. 761–766.
Appendix A Overview of TAN
For the completeness of the paper, we briefly introduce
the tree augmented naı̈ve Bayesian classification algorithm
(TAN). For details, please refer to [7]. Bayesian classification uses the posterior probability, Pr{t ∈ C 0 |t}, to predict
the class membership of test samples. Naı̈ve Bayesian classification makes an additional assumption, which is called
“conditional independence” [10]. It assumes that the values
of the attributes of a data tuple are independent of each other.
By Bayes theorem, the posterior probability of the class label
on a data tuple t is
n
Pr{Ci } r=1 Pr{ar = αri |Ci }
n
(1.16) Pr{t ∈ Ci |t} =
.
r
r=1 Pr{ar = αi }
Recall that αri is a possible value of a r . If we represent a
naı̈ve Bayesian classifier by a dependency graph (Figure 8),
we can see that no connection between two attributes is
allowed. TAN relaxes this assumption by allowing each
Figure 8: The structure of a naı̈ve Bayesian classifier
attribute to have one other attribute as its parent. That is,
TAN allows the dependency between attributes to form a
tree topology. The dependency graph of a TAN classifier
is shown in Figure 9. For a TAN classifier, the posterior
Figure 9: An example of TAN classifier
probability of the class label on a data tuple t is
(1.17) Pr{t ∈ Ci |t}
n
Pr{Ci } r=1 Pr{ar = αri |Ci , aq = αqj }
n
,
=
q
r
r=1 Pr{ar = αi |aq = αj }
where aq is the parent node of a r in the dependency tree. For
example, in the example shown in Figure 9, the parent node
of a2 is a1 .
Appendix B Proof of Lemma 5.1
L EMMA 5.1 For a given system where the classifier construction procedure uses TAN to build the classifier, if the
estimation error of A is bounded by , then the accuracy of
the system is approximately bounded by
(2.18) la (T, R, TAN)
> p − 0.4vr vq ·
∼
9·
2
vr vq
i=1
,
r q
j=1 (αi αj )
where TAN is the notion of the classifier construction procedure. vr and vq are the number of values of a r and aq ,
respectively, αri and αqj are the i-th and j-th possible values
of ar and aq , respectively, and p is the accuracy of the system
when = 0.
Proof. Let p − lt (R, BAN) be t . In order to derive a higher
bound on t , we first consider rq as follows.
rq =
(2.19)
vq
vr | Pr{C0 , ar =
αri , aq
=
αqj }
i=1 j=1
− Pr{C0 , ãr = αri , ãq = αqj }|
Here r, q ∈ [1, n]. Recall that vr and vq are the number of
possible values of a r and aq , respectively.
R EMARK B.1
t ≤ max rq .
(2.20)
r,q∈[1,n]
For the simplicity of discussion, we do not include the details
here. An intuitive explanation of the remark is to consider
a TAN classifer built on two attributes only. That is, the
dependency
graph consists of three nodes: C i , ar and aq .
q
r
Pr{C
0 , ar = αi , aq = αj } is the probability that
i
j
a test sample is correctly
classified by such classifer built
q
r
on the original data.
i
j Pr{C0 , ãr = αi , ãq = αj }
is the correction rate of such classifer built on the perturbed
data. Thus, rq is the error rate of classification made by such
classifer due to the perturbation.
R EMARK B.2 Let the upper bound on rq be δ. Suppose that
the value of every element of A can be determined from the
perturbed data tuples with error . An estimation of δ is
2
vr vq
δ ≈ 0.4vr vq
(2.21)
.
9 · i=1 j=1 (αi αj )
We represent Pr{ar = αri } by Pr{αri }. Other notions
are similarly abbreviated. Recall that an element A rq in
A = T0 T0 − T1 T1 is
(2.22)
Arq =
vq
vr αri αqj (Pr{C0 , αri , αqj } − Pr{C1 , αri , αqj })
i=1 j=1
(2.23)
=
vq
vr αri αqj Pr{αri , αqj }(2 Pr{C0 |αri , αqj } − 1).
i=1 j=1
Given r, q, let the error of Pr{C 0 , ar = αri , aq = αqj }
after perturbation be ij . Here we make an assumption
(approximation) that all ij are independently and identically
distributed (i.i.d.) random variables.
We describe them by
the normal distribution. Since i j ij is always 0, we
assume that the mean of ij is 0. Let the variance of ij be
2
. The distribution of |A rq −Ãrq | is normal distribution
σij
with mean µ = 0 and variance
(2.24)
σ2 =
vq
vr i=1 j=1
2
αri αqj · 4σij
.
Recall that the error of Ãrq is upper bounded by .
(2.25)
max |A − Ãrq | ≤ .
r,q∈[1,n]
2
Thus, we have σ ≤ /3. Due to (2.24), σ ij
can be bounded
by
(2.26)
2
σij
≤
2
1
vr vq
·
.
4 9 · i=1 j=1 (αri αqj )
Consider the average absolute deviation (i.e., mean deviation) of ij , which is meanij |ij |. We have rq =
vr vq meanij |ij |. Since the average absolute deviation of a
normal distribution is about 0.8 times the standard deviation [15], we can approximate δ by
2
vr vq
δ ≈ 0.4vr vq
(2.27)
.
9 · i=1 j=1 (αri αqj )
Appendix C Proof of Lemma 5.2
L EMMA 5.2 Let the (k + 1)-th largest eigenvalue of A be
sk+1 . ∀r, q ∈ [1, n], we have
A − Ãrq ≤ sk+1 .
(3.28)
Proof. We consider the case where V k is composed of the
first k eigenvectors of A = T 0 T0 − T1 T1 . Thus, the bound
derived is an approximation as in practice, the value of V k
is estimated from the current copy of T ∗ (i.e., currently
received data tuples) 5 . Given data tuple matrix T , recall
that the matrix of perturbed data tuples is T̃ .
(3.29)
T = [t1 ; . . . ; tm ],
(3.30)
T̃ = [t1 Vk Vk ; . . . ; tm Vk Vk ] = T Vk Vk .
T̃0 and T̃1 are the matrices of data tuples that belong to class
C0 and C1 , respectively. Recall that à = T̃0 T̃0 − T̃1 T̃1 . Let
the singular value decomposition of A be A = V ΣV . With
some algebraic manipulation, we have
(3.31) Ã = (T0 Vk Vk ) (T0 Vk Vk ) − (T1 Vk Vk ) (T1 Vk Vk )
(3.32)
(3.33)
(3.34)
= Vk Vk (T0 T0 − T1 T1 )Vk Vk
= Vk Vk V ΣV Vk Vk
= Vk ΣVk .
Due to the theory of singular value decomposition [9], Ã is
the optimal rank-k approximation of A. We have
(3.35)
max |A − Ãrq | ≤ A − Ã2 = sk+1 ,
r,q∈[1,n]
where A − Ã2 is the spectral norm of A − Ã.
5 Fortunately, the estimated V converges well to its accurate value.
k
Besides, the convergence is fairly fast in most cases.
Appendix D Proof of Theorem 5.2
T HEOREM 5.2 The lower bounds on the privacy measure are
given below
norm
lower bound on p
T − T̂ 1
T − T̂ 2
T − T̂ ∞
δk+1 /(maxi ai + ãi ∞ )
ρk+1
δk+1 /(max
i ai + ãi 1 )
n
2
i=k+1 ρi
T − T̂ F
where matrix norms are defined in [9], ρ i is the i-th eigenvalue of T , si is the i-th eigenvalue of A, σ i is the i-th eigenvalue of T0 T0 , τi is the i-th eigenvalue of T 1 T1 , and
(4.36)
δ i = 2 max{σi , τi } − si .
Proof. Recall that T̃ is the least square approximation of T
that can be derived from T̃ . For the simplicity of discussion,
we consider T̂ equal to T̃ . We first show a simple lower
bound on spectral norm T − T̃ 2 and Frobenius norm
T − T̃ F . After that, we derive lower bounds on maximum
absolute column sum norm (i.e., 1-norm) T − T̃ 1 and
maximum absolute row sum norm (i.e, infinity norm) T −
T̃ ∞ based on the special structure of T̃ . Please refer to
Table 5 for notions used in the proof. Recall that T̃ =
T Vk Vk . Since the rank of V k is k, we have rank(T̃ ) ≤ k.
Let the optimal rank-k apprimation of T be T k . We have
(4.37)
T̃ − T F ≥ Tk − T F =
ρ2k+1 + · · · + ρ2n
A − Ã2 = sk+1 .
Note that T̃0 and T̃1 are rank-k approximations of T 0 and T1 ,
respectively. Let the optimal rank-k approximation of T 0 and
T1 be T0k and T1k , respectively. We have
(4.40)
T0 T0 − T̃0 T̃0 2 ≥ T0k − T 2 = σk+1 ,
and
(4.41)
T1 T1 − T̃1 T̃1 2 ≥ T1k − T 2 = τk+1 .
Note that all matrix norms satisfy triangle inequalities. Let
T T − T̃ T̃ be ∆. Thus,
(4.42)
∆2 ≥ 2 max{σk+1 , τk+1 } − sk+1 .
Note that for any matrix M , M 22 ≤ M 1 M inf , where
M 1 is the maximum absolute column sum of M and
M ∞ is the maximum absolute row sum of M . Since ∆ is
a symmetric matrix, the 1-norm of ∆ is equal to the infinity
norm of ∆. Thus, we have
(4.46)
∆1 = ∆inf
≥ ∆2 ≥ 2 max{σk+1 , τk+1 } − sk+1 .
Due to the definition of 1-norm, we can always find a column
in T with index i ∈ [1, n] such that
(4.47)
(4.48)
(4.49)
∆1 =T ai − T̃ ãi 1
<(T − T̃ ) (ai + ãi )1
∼
≤T − T̃ ∞ ai + ãi 1 .
Similarly, we can derive that
(4.50)
∆1 ≤ T − T̃ 1 ai + ãi ∞ .
Thus, the lower bounds on T − T̃ 1 and T − T̃ ∞ are as
follows.
(4.51)
T − T̃ 1 ≥
2 max{σk+1 , τk+1 } − sk+1
,
maxi ai + ãi ∞
T − T̃ ∞ ≥
2 max{σk+1 , τk+1 } − sk+1
.
maxi ai + ãi 1
and
As we have proved in Appendix C, Ã is the optimal
rank-k approximation of A. Thus, we have
(4.39)
(4.45)
T̃ − T 2 ≥ Tk − T 2 = ρk+1 ,
and
(4.38)
Similarly, we have
∆2 =2T0 T0 − 2T̃0 T̃0 − (A − Ã)2
(4.43)
≥2T0 T0 − 2T̃0 T̃0 2 − A − Ã2
(4.44)
≥2σk+1 − sk+1 .
(4.52)
Appendix E Table of Attributes and Classification
Functions
Table 3: Description of Attributes
Attribute
salary
commission
age
elevel
car
zipcode
hvalue
hyears
loan
Description
uniformly distributed on [20k,150k]
if salary ≥ 75k then commission=0
else uniformly distributed on [10k,75k]
uniformly distribued on [20,80]
uniformly chosen from 0..4
uniformly chosen from 1..20
uniformly chosen from 0..9
uniformly distributed on [zipcode × 50k,
zipcode × 100k]
uniformly distributed on [1,30]
uniformly distributed on [0,500k]
Table 4: Classification Functions
F1
F2
F3
F4
F5
Condition for t ∈ C 0
(age<40) ∨ (age>60)
((age<40)
∧
(50k≤salary≤100k))
∨
((40≤age<60) ∧ (75k<salary<125k)) ∨
((age≥60) ∧ (25k<salary<75k))
((age<40) ∧ (((elevel ∈ [0, 1]) ∧ (25k ≤ salary ≤
75k)) ∨ ((elevel ∈ [2, 3]) ∧ (50k≤salary≤100k))))
∨ ((40 ≤age<60) ∧ (((elevel ∈ [1, 3]) ∧
(50k ≤ salary ≤ 100k)) ∨ ((elevel = 4) ∧
(75k≤salary≤125k)))) ∨ ((age≥60) ∧ (((elevel
∈ [2, 4]) ∧ (50k ≤ salary ≤ 100k)) ∨ ((elevel = 1)
∧ (25k≤salary≤75k))))
(0.67 × (salary+commission) - 0.2× loan 10k)>0
(0.67 × (salary+commission) - 0.2× loan +0.2 ×
equity -10k)>0 where equity = 0.1 × hvalue ×
max(hyears-20)
Appendix F Table of Notions
Table 5: Notions
notion
m
n
t
R(t)
ar
αri
vr
ãr
C0
C1
T
T0
T1
T̃
T̃0
T̃1
T∗
T0∗
T1∗
A
Ã
A∗0
A∗1
A∗
V∗
Vk
∆
ρi
si
Σ
σi
τi
δi
·rq
definition
number of data tuples in the training data set
number of attributes besides the class label attribute
a data tuple in the original training data set
a data tuple in the perturbed training data set
r-th attribute of an original data tuple
the i-th value of a r
the number of possible values of a r
the r-th attribute of a perturbed data tuple
class 0
class 1
matrix of original training data set
matrix of original data tuples that belong to C 0
matrix of original data tuples that belong to C 1
matrix of perturbed training data set
matrix of perturbed data tuples that belong to C 0
matrix of perturbed data tuples that belong to C 1
current matrix of received (perturbed) data tuples
matrix of received data tuples that belong to C 0
matrix of received data tuples that belong to C 1
T0 T0 − T1 T1
T̃0 T̃0 − T̃1 T̃1
T0∗ T0∗
T1∗ T1∗
A∗0 − A∗1
maxr,q∈[1,n] A − Ãrq
matrix composed of the eigenvectors of A ∗
matrix composed of the first k eigenvectors of A ∗
T̃ T̃ − T T
i-th eigenvalue of T
i-th eigenvalue of A ∗
diag(s1 , · · · , sn )
i-th eigenvalue of T 0 T0
i-th eigenvalue of T 1 T1
2 max{σi , τi } − si
the element of a matrix with index r and q