Download Privacy-preserving boosting | SpringerLink

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

K-means clustering wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Transcript
Data Min Knowl Disc (2007) 14:131–170
DOI 10.1007/s10618-006-0051-9
Privacy-preserving boosting
Sébastien Gambs · Balázs Kégl · Esma Aïmeur
Received: 16 April 2005 / Accepted: 10 May 2006 /
Published online: 26 January 2007
Springer Science+Business Media, LLC 2007
Abstract We describe two algorithms, BiBoost (Bipartite Boosting) and
MultBoost (Multiparty Boosting), that allow two or more participants to
construct a boosting classifier without explicitly sharing their data sets. We
analyze both the computational and the security aspects of the algorithms.
The algorithms inherit the excellent generalization performance of AdaBoost.
Experiments indicate that the algorithms are better than AdaBoost executed
separately by the participants, and that, independently of the number of participants, they perform close to AdaBoost executed using the entire data set.
Keywords Privacy-preserving data mining · Boosting · AdaBoost distributed
learning · Secure multiparty computation
1 Introduction
The principal goal of data mining can be described as finding useful information
in a vast amount of data. The recent appearance of sources of huge, heterogeneous data made it difficult to use traditional methods developed for small,
Responsible Editor: Charu Aggarwal.
Sébastien Gambs · Balázs Kégl (B) · Esma Aïmeur
Department of Computer Science and Operations Research, University of Montreal, C. P. 6128,
Succ. Centre-Ville, Montréal, Québec, Canada, H3C 3J7
e-mail: [email protected]
Sébastien Gambs
e-mail: [email protected]
Esma Aïmeur
e-mail: [email protected]
132
S. Gambs et al.
well-structured databases, so data mining has been increasingly employing
techniques developed in artificial intelligence and statistical machine learning. Data mining is now used in a wide range of different areas such as finances,
bioinformatics, and astrophysics, to name a few.
Secure multiparty computation is a branch of cryptography. Its goal is to
realize distributed tasks in a secure manner, where the definition of security can
take different flavors such as preserving the privacy of the data or protecting the
computation against malicious participants. A typical task is to compute some
function f (x, y) in which input x is in the hands of one participant and input y is in
the hands of the other. For the computation to be considered totally secure, the
two participants should learn nothing after the completion of the task, except
for what can be inferred from their own input and the function’s output. The first
technique that enabled the implementation of an arbitrary probabilistic computation between two participants in a secure manner was given by (Yao 1986).
His results were later generalized to multiple participants. First, Chaum et al.
(1987) proposed computational protection under cryptographic assumptions.
The model was further extended to provide unconditional protection under
the assumption that at least some proportion of the participants are honest
(Ben-Or et al. 1988, Chaum et al. 1988). Although these methods are universal
and general, they can be very inefficient in terms of both communication and
computational complexity when the inputs are large and when the function to
compute is relatively complex.
In this paper, we study the intersection of data mining and secure multiparty
computation. In particular, we consider the task of constructing a classifier in
a distributed and secure manner. In machine learning, the goal of classification
is to predict the class y of an object based on a vector x = x(1) , . . . , x(d) of
observations1 about this object. Typically, the classifier is learned by using a
finite training set Dn = {(x1 , y1 ), . . . , (xn , yn )}, where each element consists of
a vector xi of d observations and its class yi given by an expert or obtained
by observing the past. For example, x can contain a set of descriptors or attributes of a bank’s client, and y can represent a binary decision whether the
client is a good candidate for a loan. There are numerous high-performance
learning algorithms that can be used for this task. The objective of this paper
is to extend one of these algorithms, AdaBoost (Freund and Schapire 1997),
to the case when the training data Dn is shared between two or more participants. If the participants wish to fully disclose their databases to each other,
they can use a standard learning algorithm on the union of their data sets.
There are several application domains, however, when data sets are highly confidential (client records of financial institutions, medical records, source code
of proprietary software systems, etc.), so participants are reluctant to directly
communicate their databases. At the same time, they might want to collaborate in order to perform a task of mutual interest (for example, two banks
might want to design a system that detects potential cases of fraud). The goal
1 We will use bold symbols to denote real vectors throughout the paper.
Privacy-preserving boosting
133
of such collaboration is to design a classifier that performs better than the
classifiers that the participants could learn using only their separate data sets,
while, in the same time, disclosing as little information as possible on their data
sets.
Two principal security models are generally considered in the literature (Goldreich 2004). In the first model, the participants are called semi-honest (also
known as passive or honest-but-curious). This model corresponds to the situation where the participants follow the execution of their prescribed protocols
without any attempt to cheat, but they try to learn as much information as possible about the other participant’s data by analyzing the information exchanged
during the protocol. This model is weaker than if we had allowed participants
to be malicious and to cheat actively during the protocol, but it is nevertheless
useful and almost always considered in privacy-preserving data mining. For
the simplicity of the analysis, we will mainly focus on the semi-honest model.
Note, however, that there are techniques to upgrade the model so that it can
deal with malicious participants at the cost of increasing the complexity (both
computational and communicational).
AdaBoost (Freund and Schapire 1997) is one of the best off-the-shelf learning
methods developed in the last decade. It constructs a classifier in an incremental
fashion by adding simple classifiers to a pool, and using their weighted “vote” to
determine the final classification. The algorithm can be extended to multiparty
classification in a natural way, and, as we will show, it performs outstandingly
on benchmark data. Although we cannot achieve total security (when all the
information exchanged can be derived from the final classifier), the information
overhead is minimal during the protocol, and it is unclear whether it can be used
at all to reverse-engineer the training data sets. Throughout the paper, we will
consider binary classification, where the class y of every observation is −1 or
+1. The algorithm can be extended to multiclass classification and to regression
along the lines of Schapire and Singer (1999) and Kégl (2003), respectively. Note
that similar algorithms have been proposed by Fan et al. (1999) and Lazarevic
and Obradovic (2002). They extended AdaBoost to distributed learning, however, they did not analyze the security and the privacy aspects of the proposed
methods.
The outline of this paper is as follows. We present privacy-preserving data
mining in Sect. 2 and AdaBoost in Sect. 3. In Sect. 4, we introduce BiBoost
(Bipartite Boosting), the extension of AdaBoost to the case of two participants and we analyze its communication and computational complexity. In
Sect. 7.2, we propose MultBoost (Multiparty Boosting), a further extension of
the algorithm to more than two participants. In Sect. 6, we describe some privacy
models commonly used in data mining, and discuss the cryptographic security
of BiBoost and MultBoost as well as the impact of the obtained classifier
on privacy. In Sect. 7, we demonstrate the algorithms’ performance on benchmark data sets, and we compare them to the standard AdaBoost algorithm.
Finally, we give some possible generalizations of our approach and conclude in
Sect. 8.
134
S. Gambs et al.
2 Privacy-preserving data mining
Today, the Internet makes it possible to reach and connect sources of information throughout the world. At the same time, many questions are raised
concerning the security and privacy of the data. Privacy-preserving data mining
is an emerging field that studies how data mining algorithms affect the privacy
of data and tries to find and analyze new algorithms that preserve this privacy.
Recently, Verykios et al. (2004) presented an overview of the field. In particular,
they identified three different approaches that deal with the privacy-preserving
issues in data mining.
The algorithms of the first approach perturb the values of selected attributes
of individual records before communicating the data sets. These modifications
(j)
can include altering the value of an attribute xi by either perturbing it Atallah
et al. (1999) or replacing it with the “unknown” value (Chang and Moskowitz
(j)
(j)
2000), swapping the values xi and x of an attribute between two records
(Fienberg and McIntyre 2004), or using a coarser granularity by merging several possible values of an attribute (Chang and Moskowitz 2000). Of course,
this technique (called sanitization) will increase uncertainty (noise) in the data,
which makes the learning task more difficult. The objective of these algorithms
is to find a satisfactory trade-off between the privacy of the data and the accuracy of the learned classifier. A specific example of a sanitization method is
the k-anonymization procedure (Iyengar 2002, Sweeney 2002) which proceeds
through generalizations and suppressions of attribute values. In particular, it
generates a k-anonymized dataset that has the property that each record in the
dataset is indistinguishable from at least k − 1 other records within the dataset.
This implies that no specific individual within the k-anonymized dataset can be
identified with probability better than k1 even with the help of linking attacks.
Bayardo and Agrawal (2005) recently gave a practical algorithm to find a kanonymization that is optimal in terms of a cost function that quantifies the loss
of information.
The algorithms of the second approach first randomize the data in a global
manner by adding independent Gaussian or uniform noise to the attribute values. The recipient of the “polluted” data set then either reconstructs the data
distribution (by using, e.g., Expectation-Maximization, Agrawal and Aggarwal
2001), or constructs a classifier directly on the noisy data (Agrawal and Srikant
2000). The goal, as in the first approach, is to hide the particular attribute
values while preserving the joint distribution the data. In this approach, the
intensity of the noise is used to balance between data privacy and model accuracy. Evfimievski et al. (2003) formally analyzed this trade-off, and proposed a
method to limit privacy breaches without any knowledge on the data distribution.
Our algorithm belongs to the third, cryptography-based approach (Clifton et
al. 2002), which is very different in spirit from the first two. Instead of altering
the data sets to hide sensitive information, we do not directly communicate
the data sets at all, rather we distribute the learning procedure between the
Privacy-preserving boosting
135
participants. The objective in this approach is to preserve the privacy of the
participants’ data while approximating as much as possible the performance
of the classifier that they would have obtained had they fully disclosed their
data sets to each other. Lindell and Pinkas (2002) proposed the first algorithm
that followed this approach. They described a privacy-preserving, distributed
extension of ID3 (Quinlan 1986), a well-known learning algorithm that builds
a decision tree using an entropy-based metric. This approach is also related to
distributed learning, where communication is constrained due to limited channel capacity. Predd et al. (2006) have recently analyzed statistical consistency
in this distributed learning model. We return to these methods in more details
in Sect. 6.
3 AdaBoost
AdaBoost (Freund and Schapire 1997) is one of the best general purpose learning methods developed in the last decade. Its development was originally motivated by a rather theoretical question within the Probably Approximately Correct (PAC) Learning framework (Valiant 1984). It later inspired several learning
theoretical results and, due to its simplicity, flexibility, and excellent performance on real-world data, it has gained popularity among practitioners.
AdaBoost is an ensemble (or meta-learning) method that constructs a classifier in an iterative fashion. In each iteration, it calls a simple learning algorithm
(called the weak learner) that returns a classifier. The final classification will
be decided by a weighted “vote” of the weak classifiers, where each weight
decreases with the error of the corresponding weak classifier. The weak classifiers have to be only slightly better than a random guess (from where their
name derives), which gives great flexibility to the design of the weak classifier
(or feature) set. If there is no particular a priori knowledge available on the
domain of the learning problem, decision trees or, in the extreme case, decision
stumps (decision trees with two leaves) are often used. A decision stump can
be defined by three parameters, the index j of the attribute2 that it cuts, the
threshold θ of the cut, and the sign of the decision. Formally,
+1 if x(j) ≥ θ ,
(j)
hj,θ+ (x) = 2I x ≥ θ − 1 =
(1)
−1 otherwise,
and
(j)
hj,θ− (x) = −hj,θ+ (x) = 2I x
+1
<θ −1=
−1
if x(j) < θ ,
otherwise,
(2)
where the indicator function I {A} is 1 if its argument A is true and 0 otherwise.
Although decision stumps may seem very simple, they satisfy the weak learna2 We assume that attributes are real valued, so observations x are in Rd .
136
S. Gambs et al.
bility condition, and, when boosted, they yield excellent classifiers in practice.
Also, finding the best decision stump using exhaustive search can be done efficiently in O(nd) time, where n is the number of training points, and d is the
number of attributes of an observation x. In this paper we will use decision
stumps as weak classifiers. Nevertheless, most of our results can be extended to
boost other weak learners.
For the formal description
let the training set be Dn = {(x1 , y1 ),
of AdaBoost,
. . . , (xn , yn )}, where xi = x(1) , . . . , x(d) ∈ Rd is an observation vector containlabel. The algorithm
ing d real-valued attributes, and yi ∈ {−1, 1} is a binary
(t)
(t)
(t)
maintains a weight distribution w = w1 , . . . , wn over the data points. The
weights are initialized uniformly in line 1, and are updated in each iteration in
lines 7–10 (Fig. 1). The weight distribution remains normalized in each iteration,
(t)
that is, ni=1 wi = 1 for all t. We suppose that we are given a set H of classifiers
h : Rd → {−1, 1} that assign one of the two labels to every observation. In addition to H, we are also provided a weak learner algorithm that, in each iteration
t, returns the weak classifier h(t) ∈ H that minimizes the weighted error
(t)
ε− =
n
wi(t) I h(t) (xi ) = yi .
(3)
i=1
The coefficient α (t) of h(t) is set in line 5 and the weights wi of training points
(t)
are updated in lines 7–10. Since ε− < 1/2 (otherwise we would flip the labels
(t)
and return −h in line 3), the weight update formulas in lines 8 and 10 indicate
that we increase the weights of misclassified points and decrease the weights
of correctly classified points. As the algorithm progresses, the weights of frequently misclassified points will increase, so weak classifiers will concentrate
more and more on these “hard” data points. After T iterations,3 the algorithm
(t) (t)
returns the weighted average f (T) (·) = T
t=1 α h (·) of the weak classifiers.
(T)
The sign of f (x) is then used as the final classification of x.
There exist numerous versions and extensions of AdaBoost. In this paper,
we will use a variant proposed by Schapire and Singer (1999), which allows the
weak classifier not only to answer “−1” or “+1”, but also to abstain by returning
“0”. For the formal description of this extended algorithm, we first re-define4
(t)
the weighted error ε− as
(t)
ε− =
n
wi(t) I h(t) (xi ) = −yi ,
i=1
3 T is an appropriately chosen constant that can be set, for example, by cross-validation.
4 The definition (3) is no longer valid if h(t) is allowed to return 0.
(4)
Privacy-preserving boosting
137
Fig. 1 The pseudocode of AdaBoost. Dn = (x1 , y1 ), . . . , (xn , yn ) is the training set, H is the set
of weak classifiers, and T is the number of iterations. w(t) is the weighting over the data points in
the tth iteration, and α (t) is the weight of the tth weak classifier h(t)
the weighted rate of correctly classified point as
(t)
ε+
=
n
wi(t) I h(t) (xi ) = yi ,
(5)
i=1
and the weighted abstention rate as
n
(t)
(t)
ε0(t) = 1 − ε−
− ε+
=
wi(t) I h(t) (xi ) = 0 .
(6)
i=1
The coefficient α (t) is set5 to
7–10 becomes6
1
2
wi(t+1) ← wi(t) ×
ln
ε
+
/ε− in line 5 and the weight update in lines
⎧
1√
⎪
⎪
⎨ 2ε− +ε0 ε− /ε+
1√
2ε+ +ε0
⎪
⎪
1
⎩
√
ε+ /ε−
ε0 +2 ε+ ε−
if h(t) (xi ) = −yi ,
if h(t) (xi ) = yi ,
if
h(t) (xi )
(7)
= 0.
5 In order to avoid a singularity when ε is small, Schapire and Singer (1999) suggest that α (t) =
−
1 ln ε+ +δ be used, where δ is a small appropriate constant.
ε− +δ
2
6 Note that we will omit the iteration index (t) where it does not cause confusion.
138
S. Gambs et al.
It can be shown that the (unweighted) training error
1
R(f (T) ) =
I sign f (T) (xi ) = yi
n
n
(8)
i=1
of the final classifier can be upper bounded by
T (t)
(t) (t)
ε0 + 2 ε+ ε− ,
t=1
√
so the goal of the weak learner is to minimize ε0 + 2 ε+ ε− in each iteration.
(t) (t)
If there exists a constant δ > 0 for which ε0(t) + 2 ε+
ε− ≤ 1 − δ for all t, then
(T)
the training error can be upper
by R(f ) ≤ e−Tδ . This means that it
lnbounded
n
/δ + 1 iterations. For suboptimal weak clasbecomes 0 after at most T =
sifiers the convergence can be slower, nevertheless, the algorithm can continue
with any weak classifier for which
ε+ > ε− ,
(9)
√
which guarantees that ε0 + 2 ε+ ε− < 1. We will use this condition (9) in the
next section when showing the algorithmic convergence of the privacy-preserving extension of AdaBoost.
4 BiBoost
In this section we extend AdaBoost to the case when the data set Dn is split
between two participants, Alice and Bob, who want to obtain a final classifier
without exchanging their data sets. The basic idea of the algorithm, which we
call BiBoost for Bipartite Boosting,7 is that Alice and Bob compute two separate
weak classifiers in each iteration, and merge the two classifiers into a ternary
classifier. This merged classifier outputs Alice’s (or Bob’s) label for data points
on which the two separate classifiers agree, and abstains if they disagree. In
Lemma 1 we will show that if the two separate classifiers are optimal on their
respective data sets, then the merged classifier will satisfy condition (9), so the
algorithm converges.
A , yA ), . . . , (xA , yA ) and DB = (xB , yB ), . . . ,
=
(x
Formally, let DA
A
A
A
B
1
1
1 1
n
n
n
n
B
B
(xnB , ynB ) Alice’s and Bob’s data sets, respectively. The algorithm maintains
7 An earlier version of this algorithm was described in Aïmeur et al. (2004) under the name of
MABoost.
Privacy-preserving boosting
139
Fig. 2 The pseudocode of BiBoost. DAA = (xA
, yA ), . . . , (xAA , yAA ) and DBB = (xB
, yB ), . . . ,
1 1
1 1
n
n
n
n
(xBB , yBB ) are Alice’s and Bob’s training sets, respectively, H is the set of weak classifiers, and T
n
n
is the number of iterations
(t)
(t)
two weight distributions, wA and wB , over the data points. The weights
are initialized in line 1, and are updated in each iteration in line 13 (Fig. 2)
using the formulas (7) designed for ternary weak classifiers. It can be shown
that if the
weights are initialized non-uniformly to an arbitrary weight vector
(1)
(1)
w(1) = w1 , . . . , wn then AdaBoost minimizes the weighted training error
Rw(1) (f (T) ) =
n
(T)
w(1)
(xi ) = yi
i I sign f
(10)
i=1
instead of (8). In line 1 we initialize the weight vectors uniformly within the
data sets of Alice and Bob, which means that we give equal weight to the two
data sets DA
and DB
, in other words, we minimize
nA
nB
140
S. Gambs et al.
n
n
1 1 (T) A
A
(T) B
B
+
.
I
sign
f
(x
)
=
y
I
sign
f
(x
)
=
y
i
i
i
i
2nA
2nB
A
B
i=1
i=1
The advantage of this initialization is that Alice and Bob do not have to communicate their respective data sizes. If they wish to minimize the unweightA
B
ed training error (8),
they should exchange
then
n and n , and initialize
(1)
A
B
A
B
w uniformly to 1/(n + n ), . . . , 1/(n + n ) . Using the weighted error
(10) also gives flexibility to the participants in their experimental design. For
example, they can assign different weights to positive and negative examples
when the cost of misclassification is different for false positives and false negatives.
In line 4, Alice and Bob select a subset H(t) of the weak classifier set H. They
will use this set to return two weak classifiers, hA and hB , separately by minimizing the weighted error over their respective data sets (lines 5–6). In the simplest
case, H(t) can be identical to H in every iteration. In certain situations, it can
be more convenient to simplify the task of the weak learner by using simpler
weak classifier sets that can differ in each iteration. In Sect. 4.1 we will describe
three particular strategies which can be used by Alice and Bob to select H(t) in
a privacy-preserving manner.
We merge the weak classifiers hA and hB into a ternary classifier in line 7
by taking their pointwise average. Since hA and hB are both binary classifiers8
Rd → {−1, 1}, the formula in line 7 is equivalent to
hA (x)
h (x) =
0
(t)
if hA (x) = hB (x),
otherwise,
(11)
that is, h(t) will agree with hA (or hB ) if hA and hB agree, and abstain otherwise.
To calculate the coefficient α (t) (line 11), and to update the weights of the data
points (lines 12–13), we calculate the weighted error, the weighted rate of correctly classified points, and the weighted abstention rate of h(t) in lines 8–10
using the formulas (4), (5), and (6), respectively. As before, the final classifier
can be obtained as the sign of the weighted “vote” of the weak classifiers (line
14).
According to condition (9), BiBoost will converge if ε+ > ε− in each iteration. We now proceed to show that if hA and hB are optimal within H(t) on the
data sets of Alice and Bob, respectively, then the merged classifier satisfies the
convergence criterion (9) of AdaBoost with abstention. First we say that hA
and hB are H(t) -optimal if they minimize the weighted error on their respective
data sets, that is,
8 The formula in line 7 works also in the case of real-valued weak classifiers, so the algorithm can
be easily extended to confidence-rated AdaBoost (Schapire and Singer 1999).
Privacy-preserving boosting
141
A
A
h = argmin
n
h∈H(t) i=1
A
A
wA
i I h(xi ) = yi
and
B
B
h = argmin
n
h∈H(t) i=1
B
B
wB
i I h(xi ) = yi .
Now, let
A
A
ε−
=
n
wA
i I
A
h
(xA
i )
=
−yA
i
B
and
A
ε+
=
i=1
n
A A
A
wA
I
h
(x
)
=
y
i
i
i
i=1
be the weighted error and the weighted rate of correctly classified points, respecB and ε B be defined
tively, of Alice’s classifier hA on Alice’s data set DA
, and let ε−
+
nA
similarly for Bob. Furthermore, let
A
A
ω−
=
n
i=1
wA
i I
(t)
h
(xA
i )
=
−yA
i
A
and
A
ω+
=
n
(t) A
A
(12)
wA
i I h (xi ) = yi
i=1
be the weighted error and the weighted rate of correctly classified points,
B
respectively, of the merged classifier h(t) on Alice’s data set DA
, and let ω−
nA
B
and ω+ be defined similarly for Bob. Note the subtle difference between the
A and ωA . The former uses Alice’s classifier hA whereas the latter
definitions of ε±
±
uses the merged classifier h(t) . With this notation, the weighted error (4) and
the weighted rate of correctly classified points (5) can be expressed as
A
B
A
B
+ ω−
and ε+ = ω+
+ ω+
,
ε− = ω−
respectively. The following lemma provides a sufficient condition for the convergence of BiBoost in the case when H(t) is closed under negation, that is, for
every base classifier h ∈ H(t) , −h is also an element of H(t) . The condition is
trivially true for the set of decision stumps (see (1) and (2)), and for any function
set used by practical learning algorithms.
Lemma 1 If H(t) is closed under negation and if hA and hB are H(t) -optimal,
then ε− ≤ ε+ . Furthermore, ε− = ε+ only if −hA is optimal on Bob’s data set
and −hB is optimal on Alice’s data set DA
.
DB
nB
nA
A ≤ ωA , otherwise the
Proof The main observation of the proof is that ω−
+
classifier −hB would have a lower error on Alice’s data set DA
than Alice’s
nA
142
S. Gambs et al.
chosen classifier hA , which would violate the optimality of hA . To see this, first
note that by the definition (11) of the merged classifier h(t) , −hB (x) agrees with
hA (x) on observations x on which h(t) abstains, and they disagree otherwise.
is the sum of 1) the error of hA on the subset of
Thus, the error of −hB on DA
nA
A − ωA ), and 2) the rate of correctly classified
DA
where h(t) abstains, that is, (ε−
−
nA
A
A
A . Hence,
points of h on the subset of DnA where h(t) does not abstain, that is, ω+
A < ωA , then the error (ε A − ωA + ωA ) of −hB is less then the error ε A of
if ω+
−
−
−
+
−
A
h , which violates the optimality of hA . It can be shown in a similar way that
B ≤ ωB , and the first statement of the Lemma follows. The second statement
ω−
+
A ≤ ωA , ωB ≤ ωB and ε = ε imply
is also easy to see by observing that ω−
−
+
+
−
+
A = ωA and ωB = ωB , so the error of −hB on DA is (ε A − ωA + ωA ) = ε A ,
ω−
A
+
−
+
−
−
+
−
n
B.
and, similarly, the error of −hA on DB
is
ε
−
nB
A ≤ ε A .9 Moreover, if −hA
Remark 1 If H(t) is closed under negation, then ε−
+
A
A
is optimal, then ε− = ε+ can happen only if we have equality for all weak
classifiers, in which case AdaBoost would saturate and should be stopped. The
same thing happens in BiBoost if Alice and Bob pick exactly the same classifier
but with opposite signs. If we add to the protocol that in this case Alice and
Bob should go back to line 4 and agree on another subset H(t) , then BiBoost
has to be stopped only if Alice and Bob pick exactly opposite classifiers for all
possible subsets.
Remark 2 The H(t) -optimality of hA and hB is crucial for the lemma. If Alice
runs AdaBoost alone on her data set, she can continue with any weak classiA < ε A , so she is not required to find the minimizer of the
fier hA for which ε−
+
weighted error. On the other hand, it is easy to find an example of a non-optiA < ε A , yet ε > ε .10 Such an hA would be admissible for
mal hA for which ε−
−
+
+
AdaBoost if Alice were training alone, but it would make the merged classifier
fail in BiBoost.
Remark 3 The condition in the lemma is sufficient for BiBoost’s convergence,
however, it is not necessary. In practice, it is plausible that most of the time
ε− < ε+ even if hA and hB are not H(t) -optimal. If H(t) -optimality cannot be
guaranteed, condition (9) should be verified after line 9, and in case it is not
satisfied, the algorithm should return to line 4 to select new weak classifiers.
9 Otherwise by switching labels, the error of −hA (that is, ε A ) would be lower than the error ε A of
+
−
hA , so hA would not be optimal.
10 For example, let DA = {(0, −1), (2, −1), (4, +1)}, DB = {(6, −1), (4, −1), (2, +1)}, and
3
3
(t)
(t)
= wB
= (0.15, 0.2, 0.15). Let Alice choose the suboptimal decision stump hA (x) =
wA
h1,1+ (x) = 2I {x ≥ 1} − 1, and let Bob choose hB (x) = h1,5− (x) = 2I {x < 5} − 1. Since
A = ε B = 0.2 < 0.3 = ε A = ε B , both stumps would be admissible for AdaBoost. For the
ε−
−
+
+
merged classifier, ε− = 0.4 > 0.3 = ε+ , so BiBoost cannot continue.
Privacy-preserving boosting
143
4.1 The choice of H(t)
The generality of the lemma allows us to consider various strategies to select
the set H(t) of weak classifiers. In the simplest case, H(t) can be identical to H
in every iteration. In certain cases, it may be convenient to divide H into k (not
necessarily disjoint) subsets H1 , . . . , Hk , and select one of them as H(t) in each
iteration. There are multiple reasons why such a partitioning of the weak classifier space can be useful. First, from a computational point of view, we might
be able to guarantee Hj -optimality on the subsets, and even if Hj -optimality
cannot be guaranteed on the subsets, this design can allow Alice and Bob to
go back to line 4 and select another alternative when condition (9) is violated.
Second, from the privacy-preserving aspect, weak classifiers that are optimal
over a possibly large set H may reveal much more information about the data
sets than weak classifiers that were selected from a small subset of H. Of course,
in this case, Bob and Alice have to agree on the subset selected for H(t) , and
the communication and computational costs may cancel this advantage (see
Sect.s 4.2 and 6.2 for a detailed analysis). Finally, from a statistical point of
view, it has been argued (Amit et al. 2000, Friedman 2002, Kolcz et al. 2002)
that introducing randomness into the weak classifier selection can improve the
generalization performance of AdaBoost (our experiments in Sect. 7 confirmed
this argument), and partitioning H provides a natural framework for randomization. In this section we describe three general alternatives that can be used to
select H(t) , and discuss their communicational and computational implications.
4.1.1 The WholeSet alternative
In the simplest case, Alice and Bob will use H(t) = H in each iteration. They
do not need to communicate in line 4, they only have to exchange their weak
classifiers hA and hB .
4.1.2 The CommonSubset alternative
In general, Alice and Bob can divide H into k (not necessarily disjoint) subsets
H1 , . . . , Hk , and select one of them as H(t) in each iteration. In the experiments
(Sect. 7) we will use decision stumps as weak classifiers, and
Hj , j = 1, . . . , d will
be the set of stumps that cut the jth attribute, that is, Hj = hj,θ+ , hj,θ− : θ ∈ R
(see the definitions (1) and (2)). This may be the simplest way to apply this
alternative, nevertheless, the protocol can be used for any partitioning of any
weak classifier set.
In certain situations, it is possible that some of the subsets are not admissible
for Alice or Bob. In multi-class classification, or if the subset Hj is not closed
under negation, it is possible that Alice (Bob) cannot find a weak classifier in
A ≤ ε A (ε B ≤ ε B ). Moreover, it can happen even in binary classifiHj with ε−
+
−
+
A is equal to ε A within the numerical precision of the computer
cation that ε−
+
for all weak classifiers in Hj . In this situation Alice and Bob must find a subset
144
S. Gambs et al.
Hj which is admissible for both of them. If such a subset does not exist, they
want to be aware of that, and if there are several such subsets, they want to
choose one randomly from among them. In the rest of this section we describe
a protocol that allows Alice and Bob to find a common admissible subset in a
privacy-preserving manner.
To formalize the problem, Alice and Bob represent their admissible subsets
A
B
B
A
B
by two binary vectors, bA = (bA
1 , . . . , bk ) and b = (b1 , . . . , bk ), where bj (or
bB
j ) is 1 if Hj is admissible for Alice (or Bob), and 0 otherwise. Their goal is
A
to select an index j randomly from the set { : bB
∧ b = 1}. Formally, they
A
B
want to compute j as a function φ(b , b ). What makes the problem non-trivial
is that Alice and Bob do not want to disclose any unnecessary information on
bA and bB while computing φ(bA , bB ),11 in other words, they want to find a
solution based on secure multiparty computation. We call this task the random
B
rendez-vous problem, referring to the equivalent problem where bA
j and bj are
indicators of free slots in Alice’s and Bob’s agendas, respectively, and the goal is
to schedule a meeting without disclosing any unnecessary information on their
agendas.
The problem of computing φ(bA , bB ) is equivalent to computing a random
element of the intersection of two sets. This latter problem was solved and
analyzed in the area of communication complexity by Kalyanasundaran and
Schnitger (1987). They proved that the computation of the intersection requires
(k) bits of communication on the average. They, however, were concerned only
about the issue of minimizing communication; they were not interested in the
privacy-preserving aspect of the computation. Nevertheless, the result provides
us with a benchmark since any distributed and privacy-preserving protocol has
to use at least as many bits of communication as the best non privacy-preserving
distributed protocol.
A secure and efficient way to compute φ(bA , bB ) would be by adapting
the secure set intersection protocol proposed in Kissner and Song (2005). In
their paper, the authors describe how to use and manipulate polynomial representations of the sets to implement privacy-preserving operations. Their set
intersection protocol requires that each participant encode his set as a polynomial, encrypt the polynomial using an additively homomorphic scheme (such as
Paillier’s cryptosystem Paillier 2000), do some randomization on the encrypted
polynomial, then exchange it with the other participant, who will also encrypt
and randomize the received polynomial before sending back the result. Finally,
the two participants have to cooperate in order to perform a group decryption
and reveal the intersection of the two sets. It is also possible to modify the
protocol to make it secure against malicious participants using standard cryptographic techniques involving zero-knowledge proofs at the cost of increasing
the complexity of the protocol. As an alternative, a similar protocol (Freedman
et al. 2004) could be also used to compute the set intersection efficiently.
11 For example, they do not want to reveal any admissible subsets other than H , not even the
j
number of their admissible subsets.
Privacy-preserving boosting
145
Note that to implement the random rendez-vous protocol (that is, to compute
φ(bA , bB )), we only need to find one random element of the intersection, so we
can modify the last part of Kissner and Song (2005)’s protocol in the following
way. Instead of decrypting all the doubly-encrypted elements at once which
would reveal the entire intersection, the participants should shuffle the list and
proceed element by element until a match is found (or all the elements of the
list have been ruled out, in which case the intersection is empty). This allows to
reveal only one element of the set intersection chosen at random instead of the
whole intersection.
4.1.3 The RandomSubset alternative
In this alternative, Alice first chooses an admissible subset HjA randomly from
among all her admissible subsets, and announces jA to Bob. If HjA is also admissible for Bob, they agree to set H(t) = HjA without having to run the random
rendez-vous protocol. If HjA is not admissible for Bob, then he chooses an
admissible subset HjB randomly from among all his admissible subsets, and
they set H(t) = HjA ∪ HjB .
4.2 Complexity
In this section we discuss the complexity of BiBoost. Since it is a distributed
algorithm, we can analyze two different measures of complexity, the communication complexity, which is the number of bits exchanged during the execution of
the protocol, and the computational complexity, which considers the processing
time required to execute the algorithm.
4.2.1 Communication complexity
To quantify the amount of communication used in BiBoost, we need to determine the number of bits exchanged during an iteration. First observe that the
only steps that require communication between Alice and Bob are
1. agreeing on H(t) in line 4 (Fig. 2),
2. exchanging the weak classifiers hA and hB in line 7, and
3. exchanging the errors ε+ and ε− for calculating the coefficient α (t) in line 11
and for updating the weights in line 13.
The communication cost of the first step depends on the protocol used to
agree on H(t) . In the WholeSet alternative there is no communication, so
this step costs nothing. Using the CommonSubset alternative with the random
rendez-vous protocol of Kissner and Song (2005), we need to communicate
(k log k) bits, where k is the number of subsets considered. In the RandomSubset alternative, Alice and Bob have to communicate their selection of HjA
and HjB which costs (log k) bits. For quantifying the communication cost of
146
S. Gambs et al.
Table 1 Communication and computational cost (using decision stumps) of BiBoost
Alternative
Communication cost
Computational cost
WholeSet
CommonSubset
RandomSubset
(Tm)
(T(m + k log k))
(T(m + log k))
(Tdn)
(Td(n + e))
(Tdn)
T is the number of iterations, m is the number of bits needed to encode weak classifiers in H(t) , k is
the number of weak classifier sets, d is the number of attributes (dimensions), n = nA + nB is the
total size of the data set, and e is the time required to perform one encryption
exchanging hA and hB , we suppose that we need m bits to encode weak classifiers in H(t) , so the second step costs (m) bits. Encoding the errors to finite
precision can be done using a constant number of bits, so the communication
cost of this step is constant. The final communication costs are summarized in
Table 1. Note that although the WholeSet alternative seems to have the lowest
cost, m can be much larger than in the CommonSubset and RandomSubset
alternatives, since the weak classifier set H can be large compared to the subsets Hj . In particular, if H is the set of all decision stumps and Hj is the set of
decision stumps on the jth attribute, then k is equal to the number of attributes
d, so, to encode a stump we need log d bits more in the WholeSet alternative
than in the RandomSubset alternative. Hence, assuming the same number of
iterations, the two alternatives have the same communication cost.
4.2.2 Computational complexity
The computational cost of lines 7 through 13 is (n) where n = nA + nB . Since
training the weak classifiers in lines 5 and 6 is at least linear in n, the computational cost of BiBoost is dominated by the selection of H(t) (line 4), and the
training of the weak classifiers (lines 5 and 6).
The computation time required to agree on H(t) depends again on which
alternative is chosen. It is negligible if the WholeSet or RandomSubset alternative is used. To analyze the computational cost of the set intersection protocol
of Kissner and Song (2005) used in the CommonSubset alternative, let e stand
for the time required to encrypt one element of the set. For each weak classifier
set Hj , Alice and Bob have to perform two encryptions which take (ke) time.
The computational cost of finding the weak classifier depends on the learning algorithm used. It is at least linear in n unless the algorithm uses only a
subsample of the data Dn to train the weak classifiers. In particular, the best
decision stump can be found in (dn) steps if the data points are sorted in each
attribute. Sorting the data points can be done in (dn log n) time, but it has to
be done only once before the boosting iteration starts. The overall cost of this
step for all T iterations is therefore in (dn(T + log n)).
Summing up (see Table 1), the overall computational complexity of all T
iterations of BiBoost is in (dn log n + Td(n + e)) if the CommonSubset alternative is used and d = k. In the likely case that d < 2n and n < 2T , this reduces to
Privacy-preserving boosting
147
(Td(n + e)). If the WholeSet or RandomSubset alternative is used, the overall computational complexity of BiBoost is (dn(T + log n)), which reduces to
(Tdn) if n < 2T . In any case, by using decision stumps as weak classifiers, the
time required to apply our solution increases linearly with the number of attributes. If there are too many attributes, the processing time of BiBoost could be
prohibitive.
5 MultBoost
In this section, we describe MultBoost (Multiparty Boosting) that generalizes
BiBoost to multiple participants. In general, extending a bipartite protocol to
the multiparty case is a non-trivial task in cryptography. For example,
BiBoost’s closest relative, the privacy-preserving version of ID3 (Lindell and
Pinkas 2002) exists only as a bipartite algorithm, and the authors acknowledge
that extending it to the multiparty case would be difficult. The main difficulty
of moving from a bipartite to a multiparty setting is that some steps that were
previously considered secure are no longer so. For example, the “naive” computation of the weak classifier’s coefficient α (t) is no longer secure, since the
errors of the participants’ individual weak classifiers cannot be inferred from
the final classifier, so, according to our security paradigm, these errors should
now be protected and not revealed in the clear. The security of agreeing on a
common subset H(t) must also be revised in the multiparty setting. The descriptions of the individual classifiers still do not have to be protected (assuming that
they are explicitly contained in the merged classifier), however, the origin of
each individual classifier (which one came from which participant) must now
be hidden.
The communication model considered in the multiparty setting is the following. The existence of a private channel between each pair of participants
is assumed. Practically, this means that each participant has the possibility of
sending a secret message to any other participant without the possibility for
anybody else to learn any information about the content of the message. The
participants have also access to a broadcast channel which takes a message from
one participant and broadcasts it to every other participants. If this channel does
not exist physically, it is always possible for one participant in the semi-honest
model to emulate it by using the individual private channels. This emulation has
a communication cost of (M) bits, where M is the number of participants, since
the selected participant has to send M messages using private channels instead
of one message over the broadcast channel. However, MultBoost’s protocols
use the broadcast channel in a limited way, so the communication complexity
of these protocols is not affected by whether the broadcast channel is real or
simulated. Note that in a model where the participants could be malicious, the
emulation of a broadcast channel would have a significant impact both on the
communication complexity of the protocol and its security. For example, if a
broadcast channel is available, any multiparty computation can be made secure
against at most half of the participants (Goldreich et al. 1987, Rabin and Ben-Or
148
S. Gambs et al.
Fig. 3 The pseudocode of MultBoost. D1 1 ,…, DMM are the training sets of the M participants, H
n
n
is the set of weak classifiers, and T is the number of iterations
1989), whereas without the broadcast channel the security is only guaranteed
if at least two thirds of the participants are honest. In our case, as long as the
word “broadcast” is not explicitly used during the description of the protocol,
the communication is done using private individual channels.
p p
p
p
p For the formal description of MultBoost, let Dnp = (x1 , y1 ), . . . , (xnp , ynp )
be the data set of the pth participant, p = 1, . . . , M. The algorithm maintains a
weight distribution wp (t) over the data points. The weights are initialized uniformly in lines 1–2, and are updated in each iteration in line 16 (Fig. 3) using
the formulas (7) designed for ternary weak classifiers. In line 4, the participants
select a subset H(t) of the weak classifier set H. They will use this set to return
their weak classifiers hp separately by minimizing the weighted error over their
respective data sets (lines 5–6). The WholeSet alternative, when H(t) = H in
each iteration, is easy to use also in MultBoost, but it has the same disadvantages as in BiBoost. The CommonSubset alternative might be difficult to use
Privacy-preserving boosting
149
in MultBoost. The random rendez-vous protocol could easily be extended to
the multiparty case for a cost that is linear in the number of participants, but
the goal of the protocol might be too ambitious in the sense that finding a subset Hj that is admissible for all participants can be more and more difficult as
M grows. In our implementation, we adopted a variant of the RandomSubset
alternative. Each participant selects an admissible subset Hp randomly from
p
among its admissible subsets, and then they set H(t) = M
p=1 H .
The weak classifiers h1 , . . . , hM are merged into a ternary classifier in line 7.
In our implementation, we use the simple rule
M
p
p
if M
sign
p=1 h (x)
p=1 h (x) = 0,
(t)
(13)
h (x) =
0
otherwise,
that is, we take a majority vote among the participants’ weak classifiers. This
rule simplifies to the rule (11) that we used in the bipartite case. Other rules
could also be used in this step. For example, one could output the “raw”, con1 M
p
fidence-rated (real-valued) weak classifier M
p=1 h (x), or weight the votes
in (13) according to the correctness of the individual classifiers. Unfortunately,
we cannot extend Lemma 1 to any of these rules, which means that we cannot
theoretically guarantee that h(t) satisfies the condition (9) even if the individual
p
classifiers hp are all H(t) -optimal on their respective data sets Dnp . Nevertheless,
in experiments, the merged classifier (using the rule (13)) almost never fails the
condition (9). In the rare occasion when it happens, we can go back to line 4
and agree on a different H(t) .
The security of this step must also be revised in MultBoost. In BiBoost,
both participants know the weak classifier of the other participant by subtracting their weak classifier of h(t) . In MultBoost, however, participants cannot
identify the owners of individual weak classifiers just by looking at h(t) , so these
identities must be hidden when the classifiers are merged. To solve this problem,
we use an anonymous broadcast protocol, which takes a set of objects as input,
and outputs them in a random order in such a way that it is impossible to trace
the origin of the objects.12 This kind of channel was suggested among many
other ideas in a seminal paper of Chaum (1981). Examples of recent efficient
protocols implementing this kind of channel can be found in Furukawa and
Sako (2001) and Neff (2001).
After obtaining the merged classifier h(t) , the participants compute its coefficient α (t) in lines 11–14. There is no change in the computation compared to
BiBoost, however, the summation of the individual errors (lines 11–12) must
be done securely. In BiBoost, Bob can infer the error rate ωA (12) of Alice
using α (t) , and his error rate ωB . On the other hand, in MultBoost, the particip
pants can reconstruct only the sum ε− of their individual errors ε− , and not the
12 Note that the same protocol can also be used to hide the origins of the weak classifier sets Hp
in case the RandomSubset alternative is used in line 4 of Fig. 3. This would prevent a potential
information leak during the agreement phase.
150
S. Gambs et al.
1 , . . . , ε M should
individual errors themselves. Therefore, the individual errors ε−
−
p
now be protected and not revealed in the clear. To compute ε− = M
p=1 ε−
securely, we use a secure sum computation protocol. An implementation of this
protocol in the context of privacy-preserving association rules can be found in
Kantarcioglǔ and Clifton (2004b). This protocol is based on the work started by
Shamir on linear secret sharing (Shamir 1979). It involves generating shares by
adding random numbers to mask the value of the true elements, distributing the
randomized shares, computing the sum of the shares using modular addition
and then canceling the effect of the randomization in order to reveal the global
result. By using the secure sum computation protocol, the participants compute
the weighted error ε− and the weighted rate of correctly classified points ε+ of
the merged classifier h(t) . Then they obtain the coefficient α (t) and update the
p
point weights wi as in BiBoost.
5.1 Complexity
In this section we study the communication and computational complexities of
MultBoost. When analyzing the communication complexity, we will suppose
that we have a global resource (communication channel) that the participants
share, and we compute the total amount of communication exchanged between
the participants. On the other hand, in the section on the computational complexity, we will analyze the time required for each participant. Therefore, for
the total time, the formulas should simply include an additional term of M.
5.1.1 Communication complexity
As in BiBoost the following three steps require communication between the
participants:
1. Agreeing on H(t) in line 4 (Fig. 3),
2. exchanging the weak classifiers hp in line 7, and
p
p
3. exchanging the errors ε+ and ε− for calculating the coefficient α (t) in lines
11–12.
The communication cost of the first step depends on the protocol used to
agree on H(t) . In the WholeSet alternative there is no communication, so this
step costs nothing. In the RandomSubset alternative (which we used in our
implementation), every participant has to send his choice Hp to every other
participant which costs (M2 log k) bits. The communication cost of the anonymous broadcast protocol (second step) is also (M2 log k) using the algorithm in
Furukawa and Sako (2001) and Neff (2001). The secure sum computation protocol (third step) is done in (M2 ) using the protocol described in Kantarcioglǔ
and Clifton (2004b). Since the cost of one iteration of MultBoost is dominated
by the cost of the second step, this does not change the asymptotic complexity of the algorithm. Summing up, the cost of one iteration of MultBoost is
(M2 log k), and so the total cost is (TM2 log k) for T iterations.
Privacy-preserving boosting
151
5.1.2 Computational complexity
The computational complexity of the “learning” steps of MultBoost is ((T +
M p
log n) dn), where n =
p=1 n , as in BiBoost (assuming again that decision stumps are used for weak learners). The anonymous broadcast protocol
has a computational cost of (eM) for each participant, where e is the time
required to perform one encryption. The cost of SecureSumComputation is
(M) per participant. Hence, the total computational cost of MultBoost is
((T + log n)dn + eMT). Interestingly, if M > n, the cost of encrypting the values during the anonymous broadcast protocol could exceed the cost of learning.
6 Privacy
Privacy is a difficult notion to formalize. It can take different flavors and meanings depending on the context, and there is no consensus yet on how it should
be defined in the data mining setting. Although several attempts has been
made to tackle this question (see Clifton et al. 2004 for a non-exhaustive list),
the definitions proposed are often restricted to one of the three particular
approaches outlined in Sect. 2. In this section, we first give an overview of the
privacy models used in these three approaches. Then we analyze BiBoost and
MultBoost within the standard paradigm used commonly in cryptographybased approaches. This approach considers a protocol perfectly secure if its
execution does not reveal more information than the output itself, which means
that it entirely overlooks any information leaked by the final classifier itself. In
the last subsection we elaborate on this criticism and analyze the security of the
classifier that BiBoost produces.
6.1 Privacy models in data mining
In the first approach of privacy-preserving data mining, the data is altered
through a sanitization process which tries to preserve privacy while maintaining the utility of the data. Within this approach, Bertino et al. (2005) has recently
proposed a formal framework which allows to compare different privacy-preserving algorithms using criteria such as efficiency, accuracy, scalability or level
of privacy.
The goal of the second approach is to randomize the data by adding uniform
or Gaussian noise to the feature values. In this model, two notions are commonly used to measures privacy: conditional entropy (Agrawal and Aggarwal
2001) and the notion of privacy breaches (Evfimieski 2002). Conditional entropy
is an information-theoretic measure which computes the mutual information
between the original and the randomized distribution. Low mutual information
leads to high privacy preservation but learning becomes less reliable. Privacy
breaches are used to model a change of confidence regarding the estimated
value of a particular attribute of a particular record. Evfimievski et al. (2003)
152
S. Gambs et al.
describe a technique that can be used to limit privacy breaches without any
knowledge of the original distribution. They also point out interesting links
between the notions of conditional entropy and privacy breaches, and describe
situations in which privacy breaches can occur despite low mutual information
between the original and the randomized data.
In the third, cryptography-based approach, the objective is to preserve the
privacy of the participants’ data while approximating as much as possible the
performance of the classifier that they would have obtained had they fully
disclosed their data sets to each other. Beside BiBoost, cryptographic-based
versions of decision trees (Lindell and Pinkas 2002), naïve Bayes classifier
(Kantarcioglǔ and Vaidya 2004), neural networks (Chang and Lu 2001), support vector machines (Yu et al. 2006), and k-means (Kruger et al. 2005) have
been developed. Privacy in these methods is defined within the usual cryptographic paradigm which states that a multiparty protocol is considered perfectly
secure if its execution does not reveal more information than the output of the
protocol itself (Goldreich 2004). The rationale behind this definition is that
the purpose of the protocol is to make its output available to all parties, and
therefore there is no way to avoid revealing the information that this entails. It
turns out that this well-accepted notion has serious shortcomings in our context
because our purpose is not to compute securely one particular well-specified
classifier. Much to the contrary, our study aims at determining which classifier
can be obtained as privately as possible, and various choices of strategy (such
as which alternative is taken in line 4 of BiBoost) will result in different classifiers. In particular, it would be unfair to claim that alternative x is more secure
than alternative y simply because the protocol leaks less information in the first
case than in the second, other than what can be inferred by the resulting classifiers. Indeed, it could be that the classifier itself resulting from alternative x
leaks much more information on sensitive data than the classifier resulting from
the other alternative. To make this problem even more conspicuous, consider
the extreme case of building a privacy-preserving k-nearest-neighbor classifier.
Since the classifier itself contains a copy of the data, the participants could simply exchange their datasets in the clear and still have a perfectly secure protocol
according to the cryptographic definition.13
Several papers studying the notion of privacy have raised this question,
although none of them (including the ones describing the cryptography-based
privacy-preserving classifiers cited above) has found a general model that could
be used to analyze and compare the methods. In Kantarcioglǔ et al. (2004), the
authors discuss some aspects of how the results of a data mining process can
violate privacy. They model the classifier as a “black-box,” which means that
an adversary can request an instance to be classified without getting any other
information on the classifier. They consider a multi-task classification model in
13 In a different (although related) model the goal is to classify a given test example privately using
a distributed k-nearest-neighbor classifier. See Kantarcioglǔ and Clifton (2004a) for a privacy-preserving protocol solving this problem using cryptographic techniques with the help of an untrusted
third party.
Privacy-preserving boosting
153
which the goal is to predict several attributes in the same time using common
observations. The particular question that they study is whether an adversary
can improve his classifier predicting an attribute by using the output of another
(black-box) classifier that predicts another attribute. This question is interesting
in the sense that it helps understanding how the output of a data mining process
(such as the prediction of a classifier) could be used to attack the privacy of
some unobserved attributes. However, modeling the classifier as a black-box
does not take into account how the model description might reveal sensitive
information. The angle of attack of Dinur and Nissim (2003) is different: they
try to give a computational definition of privacy in the context of databases.
In this setting, privacy is preserved if it is computationally infeasible (e.g., no
polynomial-time algorithm exists) to retrieve the original information from the
randomized data. In particular,√the authors prove that unless the perturbation is of magnitude at least ( n) (where n is the number of records in the
database), a polynomial-time adversary can always recover almost the whole
database. They also show that such a large perturbation does not imply automatically that the resulting randomized database will be useless. The approach
taken by the authors is interesting in the sense that it is a “reverse-engineering”
view of cryptography: we first look at which information should not be leaked
and look for functions that meet the privacy requirements, and then devise a
privacy-preserving version of these functions with the help of cryptography.
Another related question is how a priori knowledge of an adversary can
cause privacy breaches. Suppose that Alice and Bob run a privacy-preserving
algorithm to obtain a classifier g, and suppose that neither the form of g nor
the algorithm leaks any information regarding Bob’s data set. In principle, it is
still possible that Alice can retrieve some information on Bob’s data by running
the classification algorithm on her data set to obtain the classifier gA , and then
examining the difference between g and gA . A related problem is studied by
Chang and Lu (2001) under the name of oblivious learning. In this model, Alice
has a classifier with a fixed architecture and Bob has a training dataset. Alice
wants to train her model without learning anything about Bob’s dataset and
without Bob learning anything about the Alice’s classifier. In particular, Chang
and Lu (2001) applied this form of learning to neural networks.
In the following two sections, we first analyze BiBoost in the “traditional”
cryptographic paradigm, then we discuss the security of the obtained classifier.
6.2 Cryptographic security of BiBoost
Keeping in mind the caveat discussed in the previous section, it is nevertheless
interesting to study the amount of information leaked during the execution of
our various alternatives, above and beyond the information leaked by the final
classifier. First recall that we consider here the semi-honest model, in which
participants are expected to follow their prescribed behavior during the execution of the protocol. Yet, they attempt to learn as much as possible concerning
the other participant’s sensitive data by analyzing the information exchanged
154
S. Gambs et al.
during the protocol (Goldreich 2004). For the analysis of security, it suffices to
concentrate on steps enumerated in Sect. 4.2.1 since those are the only steps in
which communication takes place. However, it turns out that all the information
exchanged in the clear during the second and third steps is implicitly contained
in the description of this final classifier.
The security of the first step depends on the choice of which alternative is
chosen in line 4 of BiBoost, so we discuss the security issues separately for the
three alternatives in the following subsections.
6.2.1 The WholeSet alternative
The WholeSet alternative (Sect. 4.1.1) is the easiest to analyze because everything that is said in the clear (the descriptions of hA and hB ) can be reconstructed from knowledge of the final classifier. In other words, our protocol is
perfectly secure (in the semi-honest model) according to the standard definition
of security for multiparty computation (Goldreich 2004). Far from using this
observation for recommending this alternative, we use it to reinforce our claim
that the standard definition of security is inadequate in our context since much
more information leaks from the final classifier when this alternative is taken.
6.2.2 The CommonSubset alternative
For the analysis of the CommonSubset alternative (Sect. 4.1.2), first note that,
in principle, it would be possible to replace our random rendez-vous protocol
with a perfectly secure protocol, under the usual cryptographic assumptions,
since the task at hand falls under the generic technique of secure two-party
computation (Goldreich 2004). This would result in a perfectly secure protocol
for building the final classifier under the usual cryptographic definition of security in the semi-honest model. Nevertheless, we advocate against this approach
because it would be much less efficient than our proposed solution, and the gain
in security would mostly be an illusion as we have explained in the beginning
of this section: the final classifier itself leaks a fair amount of information.
For the analysis of the random rendez-vous protocol, let us assume for simplicity that the encryption schemes are strong enough that no information can
be gained from encrypted values. This ensures that no information leaks about
the values of the set intersection other than the value chosen at random. What
leaks however, as explained earlier, is a probabilistic estimate of the number
of admissible subsets (which corresponds to the size of the set intersection).
Clearly, this information could not be inferred by analyzing the final classifier
alone.
6.2.3 The RandomSubset alternative
Not much information leaks if the RandomSubset alternative (Sect. 4.1.3) is
used. Recall that, in each iteration, Alice announces a subset HjA chosen at
random from among her admissible subsets. If HjA is admissible for Bob, then it
Privacy-preserving boosting
155
is the only subset used in this iteration. In this case, no information leaks from
the interaction that cannot be reconstructed from the final classifier, which is
perfect from the standard cryptographic viewpoint. Assume now that HjA is not
admissible for Bob, which forces him to reveal his randomly chosen admissible
subset HjB . The most interesting situation occurs if Alice’s best weak classifier
in HjB is better than her initially chosen weak classifier in HjA , since in this
case, only HjB will be used in the final classifier, and no trace that HjA had been
considered will be kept in the classifier. Therefore, the interaction leaks information that cannot be reconstructed by analyzing the final classifier. Namely, Alice
learns that Bob does not have a good weak classifier in HjA and Bob learns not
only that Alice does have a good weak classifier in HjB , but also that her weak
classifier in HjB is better then her initially chosen weak classifier in HjA .
6.3 Cryptographic security of MultBoost
There is a new security threat that needs to be considered in the multiparty
setting. A collusion in the semi-honest model corresponds to a scenario where
a fixed subset of the participants agree to cooperate by exchanging all the information they have gathered during the execution of the protocol. The goal of the
colluders is to learn as much information as possible on the other participants’
input. The security of a protocol can de defined according to the maximum
number of participants that can collude without compromising the privacy of
the remaining participants inputs. Note that in the bipartite case, the notion of
collusion has no meaning since a collusion between Alice and Bob is equivalent
to exchanging their inputs.
The security of an anonymous broadcast protocol comes from the execution
of multiple rounds of shuffling and decryption by different and independent
mixers which is done so that the decrypted output can not be linked to the
encrypted input.
The security of the secure sum computation protocol is based on the masking
ability of random numbers. By looking at the individual shares, none of the participants can determine the other participants’ original numbers εp . Moreover,
even if two or more participants collude, they will fail to extract the individual
values hidden inside since the random shares are different and independent.
This makes the protocol secure against any collusion of up to M −2 participants.
There is an unexpected benefit that comes from the use of the anonymous
broadcast protocol. The final output of MultBoost is a weighted linear combination of the merged classifiers h(t) computed in each iteration. A merged
classifier is constructed of M weak classifiers whose descriptions are explicitly
contained inside the merged classifier but whose origins are unknown. Trying
to reconstruct the original data points (or even estimate the data distribution)
of a specific participant would require to be able to trace back which classifiers belong to him, which is impossible unless we have some other information
sources. Even if the origins are known, reconstructing the data sets seem to be a
difficult problem; without the origins, the reconstruction is virtually impossible.
156
S. Gambs et al.
6.4 Privacy-preservation of the final classifier
It is an open question how to evaluate the information revealed by the description of a classifier in general, or how to compare the descriptions of two different classifiers from the aspect of privacy. Intuitively, “opaque” classifiers (such
as neural networks) seem better than “transparent” ones that either use the
training points explicitly (such as nearest neighbor classifiers or support vector
machines), or contain rules that can be easily reverse-engineered (such as deep
decision trees that contain only a few training points in each leaf). An upper
bound on the information revealed by a classifier may be obtained from the
length of its description, but it still does not tell us anything about the quality of
the obtained information. Even though there exists no comprehensive framework in which we could analyze BiBoost, we argue that the flexibility and the
robustness of the algorithm together with the incremental construction of the
classifier allow the participants to select the best settings for their particular privacy-preservation model. We start with some general comments in Sect. 6.4.1.
In Sect. 6.4.2 we consider a concrete privacy model based on k-anonymization,
and we describe a subroutine which can be added to the main boosting loop to
guarantee that the final classifier preserves the privacy of the participants’ data.
This method is general in the sense that it does not assume any particular form
of the base classifiers. In Sect. 6.4.3 we analyze the particular model when the
base classifiers are decision stumps from another angle of privacy-preservation.
6.4.1 General comments
We start with some general comments. First, in most of the learning algorithms
the functional form of the classifier is fixed. When the classifier is explicitly
related to the training points (SVM, nearest neighbor), privacy-preservation
becomes impossible, but even in other widely used algorithms (neural nets or
trees), the inherent rigidity of the model makes it difficult to adapt them to a
given privacy-preservation criterion. On the other hand, BiBoost gives great
flexibility to a user in selecting the family of weak classifiers in order to defend
against particular privacy breaches. Second, the inherent robustness of the
weak-classifier selection allows the user to adopt additional privacy-preserving
measures. Since we do not require to return the best weak classifier in each
iteration, the participants can add random noise to their data before finding a
weak classifier, or add noise to the classifier itself after selection. In fact, as we
argued in Sect. 4.1, such randomization may even improve the generalization
ability of the final classifier. Third, our protocol to select and merge the participants’ weak classifiers allow Alice to refuse Bob’s proposition if she determines
that the merged weak classifier (or the final classifier f (t−1) together with the
new merged weak classifier h(t) ) would reveal sensitive information on her
dataset. This protocol would generate an interesting trade-off: the “inner loop”
would decrease the “traditional” cryptographic security of the algorithm since
it would allow to leak more information during learning in order to increase
the privacy-preservation of the final classifier. Fourth, since the final classifier is
Privacy-preserving boosting
157
constructed incrementally, the participants can stop any time if they determine
that increasing the complexity of the final classifier would be detrimental to
privacy. If, in a particular application, privacy-preservation can be measured
numerically, the participants can even define a quantitative trade-off between
generalization error and privacy-preservation, and stop the algorithm when the
combined criterion cannot be further improved.
6.4.2 Guaranteeing k-anonymity
For a concrete example, consider the definition of privacy proposed recently by
Chawla et al. (2005). Similarly to k-anonymization (Sweeney 2002), the goal in
this model is to “blend in with the crowd”. More formally, we would like each
data point to be indistinguishable from at least k − 1 other points, where k is
chosen by the participants. In the classification setup, this translates to avoiding
homogeneous decision regions with less than k (but more than zero) data points.
More concretely in BiBoost, given the final classifier f (T) we say that we cannot
distinguish two data points x1 and x2 if h(t) (x1 ) = h(t) (x2 ) for all t = 1, . . . , T.
To preserve the privacy of the data in this sense, the participants must verify
in each boosting iteration whether adding a new merged weak classifier would
create indistinguishable subsets of size less than k points in their respective data
sets. In the case when the new base classifier would generate such a region, they
can go back to line 4 (Fig. 2) and choose another subset H(t) , or terminate the
algorithm.
the
To verify whether a base classifier h(t+1) would generate
a “bad” region,
participants have to maintain the partitioning S (t) = S1(t) , . . . , Sk(t)(t) of their
(t)
data sets generated by the base classifiers, where each subset Si represents
(t)
a homogeneous region, that is, ∀x, x ∈ Si , ∀j ≤ t : h(j) (x) = h(j) (x ). When
(t)
(t+1)
is added, it is sufficient to verify for each Si
a the new base classifier h
(t)
whether there exist two points x, x ∈ Si for which h(t+1) (x) = h(t+1) (x ),
(t)
(t+1)
(t)
= Si ∩ x : h(t+1) (x) = h(t+1) (x ) and
and, in this case, split Si into Si
(t+1)
(t)
Si
= Si ∩ x : h(t+1) (x) = h(t+1) (x ) . Empty sets do not have to be repre(t)
sented, so if h(t+1) (x) = h(t+1) (x ) for all the points x, x in Si , then the subset is
simply copied to Si(t+1) = S i(t) . To verify whether h(t+1) would generate a “bad”
(t+1) region, we check whether Si
< k for all i = 1, . . . , k(t+1) . Since the number
of subsets k(t) cannot be larger than n, both the splitting and the verification
can be done in O(n) time, so these operations do not increase the asymptotic
running time of the algorithm.
6.4.3 Privacy guarantees for decision stumps
If the participants choose decision stumps as weak classifiers, the final classifier
preserves privacy in the following very strong sense. It is clear that just by looking at the set of selected decision stumps, we cannot extract information other
158
S. Gambs et al.
than the set of possible values of individual attributes (and even extracting all
the projections is non-trivial, sometimes impossible). In other words, f contains
information only on the marginal distribution of the data points, and there is no
way to “connect” the projections. For simplicity, assume that all the n projections in all the d dimensions are different, and we manage to find them at least
within an error interval. The number of possible data sets consistent with these
“measurements” is (n!)d−1 , and the probability of finding a data point at an
arbitrary combination of the projections (“pinpointing” a data point) is 1/nd−1 .
If projections can coincide, the analysis is more complicated but the number
of consistent combinations is still exponentially large both in the number of
dimensions and the number of data points.
7 Empirical results
In this section we present experimental results with BiBoost (Sect. 7.1) and
MultBoost (Sect. 7.2).
7.1 BiBoost
We tested BiBoost on three benchmark binary classification problems from
the UCI (University of California at Irvine) data repository (Blake and Merz
1998): the “sonar”, the “spambase”,14 and the “ionosphere” data sets. We use
9-fold cross validation in the following manner. We split the initial data set
Dn into 9 sets of equal size T1 , . . . , T9 , and conduct 9 trials. For each trial
i, the test set is Ti , and the training set is composed of the union of all the
other remaining sets Tj , for j = i. The baseline for the comparison is the plain,
non-distributed AdaBoost, which uses the entire training set in each trial. For
testing BiBoost, we split the training sets into two equal parts. At trial i, Alice’s
training set is composed of the four blocks Ti+1 (mod 9) , . . . , Ti+4 (mod 9) located
after Ti , and the training set of Bob is composed of the four remaining blocks,
Ti+5 (mod 9) , . . . , Ti+8 (mod 9) . The test and training errors are averaged over the
trials. Fig. 4 shows the test and training error curves of one run of BiBoost using
the CommonSubset alternative on the sonar data set.
Fig.s 5, 6 and 7 show the behavior of different boosting algorithms on the
three data sets. In each figure, the left graph in the top row displays the results
obtained by applying standard AdaBoost on the union of Alice’s and Bob’s
training sets. The error curves in the right graph in the top row were generated
by running AdaBoost on Alice’s and Bob’s training sets separately (we call this
version separated AdaBoost), and averaging the training and test errors. These
two experiments provide us with lower and upper baselines. With standard
AdaBoost, we use all the training data, and with separated AdaBoost we do
14 In the experiments with the spambase dataset, we used only 400 randomly selected points out
of the 4, 000 original points.
Privacy-preserving boosting
159
BiBoost with the CommonSubset alternative, one run on sonar
0.4
Alice’s training error
Bob’s training error
test error
0.35
error rate
0.3
0.25
0.2
0.15
0.1
0.05
0
1
10
100
1000
t
Fig. 4 Evolution of Alice’s training error, Bob’s training error and the test error of BiBoost during
one run of the CommonSubset alternative on the sonar data set
not communicate at all during the boosting iterations. Hence, we expect that
BiBoost will perform between these two extreme cases.
Surprisingly, in our preliminary experiments we found that BiBoost’s test
error was much closer to the test error of standard AdaBoost than that of
separated AdaBoost. We suspected that BiBoost’s excellent performance may
partly be explained by the inherent randomization when the CommonSubset
and RandomSubset alternatives are used to select weak classifier sets. It has
been suggested by Friedman (2002) that AdaBoost’s performance may improve
if randomization is introduced into the weak classifier selection. The argument that randomization of weak classifiers can improve generalization has
also appeared in a more general context in Amit et al. (2000) and Kolcz et al.
(2002). Although these approaches are slightly different from ours, the result
is similar: in each iteration, we select a good but suboptimal weak classifier.
To imitate the specific randomization that BiBoost uses, in each iteration, we
choose a random subset Hj from among the subsets that contain an admissible weak classifier, and select the optimal weak classifier in Hj . The results of
the resulting variants (which we call randomized AdaBoost and randomized
separated AdaBoost) are depicted in the second row of each figure.
The two bottom rows of Fig.s 5–7 show BiBoost’s training and test errors.
The first three plots correspond to the three alternatives described in Sect. 4.1.
The second figure of the bottom row displays the results that we obtained with a
modified version of the WholeSet alternative. In preliminary experiments, we
found that the non-randomized version (WholeSet alternative) of BiBoost can
easily and rapidly saturate. Saturation occurs when Alice and Bob pick opposite
weak classifiers15 hA = −hB . In this situation, we suggest in the first remark
after Lemma 1 to go back to line 4 (Fig. 2), and choose another admissible
15 More precisely, it is sufficient if for all data points x , hA (x ) = −hB (x ).
i
i
i
160
S. Gambs et al.
Standard AdaBoost on sonar
Separated AdaBoost on sonar
0.4
0.4
training error
test error
0.35
0.3
error rate
error rate
0.3
0.25
0.2
0.15
0.25
0.2
0.15
0.1
0.1
0.05
0.05
0
1
training error
test error
0.35
10
100
0
1
1000
10
t
0.4
training error
test error
0.35
error rate
error rate
0.3
0.25
0.2
0.15
0.2
0.15
0.1
0.1
0.05
10
100
0
1
1000
t
0.4
training error
test error
0.3
error rate
0.2
BiBoost with the CommonSubset
alternative on sonar
training error
test error
0.25
0.2
0.15
0.1
0.1
0.05
0.05
10
100
0
1
1000
10
t
0.4
training error
test error
1000
BiBoost with the modified WholeSet
alternative on sonar
training error
test error
0.35
0.3
0.3
0.25
error rate
error rate
100
t
BiBoost with the RandomSubset
alternative on sonar
0.35
0.2
0.15
0.25
0.2
0.15
0.1
0.1
0.05
0.05
0
1
1000
0.3
0.15
0.4
100
0.35
0.25
0
1
10
t
BiBoost with the WholeSet
alternative on sonar
0.35
error rate
0.25
0.05
0.4
training error
test error
0.35
0.3
0
1
1000
Randomized separated AdaBoost on sonar
Randomized AdaBoost on sonar
0.4
100
t
10
100
t
1000
0
1
10
100
1000
t
Fig. 5 Comparison of standard AdaBoost, randomized AdaBoost, and four alternatives of
BiBoost on the sonar data set during 1000 iterations
Privacy-preserving boosting
161
Standard AdaBoost on spambase
0.2
Separated AdaBoost on spambase
0.2
training error
test error
training error
test error
0.15
error rate
error rate
0.15
0.1
0.05
0.1
0.05
0
0
1
10
100
1000
1
10
t
Randomized AdaBoost on spambase
training error
test error
training error
test error
0.15
0.15
0.1
0.05
0.1
0.05
0
0
1
10
100
1000
1
10
t
100
BiBoost with the CommonSubset
alternative on spambase
0.2
training error
test error
training error
test error
0.15
0.15
error rate
error rate
1000
t
BiBoost with the WholeSet
alternative on spambase
0.2
0.1
0.05
0.1
0.05
0
0
1
10
100
1000
1
10
t
100
1000
t
BiBoost with the RandomSubset
alternative on spambase
0.2
BiBoost with the modified WholeSet
alternative on spambase
0.2
training error
test error
training error
test error
0.15
0.15
error rate
error rate
1000
Randomized separated AdaBoost on
spambase
0.2
error rate
error rate
0.2
100
t
0.1
0.05
0.1
0.05
0
0
1
10
100
t
1000
1
10
100
1000
t
Fig. 6 Comparison of standard AdaBoost, randomized AdaBoost, and four alternatives of BiBoost on the spambase data set during 1,000 iterations
162
S. Gambs et al.
Standard AdaBoost on ionosphere
0.2
Separated AdaBoost on ionosphere
0.2
training error
test error
training error
test error
0.15
error
rate
error rate
error rate
0.15
0.1
0.05
0.1
0.05
0
0
1
10
100
1000
1
10
t
Randomized AdaBoost on ionosphere
0.2
0.15
training error
test error
error rate
0.15
0.1
0.05
0.1
0.05
0
0
1
10
100
1000
1
10
t
100
BiBoost with the CommonSubset
alternative on ionosphere
0.2
training error
test error
training error
test error
0.15
error rate
0.15
error rate
1000
t
BiBoost with the WholeSet
alternative on ionosphere
0.2
0.1
0.1
0.05
0.05
0
0
1
10
100
1000
1
10
BiBoost with the RandomSubset
alternative on ionosphere
0.2
100
1000
t
t
BiBoost with the modified WholeSet
alternative on ionosphere
0.2
training error
test error
training error
test error
0.15
error rate
0.15
error rate
1000
Randomized separated AdaBoost on
ionosphere
0.2
training error
test error
error rate
100
t
0.1
0.1
0.05
0.05
0
0
1
10
100
t
1000
1
10
100
1000
t
Fig. 7 Comparison of standard AdaBoost, randomized AdaBoost, and four alternatives of BiBoost on the ionosphere data set during 1,000 iterations
Privacy-preserving boosting
163
subset H(t) . In the WholeSet alternative, however, there is only one set that
contains all weak classifiers. To resolve this problem, we decided to switch to the
RandomSubset alternative for one iteration when saturation occurs. The error
curves indicate that, although this trick can work a certain number of times,
there is always a moment where a saturation is inevitable and will happen with
nearly every possible admissible subsets.
The error curves confirm the “common sense” observation that boosting is
relatively immune to overfitting: the test error curves are flat during a large
span of iterations, and the asymptotic test error is usually not much larger than
the minimum. Nevertheless, overfitting does happen, and the different versions of the algorithm converge at different speeds, so, for fair comparison, the
algorithms should be stopped early after a number of iterations validated on
hold-out data. In the standard, non-separated case we used simple validation.
In each of the nine cross-validation experiments the training data Ttr is first split
further into training and validation sets Ttr and Tval (at the rate of 2:1 in our
experiments). Then we run the algorithm on Ttr and measure the error on Tval .
To smooth the error curve, we average over windows of 5 iterations and choose
the middle of the window in which the validation error is minimal as the optimal
number of iterations T̂. Then we rerun the algorithm on Ttr = Ttr ∪ Tval , and
measure the error on the hold-out test set after T̂ iterations. The final error is
then the average of the validated test errors over the cross validation folds.
In a distributed environment where the privacy of the data should be preserved, validation adds another layer of complexity to the algorithms. Finding
an efficient and privacy-preserving validation protocol is a crucial question in
every learning method where the learned classifier is sensitive to the choice of
complexity hyperparameters (such as the number of neurons or the depth of
the decision tree). The most privacy-preserving scheme is to let each participant
validate the hyperparameters separately on their data sets, and then somehow
combine the estimated parameters. The main drawback of this approach is
that the separate training sets are much smaller then the unified training sets.
Assuming that the optimal complexity grows with the size of the training set, this
procedure can seriously underestimate the optimal complexity. For small data
sets the resulting validation sets can also be very small, resulting in a parameterestimate with a high variance. In our case, we found that this protocol stopped
BiBoost way too early, and the problem was particularly accentuated with more
than two participants (Sect. 7.2).
To avoid this problem, we adopted the following protocol. In each cross-vali
A and T B /T B .
dation fold, both Alice and Bob split their training sets to TtrA /Tval
tr
val
They run BiBoost the using their training sets DA = TtrA and DB = TtrB , and
estimate the hyperparameter based on the error measured on the unified valiA ∪T B . They do not have to actually communicate their validation
dation set Tval
val
points (which would be a major privacy breach), only the error of the combined
classifier f (t) measured on their sets. This results in a minor leak that is similar to the leak that occurs in lines 8 and 9 (in Fig. 2) when they exchange the
weighted training error of the weak classifier. The advantage of the protocol
164
S. Gambs et al.
Table 2 Test errors (with standard deviations) of the different algorithms after an optimal number
of iterations estimated using cross-validation
Standard AdaBoost
Separated AdaBoost
Randomized AdaBoost
Randomized separated AdaBoost
BiBoost (WholeSet)
BiBoost (CommonSubset)
BiBoost (RandomSubset)
BiBoost (modified WholeSet)
Sonar
Spambase
Ionosphere
0.164 (0.054)
0.212 (0.084)
0.202 (0.082)
0.257 (0.065)
0.226 (0.085)
0.197 (0.049)
0.202 (0.070)
0.226 (0.079)
0.070 (0.037)
0.098 (0.036)
0.062 (0.034)
0.076 (0.033)
0.074 (0.049)
0.052 (0.034)
0.072 (0.033)
0.070 (0.047)
0.100 (0.037)
0.114 (0.040)
0.083 (0.029)
0.092 (0.023)
0.108 (0.040)
0.108 (0.020)
0.085 (0.031)
0.102 (0.038)
is that the “effective” sizes of the training and validation sets are the same
as in the non-separated case, and they do not depend on the number of participants. Although this protocol leaks more information on the participants’
data sets than the first scheme, we feel that this is a good compromise between
privacy preservation and generalization performance. Table 2 summarizes the
results obtained using this validation technique for BiBoost and “traditional”
validation for AdaBoost and randomized AdaBoost.
By observing the results on the different data sets, the following conclusions
can be drawn. First note that if Alice and Bob were to run AdaBoost alone on
their own training sets only (separated AdaBoost), then the performance of the
resulting classifier would be significantly worse than if they were running the
algorithm on the complete training set composed of the union of their databases
(standard AdaBoost). This is not really surprising since usually the accuracy of a
classifier grows with the number of training data points. The second observation
is that randomized AdaBoost is usually better than standard AdaBoost both
for the complete training set and for the separated training sets, as suggested
in Amit et al. (2000), Friedman (2002) and Kolcz et al. (2002). Note, however,
that in our case there is a small price to pay with randomized AdaBoost, that
is, a slower convergence of the training error (roughly half of that of the standard AdaBoost). We can observe a similar relation between the deterministic
(WholeSet alternative) and randomized (CommonSubset and RandomSubset
alternatives) versions of BiBoost.
In general, the CommonSubset and RandomSubset alternatives of BiBoost
perform reasonably well. In particular, their test errors are always lower than
separated AdaBoost’s test error, and usually also lower than randomized AdaBoost’s test error, although here the difference is less substantial. Their test
errors are also only slightly higher than standard (randomized) AdaBoost’s test
error, which means that BiBoost performs close to its lower limit. In general,
the test error in the WholeSet alternative is slightly higher than the test error
of the CommonSubset and RandomSubset alternatives. On the other hand, this
alternative converges faster than the others, so, in case when sparseness of the
final classifier (the low number of weak classifiers) is important, this alternative
can also be useful.
Privacy-preserving boosting
MultBoost with 25 participants
0.3
MultBoost with 50 participants
0.3
MultBoost
Randomized separated AdaBoost
Randomized AdaBoost
0.25
MultBoost
Randomized separated AdaBoost
Randomized AdaBoost
0.25
0.2
error rate
error rate
165
0.15
0.1
0.2
0.15
0.1
0.05
0.05
0
0
1
10
100
1000
10000
t
1
10
100
1000
10000
t
Fig. 8 Comparison of randomized AdaBoost, randomized separated AdaBoost, and MultBoost
with 25 and 50 participants on the pendigits data set during 15,000 iterations
7.2 Empirical results with MultBoost
We tested the MultBoost algorithm on the pendigits dataset which also comes
from the UCI repository. We chose not to reuse the ionosphere, sonar and
spambase datasets as we did with BiBoost because these datasets do not contain enough data points if they are split among a high number of participants.
We tried to use the pendigits dataset also with BiBoost but we could not observe
any significant difference between the standard and separated versions of AdaBoost and randomized AdaBoost, probably due to the large size of the data
set. The pendigits dataset contains 7,494 data points, each represented by 16
attributes. Originally, this dataset is designed to be used for multi-class classification with a total of 10 classes (one for each digit ranging from 0 to 9). We
chose instead to transform it into a binary classification task by assigning the
negative class to all even numbers and the positive class to the odd numbers.
We tested MultBoost with M = 25 and 50 participants. This time, due to
the size of the dataset, we used 2-fold validation instead of 9-fold. This means
that for each fold, half of the dataset was used for testing, while the remaining
half was used for training and split equally between the participants. Since the
total number of training points is fixed, it is possible to observe how MultBoost behaves when the total amount of information remains the same but the
number of participants changes. During the experiments we observed that it is
not necessary to make each participants “vote” in each iteration to obtain the
merged classifier (13). In particular, we ran MultBoost with 3, 4, 5 and M voters
and compared the test errors. The actual voters were selected randomly in each
iteration. Fig. 8 shows the test curves for randomized AdaBoost, randomized
separated AdaBoost, and MultBoost with M voters on the pedigits data set.
Similarly to BiBoost, the optimal number of iterations T̂ was estimated using
simple validation. In the case of MultBoost, we generalized the approach outlined in Sect. 7. During the validation pass, each participant separates his indip
p
vidual dataset into a training set Ttr and a validation set Tval . The participants
p
then execute MultBoost using their training sets Ttr and record the error of
the resulting classifier on their validation sets in each iteration. To compute the
166
S. Gambs et al.
Table 3 Test errors (with
standard deviations) of the
different algorithms obtained
on the pendigits dataset after
an optimal number of
iterations estimated using
simple validation
Standard AdaBoost
Randomized AdaBoost
Separated AdaBoost
Randomized separated
AdaBoost
MultBoost (3 voters)
MultBoost (4 voters)
MultBoost (5 voters)
MultBoost (M voters)
0.0502 (0.0051)
0.0550 (0.0027)
25 participants
50 participants
0.1412 (0.0029)
0.1346 (0.0016)
0.1807 (0.0001)
0.1701 (0.0019)
0.0620 (0.0013)
0.0518 (0.0012)
0.0543 (0.0032)
0.0582 (0.0033)
0.0551 (0.0024)
0.0531 (0.008)
0.0562 (0.0015)
0.0586 (0.0053)
Table 4 Test errors (with standard deviations) of MultBoost on the pendigits dataset with 25%
and 50% of the attributes removed
MultBoost (3 voters)
MultBoost (5 voters)
25% of the attributes removed
50% of the attributes removed
25 Participants
50 Participants
25 Participants
50 Participants
0.0559 (0.0017)
0.0572 (0.0052)
0.0591 (0.0009)
0.0642 (0.0009)
0.0943 (0.0006)
0.0945 (0.0056)
0.0903 (0.0020)
0.0899 (0.0021)
validation error, they do not have to disclose their errors, instead they can run
the secure sum computation protocol (described in Sect. ). As in BiBoost, we
smoothed the error curve by averaging over 5 iterations and set T̂ to the middle
of the minimum error window. Finally, we re-ran MultBoost using the unified
p
p
p
training sets Ttr = Ttr ∪ Tval and stopping at the estimated optimal number of
iteration T̂. Table 3 summarizes the results.
It is clear that the higher the number of participants, the more we gain by
using MultBoost instead of separated AdaBoost. In particular, there is significant difference of roughly 7–8% of accuracy between AdaBoost (both standard
and randomized) and MultBoost for 25 participants and of more than 10% for
50 participants. Note also that we do not require all the participants to output
one of their weak classifiers in each iteration since MultBoost with only 3, 4,
or 5 voters per iteration performs as well as MultBoost with all the participants voting in each iteration. This is important for privacy-preservation since
it means that we can achieve high accuracy with little exchange of information.
In Sect. 6.4 we argued that for assuring that the final classifier itself preserves
the privacy of the data sets, the participants may restrict the pool of base classifiers used to build the final linear combination. In a set of experiments, we
tested how such a limitation affects the performance of the learned classifier.
In particular, we randomly removed 25% and 50% of the original attributes
(features), and ran the algorithm on the truncated data sets. Table 4 summarizes
the results. Although the errors are worse than when all the attributes are used,
they are still significantly better than separated randomized AdaBoost. This
suggests than even in the case where some attributes are considered sensitive,
Privacy-preserving boosting
167
running MultBoost with a restricted set of attributes can still lead to a better
performance than running AdaBoost alone.
8 Conclusion and future directions
In this paper, we have described two algorithms, BiBoost (Bipartite Boosting)
and MultBoost (Multiparty Boosting), that allow two or more participants to
construct a boosting classifier without explicitly sharing their data sets. We have
analyzed both the computational and the security aspects of the algorithms. On
the one hand, the algorithms inherit the excellent generalization performance
of AdaBoost. We have demonstrated in experiments on benchmark data sets
that the algorithms are better than AdaBoost executed separately by the participants, and that they perform close to AdaBoost executed using the entire
training set. On the other hand, the algorithms are almost privacy-preserving
under our security model. Although the participants do exchange implicit information on their data sets, it is unclear at this point how this information could
be used (and whether it can be used at all) to reverse-engineer the training data
sets.
In this framework, we could also analyze the impact of using other families
of weak classifiers, such as decision trees or neural networks with a few hidden
units. It is clear that by using “stronger” weak classifiers, the number of iterations required for convergence could be reduced, so the participants would
have less time to reverse-engineer the training data sets. On the other hand,
this would come at the price of communicating more bits in each iteration to
describe the weak classifiers, which could potentially be more damaging for
privacy. It is also clear that the privacy-preserving aspect cannot be captured
only by traditional measures of complexity: the particular form of the weak
classifiers can be equally important. For example, some classifiers, such as nearest neighbor, or Support Vector Machines (Cortes and Vapnik 1995), explicitly
contain some or all the training data points in their description, so using them
as weak classifiers would obviously breach privacy, independently of whether
they are complex or not in an information theoretical sense.
This dilemma can also be viewed as a specific case of a broader issue, namely
that the “traditional” model of secure multiparty computation does not seem
to be adequate here. In particular, the participants are not interested in how
much information is leaked with respect to the information contained in f , they
would like to control the total direct information on their data sets that is communicated using the protocols, including the information contained in f . In
the extreme case of f being a nearest neighbor classifier, f would contain the
description of all data points, whereas it could be computed totally securely
under the traditional paradigm of secure multiparty computation. Although
this is a crucial question, it is often overlooked in the literature on cryptography-based privacy-preserving algorithms.
We see two research directions that could be explored to solve this problem. First, some studies consider the secure multiparty computation of an
168
S. Gambs et al.
approximation of a function f instead of the function itself (Feigenbaum et al.
2001). The usual situation when such an algorithm can be useful is when f is hard
to compute. In our case, the focus would be different: we would use an approximation to hide information about f . In the privacy-preserving protocol of ID3
(Lindell and Pinkas 2002), the authors use an approximation of the original
algorithm both because it is more efficient to compute in a privacy-preserving
manner and also because it leaks less information on the participants’ inputs.
The second direction to explore is to introduce some kind of randomization
into the learning process. Some algorithms, for example neural networks with
random initial weights, could be used in a natural way, while some others could
be changed explicitly to include randomization. As we indicated in Sect. 6.4,
BiBoost and MultBoost can easily accommodate both approaches.
Acknowledgements We would like to acknowledge the help and give our deepest thanks to Gilles
Brassard for all the useful discussions we had with him and for all the work he has done on reviewing an early version of this paper. We would also like to thank both reviewers for their insightful
and detailed comments that greatly improved the paper. The authors are supported in parts by the
Natural Sciences and Engineering Research Council of Canada.
References
Agrawal D, Aggarwal CC (2001) On the design and quantification of privacy preserving data mining
algorithms. In: Proceedings of the 20th ACM symposium of principles of databases systems,
pp 247–255
Agrawal R, Srikant R (2000) Privacy-preserving data mining. In: Proceedings of the ACM SIGMOD international conference on management of data, pp 439–450
Aïmeur E, Brassard G, Gambs S, Kégl B (2004) Privacy-preserving boosting. In: Proceedings of
the international workshop on privacy and security issues in data mining, in conjunction with
PKDD’04, pp 51–69
Amit Y, Blanchard G, Wilder K (2000) Multiple randomized classifiers: MRCL. Technical Report
496, Department of Statistics, University of Chicago
Atallah MJ, Bertino E, Elmagarmid AK, Ibrahim M, Verykios VS (1999) Disclosure limitations
of sensitive rules. In: Proceedings of the IEEE knowledge and data engineering workshop, pp
45–52
Bayardo R, Agrawal R (2005) Data privacy through optimal k-anonymization. In: Proceedings of
the 21st IEEE international conference on data engineering, pp 217–228
Ben-Or M, Goldwasser S, Wigderson A (1988) Completeness theorems for non-cryptographic
fault-tolerant distributed computation. In Proceedings of the 20th ACM annual symposium on
the theory of computing, pp 1–10
Bertino E, Fovino IN, Provenza LP (2005) A framework for evaluating privacy preserving data
mining algorithms. Data Mining Knowledge Discovery 11(2):121–154
Blake CL, Merz CJ (1998) UCI repository of machine learning databases. Available at
http://www.ics.uci.edu/∼mlearn/MLRepository.html
Chang L, Moskowitz IL, (2000) An integrated framework for database inference and privacy
protection. In: Proceedings of data and applications security, pp 161–172
Chang Y-C, Lu C-J (2001) Oblivious polynomial evaluation and oblivious neural learning. In:
Proceedings of Asiacrypt’01, pp 369–384
Chaum D (1981) Untracable electronic mail, return address and digital pseudonyms. Commun
ACM 24(2):84–88
Chaum D, Crépeau C, Damgård I (1988) Multiparty unconditionally secure protocols. In: Proceedings of the 20th ACM annual symposium on the theory of computing, pp 11–19
Chaum D, Damgård I, van de Graaf J (1987) Multiparty computations ensuring privacy of each
party’s input and correctness of the result. In: Proceedings of Crypto’87, pp 87–119
Privacy-preserving boosting
169
Chawla S, Dwork C, McSherry F, Smith A, Wee H (2005) Towards privacy in public databases. In:
Proceedings of the 2nd theory of cryptography conference, pp 363–385
Clifton C, Kantarcioglǔ M, Vaidya J (2004) Data mining: next generation challenges and future
directions, chapter Defining privacy for data mining. AAAI/MIT Press
Clifton C, Kantarcioglǔ M, Vaidya J, Lin X, Zhiu MY (2002) Tools for privacy preserving distributed
data mining. SIGKDD Explor 4(2):28–34
Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20(3):273–297
Dinur I, Nissim K (2003) Revealing information while preserving privacy. In: Proceedings of the
22nd ACM SIGACT-SIGMOD-SIGART symposium on principles of databases systems, pp
202–210
Evfimievski A (2002) Randomization in privacy preserving data mining. SIGKDD Explor 4(2):
43–48
Evfimievski A, Gehrke JE, Srikant R (2003) Limiting privacy breaches in privacy preserving data
mining. In: Proceedings of the 22nd ACM SIGACT-SIGMOD-SIGART symposium on principles of databases systems, pp 211–222
Fan W, Stolfo SJ, Zhang J (1999) The application of AdaBoost for distributed, scalable and on-line
learning. In: Proceedings of the 5th ACM SIGKDD international conference on knowledge
discovery and data mining, pp 362–366
Feigenbaum J, Ishai Y, Malkin T, Nissim K, Strauss M, Wright R (2001) Secure multiparty computation of approximations. In: Proceedings of the 28th international colloquium on automata,
languages and programming, pp 927–938
Fienberg SE, McIntyre J, (2004) Data swapping: variations on a theme by Dalenius and Reiss. In:
Proceedings of privacy in statistical databases, pp 14–29
Freedman M, Nissim K, Pinkas B (2004) Efficient private matching and set intersection. In: Proceedings of Eurocrypt’04, pp 1–19
Freund Y, Schapire RE, (1997) A decision-theoretic generalization of on-line learning and an
application to boosting. J Comput System Sci 55:119–139
Friedman J (2002) Stochastic gradient boosting. Comput Stat Data Anal 38(4):367–378
Furukawa J, Sako K (2001) An efficient scheme for proving a shuffle. In: Proceedings of Crypto
2001, pp 368–387
Goldreich O (2004) Foundations of cryptography, volume II: basic applications. Cambridge University Press
Goldreich O, Micali S, Wigderson A (1987) How to play any mental game – A completeness theorem for protocols with honest majority. In: Proceedings of the 19th ACM symposium on theory
of computing, pp 218–229
Iyengar V (2002) Transforming data to satisfy privacy constraints. In: Proceedings of the 8th ACM
SIGKDD international conference on knowledge discovery and data mining, pp 279–288
Kalyanasundaran B, Schnitger G (1987) The probabilistic communication of set intersection. In:
Proceedings of the 2nd annual IEEE conference on structure in complexity theory, pages 41–47.
Kantarcioglǔ M, Clifton C, (2004a) Privacy-preserving distributed k-nn classifier. In:
European conference on principles of data mining and knowledge discovery,
pp 279–290
Kantarcioglǔ M, Clifton C, (2004b) Privacy-preserving distributed mining of association rules on
horizontally partitioned data. IEEE Transac on Knowledge Data Engi 16(9):1026–1037
Kantarcioglǔ M, Jin J, Clifton C (2004) When do data mining results violate privacy? In: Proceedings of the 10th ACM SIGKDD international conference on knowledge discovery and data
mining, pp 599–604
Kantarcioglǔ M, Vaidya J (2004) Privacy preserving naive bayes classifier for horizontally partitioned data. In: Proceedings of the workshop on privacy preserving data mining held in
association with the third IEEE international conference on data mining
Kégl B (2003) Robust regression by boosting the median. In: Proceedings of the 16th conference
on computational learning theory, pp 258–272
Kissner L, Song D (2005) Privacy-preserving set operations. In: Proceedings of Crypto 2005,
pp 241–257
Kolcz A, Xiaomei S, Kalita J (2002). Efficient handling of high-dimensional feature spaces by
randomized classifier ensembles. In: Proceedings of SIGKDD’02, pp 307–313
170
S. Gambs et al.
Kruger L, Jha S, McDaniel P (2005) Privacy preserving clustering. In: Proceedings of the 10th
European symposium on research in computer security, pp 397–417
Lazarevic A, Obradovic Z (2002) Boosting algorithms for parallel and distributed learning. Distrib
Parallel Databases 11(2):203–229
Lindell Y, Pinkas B (2002) Privacy preserving data mining. J Cryptol 15:177–206
Paillier P (2000) Public-key cryptosystems based on composite degree residuosity classes. In: Proceedings of Asiacrypt’00, pp 573–584
Neff A (2001) A verifiable secret shuffle and its application to e-voting. In: ACM CCS, pp 116f́b-125
Predd JB, Kulkarni SR, Poor HV (2006) Consistency in models distributed learning under communication constraints. IEEE Transac Information Theor 52(1):52–63
Quinlan J (1986) Induction of decision trees. Mach Learn 1(1):81–106
Rabin T, Ben-Or M (1989) Verifiable secret sharing and multiparty protocols with honest majority.
In: Proceedings of the 21th ACM symposium on theory of computing, pp 73–85
Schapire RE, Singer Y, (1999) Improved boosting algorithms using confidence-rated predictions.
Mach Learn 37(3):297–336
Shamir A (1979) How to share a secret. Communications of the ACM 22(11):612–613
Sweeney L (2002) Achieving k-anonymity privacy protection using generalization and suppression.
Int J Uncertainty, Fuzziness, Knowledge-based Syst 10(5):571–588
Valiant L (1984) A theory of the learnable. Communications of the ACM 27(11):1134–1142
Verykios VS, Bertino E, Fovino IN, Provenza LP, Saygin Y, Theodoridis Y (2004) State-of-the-art
in privacy preserving data mining. SIGMOD Record 3(1):50–57
Yao AC (1986) How to generate and exchange secrets. In: Proceedings of the 27th IEEE symposium
on foundations of computer science, pp 162–167
Yu H, Jiang X, Vaidya J (2006) Privacy-preserving SVM using nonlinear kernels on horizontally
partitioned data. In: Proceedings of the 21st annual ACM symposium on applied computing,
pp 603–610