Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data Min Knowl Disc (2007) 14:131–170 DOI 10.1007/s10618-006-0051-9 Privacy-preserving boosting Sébastien Gambs · Balázs Kégl · Esma Aïmeur Received: 16 April 2005 / Accepted: 10 May 2006 / Published online: 26 January 2007 Springer Science+Business Media, LLC 2007 Abstract We describe two algorithms, BiBoost (Bipartite Boosting) and MultBoost (Multiparty Boosting), that allow two or more participants to construct a boosting classifier without explicitly sharing their data sets. We analyze both the computational and the security aspects of the algorithms. The algorithms inherit the excellent generalization performance of AdaBoost. Experiments indicate that the algorithms are better than AdaBoost executed separately by the participants, and that, independently of the number of participants, they perform close to AdaBoost executed using the entire data set. Keywords Privacy-preserving data mining · Boosting · AdaBoost distributed learning · Secure multiparty computation 1 Introduction The principal goal of data mining can be described as finding useful information in a vast amount of data. The recent appearance of sources of huge, heterogeneous data made it difficult to use traditional methods developed for small, Responsible Editor: Charu Aggarwal. Sébastien Gambs · Balázs Kégl (B) · Esma Aïmeur Department of Computer Science and Operations Research, University of Montreal, C. P. 6128, Succ. Centre-Ville, Montréal, Québec, Canada, H3C 3J7 e-mail: [email protected] Sébastien Gambs e-mail: [email protected] Esma Aïmeur e-mail: [email protected] 132 S. Gambs et al. well-structured databases, so data mining has been increasingly employing techniques developed in artificial intelligence and statistical machine learning. Data mining is now used in a wide range of different areas such as finances, bioinformatics, and astrophysics, to name a few. Secure multiparty computation is a branch of cryptography. Its goal is to realize distributed tasks in a secure manner, where the definition of security can take different flavors such as preserving the privacy of the data or protecting the computation against malicious participants. A typical task is to compute some function f (x, y) in which input x is in the hands of one participant and input y is in the hands of the other. For the computation to be considered totally secure, the two participants should learn nothing after the completion of the task, except for what can be inferred from their own input and the function’s output. The first technique that enabled the implementation of an arbitrary probabilistic computation between two participants in a secure manner was given by (Yao 1986). His results were later generalized to multiple participants. First, Chaum et al. (1987) proposed computational protection under cryptographic assumptions. The model was further extended to provide unconditional protection under the assumption that at least some proportion of the participants are honest (Ben-Or et al. 1988, Chaum et al. 1988). Although these methods are universal and general, they can be very inefficient in terms of both communication and computational complexity when the inputs are large and when the function to compute is relatively complex. In this paper, we study the intersection of data mining and secure multiparty computation. In particular, we consider the task of constructing a classifier in a distributed and secure manner. In machine learning, the goal of classification is to predict the class y of an object based on a vector x = x(1) , . . . , x(d) of observations1 about this object. Typically, the classifier is learned by using a finite training set Dn = {(x1 , y1 ), . . . , (xn , yn )}, where each element consists of a vector xi of d observations and its class yi given by an expert or obtained by observing the past. For example, x can contain a set of descriptors or attributes of a bank’s client, and y can represent a binary decision whether the client is a good candidate for a loan. There are numerous high-performance learning algorithms that can be used for this task. The objective of this paper is to extend one of these algorithms, AdaBoost (Freund and Schapire 1997), to the case when the training data Dn is shared between two or more participants. If the participants wish to fully disclose their databases to each other, they can use a standard learning algorithm on the union of their data sets. There are several application domains, however, when data sets are highly confidential (client records of financial institutions, medical records, source code of proprietary software systems, etc.), so participants are reluctant to directly communicate their databases. At the same time, they might want to collaborate in order to perform a task of mutual interest (for example, two banks might want to design a system that detects potential cases of fraud). The goal 1 We will use bold symbols to denote real vectors throughout the paper. Privacy-preserving boosting 133 of such collaboration is to design a classifier that performs better than the classifiers that the participants could learn using only their separate data sets, while, in the same time, disclosing as little information as possible on their data sets. Two principal security models are generally considered in the literature (Goldreich 2004). In the first model, the participants are called semi-honest (also known as passive or honest-but-curious). This model corresponds to the situation where the participants follow the execution of their prescribed protocols without any attempt to cheat, but they try to learn as much information as possible about the other participant’s data by analyzing the information exchanged during the protocol. This model is weaker than if we had allowed participants to be malicious and to cheat actively during the protocol, but it is nevertheless useful and almost always considered in privacy-preserving data mining. For the simplicity of the analysis, we will mainly focus on the semi-honest model. Note, however, that there are techniques to upgrade the model so that it can deal with malicious participants at the cost of increasing the complexity (both computational and communicational). AdaBoost (Freund and Schapire 1997) is one of the best off-the-shelf learning methods developed in the last decade. It constructs a classifier in an incremental fashion by adding simple classifiers to a pool, and using their weighted “vote” to determine the final classification. The algorithm can be extended to multiparty classification in a natural way, and, as we will show, it performs outstandingly on benchmark data. Although we cannot achieve total security (when all the information exchanged can be derived from the final classifier), the information overhead is minimal during the protocol, and it is unclear whether it can be used at all to reverse-engineer the training data sets. Throughout the paper, we will consider binary classification, where the class y of every observation is −1 or +1. The algorithm can be extended to multiclass classification and to regression along the lines of Schapire and Singer (1999) and Kégl (2003), respectively. Note that similar algorithms have been proposed by Fan et al. (1999) and Lazarevic and Obradovic (2002). They extended AdaBoost to distributed learning, however, they did not analyze the security and the privacy aspects of the proposed methods. The outline of this paper is as follows. We present privacy-preserving data mining in Sect. 2 and AdaBoost in Sect. 3. In Sect. 4, we introduce BiBoost (Bipartite Boosting), the extension of AdaBoost to the case of two participants and we analyze its communication and computational complexity. In Sect. 7.2, we propose MultBoost (Multiparty Boosting), a further extension of the algorithm to more than two participants. In Sect. 6, we describe some privacy models commonly used in data mining, and discuss the cryptographic security of BiBoost and MultBoost as well as the impact of the obtained classifier on privacy. In Sect. 7, we demonstrate the algorithms’ performance on benchmark data sets, and we compare them to the standard AdaBoost algorithm. Finally, we give some possible generalizations of our approach and conclude in Sect. 8. 134 S. Gambs et al. 2 Privacy-preserving data mining Today, the Internet makes it possible to reach and connect sources of information throughout the world. At the same time, many questions are raised concerning the security and privacy of the data. Privacy-preserving data mining is an emerging field that studies how data mining algorithms affect the privacy of data and tries to find and analyze new algorithms that preserve this privacy. Recently, Verykios et al. (2004) presented an overview of the field. In particular, they identified three different approaches that deal with the privacy-preserving issues in data mining. The algorithms of the first approach perturb the values of selected attributes of individual records before communicating the data sets. These modifications (j) can include altering the value of an attribute xi by either perturbing it Atallah et al. (1999) or replacing it with the “unknown” value (Chang and Moskowitz (j) (j) 2000), swapping the values xi and x of an attribute between two records (Fienberg and McIntyre 2004), or using a coarser granularity by merging several possible values of an attribute (Chang and Moskowitz 2000). Of course, this technique (called sanitization) will increase uncertainty (noise) in the data, which makes the learning task more difficult. The objective of these algorithms is to find a satisfactory trade-off between the privacy of the data and the accuracy of the learned classifier. A specific example of a sanitization method is the k-anonymization procedure (Iyengar 2002, Sweeney 2002) which proceeds through generalizations and suppressions of attribute values. In particular, it generates a k-anonymized dataset that has the property that each record in the dataset is indistinguishable from at least k − 1 other records within the dataset. This implies that no specific individual within the k-anonymized dataset can be identified with probability better than k1 even with the help of linking attacks. Bayardo and Agrawal (2005) recently gave a practical algorithm to find a kanonymization that is optimal in terms of a cost function that quantifies the loss of information. The algorithms of the second approach first randomize the data in a global manner by adding independent Gaussian or uniform noise to the attribute values. The recipient of the “polluted” data set then either reconstructs the data distribution (by using, e.g., Expectation-Maximization, Agrawal and Aggarwal 2001), or constructs a classifier directly on the noisy data (Agrawal and Srikant 2000). The goal, as in the first approach, is to hide the particular attribute values while preserving the joint distribution the data. In this approach, the intensity of the noise is used to balance between data privacy and model accuracy. Evfimievski et al. (2003) formally analyzed this trade-off, and proposed a method to limit privacy breaches without any knowledge on the data distribution. Our algorithm belongs to the third, cryptography-based approach (Clifton et al. 2002), which is very different in spirit from the first two. Instead of altering the data sets to hide sensitive information, we do not directly communicate the data sets at all, rather we distribute the learning procedure between the Privacy-preserving boosting 135 participants. The objective in this approach is to preserve the privacy of the participants’ data while approximating as much as possible the performance of the classifier that they would have obtained had they fully disclosed their data sets to each other. Lindell and Pinkas (2002) proposed the first algorithm that followed this approach. They described a privacy-preserving, distributed extension of ID3 (Quinlan 1986), a well-known learning algorithm that builds a decision tree using an entropy-based metric. This approach is also related to distributed learning, where communication is constrained due to limited channel capacity. Predd et al. (2006) have recently analyzed statistical consistency in this distributed learning model. We return to these methods in more details in Sect. 6. 3 AdaBoost AdaBoost (Freund and Schapire 1997) is one of the best general purpose learning methods developed in the last decade. Its development was originally motivated by a rather theoretical question within the Probably Approximately Correct (PAC) Learning framework (Valiant 1984). It later inspired several learning theoretical results and, due to its simplicity, flexibility, and excellent performance on real-world data, it has gained popularity among practitioners. AdaBoost is an ensemble (or meta-learning) method that constructs a classifier in an iterative fashion. In each iteration, it calls a simple learning algorithm (called the weak learner) that returns a classifier. The final classification will be decided by a weighted “vote” of the weak classifiers, where each weight decreases with the error of the corresponding weak classifier. The weak classifiers have to be only slightly better than a random guess (from where their name derives), which gives great flexibility to the design of the weak classifier (or feature) set. If there is no particular a priori knowledge available on the domain of the learning problem, decision trees or, in the extreme case, decision stumps (decision trees with two leaves) are often used. A decision stump can be defined by three parameters, the index j of the attribute2 that it cuts, the threshold θ of the cut, and the sign of the decision. Formally, +1 if x(j) ≥ θ , (j) hj,θ+ (x) = 2I x ≥ θ − 1 = (1) −1 otherwise, and (j) hj,θ− (x) = −hj,θ+ (x) = 2I x +1 <θ −1= −1 if x(j) < θ , otherwise, (2) where the indicator function I {A} is 1 if its argument A is true and 0 otherwise. Although decision stumps may seem very simple, they satisfy the weak learna2 We assume that attributes are real valued, so observations x are in Rd . 136 S. Gambs et al. bility condition, and, when boosted, they yield excellent classifiers in practice. Also, finding the best decision stump using exhaustive search can be done efficiently in O(nd) time, where n is the number of training points, and d is the number of attributes of an observation x. In this paper we will use decision stumps as weak classifiers. Nevertheless, most of our results can be extended to boost other weak learners. For the formal description let the training set be Dn = {(x1 , y1 ), of AdaBoost, . . . , (xn , yn )}, where xi = x(1) , . . . , x(d) ∈ Rd is an observation vector containlabel. The algorithm ing d real-valued attributes, and yi ∈ {−1, 1} is a binary (t) (t) (t) maintains a weight distribution w = w1 , . . . , wn over the data points. The weights are initialized uniformly in line 1, and are updated in each iteration in lines 7–10 (Fig. 1). The weight distribution remains normalized in each iteration, (t) that is, ni=1 wi = 1 for all t. We suppose that we are given a set H of classifiers h : Rd → {−1, 1} that assign one of the two labels to every observation. In addition to H, we are also provided a weak learner algorithm that, in each iteration t, returns the weak classifier h(t) ∈ H that minimizes the weighted error (t) ε− = n wi(t) I h(t) (xi ) = yi . (3) i=1 The coefficient α (t) of h(t) is set in line 5 and the weights wi of training points (t) are updated in lines 7–10. Since ε− < 1/2 (otherwise we would flip the labels (t) and return −h in line 3), the weight update formulas in lines 8 and 10 indicate that we increase the weights of misclassified points and decrease the weights of correctly classified points. As the algorithm progresses, the weights of frequently misclassified points will increase, so weak classifiers will concentrate more and more on these “hard” data points. After T iterations,3 the algorithm (t) (t) returns the weighted average f (T) (·) = T t=1 α h (·) of the weak classifiers. (T) The sign of f (x) is then used as the final classification of x. There exist numerous versions and extensions of AdaBoost. In this paper, we will use a variant proposed by Schapire and Singer (1999), which allows the weak classifier not only to answer “−1” or “+1”, but also to abstain by returning “0”. For the formal description of this extended algorithm, we first re-define4 (t) the weighted error ε− as (t) ε− = n wi(t) I h(t) (xi ) = −yi , i=1 3 T is an appropriately chosen constant that can be set, for example, by cross-validation. 4 The definition (3) is no longer valid if h(t) is allowed to return 0. (4) Privacy-preserving boosting 137 Fig. 1 The pseudocode of AdaBoost. Dn = (x1 , y1 ), . . . , (xn , yn ) is the training set, H is the set of weak classifiers, and T is the number of iterations. w(t) is the weighting over the data points in the tth iteration, and α (t) is the weight of the tth weak classifier h(t) the weighted rate of correctly classified point as (t) ε+ = n wi(t) I h(t) (xi ) = yi , (5) i=1 and the weighted abstention rate as n (t) (t) ε0(t) = 1 − ε− − ε+ = wi(t) I h(t) (xi ) = 0 . (6) i=1 The coefficient α (t) is set5 to 7–10 becomes6 1 2 wi(t+1) ← wi(t) × ln ε + /ε− in line 5 and the weight update in lines ⎧ 1√ ⎪ ⎪ ⎨ 2ε− +ε0 ε− /ε+ 1√ 2ε+ +ε0 ⎪ ⎪ 1 ⎩ √ ε+ /ε− ε0 +2 ε+ ε− if h(t) (xi ) = −yi , if h(t) (xi ) = yi , if h(t) (xi ) (7) = 0. 5 In order to avoid a singularity when ε is small, Schapire and Singer (1999) suggest that α (t) = − 1 ln ε+ +δ be used, where δ is a small appropriate constant. ε− +δ 2 6 Note that we will omit the iteration index (t) where it does not cause confusion. 138 S. Gambs et al. It can be shown that the (unweighted) training error 1 R(f (T) ) = I sign f (T) (xi ) = yi n n (8) i=1 of the final classifier can be upper bounded by T (t) (t) (t) ε0 + 2 ε+ ε− , t=1 √ so the goal of the weak learner is to minimize ε0 + 2 ε+ ε− in each iteration. (t) (t) If there exists a constant δ > 0 for which ε0(t) + 2 ε+ ε− ≤ 1 − δ for all t, then (T) the training error can be upper by R(f ) ≤ e−Tδ . This means that it lnbounded n /δ + 1 iterations. For suboptimal weak clasbecomes 0 after at most T = sifiers the convergence can be slower, nevertheless, the algorithm can continue with any weak classifier for which ε+ > ε− , (9) √ which guarantees that ε0 + 2 ε+ ε− < 1. We will use this condition (9) in the next section when showing the algorithmic convergence of the privacy-preserving extension of AdaBoost. 4 BiBoost In this section we extend AdaBoost to the case when the data set Dn is split between two participants, Alice and Bob, who want to obtain a final classifier without exchanging their data sets. The basic idea of the algorithm, which we call BiBoost for Bipartite Boosting,7 is that Alice and Bob compute two separate weak classifiers in each iteration, and merge the two classifiers into a ternary classifier. This merged classifier outputs Alice’s (or Bob’s) label for data points on which the two separate classifiers agree, and abstains if they disagree. In Lemma 1 we will show that if the two separate classifiers are optimal on their respective data sets, then the merged classifier will satisfy condition (9), so the algorithm converges. A , yA ), . . . , (xA , yA ) and DB = (xB , yB ), . . . , = (x Formally, let DA A A A B 1 1 1 1 n n n n B B (xnB , ynB ) Alice’s and Bob’s data sets, respectively. The algorithm maintains 7 An earlier version of this algorithm was described in Aïmeur et al. (2004) under the name of MABoost. Privacy-preserving boosting 139 Fig. 2 The pseudocode of BiBoost. DAA = (xA , yA ), . . . , (xAA , yAA ) and DBB = (xB , yB ), . . . , 1 1 1 1 n n n n (xBB , yBB ) are Alice’s and Bob’s training sets, respectively, H is the set of weak classifiers, and T n n is the number of iterations (t) (t) two weight distributions, wA and wB , over the data points. The weights are initialized in line 1, and are updated in each iteration in line 13 (Fig. 2) using the formulas (7) designed for ternary weak classifiers. It can be shown that if the weights are initialized non-uniformly to an arbitrary weight vector (1) (1) w(1) = w1 , . . . , wn then AdaBoost minimizes the weighted training error Rw(1) (f (T) ) = n (T) w(1) (xi ) = yi i I sign f (10) i=1 instead of (8). In line 1 we initialize the weight vectors uniformly within the data sets of Alice and Bob, which means that we give equal weight to the two data sets DA and DB , in other words, we minimize nA nB 140 S. Gambs et al. n n 1 1 (T) A A (T) B B + . I sign f (x ) = y I sign f (x ) = y i i i i 2nA 2nB A B i=1 i=1 The advantage of this initialization is that Alice and Bob do not have to communicate their respective data sizes. If they wish to minimize the unweightA B ed training error (8), they should exchange then n and n , and initialize (1) A B A B w uniformly to 1/(n + n ), . . . , 1/(n + n ) . Using the weighted error (10) also gives flexibility to the participants in their experimental design. For example, they can assign different weights to positive and negative examples when the cost of misclassification is different for false positives and false negatives. In line 4, Alice and Bob select a subset H(t) of the weak classifier set H. They will use this set to return two weak classifiers, hA and hB , separately by minimizing the weighted error over their respective data sets (lines 5–6). In the simplest case, H(t) can be identical to H in every iteration. In certain situations, it can be more convenient to simplify the task of the weak learner by using simpler weak classifier sets that can differ in each iteration. In Sect. 4.1 we will describe three particular strategies which can be used by Alice and Bob to select H(t) in a privacy-preserving manner. We merge the weak classifiers hA and hB into a ternary classifier in line 7 by taking their pointwise average. Since hA and hB are both binary classifiers8 Rd → {−1, 1}, the formula in line 7 is equivalent to hA (x) h (x) = 0 (t) if hA (x) = hB (x), otherwise, (11) that is, h(t) will agree with hA (or hB ) if hA and hB agree, and abstain otherwise. To calculate the coefficient α (t) (line 11), and to update the weights of the data points (lines 12–13), we calculate the weighted error, the weighted rate of correctly classified points, and the weighted abstention rate of h(t) in lines 8–10 using the formulas (4), (5), and (6), respectively. As before, the final classifier can be obtained as the sign of the weighted “vote” of the weak classifiers (line 14). According to condition (9), BiBoost will converge if ε+ > ε− in each iteration. We now proceed to show that if hA and hB are optimal within H(t) on the data sets of Alice and Bob, respectively, then the merged classifier satisfies the convergence criterion (9) of AdaBoost with abstention. First we say that hA and hB are H(t) -optimal if they minimize the weighted error on their respective data sets, that is, 8 The formula in line 7 works also in the case of real-valued weak classifiers, so the algorithm can be easily extended to confidence-rated AdaBoost (Schapire and Singer 1999). Privacy-preserving boosting 141 A A h = argmin n h∈H(t) i=1 A A wA i I h(xi ) = yi and B B h = argmin n h∈H(t) i=1 B B wB i I h(xi ) = yi . Now, let A A ε− = n wA i I A h (xA i ) = −yA i B and A ε+ = i=1 n A A A wA I h (x ) = y i i i i=1 be the weighted error and the weighted rate of correctly classified points, respecB and ε B be defined tively, of Alice’s classifier hA on Alice’s data set DA , and let ε− + nA similarly for Bob. Furthermore, let A A ω− = n i=1 wA i I (t) h (xA i ) = −yA i A and A ω+ = n (t) A A (12) wA i I h (xi ) = yi i=1 be the weighted error and the weighted rate of correctly classified points, B respectively, of the merged classifier h(t) on Alice’s data set DA , and let ω− nA B and ω+ be defined similarly for Bob. Note the subtle difference between the A and ωA . The former uses Alice’s classifier hA whereas the latter definitions of ε± ± uses the merged classifier h(t) . With this notation, the weighted error (4) and the weighted rate of correctly classified points (5) can be expressed as A B A B + ω− and ε+ = ω+ + ω+ , ε− = ω− respectively. The following lemma provides a sufficient condition for the convergence of BiBoost in the case when H(t) is closed under negation, that is, for every base classifier h ∈ H(t) , −h is also an element of H(t) . The condition is trivially true for the set of decision stumps (see (1) and (2)), and for any function set used by practical learning algorithms. Lemma 1 If H(t) is closed under negation and if hA and hB are H(t) -optimal, then ε− ≤ ε+ . Furthermore, ε− = ε+ only if −hA is optimal on Bob’s data set and −hB is optimal on Alice’s data set DA . DB nB nA A ≤ ωA , otherwise the Proof The main observation of the proof is that ω− + classifier −hB would have a lower error on Alice’s data set DA than Alice’s nA 142 S. Gambs et al. chosen classifier hA , which would violate the optimality of hA . To see this, first note that by the definition (11) of the merged classifier h(t) , −hB (x) agrees with hA (x) on observations x on which h(t) abstains, and they disagree otherwise. is the sum of 1) the error of hA on the subset of Thus, the error of −hB on DA nA A − ωA ), and 2) the rate of correctly classified DA where h(t) abstains, that is, (ε− − nA A A A . Hence, points of h on the subset of DnA where h(t) does not abstain, that is, ω+ A < ωA , then the error (ε A − ωA + ωA ) of −hB is less then the error ε A of if ω+ − − − + − A h , which violates the optimality of hA . It can be shown in a similar way that B ≤ ωB , and the first statement of the Lemma follows. The second statement ω− + A ≤ ωA , ωB ≤ ωB and ε = ε imply is also easy to see by observing that ω− − + + − + A = ωA and ωB = ωB , so the error of −hB on DA is (ε A − ωA + ωA ) = ε A , ω− A + − + − − + − n B. and, similarly, the error of −hA on DB is ε − nB A ≤ ε A .9 Moreover, if −hA Remark 1 If H(t) is closed under negation, then ε− + A A is optimal, then ε− = ε+ can happen only if we have equality for all weak classifiers, in which case AdaBoost would saturate and should be stopped. The same thing happens in BiBoost if Alice and Bob pick exactly the same classifier but with opposite signs. If we add to the protocol that in this case Alice and Bob should go back to line 4 and agree on another subset H(t) , then BiBoost has to be stopped only if Alice and Bob pick exactly opposite classifiers for all possible subsets. Remark 2 The H(t) -optimality of hA and hB is crucial for the lemma. If Alice runs AdaBoost alone on her data set, she can continue with any weak classiA < ε A , so she is not required to find the minimizer of the fier hA for which ε− + weighted error. On the other hand, it is easy to find an example of a non-optiA < ε A , yet ε > ε .10 Such an hA would be admissible for mal hA for which ε− − + + AdaBoost if Alice were training alone, but it would make the merged classifier fail in BiBoost. Remark 3 The condition in the lemma is sufficient for BiBoost’s convergence, however, it is not necessary. In practice, it is plausible that most of the time ε− < ε+ even if hA and hB are not H(t) -optimal. If H(t) -optimality cannot be guaranteed, condition (9) should be verified after line 9, and in case it is not satisfied, the algorithm should return to line 4 to select new weak classifiers. 9 Otherwise by switching labels, the error of −hA (that is, ε A ) would be lower than the error ε A of + − hA , so hA would not be optimal. 10 For example, let DA = {(0, −1), (2, −1), (4, +1)}, DB = {(6, −1), (4, −1), (2, +1)}, and 3 3 (t) (t) = wB = (0.15, 0.2, 0.15). Let Alice choose the suboptimal decision stump hA (x) = wA h1,1+ (x) = 2I {x ≥ 1} − 1, and let Bob choose hB (x) = h1,5− (x) = 2I {x < 5} − 1. Since A = ε B = 0.2 < 0.3 = ε A = ε B , both stumps would be admissible for AdaBoost. For the ε− − + + merged classifier, ε− = 0.4 > 0.3 = ε+ , so BiBoost cannot continue. Privacy-preserving boosting 143 4.1 The choice of H(t) The generality of the lemma allows us to consider various strategies to select the set H(t) of weak classifiers. In the simplest case, H(t) can be identical to H in every iteration. In certain cases, it may be convenient to divide H into k (not necessarily disjoint) subsets H1 , . . . , Hk , and select one of them as H(t) in each iteration. There are multiple reasons why such a partitioning of the weak classifier space can be useful. First, from a computational point of view, we might be able to guarantee Hj -optimality on the subsets, and even if Hj -optimality cannot be guaranteed on the subsets, this design can allow Alice and Bob to go back to line 4 and select another alternative when condition (9) is violated. Second, from the privacy-preserving aspect, weak classifiers that are optimal over a possibly large set H may reveal much more information about the data sets than weak classifiers that were selected from a small subset of H. Of course, in this case, Bob and Alice have to agree on the subset selected for H(t) , and the communication and computational costs may cancel this advantage (see Sect.s 4.2 and 6.2 for a detailed analysis). Finally, from a statistical point of view, it has been argued (Amit et al. 2000, Friedman 2002, Kolcz et al. 2002) that introducing randomness into the weak classifier selection can improve the generalization performance of AdaBoost (our experiments in Sect. 7 confirmed this argument), and partitioning H provides a natural framework for randomization. In this section we describe three general alternatives that can be used to select H(t) , and discuss their communicational and computational implications. 4.1.1 The WholeSet alternative In the simplest case, Alice and Bob will use H(t) = H in each iteration. They do not need to communicate in line 4, they only have to exchange their weak classifiers hA and hB . 4.1.2 The CommonSubset alternative In general, Alice and Bob can divide H into k (not necessarily disjoint) subsets H1 , . . . , Hk , and select one of them as H(t) in each iteration. In the experiments (Sect. 7) we will use decision stumps as weak classifiers, and Hj , j = 1, . . . , d will be the set of stumps that cut the jth attribute, that is, Hj = hj,θ+ , hj,θ− : θ ∈ R (see the definitions (1) and (2)). This may be the simplest way to apply this alternative, nevertheless, the protocol can be used for any partitioning of any weak classifier set. In certain situations, it is possible that some of the subsets are not admissible for Alice or Bob. In multi-class classification, or if the subset Hj is not closed under negation, it is possible that Alice (Bob) cannot find a weak classifier in A ≤ ε A (ε B ≤ ε B ). Moreover, it can happen even in binary classifiHj with ε− + − + A is equal to ε A within the numerical precision of the computer cation that ε− + for all weak classifiers in Hj . In this situation Alice and Bob must find a subset 144 S. Gambs et al. Hj which is admissible for both of them. If such a subset does not exist, they want to be aware of that, and if there are several such subsets, they want to choose one randomly from among them. In the rest of this section we describe a protocol that allows Alice and Bob to find a common admissible subset in a privacy-preserving manner. To formalize the problem, Alice and Bob represent their admissible subsets A B B A B by two binary vectors, bA = (bA 1 , . . . , bk ) and b = (b1 , . . . , bk ), where bj (or bB j ) is 1 if Hj is admissible for Alice (or Bob), and 0 otherwise. Their goal is A to select an index j randomly from the set { : bB ∧ b = 1}. Formally, they A B want to compute j as a function φ(b , b ). What makes the problem non-trivial is that Alice and Bob do not want to disclose any unnecessary information on bA and bB while computing φ(bA , bB ),11 in other words, they want to find a solution based on secure multiparty computation. We call this task the random B rendez-vous problem, referring to the equivalent problem where bA j and bj are indicators of free slots in Alice’s and Bob’s agendas, respectively, and the goal is to schedule a meeting without disclosing any unnecessary information on their agendas. The problem of computing φ(bA , bB ) is equivalent to computing a random element of the intersection of two sets. This latter problem was solved and analyzed in the area of communication complexity by Kalyanasundaran and Schnitger (1987). They proved that the computation of the intersection requires (k) bits of communication on the average. They, however, were concerned only about the issue of minimizing communication; they were not interested in the privacy-preserving aspect of the computation. Nevertheless, the result provides us with a benchmark since any distributed and privacy-preserving protocol has to use at least as many bits of communication as the best non privacy-preserving distributed protocol. A secure and efficient way to compute φ(bA , bB ) would be by adapting the secure set intersection protocol proposed in Kissner and Song (2005). In their paper, the authors describe how to use and manipulate polynomial representations of the sets to implement privacy-preserving operations. Their set intersection protocol requires that each participant encode his set as a polynomial, encrypt the polynomial using an additively homomorphic scheme (such as Paillier’s cryptosystem Paillier 2000), do some randomization on the encrypted polynomial, then exchange it with the other participant, who will also encrypt and randomize the received polynomial before sending back the result. Finally, the two participants have to cooperate in order to perform a group decryption and reveal the intersection of the two sets. It is also possible to modify the protocol to make it secure against malicious participants using standard cryptographic techniques involving zero-knowledge proofs at the cost of increasing the complexity of the protocol. As an alternative, a similar protocol (Freedman et al. 2004) could be also used to compute the set intersection efficiently. 11 For example, they do not want to reveal any admissible subsets other than H , not even the j number of their admissible subsets. Privacy-preserving boosting 145 Note that to implement the random rendez-vous protocol (that is, to compute φ(bA , bB )), we only need to find one random element of the intersection, so we can modify the last part of Kissner and Song (2005)’s protocol in the following way. Instead of decrypting all the doubly-encrypted elements at once which would reveal the entire intersection, the participants should shuffle the list and proceed element by element until a match is found (or all the elements of the list have been ruled out, in which case the intersection is empty). This allows to reveal only one element of the set intersection chosen at random instead of the whole intersection. 4.1.3 The RandomSubset alternative In this alternative, Alice first chooses an admissible subset HjA randomly from among all her admissible subsets, and announces jA to Bob. If HjA is also admissible for Bob, they agree to set H(t) = HjA without having to run the random rendez-vous protocol. If HjA is not admissible for Bob, then he chooses an admissible subset HjB randomly from among all his admissible subsets, and they set H(t) = HjA ∪ HjB . 4.2 Complexity In this section we discuss the complexity of BiBoost. Since it is a distributed algorithm, we can analyze two different measures of complexity, the communication complexity, which is the number of bits exchanged during the execution of the protocol, and the computational complexity, which considers the processing time required to execute the algorithm. 4.2.1 Communication complexity To quantify the amount of communication used in BiBoost, we need to determine the number of bits exchanged during an iteration. First observe that the only steps that require communication between Alice and Bob are 1. agreeing on H(t) in line 4 (Fig. 2), 2. exchanging the weak classifiers hA and hB in line 7, and 3. exchanging the errors ε+ and ε− for calculating the coefficient α (t) in line 11 and for updating the weights in line 13. The communication cost of the first step depends on the protocol used to agree on H(t) . In the WholeSet alternative there is no communication, so this step costs nothing. Using the CommonSubset alternative with the random rendez-vous protocol of Kissner and Song (2005), we need to communicate (k log k) bits, where k is the number of subsets considered. In the RandomSubset alternative, Alice and Bob have to communicate their selection of HjA and HjB which costs (log k) bits. For quantifying the communication cost of 146 S. Gambs et al. Table 1 Communication and computational cost (using decision stumps) of BiBoost Alternative Communication cost Computational cost WholeSet CommonSubset RandomSubset (Tm) (T(m + k log k)) (T(m + log k)) (Tdn) (Td(n + e)) (Tdn) T is the number of iterations, m is the number of bits needed to encode weak classifiers in H(t) , k is the number of weak classifier sets, d is the number of attributes (dimensions), n = nA + nB is the total size of the data set, and e is the time required to perform one encryption exchanging hA and hB , we suppose that we need m bits to encode weak classifiers in H(t) , so the second step costs (m) bits. Encoding the errors to finite precision can be done using a constant number of bits, so the communication cost of this step is constant. The final communication costs are summarized in Table 1. Note that although the WholeSet alternative seems to have the lowest cost, m can be much larger than in the CommonSubset and RandomSubset alternatives, since the weak classifier set H can be large compared to the subsets Hj . In particular, if H is the set of all decision stumps and Hj is the set of decision stumps on the jth attribute, then k is equal to the number of attributes d, so, to encode a stump we need log d bits more in the WholeSet alternative than in the RandomSubset alternative. Hence, assuming the same number of iterations, the two alternatives have the same communication cost. 4.2.2 Computational complexity The computational cost of lines 7 through 13 is (n) where n = nA + nB . Since training the weak classifiers in lines 5 and 6 is at least linear in n, the computational cost of BiBoost is dominated by the selection of H(t) (line 4), and the training of the weak classifiers (lines 5 and 6). The computation time required to agree on H(t) depends again on which alternative is chosen. It is negligible if the WholeSet or RandomSubset alternative is used. To analyze the computational cost of the set intersection protocol of Kissner and Song (2005) used in the CommonSubset alternative, let e stand for the time required to encrypt one element of the set. For each weak classifier set Hj , Alice and Bob have to perform two encryptions which take (ke) time. The computational cost of finding the weak classifier depends on the learning algorithm used. It is at least linear in n unless the algorithm uses only a subsample of the data Dn to train the weak classifiers. In particular, the best decision stump can be found in (dn) steps if the data points are sorted in each attribute. Sorting the data points can be done in (dn log n) time, but it has to be done only once before the boosting iteration starts. The overall cost of this step for all T iterations is therefore in (dn(T + log n)). Summing up (see Table 1), the overall computational complexity of all T iterations of BiBoost is in (dn log n + Td(n + e)) if the CommonSubset alternative is used and d = k. In the likely case that d < 2n and n < 2T , this reduces to Privacy-preserving boosting 147 (Td(n + e)). If the WholeSet or RandomSubset alternative is used, the overall computational complexity of BiBoost is (dn(T + log n)), which reduces to (Tdn) if n < 2T . In any case, by using decision stumps as weak classifiers, the time required to apply our solution increases linearly with the number of attributes. If there are too many attributes, the processing time of BiBoost could be prohibitive. 5 MultBoost In this section, we describe MultBoost (Multiparty Boosting) that generalizes BiBoost to multiple participants. In general, extending a bipartite protocol to the multiparty case is a non-trivial task in cryptography. For example, BiBoost’s closest relative, the privacy-preserving version of ID3 (Lindell and Pinkas 2002) exists only as a bipartite algorithm, and the authors acknowledge that extending it to the multiparty case would be difficult. The main difficulty of moving from a bipartite to a multiparty setting is that some steps that were previously considered secure are no longer so. For example, the “naive” computation of the weak classifier’s coefficient α (t) is no longer secure, since the errors of the participants’ individual weak classifiers cannot be inferred from the final classifier, so, according to our security paradigm, these errors should now be protected and not revealed in the clear. The security of agreeing on a common subset H(t) must also be revised in the multiparty setting. The descriptions of the individual classifiers still do not have to be protected (assuming that they are explicitly contained in the merged classifier), however, the origin of each individual classifier (which one came from which participant) must now be hidden. The communication model considered in the multiparty setting is the following. The existence of a private channel between each pair of participants is assumed. Practically, this means that each participant has the possibility of sending a secret message to any other participant without the possibility for anybody else to learn any information about the content of the message. The participants have also access to a broadcast channel which takes a message from one participant and broadcasts it to every other participants. If this channel does not exist physically, it is always possible for one participant in the semi-honest model to emulate it by using the individual private channels. This emulation has a communication cost of (M) bits, where M is the number of participants, since the selected participant has to send M messages using private channels instead of one message over the broadcast channel. However, MultBoost’s protocols use the broadcast channel in a limited way, so the communication complexity of these protocols is not affected by whether the broadcast channel is real or simulated. Note that in a model where the participants could be malicious, the emulation of a broadcast channel would have a significant impact both on the communication complexity of the protocol and its security. For example, if a broadcast channel is available, any multiparty computation can be made secure against at most half of the participants (Goldreich et al. 1987, Rabin and Ben-Or 148 S. Gambs et al. Fig. 3 The pseudocode of MultBoost. D1 1 ,…, DMM are the training sets of the M participants, H n n is the set of weak classifiers, and T is the number of iterations 1989), whereas without the broadcast channel the security is only guaranteed if at least two thirds of the participants are honest. In our case, as long as the word “broadcast” is not explicitly used during the description of the protocol, the communication is done using private individual channels. p p p p p For the formal description of MultBoost, let Dnp = (x1 , y1 ), . . . , (xnp , ynp ) be the data set of the pth participant, p = 1, . . . , M. The algorithm maintains a weight distribution wp (t) over the data points. The weights are initialized uniformly in lines 1–2, and are updated in each iteration in line 16 (Fig. 3) using the formulas (7) designed for ternary weak classifiers. In line 4, the participants select a subset H(t) of the weak classifier set H. They will use this set to return their weak classifiers hp separately by minimizing the weighted error over their respective data sets (lines 5–6). The WholeSet alternative, when H(t) = H in each iteration, is easy to use also in MultBoost, but it has the same disadvantages as in BiBoost. The CommonSubset alternative might be difficult to use Privacy-preserving boosting 149 in MultBoost. The random rendez-vous protocol could easily be extended to the multiparty case for a cost that is linear in the number of participants, but the goal of the protocol might be too ambitious in the sense that finding a subset Hj that is admissible for all participants can be more and more difficult as M grows. In our implementation, we adopted a variant of the RandomSubset alternative. Each participant selects an admissible subset Hp randomly from p among its admissible subsets, and then they set H(t) = M p=1 H . The weak classifiers h1 , . . . , hM are merged into a ternary classifier in line 7. In our implementation, we use the simple rule M p p if M sign p=1 h (x) p=1 h (x) = 0, (t) (13) h (x) = 0 otherwise, that is, we take a majority vote among the participants’ weak classifiers. This rule simplifies to the rule (11) that we used in the bipartite case. Other rules could also be used in this step. For example, one could output the “raw”, con1 M p fidence-rated (real-valued) weak classifier M p=1 h (x), or weight the votes in (13) according to the correctness of the individual classifiers. Unfortunately, we cannot extend Lemma 1 to any of these rules, which means that we cannot theoretically guarantee that h(t) satisfies the condition (9) even if the individual p classifiers hp are all H(t) -optimal on their respective data sets Dnp . Nevertheless, in experiments, the merged classifier (using the rule (13)) almost never fails the condition (9). In the rare occasion when it happens, we can go back to line 4 and agree on a different H(t) . The security of this step must also be revised in MultBoost. In BiBoost, both participants know the weak classifier of the other participant by subtracting their weak classifier of h(t) . In MultBoost, however, participants cannot identify the owners of individual weak classifiers just by looking at h(t) , so these identities must be hidden when the classifiers are merged. To solve this problem, we use an anonymous broadcast protocol, which takes a set of objects as input, and outputs them in a random order in such a way that it is impossible to trace the origin of the objects.12 This kind of channel was suggested among many other ideas in a seminal paper of Chaum (1981). Examples of recent efficient protocols implementing this kind of channel can be found in Furukawa and Sako (2001) and Neff (2001). After obtaining the merged classifier h(t) , the participants compute its coefficient α (t) in lines 11–14. There is no change in the computation compared to BiBoost, however, the summation of the individual errors (lines 11–12) must be done securely. In BiBoost, Bob can infer the error rate ωA (12) of Alice using α (t) , and his error rate ωB . On the other hand, in MultBoost, the particip pants can reconstruct only the sum ε− of their individual errors ε− , and not the 12 Note that the same protocol can also be used to hide the origins of the weak classifier sets Hp in case the RandomSubset alternative is used in line 4 of Fig. 3. This would prevent a potential information leak during the agreement phase. 150 S. Gambs et al. 1 , . . . , ε M should individual errors themselves. Therefore, the individual errors ε− − p now be protected and not revealed in the clear. To compute ε− = M p=1 ε− securely, we use a secure sum computation protocol. An implementation of this protocol in the context of privacy-preserving association rules can be found in Kantarcioglǔ and Clifton (2004b). This protocol is based on the work started by Shamir on linear secret sharing (Shamir 1979). It involves generating shares by adding random numbers to mask the value of the true elements, distributing the randomized shares, computing the sum of the shares using modular addition and then canceling the effect of the randomization in order to reveal the global result. By using the secure sum computation protocol, the participants compute the weighted error ε− and the weighted rate of correctly classified points ε+ of the merged classifier h(t) . Then they obtain the coefficient α (t) and update the p point weights wi as in BiBoost. 5.1 Complexity In this section we study the communication and computational complexities of MultBoost. When analyzing the communication complexity, we will suppose that we have a global resource (communication channel) that the participants share, and we compute the total amount of communication exchanged between the participants. On the other hand, in the section on the computational complexity, we will analyze the time required for each participant. Therefore, for the total time, the formulas should simply include an additional term of M. 5.1.1 Communication complexity As in BiBoost the following three steps require communication between the participants: 1. Agreeing on H(t) in line 4 (Fig. 3), 2. exchanging the weak classifiers hp in line 7, and p p 3. exchanging the errors ε+ and ε− for calculating the coefficient α (t) in lines 11–12. The communication cost of the first step depends on the protocol used to agree on H(t) . In the WholeSet alternative there is no communication, so this step costs nothing. In the RandomSubset alternative (which we used in our implementation), every participant has to send his choice Hp to every other participant which costs (M2 log k) bits. The communication cost of the anonymous broadcast protocol (second step) is also (M2 log k) using the algorithm in Furukawa and Sako (2001) and Neff (2001). The secure sum computation protocol (third step) is done in (M2 ) using the protocol described in Kantarcioglǔ and Clifton (2004b). Since the cost of one iteration of MultBoost is dominated by the cost of the second step, this does not change the asymptotic complexity of the algorithm. Summing up, the cost of one iteration of MultBoost is (M2 log k), and so the total cost is (TM2 log k) for T iterations. Privacy-preserving boosting 151 5.1.2 Computational complexity The computational complexity of the “learning” steps of MultBoost is ((T + M p log n) dn), where n = p=1 n , as in BiBoost (assuming again that decision stumps are used for weak learners). The anonymous broadcast protocol has a computational cost of (eM) for each participant, where e is the time required to perform one encryption. The cost of SecureSumComputation is (M) per participant. Hence, the total computational cost of MultBoost is ((T + log n)dn + eMT). Interestingly, if M > n, the cost of encrypting the values during the anonymous broadcast protocol could exceed the cost of learning. 6 Privacy Privacy is a difficult notion to formalize. It can take different flavors and meanings depending on the context, and there is no consensus yet on how it should be defined in the data mining setting. Although several attempts has been made to tackle this question (see Clifton et al. 2004 for a non-exhaustive list), the definitions proposed are often restricted to one of the three particular approaches outlined in Sect. 2. In this section, we first give an overview of the privacy models used in these three approaches. Then we analyze BiBoost and MultBoost within the standard paradigm used commonly in cryptographybased approaches. This approach considers a protocol perfectly secure if its execution does not reveal more information than the output itself, which means that it entirely overlooks any information leaked by the final classifier itself. In the last subsection we elaborate on this criticism and analyze the security of the classifier that BiBoost produces. 6.1 Privacy models in data mining In the first approach of privacy-preserving data mining, the data is altered through a sanitization process which tries to preserve privacy while maintaining the utility of the data. Within this approach, Bertino et al. (2005) has recently proposed a formal framework which allows to compare different privacy-preserving algorithms using criteria such as efficiency, accuracy, scalability or level of privacy. The goal of the second approach is to randomize the data by adding uniform or Gaussian noise to the feature values. In this model, two notions are commonly used to measures privacy: conditional entropy (Agrawal and Aggarwal 2001) and the notion of privacy breaches (Evfimieski 2002). Conditional entropy is an information-theoretic measure which computes the mutual information between the original and the randomized distribution. Low mutual information leads to high privacy preservation but learning becomes less reliable. Privacy breaches are used to model a change of confidence regarding the estimated value of a particular attribute of a particular record. Evfimievski et al. (2003) 152 S. Gambs et al. describe a technique that can be used to limit privacy breaches without any knowledge of the original distribution. They also point out interesting links between the notions of conditional entropy and privacy breaches, and describe situations in which privacy breaches can occur despite low mutual information between the original and the randomized data. In the third, cryptography-based approach, the objective is to preserve the privacy of the participants’ data while approximating as much as possible the performance of the classifier that they would have obtained had they fully disclosed their data sets to each other. Beside BiBoost, cryptographic-based versions of decision trees (Lindell and Pinkas 2002), naïve Bayes classifier (Kantarcioglǔ and Vaidya 2004), neural networks (Chang and Lu 2001), support vector machines (Yu et al. 2006), and k-means (Kruger et al. 2005) have been developed. Privacy in these methods is defined within the usual cryptographic paradigm which states that a multiparty protocol is considered perfectly secure if its execution does not reveal more information than the output of the protocol itself (Goldreich 2004). The rationale behind this definition is that the purpose of the protocol is to make its output available to all parties, and therefore there is no way to avoid revealing the information that this entails. It turns out that this well-accepted notion has serious shortcomings in our context because our purpose is not to compute securely one particular well-specified classifier. Much to the contrary, our study aims at determining which classifier can be obtained as privately as possible, and various choices of strategy (such as which alternative is taken in line 4 of BiBoost) will result in different classifiers. In particular, it would be unfair to claim that alternative x is more secure than alternative y simply because the protocol leaks less information in the first case than in the second, other than what can be inferred by the resulting classifiers. Indeed, it could be that the classifier itself resulting from alternative x leaks much more information on sensitive data than the classifier resulting from the other alternative. To make this problem even more conspicuous, consider the extreme case of building a privacy-preserving k-nearest-neighbor classifier. Since the classifier itself contains a copy of the data, the participants could simply exchange their datasets in the clear and still have a perfectly secure protocol according to the cryptographic definition.13 Several papers studying the notion of privacy have raised this question, although none of them (including the ones describing the cryptography-based privacy-preserving classifiers cited above) has found a general model that could be used to analyze and compare the methods. In Kantarcioglǔ et al. (2004), the authors discuss some aspects of how the results of a data mining process can violate privacy. They model the classifier as a “black-box,” which means that an adversary can request an instance to be classified without getting any other information on the classifier. They consider a multi-task classification model in 13 In a different (although related) model the goal is to classify a given test example privately using a distributed k-nearest-neighbor classifier. See Kantarcioglǔ and Clifton (2004a) for a privacy-preserving protocol solving this problem using cryptographic techniques with the help of an untrusted third party. Privacy-preserving boosting 153 which the goal is to predict several attributes in the same time using common observations. The particular question that they study is whether an adversary can improve his classifier predicting an attribute by using the output of another (black-box) classifier that predicts another attribute. This question is interesting in the sense that it helps understanding how the output of a data mining process (such as the prediction of a classifier) could be used to attack the privacy of some unobserved attributes. However, modeling the classifier as a black-box does not take into account how the model description might reveal sensitive information. The angle of attack of Dinur and Nissim (2003) is different: they try to give a computational definition of privacy in the context of databases. In this setting, privacy is preserved if it is computationally infeasible (e.g., no polynomial-time algorithm exists) to retrieve the original information from the randomized data. In particular,√the authors prove that unless the perturbation is of magnitude at least ( n) (where n is the number of records in the database), a polynomial-time adversary can always recover almost the whole database. They also show that such a large perturbation does not imply automatically that the resulting randomized database will be useless. The approach taken by the authors is interesting in the sense that it is a “reverse-engineering” view of cryptography: we first look at which information should not be leaked and look for functions that meet the privacy requirements, and then devise a privacy-preserving version of these functions with the help of cryptography. Another related question is how a priori knowledge of an adversary can cause privacy breaches. Suppose that Alice and Bob run a privacy-preserving algorithm to obtain a classifier g, and suppose that neither the form of g nor the algorithm leaks any information regarding Bob’s data set. In principle, it is still possible that Alice can retrieve some information on Bob’s data by running the classification algorithm on her data set to obtain the classifier gA , and then examining the difference between g and gA . A related problem is studied by Chang and Lu (2001) under the name of oblivious learning. In this model, Alice has a classifier with a fixed architecture and Bob has a training dataset. Alice wants to train her model without learning anything about Bob’s dataset and without Bob learning anything about the Alice’s classifier. In particular, Chang and Lu (2001) applied this form of learning to neural networks. In the following two sections, we first analyze BiBoost in the “traditional” cryptographic paradigm, then we discuss the security of the obtained classifier. 6.2 Cryptographic security of BiBoost Keeping in mind the caveat discussed in the previous section, it is nevertheless interesting to study the amount of information leaked during the execution of our various alternatives, above and beyond the information leaked by the final classifier. First recall that we consider here the semi-honest model, in which participants are expected to follow their prescribed behavior during the execution of the protocol. Yet, they attempt to learn as much as possible concerning the other participant’s sensitive data by analyzing the information exchanged 154 S. Gambs et al. during the protocol (Goldreich 2004). For the analysis of security, it suffices to concentrate on steps enumerated in Sect. 4.2.1 since those are the only steps in which communication takes place. However, it turns out that all the information exchanged in the clear during the second and third steps is implicitly contained in the description of this final classifier. The security of the first step depends on the choice of which alternative is chosen in line 4 of BiBoost, so we discuss the security issues separately for the three alternatives in the following subsections. 6.2.1 The WholeSet alternative The WholeSet alternative (Sect. 4.1.1) is the easiest to analyze because everything that is said in the clear (the descriptions of hA and hB ) can be reconstructed from knowledge of the final classifier. In other words, our protocol is perfectly secure (in the semi-honest model) according to the standard definition of security for multiparty computation (Goldreich 2004). Far from using this observation for recommending this alternative, we use it to reinforce our claim that the standard definition of security is inadequate in our context since much more information leaks from the final classifier when this alternative is taken. 6.2.2 The CommonSubset alternative For the analysis of the CommonSubset alternative (Sect. 4.1.2), first note that, in principle, it would be possible to replace our random rendez-vous protocol with a perfectly secure protocol, under the usual cryptographic assumptions, since the task at hand falls under the generic technique of secure two-party computation (Goldreich 2004). This would result in a perfectly secure protocol for building the final classifier under the usual cryptographic definition of security in the semi-honest model. Nevertheless, we advocate against this approach because it would be much less efficient than our proposed solution, and the gain in security would mostly be an illusion as we have explained in the beginning of this section: the final classifier itself leaks a fair amount of information. For the analysis of the random rendez-vous protocol, let us assume for simplicity that the encryption schemes are strong enough that no information can be gained from encrypted values. This ensures that no information leaks about the values of the set intersection other than the value chosen at random. What leaks however, as explained earlier, is a probabilistic estimate of the number of admissible subsets (which corresponds to the size of the set intersection). Clearly, this information could not be inferred by analyzing the final classifier alone. 6.2.3 The RandomSubset alternative Not much information leaks if the RandomSubset alternative (Sect. 4.1.3) is used. Recall that, in each iteration, Alice announces a subset HjA chosen at random from among her admissible subsets. If HjA is admissible for Bob, then it Privacy-preserving boosting 155 is the only subset used in this iteration. In this case, no information leaks from the interaction that cannot be reconstructed from the final classifier, which is perfect from the standard cryptographic viewpoint. Assume now that HjA is not admissible for Bob, which forces him to reveal his randomly chosen admissible subset HjB . The most interesting situation occurs if Alice’s best weak classifier in HjB is better than her initially chosen weak classifier in HjA , since in this case, only HjB will be used in the final classifier, and no trace that HjA had been considered will be kept in the classifier. Therefore, the interaction leaks information that cannot be reconstructed by analyzing the final classifier. Namely, Alice learns that Bob does not have a good weak classifier in HjA and Bob learns not only that Alice does have a good weak classifier in HjB , but also that her weak classifier in HjB is better then her initially chosen weak classifier in HjA . 6.3 Cryptographic security of MultBoost There is a new security threat that needs to be considered in the multiparty setting. A collusion in the semi-honest model corresponds to a scenario where a fixed subset of the participants agree to cooperate by exchanging all the information they have gathered during the execution of the protocol. The goal of the colluders is to learn as much information as possible on the other participants’ input. The security of a protocol can de defined according to the maximum number of participants that can collude without compromising the privacy of the remaining participants inputs. Note that in the bipartite case, the notion of collusion has no meaning since a collusion between Alice and Bob is equivalent to exchanging their inputs. The security of an anonymous broadcast protocol comes from the execution of multiple rounds of shuffling and decryption by different and independent mixers which is done so that the decrypted output can not be linked to the encrypted input. The security of the secure sum computation protocol is based on the masking ability of random numbers. By looking at the individual shares, none of the participants can determine the other participants’ original numbers εp . Moreover, even if two or more participants collude, they will fail to extract the individual values hidden inside since the random shares are different and independent. This makes the protocol secure against any collusion of up to M −2 participants. There is an unexpected benefit that comes from the use of the anonymous broadcast protocol. The final output of MultBoost is a weighted linear combination of the merged classifiers h(t) computed in each iteration. A merged classifier is constructed of M weak classifiers whose descriptions are explicitly contained inside the merged classifier but whose origins are unknown. Trying to reconstruct the original data points (or even estimate the data distribution) of a specific participant would require to be able to trace back which classifiers belong to him, which is impossible unless we have some other information sources. Even if the origins are known, reconstructing the data sets seem to be a difficult problem; without the origins, the reconstruction is virtually impossible. 156 S. Gambs et al. 6.4 Privacy-preservation of the final classifier It is an open question how to evaluate the information revealed by the description of a classifier in general, or how to compare the descriptions of two different classifiers from the aspect of privacy. Intuitively, “opaque” classifiers (such as neural networks) seem better than “transparent” ones that either use the training points explicitly (such as nearest neighbor classifiers or support vector machines), or contain rules that can be easily reverse-engineered (such as deep decision trees that contain only a few training points in each leaf). An upper bound on the information revealed by a classifier may be obtained from the length of its description, but it still does not tell us anything about the quality of the obtained information. Even though there exists no comprehensive framework in which we could analyze BiBoost, we argue that the flexibility and the robustness of the algorithm together with the incremental construction of the classifier allow the participants to select the best settings for their particular privacy-preservation model. We start with some general comments in Sect. 6.4.1. In Sect. 6.4.2 we consider a concrete privacy model based on k-anonymization, and we describe a subroutine which can be added to the main boosting loop to guarantee that the final classifier preserves the privacy of the participants’ data. This method is general in the sense that it does not assume any particular form of the base classifiers. In Sect. 6.4.3 we analyze the particular model when the base classifiers are decision stumps from another angle of privacy-preservation. 6.4.1 General comments We start with some general comments. First, in most of the learning algorithms the functional form of the classifier is fixed. When the classifier is explicitly related to the training points (SVM, nearest neighbor), privacy-preservation becomes impossible, but even in other widely used algorithms (neural nets or trees), the inherent rigidity of the model makes it difficult to adapt them to a given privacy-preservation criterion. On the other hand, BiBoost gives great flexibility to a user in selecting the family of weak classifiers in order to defend against particular privacy breaches. Second, the inherent robustness of the weak-classifier selection allows the user to adopt additional privacy-preserving measures. Since we do not require to return the best weak classifier in each iteration, the participants can add random noise to their data before finding a weak classifier, or add noise to the classifier itself after selection. In fact, as we argued in Sect. 4.1, such randomization may even improve the generalization ability of the final classifier. Third, our protocol to select and merge the participants’ weak classifiers allow Alice to refuse Bob’s proposition if she determines that the merged weak classifier (or the final classifier f (t−1) together with the new merged weak classifier h(t) ) would reveal sensitive information on her dataset. This protocol would generate an interesting trade-off: the “inner loop” would decrease the “traditional” cryptographic security of the algorithm since it would allow to leak more information during learning in order to increase the privacy-preservation of the final classifier. Fourth, since the final classifier is Privacy-preserving boosting 157 constructed incrementally, the participants can stop any time if they determine that increasing the complexity of the final classifier would be detrimental to privacy. If, in a particular application, privacy-preservation can be measured numerically, the participants can even define a quantitative trade-off between generalization error and privacy-preservation, and stop the algorithm when the combined criterion cannot be further improved. 6.4.2 Guaranteeing k-anonymity For a concrete example, consider the definition of privacy proposed recently by Chawla et al. (2005). Similarly to k-anonymization (Sweeney 2002), the goal in this model is to “blend in with the crowd”. More formally, we would like each data point to be indistinguishable from at least k − 1 other points, where k is chosen by the participants. In the classification setup, this translates to avoiding homogeneous decision regions with less than k (but more than zero) data points. More concretely in BiBoost, given the final classifier f (T) we say that we cannot distinguish two data points x1 and x2 if h(t) (x1 ) = h(t) (x2 ) for all t = 1, . . . , T. To preserve the privacy of the data in this sense, the participants must verify in each boosting iteration whether adding a new merged weak classifier would create indistinguishable subsets of size less than k points in their respective data sets. In the case when the new base classifier would generate such a region, they can go back to line 4 (Fig. 2) and choose another subset H(t) , or terminate the algorithm. the To verify whether a base classifier h(t+1) would generate a “bad” region, participants have to maintain the partitioning S (t) = S1(t) , . . . , Sk(t)(t) of their (t) data sets generated by the base classifiers, where each subset Si represents (t) a homogeneous region, that is, ∀x, x ∈ Si , ∀j ≤ t : h(j) (x) = h(j) (x ). When (t) (t+1) is added, it is sufficient to verify for each Si a the new base classifier h (t) whether there exist two points x, x ∈ Si for which h(t+1) (x) = h(t+1) (x ), (t) (t+1) (t) = Si ∩ x : h(t+1) (x) = h(t+1) (x ) and and, in this case, split Si into Si (t+1) (t) Si = Si ∩ x : h(t+1) (x) = h(t+1) (x ) . Empty sets do not have to be repre(t) sented, so if h(t+1) (x) = h(t+1) (x ) for all the points x, x in Si , then the subset is simply copied to Si(t+1) = S i(t) . To verify whether h(t+1) would generate a “bad” (t+1) region, we check whether Si < k for all i = 1, . . . , k(t+1) . Since the number of subsets k(t) cannot be larger than n, both the splitting and the verification can be done in O(n) time, so these operations do not increase the asymptotic running time of the algorithm. 6.4.3 Privacy guarantees for decision stumps If the participants choose decision stumps as weak classifiers, the final classifier preserves privacy in the following very strong sense. It is clear that just by looking at the set of selected decision stumps, we cannot extract information other 158 S. Gambs et al. than the set of possible values of individual attributes (and even extracting all the projections is non-trivial, sometimes impossible). In other words, f contains information only on the marginal distribution of the data points, and there is no way to “connect” the projections. For simplicity, assume that all the n projections in all the d dimensions are different, and we manage to find them at least within an error interval. The number of possible data sets consistent with these “measurements” is (n!)d−1 , and the probability of finding a data point at an arbitrary combination of the projections (“pinpointing” a data point) is 1/nd−1 . If projections can coincide, the analysis is more complicated but the number of consistent combinations is still exponentially large both in the number of dimensions and the number of data points. 7 Empirical results In this section we present experimental results with BiBoost (Sect. 7.1) and MultBoost (Sect. 7.2). 7.1 BiBoost We tested BiBoost on three benchmark binary classification problems from the UCI (University of California at Irvine) data repository (Blake and Merz 1998): the “sonar”, the “spambase”,14 and the “ionosphere” data sets. We use 9-fold cross validation in the following manner. We split the initial data set Dn into 9 sets of equal size T1 , . . . , T9 , and conduct 9 trials. For each trial i, the test set is Ti , and the training set is composed of the union of all the other remaining sets Tj , for j = i. The baseline for the comparison is the plain, non-distributed AdaBoost, which uses the entire training set in each trial. For testing BiBoost, we split the training sets into two equal parts. At trial i, Alice’s training set is composed of the four blocks Ti+1 (mod 9) , . . . , Ti+4 (mod 9) located after Ti , and the training set of Bob is composed of the four remaining blocks, Ti+5 (mod 9) , . . . , Ti+8 (mod 9) . The test and training errors are averaged over the trials. Fig. 4 shows the test and training error curves of one run of BiBoost using the CommonSubset alternative on the sonar data set. Fig.s 5, 6 and 7 show the behavior of different boosting algorithms on the three data sets. In each figure, the left graph in the top row displays the results obtained by applying standard AdaBoost on the union of Alice’s and Bob’s training sets. The error curves in the right graph in the top row were generated by running AdaBoost on Alice’s and Bob’s training sets separately (we call this version separated AdaBoost), and averaging the training and test errors. These two experiments provide us with lower and upper baselines. With standard AdaBoost, we use all the training data, and with separated AdaBoost we do 14 In the experiments with the spambase dataset, we used only 400 randomly selected points out of the 4, 000 original points. Privacy-preserving boosting 159 BiBoost with the CommonSubset alternative, one run on sonar 0.4 Alice’s training error Bob’s training error test error 0.35 error rate 0.3 0.25 0.2 0.15 0.1 0.05 0 1 10 100 1000 t Fig. 4 Evolution of Alice’s training error, Bob’s training error and the test error of BiBoost during one run of the CommonSubset alternative on the sonar data set not communicate at all during the boosting iterations. Hence, we expect that BiBoost will perform between these two extreme cases. Surprisingly, in our preliminary experiments we found that BiBoost’s test error was much closer to the test error of standard AdaBoost than that of separated AdaBoost. We suspected that BiBoost’s excellent performance may partly be explained by the inherent randomization when the CommonSubset and RandomSubset alternatives are used to select weak classifier sets. It has been suggested by Friedman (2002) that AdaBoost’s performance may improve if randomization is introduced into the weak classifier selection. The argument that randomization of weak classifiers can improve generalization has also appeared in a more general context in Amit et al. (2000) and Kolcz et al. (2002). Although these approaches are slightly different from ours, the result is similar: in each iteration, we select a good but suboptimal weak classifier. To imitate the specific randomization that BiBoost uses, in each iteration, we choose a random subset Hj from among the subsets that contain an admissible weak classifier, and select the optimal weak classifier in Hj . The results of the resulting variants (which we call randomized AdaBoost and randomized separated AdaBoost) are depicted in the second row of each figure. The two bottom rows of Fig.s 5–7 show BiBoost’s training and test errors. The first three plots correspond to the three alternatives described in Sect. 4.1. The second figure of the bottom row displays the results that we obtained with a modified version of the WholeSet alternative. In preliminary experiments, we found that the non-randomized version (WholeSet alternative) of BiBoost can easily and rapidly saturate. Saturation occurs when Alice and Bob pick opposite weak classifiers15 hA = −hB . In this situation, we suggest in the first remark after Lemma 1 to go back to line 4 (Fig. 2), and choose another admissible 15 More precisely, it is sufficient if for all data points x , hA (x ) = −hB (x ). i i i 160 S. Gambs et al. Standard AdaBoost on sonar Separated AdaBoost on sonar 0.4 0.4 training error test error 0.35 0.3 error rate error rate 0.3 0.25 0.2 0.15 0.25 0.2 0.15 0.1 0.1 0.05 0.05 0 1 training error test error 0.35 10 100 0 1 1000 10 t 0.4 training error test error 0.35 error rate error rate 0.3 0.25 0.2 0.15 0.2 0.15 0.1 0.1 0.05 10 100 0 1 1000 t 0.4 training error test error 0.3 error rate 0.2 BiBoost with the CommonSubset alternative on sonar training error test error 0.25 0.2 0.15 0.1 0.1 0.05 0.05 10 100 0 1 1000 10 t 0.4 training error test error 1000 BiBoost with the modified WholeSet alternative on sonar training error test error 0.35 0.3 0.3 0.25 error rate error rate 100 t BiBoost with the RandomSubset alternative on sonar 0.35 0.2 0.15 0.25 0.2 0.15 0.1 0.1 0.05 0.05 0 1 1000 0.3 0.15 0.4 100 0.35 0.25 0 1 10 t BiBoost with the WholeSet alternative on sonar 0.35 error rate 0.25 0.05 0.4 training error test error 0.35 0.3 0 1 1000 Randomized separated AdaBoost on sonar Randomized AdaBoost on sonar 0.4 100 t 10 100 t 1000 0 1 10 100 1000 t Fig. 5 Comparison of standard AdaBoost, randomized AdaBoost, and four alternatives of BiBoost on the sonar data set during 1000 iterations Privacy-preserving boosting 161 Standard AdaBoost on spambase 0.2 Separated AdaBoost on spambase 0.2 training error test error training error test error 0.15 error rate error rate 0.15 0.1 0.05 0.1 0.05 0 0 1 10 100 1000 1 10 t Randomized AdaBoost on spambase training error test error training error test error 0.15 0.15 0.1 0.05 0.1 0.05 0 0 1 10 100 1000 1 10 t 100 BiBoost with the CommonSubset alternative on spambase 0.2 training error test error training error test error 0.15 0.15 error rate error rate 1000 t BiBoost with the WholeSet alternative on spambase 0.2 0.1 0.05 0.1 0.05 0 0 1 10 100 1000 1 10 t 100 1000 t BiBoost with the RandomSubset alternative on spambase 0.2 BiBoost with the modified WholeSet alternative on spambase 0.2 training error test error training error test error 0.15 0.15 error rate error rate 1000 Randomized separated AdaBoost on spambase 0.2 error rate error rate 0.2 100 t 0.1 0.05 0.1 0.05 0 0 1 10 100 t 1000 1 10 100 1000 t Fig. 6 Comparison of standard AdaBoost, randomized AdaBoost, and four alternatives of BiBoost on the spambase data set during 1,000 iterations 162 S. Gambs et al. Standard AdaBoost on ionosphere 0.2 Separated AdaBoost on ionosphere 0.2 training error test error training error test error 0.15 error rate error rate error rate 0.15 0.1 0.05 0.1 0.05 0 0 1 10 100 1000 1 10 t Randomized AdaBoost on ionosphere 0.2 0.15 training error test error error rate 0.15 0.1 0.05 0.1 0.05 0 0 1 10 100 1000 1 10 t 100 BiBoost with the CommonSubset alternative on ionosphere 0.2 training error test error training error test error 0.15 error rate 0.15 error rate 1000 t BiBoost with the WholeSet alternative on ionosphere 0.2 0.1 0.1 0.05 0.05 0 0 1 10 100 1000 1 10 BiBoost with the RandomSubset alternative on ionosphere 0.2 100 1000 t t BiBoost with the modified WholeSet alternative on ionosphere 0.2 training error test error training error test error 0.15 error rate 0.15 error rate 1000 Randomized separated AdaBoost on ionosphere 0.2 training error test error error rate 100 t 0.1 0.1 0.05 0.05 0 0 1 10 100 t 1000 1 10 100 1000 t Fig. 7 Comparison of standard AdaBoost, randomized AdaBoost, and four alternatives of BiBoost on the ionosphere data set during 1,000 iterations Privacy-preserving boosting 163 subset H(t) . In the WholeSet alternative, however, there is only one set that contains all weak classifiers. To resolve this problem, we decided to switch to the RandomSubset alternative for one iteration when saturation occurs. The error curves indicate that, although this trick can work a certain number of times, there is always a moment where a saturation is inevitable and will happen with nearly every possible admissible subsets. The error curves confirm the “common sense” observation that boosting is relatively immune to overfitting: the test error curves are flat during a large span of iterations, and the asymptotic test error is usually not much larger than the minimum. Nevertheless, overfitting does happen, and the different versions of the algorithm converge at different speeds, so, for fair comparison, the algorithms should be stopped early after a number of iterations validated on hold-out data. In the standard, non-separated case we used simple validation. In each of the nine cross-validation experiments the training data Ttr is first split further into training and validation sets Ttr and Tval (at the rate of 2:1 in our experiments). Then we run the algorithm on Ttr and measure the error on Tval . To smooth the error curve, we average over windows of 5 iterations and choose the middle of the window in which the validation error is minimal as the optimal number of iterations T̂. Then we rerun the algorithm on Ttr = Ttr ∪ Tval , and measure the error on the hold-out test set after T̂ iterations. The final error is then the average of the validated test errors over the cross validation folds. In a distributed environment where the privacy of the data should be preserved, validation adds another layer of complexity to the algorithms. Finding an efficient and privacy-preserving validation protocol is a crucial question in every learning method where the learned classifier is sensitive to the choice of complexity hyperparameters (such as the number of neurons or the depth of the decision tree). The most privacy-preserving scheme is to let each participant validate the hyperparameters separately on their data sets, and then somehow combine the estimated parameters. The main drawback of this approach is that the separate training sets are much smaller then the unified training sets. Assuming that the optimal complexity grows with the size of the training set, this procedure can seriously underestimate the optimal complexity. For small data sets the resulting validation sets can also be very small, resulting in a parameterestimate with a high variance. In our case, we found that this protocol stopped BiBoost way too early, and the problem was particularly accentuated with more than two participants (Sect. 7.2). To avoid this problem, we adopted the following protocol. In each cross-vali A and T B /T B . dation fold, both Alice and Bob split their training sets to TtrA /Tval tr val They run BiBoost the using their training sets DA = TtrA and DB = TtrB , and estimate the hyperparameter based on the error measured on the unified valiA ∪T B . They do not have to actually communicate their validation dation set Tval val points (which would be a major privacy breach), only the error of the combined classifier f (t) measured on their sets. This results in a minor leak that is similar to the leak that occurs in lines 8 and 9 (in Fig. 2) when they exchange the weighted training error of the weak classifier. The advantage of the protocol 164 S. Gambs et al. Table 2 Test errors (with standard deviations) of the different algorithms after an optimal number of iterations estimated using cross-validation Standard AdaBoost Separated AdaBoost Randomized AdaBoost Randomized separated AdaBoost BiBoost (WholeSet) BiBoost (CommonSubset) BiBoost (RandomSubset) BiBoost (modified WholeSet) Sonar Spambase Ionosphere 0.164 (0.054) 0.212 (0.084) 0.202 (0.082) 0.257 (0.065) 0.226 (0.085) 0.197 (0.049) 0.202 (0.070) 0.226 (0.079) 0.070 (0.037) 0.098 (0.036) 0.062 (0.034) 0.076 (0.033) 0.074 (0.049) 0.052 (0.034) 0.072 (0.033) 0.070 (0.047) 0.100 (0.037) 0.114 (0.040) 0.083 (0.029) 0.092 (0.023) 0.108 (0.040) 0.108 (0.020) 0.085 (0.031) 0.102 (0.038) is that the “effective” sizes of the training and validation sets are the same as in the non-separated case, and they do not depend on the number of participants. Although this protocol leaks more information on the participants’ data sets than the first scheme, we feel that this is a good compromise between privacy preservation and generalization performance. Table 2 summarizes the results obtained using this validation technique for BiBoost and “traditional” validation for AdaBoost and randomized AdaBoost. By observing the results on the different data sets, the following conclusions can be drawn. First note that if Alice and Bob were to run AdaBoost alone on their own training sets only (separated AdaBoost), then the performance of the resulting classifier would be significantly worse than if they were running the algorithm on the complete training set composed of the union of their databases (standard AdaBoost). This is not really surprising since usually the accuracy of a classifier grows with the number of training data points. The second observation is that randomized AdaBoost is usually better than standard AdaBoost both for the complete training set and for the separated training sets, as suggested in Amit et al. (2000), Friedman (2002) and Kolcz et al. (2002). Note, however, that in our case there is a small price to pay with randomized AdaBoost, that is, a slower convergence of the training error (roughly half of that of the standard AdaBoost). We can observe a similar relation between the deterministic (WholeSet alternative) and randomized (CommonSubset and RandomSubset alternatives) versions of BiBoost. In general, the CommonSubset and RandomSubset alternatives of BiBoost perform reasonably well. In particular, their test errors are always lower than separated AdaBoost’s test error, and usually also lower than randomized AdaBoost’s test error, although here the difference is less substantial. Their test errors are also only slightly higher than standard (randomized) AdaBoost’s test error, which means that BiBoost performs close to its lower limit. In general, the test error in the WholeSet alternative is slightly higher than the test error of the CommonSubset and RandomSubset alternatives. On the other hand, this alternative converges faster than the others, so, in case when sparseness of the final classifier (the low number of weak classifiers) is important, this alternative can also be useful. Privacy-preserving boosting MultBoost with 25 participants 0.3 MultBoost with 50 participants 0.3 MultBoost Randomized separated AdaBoost Randomized AdaBoost 0.25 MultBoost Randomized separated AdaBoost Randomized AdaBoost 0.25 0.2 error rate error rate 165 0.15 0.1 0.2 0.15 0.1 0.05 0.05 0 0 1 10 100 1000 10000 t 1 10 100 1000 10000 t Fig. 8 Comparison of randomized AdaBoost, randomized separated AdaBoost, and MultBoost with 25 and 50 participants on the pendigits data set during 15,000 iterations 7.2 Empirical results with MultBoost We tested the MultBoost algorithm on the pendigits dataset which also comes from the UCI repository. We chose not to reuse the ionosphere, sonar and spambase datasets as we did with BiBoost because these datasets do not contain enough data points if they are split among a high number of participants. We tried to use the pendigits dataset also with BiBoost but we could not observe any significant difference between the standard and separated versions of AdaBoost and randomized AdaBoost, probably due to the large size of the data set. The pendigits dataset contains 7,494 data points, each represented by 16 attributes. Originally, this dataset is designed to be used for multi-class classification with a total of 10 classes (one for each digit ranging from 0 to 9). We chose instead to transform it into a binary classification task by assigning the negative class to all even numbers and the positive class to the odd numbers. We tested MultBoost with M = 25 and 50 participants. This time, due to the size of the dataset, we used 2-fold validation instead of 9-fold. This means that for each fold, half of the dataset was used for testing, while the remaining half was used for training and split equally between the participants. Since the total number of training points is fixed, it is possible to observe how MultBoost behaves when the total amount of information remains the same but the number of participants changes. During the experiments we observed that it is not necessary to make each participants “vote” in each iteration to obtain the merged classifier (13). In particular, we ran MultBoost with 3, 4, 5 and M voters and compared the test errors. The actual voters were selected randomly in each iteration. Fig. 8 shows the test curves for randomized AdaBoost, randomized separated AdaBoost, and MultBoost with M voters on the pedigits data set. Similarly to BiBoost, the optimal number of iterations T̂ was estimated using simple validation. In the case of MultBoost, we generalized the approach outlined in Sect. 7. During the validation pass, each participant separates his indip p vidual dataset into a training set Ttr and a validation set Tval . The participants p then execute MultBoost using their training sets Ttr and record the error of the resulting classifier on their validation sets in each iteration. To compute the 166 S. Gambs et al. Table 3 Test errors (with standard deviations) of the different algorithms obtained on the pendigits dataset after an optimal number of iterations estimated using simple validation Standard AdaBoost Randomized AdaBoost Separated AdaBoost Randomized separated AdaBoost MultBoost (3 voters) MultBoost (4 voters) MultBoost (5 voters) MultBoost (M voters) 0.0502 (0.0051) 0.0550 (0.0027) 25 participants 50 participants 0.1412 (0.0029) 0.1346 (0.0016) 0.1807 (0.0001) 0.1701 (0.0019) 0.0620 (0.0013) 0.0518 (0.0012) 0.0543 (0.0032) 0.0582 (0.0033) 0.0551 (0.0024) 0.0531 (0.008) 0.0562 (0.0015) 0.0586 (0.0053) Table 4 Test errors (with standard deviations) of MultBoost on the pendigits dataset with 25% and 50% of the attributes removed MultBoost (3 voters) MultBoost (5 voters) 25% of the attributes removed 50% of the attributes removed 25 Participants 50 Participants 25 Participants 50 Participants 0.0559 (0.0017) 0.0572 (0.0052) 0.0591 (0.0009) 0.0642 (0.0009) 0.0943 (0.0006) 0.0945 (0.0056) 0.0903 (0.0020) 0.0899 (0.0021) validation error, they do not have to disclose their errors, instead they can run the secure sum computation protocol (described in Sect. ). As in BiBoost, we smoothed the error curve by averaging over 5 iterations and set T̂ to the middle of the minimum error window. Finally, we re-ran MultBoost using the unified p p p training sets Ttr = Ttr ∪ Tval and stopping at the estimated optimal number of iteration T̂. Table 3 summarizes the results. It is clear that the higher the number of participants, the more we gain by using MultBoost instead of separated AdaBoost. In particular, there is significant difference of roughly 7–8% of accuracy between AdaBoost (both standard and randomized) and MultBoost for 25 participants and of more than 10% for 50 participants. Note also that we do not require all the participants to output one of their weak classifiers in each iteration since MultBoost with only 3, 4, or 5 voters per iteration performs as well as MultBoost with all the participants voting in each iteration. This is important for privacy-preservation since it means that we can achieve high accuracy with little exchange of information. In Sect. 6.4 we argued that for assuring that the final classifier itself preserves the privacy of the data sets, the participants may restrict the pool of base classifiers used to build the final linear combination. In a set of experiments, we tested how such a limitation affects the performance of the learned classifier. In particular, we randomly removed 25% and 50% of the original attributes (features), and ran the algorithm on the truncated data sets. Table 4 summarizes the results. Although the errors are worse than when all the attributes are used, they are still significantly better than separated randomized AdaBoost. This suggests than even in the case where some attributes are considered sensitive, Privacy-preserving boosting 167 running MultBoost with a restricted set of attributes can still lead to a better performance than running AdaBoost alone. 8 Conclusion and future directions In this paper, we have described two algorithms, BiBoost (Bipartite Boosting) and MultBoost (Multiparty Boosting), that allow two or more participants to construct a boosting classifier without explicitly sharing their data sets. We have analyzed both the computational and the security aspects of the algorithms. On the one hand, the algorithms inherit the excellent generalization performance of AdaBoost. We have demonstrated in experiments on benchmark data sets that the algorithms are better than AdaBoost executed separately by the participants, and that they perform close to AdaBoost executed using the entire training set. On the other hand, the algorithms are almost privacy-preserving under our security model. Although the participants do exchange implicit information on their data sets, it is unclear at this point how this information could be used (and whether it can be used at all) to reverse-engineer the training data sets. In this framework, we could also analyze the impact of using other families of weak classifiers, such as decision trees or neural networks with a few hidden units. It is clear that by using “stronger” weak classifiers, the number of iterations required for convergence could be reduced, so the participants would have less time to reverse-engineer the training data sets. On the other hand, this would come at the price of communicating more bits in each iteration to describe the weak classifiers, which could potentially be more damaging for privacy. It is also clear that the privacy-preserving aspect cannot be captured only by traditional measures of complexity: the particular form of the weak classifiers can be equally important. For example, some classifiers, such as nearest neighbor, or Support Vector Machines (Cortes and Vapnik 1995), explicitly contain some or all the training data points in their description, so using them as weak classifiers would obviously breach privacy, independently of whether they are complex or not in an information theoretical sense. This dilemma can also be viewed as a specific case of a broader issue, namely that the “traditional” model of secure multiparty computation does not seem to be adequate here. In particular, the participants are not interested in how much information is leaked with respect to the information contained in f , they would like to control the total direct information on their data sets that is communicated using the protocols, including the information contained in f . In the extreme case of f being a nearest neighbor classifier, f would contain the description of all data points, whereas it could be computed totally securely under the traditional paradigm of secure multiparty computation. Although this is a crucial question, it is often overlooked in the literature on cryptography-based privacy-preserving algorithms. We see two research directions that could be explored to solve this problem. First, some studies consider the secure multiparty computation of an 168 S. Gambs et al. approximation of a function f instead of the function itself (Feigenbaum et al. 2001). The usual situation when such an algorithm can be useful is when f is hard to compute. In our case, the focus would be different: we would use an approximation to hide information about f . In the privacy-preserving protocol of ID3 (Lindell and Pinkas 2002), the authors use an approximation of the original algorithm both because it is more efficient to compute in a privacy-preserving manner and also because it leaks less information on the participants’ inputs. The second direction to explore is to introduce some kind of randomization into the learning process. Some algorithms, for example neural networks with random initial weights, could be used in a natural way, while some others could be changed explicitly to include randomization. As we indicated in Sect. 6.4, BiBoost and MultBoost can easily accommodate both approaches. Acknowledgements We would like to acknowledge the help and give our deepest thanks to Gilles Brassard for all the useful discussions we had with him and for all the work he has done on reviewing an early version of this paper. We would also like to thank both reviewers for their insightful and detailed comments that greatly improved the paper. The authors are supported in parts by the Natural Sciences and Engineering Research Council of Canada. References Agrawal D, Aggarwal CC (2001) On the design and quantification of privacy preserving data mining algorithms. In: Proceedings of the 20th ACM symposium of principles of databases systems, pp 247–255 Agrawal R, Srikant R (2000) Privacy-preserving data mining. In: Proceedings of the ACM SIGMOD international conference on management of data, pp 439–450 Aïmeur E, Brassard G, Gambs S, Kégl B (2004) Privacy-preserving boosting. In: Proceedings of the international workshop on privacy and security issues in data mining, in conjunction with PKDD’04, pp 51–69 Amit Y, Blanchard G, Wilder K (2000) Multiple randomized classifiers: MRCL. Technical Report 496, Department of Statistics, University of Chicago Atallah MJ, Bertino E, Elmagarmid AK, Ibrahim M, Verykios VS (1999) Disclosure limitations of sensitive rules. In: Proceedings of the IEEE knowledge and data engineering workshop, pp 45–52 Bayardo R, Agrawal R (2005) Data privacy through optimal k-anonymization. In: Proceedings of the 21st IEEE international conference on data engineering, pp 217–228 Ben-Or M, Goldwasser S, Wigderson A (1988) Completeness theorems for non-cryptographic fault-tolerant distributed computation. In Proceedings of the 20th ACM annual symposium on the theory of computing, pp 1–10 Bertino E, Fovino IN, Provenza LP (2005) A framework for evaluating privacy preserving data mining algorithms. Data Mining Knowledge Discovery 11(2):121–154 Blake CL, Merz CJ (1998) UCI repository of machine learning databases. Available at http://www.ics.uci.edu/∼mlearn/MLRepository.html Chang L, Moskowitz IL, (2000) An integrated framework for database inference and privacy protection. In: Proceedings of data and applications security, pp 161–172 Chang Y-C, Lu C-J (2001) Oblivious polynomial evaluation and oblivious neural learning. In: Proceedings of Asiacrypt’01, pp 369–384 Chaum D (1981) Untracable electronic mail, return address and digital pseudonyms. Commun ACM 24(2):84–88 Chaum D, Crépeau C, Damgård I (1988) Multiparty unconditionally secure protocols. In: Proceedings of the 20th ACM annual symposium on the theory of computing, pp 11–19 Chaum D, Damgård I, van de Graaf J (1987) Multiparty computations ensuring privacy of each party’s input and correctness of the result. In: Proceedings of Crypto’87, pp 87–119 Privacy-preserving boosting 169 Chawla S, Dwork C, McSherry F, Smith A, Wee H (2005) Towards privacy in public databases. In: Proceedings of the 2nd theory of cryptography conference, pp 363–385 Clifton C, Kantarcioglǔ M, Vaidya J (2004) Data mining: next generation challenges and future directions, chapter Defining privacy for data mining. AAAI/MIT Press Clifton C, Kantarcioglǔ M, Vaidya J, Lin X, Zhiu MY (2002) Tools for privacy preserving distributed data mining. SIGKDD Explor 4(2):28–34 Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20(3):273–297 Dinur I, Nissim K (2003) Revealing information while preserving privacy. In: Proceedings of the 22nd ACM SIGACT-SIGMOD-SIGART symposium on principles of databases systems, pp 202–210 Evfimievski A (2002) Randomization in privacy preserving data mining. SIGKDD Explor 4(2): 43–48 Evfimievski A, Gehrke JE, Srikant R (2003) Limiting privacy breaches in privacy preserving data mining. In: Proceedings of the 22nd ACM SIGACT-SIGMOD-SIGART symposium on principles of databases systems, pp 211–222 Fan W, Stolfo SJ, Zhang J (1999) The application of AdaBoost for distributed, scalable and on-line learning. In: Proceedings of the 5th ACM SIGKDD international conference on knowledge discovery and data mining, pp 362–366 Feigenbaum J, Ishai Y, Malkin T, Nissim K, Strauss M, Wright R (2001) Secure multiparty computation of approximations. In: Proceedings of the 28th international colloquium on automata, languages and programming, pp 927–938 Fienberg SE, McIntyre J, (2004) Data swapping: variations on a theme by Dalenius and Reiss. In: Proceedings of privacy in statistical databases, pp 14–29 Freedman M, Nissim K, Pinkas B (2004) Efficient private matching and set intersection. In: Proceedings of Eurocrypt’04, pp 1–19 Freund Y, Schapire RE, (1997) A decision-theoretic generalization of on-line learning and an application to boosting. J Comput System Sci 55:119–139 Friedman J (2002) Stochastic gradient boosting. Comput Stat Data Anal 38(4):367–378 Furukawa J, Sako K (2001) An efficient scheme for proving a shuffle. In: Proceedings of Crypto 2001, pp 368–387 Goldreich O (2004) Foundations of cryptography, volume II: basic applications. Cambridge University Press Goldreich O, Micali S, Wigderson A (1987) How to play any mental game – A completeness theorem for protocols with honest majority. In: Proceedings of the 19th ACM symposium on theory of computing, pp 218–229 Iyengar V (2002) Transforming data to satisfy privacy constraints. In: Proceedings of the 8th ACM SIGKDD international conference on knowledge discovery and data mining, pp 279–288 Kalyanasundaran B, Schnitger G (1987) The probabilistic communication of set intersection. In: Proceedings of the 2nd annual IEEE conference on structure in complexity theory, pages 41–47. Kantarcioglǔ M, Clifton C, (2004a) Privacy-preserving distributed k-nn classifier. In: European conference on principles of data mining and knowledge discovery, pp 279–290 Kantarcioglǔ M, Clifton C, (2004b) Privacy-preserving distributed mining of association rules on horizontally partitioned data. IEEE Transac on Knowledge Data Engi 16(9):1026–1037 Kantarcioglǔ M, Jin J, Clifton C (2004) When do data mining results violate privacy? In: Proceedings of the 10th ACM SIGKDD international conference on knowledge discovery and data mining, pp 599–604 Kantarcioglǔ M, Vaidya J (2004) Privacy preserving naive bayes classifier for horizontally partitioned data. In: Proceedings of the workshop on privacy preserving data mining held in association with the third IEEE international conference on data mining Kégl B (2003) Robust regression by boosting the median. In: Proceedings of the 16th conference on computational learning theory, pp 258–272 Kissner L, Song D (2005) Privacy-preserving set operations. In: Proceedings of Crypto 2005, pp 241–257 Kolcz A, Xiaomei S, Kalita J (2002). Efficient handling of high-dimensional feature spaces by randomized classifier ensembles. In: Proceedings of SIGKDD’02, pp 307–313 170 S. Gambs et al. Kruger L, Jha S, McDaniel P (2005) Privacy preserving clustering. In: Proceedings of the 10th European symposium on research in computer security, pp 397–417 Lazarevic A, Obradovic Z (2002) Boosting algorithms for parallel and distributed learning. Distrib Parallel Databases 11(2):203–229 Lindell Y, Pinkas B (2002) Privacy preserving data mining. J Cryptol 15:177–206 Paillier P (2000) Public-key cryptosystems based on composite degree residuosity classes. In: Proceedings of Asiacrypt’00, pp 573–584 Neff A (2001) A verifiable secret shuffle and its application to e-voting. In: ACM CCS, pp 116f́b-125 Predd JB, Kulkarni SR, Poor HV (2006) Consistency in models distributed learning under communication constraints. IEEE Transac Information Theor 52(1):52–63 Quinlan J (1986) Induction of decision trees. Mach Learn 1(1):81–106 Rabin T, Ben-Or M (1989) Verifiable secret sharing and multiparty protocols with honest majority. In: Proceedings of the 21th ACM symposium on theory of computing, pp 73–85 Schapire RE, Singer Y, (1999) Improved boosting algorithms using confidence-rated predictions. Mach Learn 37(3):297–336 Shamir A (1979) How to share a secret. Communications of the ACM 22(11):612–613 Sweeney L (2002) Achieving k-anonymity privacy protection using generalization and suppression. Int J Uncertainty, Fuzziness, Knowledge-based Syst 10(5):571–588 Valiant L (1984) A theory of the learnable. Communications of the ACM 27(11):1134–1142 Verykios VS, Bertino E, Fovino IN, Provenza LP, Saygin Y, Theodoridis Y (2004) State-of-the-art in privacy preserving data mining. SIGMOD Record 3(1):50–57 Yao AC (1986) How to generate and exchange secrets. In: Proceedings of the 27th IEEE symposium on foundations of computer science, pp 162–167 Yu H, Jiang X, Vaidya J (2006) Privacy-preserving SVM using nonlinear kernels on horizontally partitioned data. In: Proceedings of the 21st annual ACM symposium on applied computing, pp 603–610